Shell Pattern Matching

Parameter expansion

这是在script 中处理string, number 数据的常用方法. 可以用来代替sed, cut这些external programs, speed up significantly. As our experience with scripting grows, the ability to effectively manipulate strings and numbers will prove extremely valuable.

对于变量值的检查(比如参数是否为dash 开头,否则当作argument使用),提取(比如提取一个文件名去掉后缀)很有帮助,这个手册概括了所有情况,但可能不好理解,可以动手试一下就知道了, Shell parameter expansion. 其实用pipeline 也可以达到相同的效果,但是会麻烦一些。

这是中文总结,还不错Shell扩展(Shell Expansions)-参数扩展(Shell Parameter Expansion). 有个地方写错了 $$ 才是当前shell 的PID。

注意nullunset variable的区别, set -u 可以检测报错使用没有定义的variable (也就是unset), null 在这里就是empty的意思,比如var=,这个变量是存在的,只是没有值:

1
2
3
4
5
6
7
8
9
10
11
12
var=
# 空
echo $var
# true
[[ -z $var ]]
# false
[[ -n $var ]]

set -u
unset var
# 报错
echo $var

Bash’s various forms of parameter expansion can also distinguish between unset and null values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# `w` can be literal or another variable

# if a is unset, value of w is used
a=${a-w}
# if a is unset or null(empty), value of w is used
# for example, used for positional parameter passed from outside
a=${a:-w}

# if a is unset or null(empty), value of w is assigned to a
# 不能用于positional parameters的赋值, 比如 ${3:=hello}
${a:=w}

# if a is unset or null(empty), w is written to stderr
# and script exits with err
a=${a:?w}

Expansion that return variable names:

1
2
3
4
5
6
# return variable name starts with prefix
# these 2 are identical
${!prefix*}
${!prefix@}
# return all BASH prefixed variables
echo ${!BASH*}

Indirect parameter expansion:

1
2
3
4
5
parameter="var"
var="hello"

# echo is hello
echo ${!parameter}

Substring expansion,来自上面的链接中 Shell parameter expansion 的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
$ string=01234567890abcdefgh
# 7 is start index
$ echo ${string:7}
7890abcdefgh
# 0 is number of char to cut
$ echo ${string:7:0}

$ echo ${string:7:2}
78
$ echo ${string:7:-2}
7890abcdef
# 要空格, 防止和:- 混淆
$ echo ${string: -7}
bcdefgh
$ echo ${string: -7:0}

$ echo ${string: -7:2}
bc
$ echo ${string: -7:-2}
bcdef

# set $1 positional parameter
$ set -- 01234567890abcdefgh
$ echo ${1:7}
7890abcdefgh
$ echo ${1:7:0}

$ echo ${1:7:2}
78
$ echo ${1:7:-2}
7890abcdef
$ echo ${1: -7}
bcdefgh
$ echo ${1: -7:0}

$ echo ${1: -7:2}
bc
# -2: start from end
$ echo ${1: -7:-2}
bcdef

# array
$ array[0]=01234567890abcdefgh
$ echo ${array[0]:7}
7890abcdefgh
$ echo ${array[0]:7:0}

$ echo ${array[0]:7:2}
78
$ echo ${array[0]:7:-2}
7890abcdef
$ echo ${array[0]: -7}
bcdefgh
$ echo ${array[0]: -7:0}

$ echo ${array[0]: -7:2}
bc
$ echo ${array[0]: -7:-2}
bcdef

其他常见的用法,主要是针对string 操作,特别是pathname, much faster than cut extraction!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# get string length
${#string}
# note, the number of positional parameters
${#@}

# 检查第一个位置参数是不是以-开头,扩展结果是删除最短的match部分
# 对 ${1} 的操作, for example ${1} is --verbose
# get: -verbose
${1#-}
# 同上,但删除最长的match部分
# get: verbose
${1##-}

# get filename name from a download url
# \ is used to escape / in path
${1##*\/}
# get path of a url
${1%\/*}

This can be also used to do substring contains checking:

1
2
# empty the whole string if substring target is inside
[ ! -z "${1##*target*}" ]

其实# or ## 后面可以使用pattern matching, 这样功能更强, 比如:

1
2
3
4
5
6
7
8
9
10
11
12
${1#+(-)}
${1##+(-)}

# remove leading space or blank
# note that double [[]] wrapper!
shopt -s extglob
${1##*([[:blank:]]|[[:space:]])}
# remove trailing space or blank
${1%%*([[:blank:]]|[[:space:]])}

# remove .tar.gz or .tgz suffix
${1%%(.tgz|.tar.gz)}

参数替换, search and replace ${parameter/pattern/string}, replace pattern in parameter with string.

1
2
3
4
5
6
7
8
9
10
11
12
13
foo=JPG.JPG 
# replace first match
# jpg.JPG
echo ${foo/JPG/jpg}
# replace all matches
# jpg.jpg
echo ${foo//JPG/jpg}
# replace only start
# jpg.JPG
echo ${foo/#JPG/jpg}
# replace only end
# JPG.jpg
echo ${foo/%JPG/jpg}

大小写变换

Shell Globs

A glob is a wildcard that is processed by the shell and expands into a list of arguments.

Glob is like regular expression but less expressive and eaiser to use. Glob match file names, for example ls [0-9]?file*.txt, whereas regular expression match text, for example ls | grep '[0-9].file.*\.txt'. Sometimes the funtionality can look blurred depending on how you use it. 都可以用在if condition [[ =~ ]], case condition中.

In ls [0-9]?file*.txt, ls does not support regular expression, shell expands the glob and used by ls. grep '^A.*\.txt' *.txt, grep is using regular expression on the files context that file name is expanded by shell from glob.

Shell expansion types and execution order (precedence high to low from up to bottom):

  1. brace expansion touch file{1..2}
  2. tilde expansion ls ~
  3. parameter and variable expansion ${1:1:1}, ${PATH}
  4. command substitution $() or ``
  5. word splitting
  6. arithmetic expansion echo $((11 + 22))
  7. filename expansion echo file{1..2}.*
  8. quote removal echo "$USER"

Wildcards

1
2
3
4
ls *.txt
# ? is any one char
ls file?.txt
ls file??.txt

Character Set

注意,在Linux中,根据locale的设置,这个regular expression 其实是不包含a的:

1
2
3
4
ls /usr/sbin/[A-Z]*
# 默认字典顺序 is actually in order
# aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
# 所以[A-Z] 不含a

解决办法是用POSIX Character class(见下一节), this standards introduced a concept called a locale, which could be adjusted to select the character set needed for a particular location. We can see the language setting of our system using the following command:

1
2
3
4
5
echo $LANG
# usually is
en_US.UTF-8
# 所以对于上面[A-Z]*的正确写法是
ls /usr/sbin/[[:upper:]]*

注意和brace expansion {} 区别,brace expansion是展开,character set是一种match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# character set
# match one of them
ls [123abc]file.txt
ls file[0-9].txt
ls file[a-z9].txt

# 这个显示的结果和bash设置有关, 和上面提到的问题一样,不过这里更改了LC_COLLATE的值
# locale, In bash terminal, set LC_COLLATE=C (collation)
ls file[A-G].txt
# ! is inversion, not include a-z
ls file[!a-z].txt
# put at end to match !, it is a special char
ls file[a-z!].txt
ls file[az-].txt

Character classes

1
2
3
4
5
# [:upper:] is the character class
# put char class in char set []
ls file[[:upper:]?].txt
ls file[[:lower:]?].txt
ls file[![:lower:][:space:]].txt

Others class useful:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# numbers
[:digit:]
# upper and lower case
[:alpha:]
# upper and lower and numbers
[:alnum:]
# upper
[:upper:]
# lower
[:lower:]
# space, tab, carriage return, newline, vertical tab, and form feed.
# is superset of [:blank:]
[:space:]
# space and tab characters
[:blank:]

Shell globbing Options

使用shopt command的设置 glob的一些特性,比如设置nullglob, extglob, etc.

shopt -s extglob, when using extended pattern matching operators. see here shopt -s nocasematch, set bash case-insensitive match in case or [[ ]] condition. 这个是从bash tutorial 中文版中学到的, shopt is bash built-in setting, unlike set is from POSIX.

Extended Globs

You need to open it:

1
2
shopt | grep extglob
shopt -s extglob

For example, create test cases:

1
2
touch file1.png photo.jpg photo photo.png file.png photo.png.jpg
rm -f file1.png photo.jpg photo photo.png file.png photo.png.jpg
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# @(match): match one or others
# match photo.jpg
ls photo@(.jpg)
ls @(file)
# photo.jpg or photo.png
ls photo@(.jpg|.png)

# ?(match): match 0 or 1
ls photo?(.jpg|.png)

# +(match): match 1 or more
ls photo+(.jpg|.png)

# *(match): match 0 or more
ls photo*(.jpg|.png)

# !(match): invert match
# all files that do not have photo or file name and do not end with jpg or png
!(+(file|photo)*+(.jpg|.png))

主要用在command line, if condition [[ =~ ]], case condition上,比regular expression matching更快。

Brace Expansion

这个用在比如for loop的counter, create file pre/suffix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# create file1.txt file2.txt file4.txt
touch file{1,2,4}.txt
touch file{1..1000}.txt

# expand from left to right, 两两组合
echo {a..c}{10..15}

# specify increase step
echo {1..100..2}
# can pad heading 0
echo {0001..10..2}
echo {10..0}

echo {a..z..2}

# can be nested
echo file-201{1..9}-{0{0..9},1{1..2}}-{1..30}.{tar,bak}.{tgz,bz2}
# create folder structure
mkdir -p 20{10..20}/{01-12}

For easy copy and rename file:

1
2
3
# match before and after ,
# file file.bkp
cp -f a/long/path/file{,.bkp}

Regular Expression

注意regular expression 和 globs 的区别,regular expression 是match text的,globs是 shell来扩展的. 会使用到的地方:

  • grep
  • sed
  • awk
  • if [[ =~ ]]
  • vim for search
  • less for search
  • find -regex
  • locate -regex

Regular Expression Info POSIX regular expression has basic regular expression(BRE) and extended regular expression(ERE). ERE Syntax, note that ERE has () {} ? + | expression that BRE does not:

  • . matches one char
  • [ ] character set
  • \ escape single char
  • | alternation: match to occur from among a set of expressions
  • ( ) pattern grouping, for example, separate | with others: ^(AA|BB|CC)+
  • ? * + { } repetition operators
  • ^abc leading anchor
  • abc$ trailing anchor
  • [^abc] netates pattern, ^ must appear at beginning

Use ERE whenever possible!!! The support in GNU tools are:

  1. grep -E ERE, grep [-G] default is BRE
  2. sed -E ERE, sed default BRE
  3. awk only supports ERE
  4. [[ =~ ]] ERE
1
2
3
4
5
6
# match one char with zero to 3 occurances
.{,3}
# match one char with 3 to 7 occurances
.{3,7}
# match one char with 3 to more occurances
.{3,}

Backreferences

A pattern stored in a buffer to be recalled later, limit of nine for example: \1 to \9:

1
2
3
4
5
6
7
# \1 is (ss) pattern
(ss).*\1
(ss).*\1.*\1

# radar
# opapo
^(.)(.).\2\1$

POSIX ERE does not support backreferences, GNU version supports it.

Bash Extended Regexp

Used in [[ =~ ]] in if condition, it is simple to write then extended globs but less efficiency.

BASH_REMATCH: regular expression match, the matched text is placed into array BASH_REMATCH:

1
2
3
4
[[ abcdef =~ b.d ]]
# the matched is bcd in BASH_REMATCH[0]
echo ${BASH_REMATCH[0]}
# if no match BASH_REMATCH[0] is null

${BASH_REMATCH[0]}很有用,因为存了match的内容,如果是多个group ( ) pattern的组合,则每个group一次存放在${BASH_REMATCH[n]}, n is 0/1/2/3…。

Grep EREs

Grep is global regular expression print. Stick to grep -E 'xxxx'. grep -E -w only match a whole word. grep -E -x only match whole line, same as using anchors. grep -E -o only return the text that match the expression. grep -E -q quiet mode, used to verfiy existence of search item, 用于以前没有[[ =~ ]]的时候.

Sed EREs

See my Sed blog.

Awk EREs

Only support ERE by default, see my awk dedicated blog.

0%