U4 - Shell Pattern Matching
U4 - Shell Pattern Matching
Shell globbing
Pattern matching in the shell against filenames has metacharacters defined differently from the
rest of unix pattern matching prgorams. * is match any character except whitespace, ? is match
one character except whitespace. so *.c is match any filename ending with the two characters .c
(this will list out all c source files in the directory, assuming the directory's owner is sane).
grep, sed
Table of metacharacters:
Sed commands,
Examples
Wildcards have been around forever. Some even claim they appear in the hieroglyphics of the
ancient Egyptians. Wildcards allow you to specify succinctly a pattern that matches a set of
filenames (for example, *.pdf to get a list of all the PDF files). Wildcards are also often referred
to as glob patterns (or when using them, as "globbing"). But glob patterns have uses beyond just
generating a list of useful filenames. The bash man page refers to glob patterns simply as
"Pattern Matching".
First, let's do a quick review of bash's glob patterns. In addition to the simple wildcard characters
that are fairly well known, bash also has extended globbing, which adds additional features.
These extended features are enabled via the extglob option.
Pattern Description
* Match zero or more characters
? Match any single character
[...] Match any of the characters in a set
?(patterns) Match zero or one occurrences of the patterns (extglob)
*(patterns) Match zero or more occurrences of the patterns (extglob)
+(patterns) Match one or more occurrences of the patterns (extglob)
@(patterns) Match one occurrence of the patterns (extglob)
!(patterns) Match anything that doesn't match one of the patterns (extglob)
For example:
$ ls
a.jpg b.gif c.png d.pdf ee.pdf
$ ls *.jpg
a.jpg
$ ls ?.pdf
d.pdf
$ ls [ab]*
a.jpg b.gif
$ ls ?(*.jpg|*.gif)
a.jpg b.gif
When first using extended globbing, many of them didn't seem to do what I initially thought they
ought to do. For example, it appeared to me that, given a.jpg, the pattern ?(*.jpg|a.jpg)
should not match, because a.jpg matched both patterns, and the ? is "zero or one", right?
Wrong. My confusion was due to a misreading of the description: it's not the filename that
can match only once, it's the pattern that can match only once. Think of it terms of regular
expressions:
$ ls *.pdf
ee.pdf e.pdf .pdf
And while I'm comparing glob patterns to regular expressions, there's an important point to be
made that may not be immediately obvious: glob patterns are just another syntax for doing
pattern matching in general in bash. And you can use them in a number of different places:
The following example uses pattern matching in the expression of an if statement to test
whether a variable has a value of "something" or "anything":
$ shopt +s extglob
$ a=something
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
yes
$ a=anything
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
yes
$ a=nothing
$ if [[ $a == +(some|any)thing ]]; then echo yes; else echo no; fi
no
The following example uses pattern matching in a case statement to determine whether a file is
an image file:
shopt +s extglob
for f in $*
do
case $f in
!(*.gif|*.jpg|*.png)) # ! == does not match
echo "Not an image: $f"
;;
*)
echo "Image: $f"
;;
esac
done
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
Image: a.jpg
Image: b.gif
Image: c.png
Not an image: d.pdf
Not an image: e.pdf
In the example above, the pattern !(*.gif|*.jpg|*.png) will match a filename if it's not a gif,
jpg or png.
The following example uses pattern matching in a %% parameter expansion to remove the
extension from all image files:
shopt -s extglob
for f in $*
do
echo ${f%%*(.gif|.jpg|.png)}
done
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a
b
c
d.pdf
e.pdf
A feature that I just recently became aware of is that you can do the above action in one fell
swoop: if you use "*" or "@" as the variable name, the transformation is done on all the
command-line arguments at once. [Note to self: always read the last half of the paragraph from
now on]:
shopt -s extglob
echo ${*%%*(.gif|.jpg|.png)}
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a b c d.pdf e.pdf
shopt -s extglob
array=($*)
echo ${array[*]%%*(.gif|.jpg|.png)}
$ bash script.sh a.jpg b.gif c.png d.pdf e.pdf
a b c d.pdf e.pdf
The biggest takeaway here is to stop thinking of wildcards as a mechanism just to get a list of
filenames and start thinking of them as glob patterns that can be used to do general pattern
matching in your bash scripts. Think of glob patterns as regular expressions in a different
language.