Regular Expressions
Using the different regular expressions
A regular expression is a way of describing string patterns in text files. Regular Expressions may be used to search for patterns in text and modify the patterns. They can also be used to launch commands or applications based on the recognized patterns. Although regulator expressions are very powerful, they represent a steep learning curve and are also not very easy to understand or debug. The best way to learn regular expressions is through constant practice and trial and error.
Linux command line tools like grep, sed, and awk are 'wrapper' tools for regular expression processing. Also, text editors like vi and emacs allow the use of regular expressions in search and replace operations. Scripting languages like Perl, Python, and TCL are centered on regular expressions while OO languages like Java support exhaustive methods on Regular Expression objects. Linux CLI shells also incorporate the limited use of regular expressions via certain wildcards.
The implementation of regular expressions is fairly uniform across platforms and tools; there are a few variations. Some tools add features that are not available in others. We will look at regular expressions as used by the Linux search utilities sed and awk.
Regular expressions that describe text patterns are surrounded by forward slashes. The actual searched regular expression is in between the first pair of slashes. Options, switches, and replace patterns may follow after the second slash.
/<Search_Pattern>/
Using regular expressions
Regular expressions are constructed using combinations of meta-characters, which are characters with special meaning or symbol, and literals or plain characters. Conventional notations (e.g. backslash followed by a character) are used to indicate the following within a search pattern: position anchors like the start or end of a line, groups of characters to look for, a range of characters (a-z, 0-9 etc), or a specified quantity of one or more characters. The following examples demonstrate regular expression construction. In the following examples, the matched part of the text is in bold:
Anchors
The tilde '^ ' matches the beginning of a line while the '$' matches the end of a line.
| Pattern | Matches | Does Not Match |
| /^S/ | She sells seashells' Sea water is salty |
The rain in Spain seven sunny Saturdays |
| /y$/ | Wacky and Tacky A major is one and twenty |
Yea ole sweet shop SWEET AND SALTY |
| /^Just me$/ | Just me | just me |
Groups and Character Ranges
- [abcde] - Matches any of the following - a,b,c,d,e
- [a-e] - Same as above
- [a-z] - Matches any lower case alphabet
- [^a-e] - Matches any character other than lower case a, b, c, d, e ('^' as the first character within square brackets stands for group negation and not beginning of line)
- \<pattern\> - Matches 'pattern' as a whole word only (i.e. bounded by spaces)
- . - Matches any single character
- \ - Matches metacharacter literally (e.g. use \. to match an actual dot rather than have it stand for any character)
- \character - Used to search for control characters and certain groups (\s - all white space, \t tab etc.)
| Pattern | Matches | Does Not Match |
| /[0-9]/ | We have 9 waiters She has 4 children |
No numbers in this string |
| /\< it\>/ | it is legitimate its fresh, take it home |
She has a petit frame |
| /^[a-zA-Z]$/ | Lines that contain a single alphabet | Anything else. |
| /[^a-z]/ | someVariable Anita |
alllowercase |
| /^A.$/ | At, An, A@, A$ (Lines made up of 'A' followed by any single character) |
anything else |
Quantitative Modifiers
- * - Matches zero or more instances of the preceding character
- ? - Matches zero or one instance of the preceding character/ regular expression
- + - Matches one or more instance of the preceding character/ regular expression
- \{n,m\} - Matches at least n and at most m instances of the preceding character
- \{n\} - Matches n repetitions only;
- \{n,\} - Matches at least n instances of the preceding character
- \| - cConditional: matches either the preceding or following character/regular expression
| Pattern | Matches |
| /^[A-Z].*$/ | Any line that starts with an upper case alphabet |
| /is*/ | Matches i, is, ,iss, isss (i followed by 0 or more s) |
| /[0-9][0-9]*/ | Matches one or more digits |
| /[0-9][0-9]+/ | Matches two or more digits |
| /[0-9][0-9]? | Matches one or two digit numbers |
| /[Uu]ser ?[Nn]ame/ | Matches username, userName, Username, UserName with or without space between 'user' and 'name' |
Group Operators
Group operators are used to define groups within the matched pattern. eEach matched group may later be retrieved:
- \(<pattern>\) - pattern group delimiter
- \n - returns nth matched group
The following pattern returns lines that contain sequences of alphabets followed by a space and sequences of numbers:
/[a-zA-Z]+ [0-9]+/
If we need the matched number or matched alphabet sequence, we could write this regular expression the following way:
/\([a-zA-Z]+\) \([0-9]+\)/
A sed command may now retrieve the alphabet sequence using \1 (first group delimited using brackets) and the numbers using \2 (second group delimited using brackets). For example, for the line 'Joseph 29110', \1 would return 'Joseph' and \2 would return '29110'.
sed (stream editor)
The sed utility is used to perform basic text substitutions and other transformations on input streams from any source (a file, standard input, pipe etc). The changes are made using regular expressions. New lines may be inserted between certain patterns, lines that contain patterns may be deleted, and patterns may be searched for and replaced throughout the file (global search and replace). The sed utility is in fact a programming language but its syntax is quite archaic and is now mainly used for text manipulation. Sed is most often used at the command prompt with simple parameters; however, scripts that perform a sequence of transformations may also be written using sed. Note that the input file is not changed; sed's output contains the modified text from the file; this may be redirected to a file if necessary.
sed [options] "command1" [files] sed [options] -e "command1" [-e "command2" ...] [files] sed [options] -f sed_script [files]
sed Examples
- Substitutes one occurrence per line of sitex.com with mysite.com in the file 'sampleconfig' and writes output to file with name 'config'.
[ LinuxUser ] ~$ sed "s/sitex\.com/mysite\.com/" sampleconfig > config
- Substitutes all occurrences of sitex.com with mysite.com in the file 'sampleconfig' and writes output to file with name 'config'.
[ LinuxUser ] ~$ sed "s/ftp\.sitex\.com/ftp\.mysite\.com/g" sampleconfig > config
- Removes lines containing anything other than alphabets, numbers, or spaces
[ LinuxUser ] ~$ sed "/[^0-9a-zA-Z ]/d" somefile > onlyAlphaNumeric
- Literal substitution of 'cat' for 'Cat'
[ LinuxUser ] ~$ sed "y/Cat/cat/" filea > fileb
- Deletes lines from 2 through 27 in file mydata.txt:
[ LinuxUser ] ~$ sed '2,27d' mydata.txt
- Processes more than one directive in one command:
[ LinuxUser ] ~$ cat myfile.txt An apple is a fruit, so is an orange [ LinuxUser ] ~$ sed -e 's/apple/orange/g' -e 's/orange/pear/g' file A pear is a fruit, so is a pear
A script using sed
The following simple exercise finds and returns the first occurrence in a file of any word from a list and returns the found word. This shell script uses $1 to refer to its first argument (the input file). If you were to save this file with the name sedScript.sh, you should make it executable through 'chmod ugo+x sedScript.sh' and execute it as follows:
[ LinuxUser ] ~$ ./sedScript.sh inputFile.txt
The script follows. As always, it starts with a directive that says that the script should run on the bourne shell. Our list of words is apple, orange, grape, banana, and pear. This program will quit as soon as it finds one of these words in the input file.
#!/bin/sh
List='\<apple\>\|\<orange\>\|\<grape\>\|\<banana\>\|\<pear\>' sed -e " /$List/!d
/$List/{ s/\($List\).*/\1/ s/.*\($List\)/\1/ q }" $1
- A variable 'List' is set to a regular expression that contains a list of words separated by the conditional operator to tell sed to search for 'apple' or 'orange' or 'grape' or 'banana' or 'pear'.
- The line '/$List/!d' tells sed to delete all lines that do not contain the pattern $List (d would delete the lines containing the pattern; !d deletes the lines that do not contain the pattern). These lines do not get deleted from the input file. This just means that the lines are not displayed on sed's output displayed to the user.
- The next command uses the two regexs /\($List\).*/ and /.*\($List\)/ to search each line for one of the items in the list. Since .* matches as many characters as possible (greedy matching), a line containing 'apple orange banana' will be completely matched except for 'apple' by '/\($List\).*/' (since out expression matches the first word).
- The group operator - the pair of braces - \( \) saves the found word, we recover it using \1. The 's' switch simply replaces the entire line with the found word at \1
- The program quits as soon as the word is found.
- Two regexs - /\($List\).*/ and /.*\($List\)/ are used because the word may occur in the beginning, end, or middle of the line. Shortening this to /.*\($List\).*/ will not work; we will never know which item occurs first and it may not be returned through \1 if the sentence contains more than one matching word (e.g. 'orange pear banana').