Linux administration

Search Text Files Using Regular Expressions

Nguyen Hai Chau
Vietnam National University

grep prints lines from files which match a pattern
For example, to find an entry in the password file /etc/passwd relating to the user nancy:

$ grep nancy /etc/passwd

Use grep to find patterns, as well as simple strings
Patterns are expressed as regular expressions
Certain punctuation characters have special meanings
For example this might be a better way to search for Nancy's entry in the password file:

$ grep 'ˆnancy' /etc/passwd

The caret (ˆ) anchors the pattern to the start of the line
- In the same way, $ acts as an anchor when it appears at the end of a string, making the pattern match only at the end of a line

Some regexp special characters are also special to the shell, and so need to be protected with quotes or backslashes
We can match a repeating pattern by adding a modifier:

$ grep -i 'continued\.*'

Dot (.) on its own would match any character, so to match an actual dot we escape it with \
The * modifier matches the preceding character zero or more times
Similarly, the \+ modifier matches one or more times

Multiple subpatterns can be provided as alternatives, separated with \|, for example:

$ grep 'fish\|chips\|pies' food.txt # first method
$ grep -E 'fish|chips|pies' food.txt    # second method

$ grep -i '\(cream\|fish\|birthday\) cakes' delicacies.txt

$ grep '[Jj]oe [Bb]loggs' staff.txt

Any single character from the class matches; and ranges of characters can be expressed as 'a-z'

egrep runs grep in a different mode
- Same as grep -E
Special characters don't have to be marked with \
- So \+ is written +, $...$ is written (...), etc
- In extended regexps, \+ is a literal +

sed reads input lines, runs editing-style commands on them, and writes them to stdout
sed uses regular expressions as patterns in substitutions. sed regular expressions use the same syntax as grep
For example, to used sed to put # at the start of each line:

$ sed -e 's/ˆ/#/' < input.txt > output.txt

sed has simple substitution and translation facilities, but can also be used like a programming language

a. Use grep to find information about the HTTP protocol in the file /etc/services.
b. Usually this file contains some comments, starting with the # symbol. Use grep with the -v option to ignore lines starting with # and look at the rest of the file in less.
c. Add another use of grep -v to your pipeline to remove blank lines (which match the pattern ˆ$).
d. Use sed (also in the same pipeline) to remove the information after the / symbol on each line, leaving just the names of the protocols and their port numbers.