Linux administration

Search Text Files Using Regular Expressions

Nguyen Hai Chau
Vietnam National University

Searching Files with grep

  • grep prints lines from files which match a pattern
  • For example, to find an entry in the password file /etc/passwd relating to the user nancy:
$ grep nancy /etc/passwd
  • grep has a few useful options:
    • -i makes the matching case-insensitive
    • -r searches through files in specified directories, recursively
    • -l prints just the names of files which contain matching lines
    • -c prints the count of matches in each file
    • -n numbers the matching lines in the output
    • -v rev erses the test, printing lines which don't match

Pattern Matching

  • Use grep to find patterns, as well as simple strings
  • Patterns are expressed as regular expressions
  • Certain punctuation characters have special meanings
  • For example this might be a better way to search for Nancy's entry in the password file:
$ grep 'ˆnancy' /etc/passwd
  • The caret (ˆ) anchors the pattern to the start of the line
    • In the same way, $ acts as an anchor when it appears at the end of a string, making the pattern match only at the end of a line

Matching Repeated Patterns

  • Some regexp special characters are also special to the shell, and so need to be protected with quotes or backslashes

  • We can match a repeating pattern by adding a modifier:

$ grep -i 'continued\.*'
  • Dot (.) on its own would match any character, so to match an actual dot we escape it with \
  • The * modifier matches the preceding character zero or more times
  • Similarly, the \+ modifier matches one or more times

Matching Alternative Patterns

  • Multiple subpatterns can be provided as alternatives, separated with \|, for example:
$ grep 'fish\|chips\|pies' food.txt # first method
$ grep -E 'fish|chips|pies' food.txt    # second method
  • The previous command finds lines which match at least one of the words
  • Use \(...\) to enforce precedence:
$ grep -i '\(cream\|fish\|birthday\) cakes' delicacies.txt
  • Use square brackets to build a character class:
$ grep '[Jj]oe [Bb]loggs' staff.txt
  • Any single character from the class matches; and ranges of characters can be expressed as 'a-z'

Extended Regular Expression Syntax

  • egrep runs grep in a different mode
    • Same as grep -E
  • Special characters don't have to be marked with \
    • So \+ is written +, \(...\) is written (...), etc
    • In extended regexps, \+ is a literal +

sed

  • sed reads input lines, runs editing-style commands on them, and writes them to stdout
  • sed uses regular expressions as patterns in substitutions. sed regular expressions use the same syntax as grep
  • For example, to used sed to put # at the start of each line:
$ sed -e 's/ˆ/#/' < input.txt > output.txt
  • sed has simple substitution and translation facilities, but can also be used like a programming language

Further Reading

Exercise

  • a. Use grep to find information about the HTTP protocol in the file /etc/services.
  • b. Usually this file contains some comments, starting with the # symbol. Use grep with the -v option to ignore lines starting with # and look at the rest of the file in less.
  • c. Add another use of grep -v to your pipeline to remove blank lines (which match the pattern ˆ$).
  • d. Use sed (also in the same pipeline) to remove the information after the / symbol on each line, leaving just the names of the protocols and their port numbers.