Home Meetings

East Grinstead Linux User Group (EGLUG)

Grep and Sed RegExp Examples Analysed

An excerpt out of a book from the Internet Archive was used as the basis for this workshop, which I delivered to Sussex LUG in August 2011. See bottom of page for list of resources used.

Command No Element Explanation
Find lines containing page numbers, e.g. -5- or - 5 - on its own on a line
(Initially using grep to display all the lines that will be targeted, so we don't lose anything valuable)
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less
Command is run as normal user at the bash prompt
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 1 grep Seach utility
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 2 ' Enclose in quotes. Single quotes are the safest to use, because they protect your regular expression from the shell
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 3 ^ (carat) The ^ character matches the beginning of the line
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 4 $ The $ character matches the end of the line
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 5 [[:space:]] [[:space:]] matches any white space including tabs
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 6 * Any number of spaces
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 7 \- When a dash is to be used literally, it must be escaped with a backslash
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 8 (space) A space character and [[:space:]] mean the same thing
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 9 * (space asterisk) Any number of spaces
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 10 [0-9] Matches any digit between 0 and 9
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 11 \{ and \} Braces, with backslash escape character, to contain number of repetitions of preceding string
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 12 1,3 When placed inside curly braces, matches contents of preceding square brackets between 1 and 3 times.
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 13 input.txt Source file. Content of input.txt remains unchanged
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 14 | (pipe) Passes the output of the previous command (grep) as input to the next command (less).
grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less 15 less Displays output one screen at a time, allowing bi-directional scrolling. Screen will be filled with the lines containing page numbers.
# Remove lines containing page numbers, e.g. -5- or - 5 - on its own on a line
(Now we know the lines that will be targeted, use the sed command to remove them)
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less 16 sed Stream editor utility, used here to delete a line
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less 17 -e Expression option
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less 18 / Delimiters. Slashes surround expression to be matched to a line to be deleted
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less 19 d Delete
To run in earnest, change the command to:
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d'
test_file_00.txt > test_file_01.txt
where test_file_00.txt is the 33,000-line text file mentioned in the resource list
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt 20 > Directs the output
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt 21 test_file_01.txt Choose an output file. If it doesn't already exist, it will be created. It will be populated with the lines remaining after the lines containing page numbers have been removed.

For permanent changes to the old versions (<4) use a temporary file for GNU sed use the "-i[suffix]":
sed -i".bak" '3d' filename.txt
From http://en.kioskea.net/faq/1451-sed-delete-one-or-more-lines-from-a-file

 

Command No Element Explanation
Insert a blank line above every line beginning with one or more spaces
(Effectively places a line break between each paragraph)
sed '/^[[:space:]]/{x;p;x;}' input.txt | less
Command is run as normal user at the bash prompt
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 1 sed Stream editor utility, used here to insert a line
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 2 ' Enclose entire expression in single quotes
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 3 / / Delimiters. The first slash precedes the expression to search for; the second precedes the expression to replace it with. You can use a different character instead if preferable.
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 4 ^ (carat) The ^ character matches the beginning of the line
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 5 [[:space:]] Represents white space
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 6 {} (Curly braces) Command grouping, used to group the commands. Executes all the commands in "..." on the line that matches the restriction operation.
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 7 ; Combines several sed commands on one line
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 8 x Exchange command: eXchanges the pattern space with the hold buffer.
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 9 p Print command: "p." If sed wasn't started with an "-n" option, the "p" command will duplicate the input.
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 10 input.txt Source file
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 11 | (pipe) Passes the output of the previous command (sed) as input to the next command (less).
sed '/^[[:space:]]/{x;p;x;}' input.txt | less 12 less Displays output one screen at a time, allowing bi-directional scrolling
To run in earnest, change the command to:
sed '/^[[:space:]]/{x;p;x;}'
test_file_01.txt > test_file_02.txt
where test_file_01.txt is the output from the previous command

 

Command No Element Explanation
Remove line breaks within a paragraph
(Append current line to previous line ONLY IF current line starts with alpha character)
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less
Command is run as normal user at the bash prompt
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 1 sed Stream editor. Here, we have a three-parter with a named label allowing for a loop
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 2 -e Allows a sed program to be written in several parts, making it more readable.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 3 :a Creates a named label.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 4 ' Enclose entirety of each expression in single quotes
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 5 $ Last line of the file
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 6 ! Not (Here, it pairs with "N" to mean do NOT append if it is the last line)
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 7 N Appends the next line to the current one
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 8 ; Combines several sed commands on one line
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 9 s Substitution command
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 10 / / / The three delimiters for the substitution command. It doesn't have to be the familiar slash; it can be any character that isn't in the string you're searching for.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 11 \n= New line with escape character
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 12 \( and \) Grouper with escape characters. Whatever is between the opening and closing parentheses is treated as a group for the purpose of referring back to it later.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 13 [a-zA-Z] Matches any alpha character between a and z or A and Z
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 14 (space) The new line and equals characters will be replaced by a space character.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 15 \1 Back-reference to group number 1. It captures the contents of the group. In this expression, a newline and a single character are replaced by a space and the captured character. If there were a second group in the line, its back-reference would be \2 and so on up to 9.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | lessv 16 t The "t" command branches to a named label if the last substitute command modified pattern space. This branching technique can be used to create loops in sed.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 17 a The "a" refers to the already created named label.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 18 P Print command. Here, if the substitution fails, one-liner prints out the pattern space up to the newline character.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 19 D Delete command. Here, if the substitution fails, one-liner deletes the contents of pattern space up to the newline character.
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 20 input.txt Source file
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 21 | (pipe) Passes the output of the previous command (sed) as input to the next command (less).
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less 22 less Displays output one screen at a time, allowing bi-directional scrolling
To run in earnest, change the command to:
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D'
test_file_02.txt > test_file_03.txt
where test_file_02.txt is the output from the previous command

 

Resources Used

[1] Famous Sed One-Liners Explained by Peteris Krumins
http://www.catonmat.net/blog/sed-one-liners-explained-part-one/
http://www.catonmat.net/blog/sed-one-liners-explained-part-two/
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/

[2] Sed - An Introduction and Tutorial by Bruce Barnett
http://www.grymoire.com/Unix/Sed.html

[3] Sed one-liners compiled by Eric Pement
http://www.catonmat.net/blog/wp-content/uploads/2008/09/sed1line.txt

[4] Sed - UNIX Stream Editor - Cheat Sheet by Peteris Krumins
http://www.catonmat.net/blog/sed-stream-editor-cheat-sheet/

[5] Extra large file to practice on (33,000 lines)
http://www.archive.org/details/Law_Of_Success_in_16_Lessons
Left menu gives links to a selection of formats.
[Update 16/09/2011] Please note: Format options include "PDF", "PDF with text" and "Full Text". I chose "PDF". I then selected all text and copied and pasted into a text file. This gives a different result to the same operation performed on "PDF with text". Please be aware that the latter option pastes portions of sentences out of order. The other point to note is that choosing "Full Text" makes it harder to craft a command to identify and remove page numbers.
I deliberately chose PDF, then selected all text and pasted into a text file named test_file_00.txt.
This results in text containing page headers and numbers; paragraphs broken into short line segments; and numerous extraneous characters and white space in need of tidying. This work, published in 1928 across multiple volumes, has paragraph indents of 2, 3, 4, 5 or 6 whitespace characters, as well as multiple spaces etween words within some paragraphs, making for a particularly challenging exercise.
To test the commands, paste a few paragraphs into a file named input.txt.

[6] My commands
Remove the page numbers:
$ sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt
Separate the paragraphs:
$ sed '/^[[:space:]]/{x;p;x;}' test_file_01.txt > test_file_02.txt
Remove the line breaks:
$ sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' test_file_02.txt > test_file_03.txt

And there are plenty more operations that could be done on the file in question :-)

I hope this has been helpful.

Fay
Lugmaster
East Grinstead Linux User Group

 

Don't Go It Alone!

"Don't go it alone" image from http://lug.org.uk/linktous. This image was made for the UK LUG site by Jake Rayson of Faversham LUG / favlug.