An excerpt out of a book from the Internet Archive was used as the basis for this workshop, which I delivered to Sussex LUG in August 2011. See bottom of page for list of resources used.
| Command | No | Element | Explanation |
|---|---|---|---|
| Find lines containing page numbers, e.g. -5- or - 5 - on its own on a line
(Initially using grep to display all the lines that will be targeted, so we don't lose anything valuable) grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less |
|||
| Command is run as normal user at the bash prompt | |||
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 1 | grep | Seach utility |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 2 | ' | Enclose in quotes. Single quotes are the safest to use, because they protect your regular expression from the shell |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 3 | ^ (carat) | The ^ character matches the beginning of the line |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 4 | $ | The $ character matches the end of the line |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 5 | [[:space:]] | [[:space:]] matches any white space including tabs |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 6 | * | Any number of spaces |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 7 | \- | When a dash is to be used literally, it must be escaped with a backslash |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 8 | (space) | A space character and [[:space:]] mean the same thing |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 9 | * (space asterisk) | Any number of spaces |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 10 | [0-9] | Matches any digit between 0 and 9 |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 11 | \{ and \} | Braces, with backslash escape character, to contain number of repetitions of preceding string |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 12 | 1,3 | When placed inside curly braces, matches contents of preceding square brackets between 1 and 3 times. |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 13 | input.txt | Source file. Content of input.txt remains unchanged |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 14 | | (pipe) | Passes the output of the previous command (grep) as input to the next command (less). |
| grep '^[[:space:]]*\- *[0-9]\{1,3\} *\-$' input.txt | less | 15 | less | Displays output one screen at a time, allowing bi-directional scrolling. Screen will be filled with the lines containing page numbers. |
| # Remove lines containing page numbers, e.g. -5- or - 5 - on its own on a line
(Now we know the lines that will be targeted, use the sed command to remove them) sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less |
|||
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less | 16 | sed | Stream editor utility, used here to delete a line |
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less | 17 | -e | Expression option |
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less | 18 | / | Delimiters. Slashes surround expression to be matched to a line to be deleted |
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' input.txt | less | 19 | d | Delete |
| To run in earnest, change the command to:
sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt where test_file_00.txt is the 33,000-line text file mentioned in the resource list |
|||
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt | 20 | > | Directs the output |
| sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt | 21 | test_file_01.txt | Choose an output file. If it doesn't already exist, it will be created. It will be populated with the lines remaining after the lines containing page numbers have been removed. |
For permanent changes to the old versions (<4) use a temporary file for GNU sed use the "-i[suffix]":
sed -i".bak" '3d' filename.txt
From http://en.kioskea.net/faq/1451-sed-delete-one-or-more-lines-from-a-file
| Command | No | Element | Explanation |
|---|---|---|---|
| Insert a blank line above every line beginning with one or more spaces
(Effectively places a line break between each paragraph) sed '/^[[:space:]]/{x;p;x;}' input.txt | less |
|||
| Command is run as normal user at the bash prompt | |||
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 1 | sed | Stream editor utility, used here to insert a line |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 2 | ' | Enclose entire expression in single quotes |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 3 | / / | Delimiters. The first slash precedes the expression to search for; the second precedes the expression to replace it with. You can use a different character instead if preferable. |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 4 | ^ (carat) | The ^ character matches the beginning of the line |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 5 | [[:space:]] | Represents white space |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 6 | {} (Curly braces) | Command grouping, used to group the commands. Executes all the commands in "..." on the line that matches the restriction operation. |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 7 | ; | Combines several sed commands on one line |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 8 | x | Exchange command: eXchanges the pattern space with the hold buffer. |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 9 | p | Print command: "p." If sed wasn't started with an "-n" option, the "p" command will duplicate the input. |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 10 | input.txt | Source file |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 11 | | (pipe) | Passes the output of the previous command (sed) as input to the next command (less). |
| sed '/^[[:space:]]/{x;p;x;}' input.txt | less | 12 | less | Displays output one screen at a time, allowing bi-directional scrolling |
| To run in earnest, change the command to:
sed '/^[[:space:]]/{x;p;x;}' test_file_01.txt > test_file_02.txt where test_file_01.txt is the output from the previous command |
|||
| Command | No | Element | Explanation |
|---|---|---|---|
| Remove line breaks within a paragraph
(Append current line to previous line ONLY IF current line starts with alpha character) sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less |
|||
| Command is run as normal user at the bash prompt | |||
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 1 | sed | Stream editor. Here, we have a three-parter with a named label allowing for a loop |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 2 | -e | Allows a sed program to be written in several parts, making it more readable. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 3 | :a | Creates a named label. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 4 | ' | Enclose entirety of each expression in single quotes |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 5 | $ | Last line of the file |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 6 | ! | Not (Here, it pairs with "N" to mean do NOT append if it is the last line) |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 7 | N | Appends the next line to the current one |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 8 | ; | Combines several sed commands on one line |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 9 | s | Substitution command |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 10 | / / / | The three delimiters for the substitution command. It doesn't have to be the familiar slash; it can be any character that isn't in the string you're searching for. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 11 | \n= | New line with escape character |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 12 | \( and \) | Grouper with escape characters. Whatever is between the opening and closing parentheses is treated as a group for the purpose of referring back to it later. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 13 | [a-zA-Z] | Matches any alpha character between a and z or A and Z |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 14 | (space) | The new line and equals characters will be replaced by a space character. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 15 | \1 | Back-reference to group number 1. It captures the contents of the group. In this expression, a newline and a single character are replaced by a space and the captured character. If there were a second group in the line, its back-reference would be \2 and so on up to 9. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | lessv | 16 | t | The "t" command branches to a named label if the last substitute command modified pattern space. This branching technique can be used to create loops in sed. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 17 | a | The "a" refers to the already created named label. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 18 | P | Print command. Here, if the substitution fails, one-liner prints out the pattern space up to the newline character. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 19 | D | Delete command. Here, if the substitution fails, one-liner deletes the contents of pattern space up to the newline character. |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 20 | input.txt | Source file |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 21 | | (pipe) | Passes the output of the previous command (sed) as input to the next command (less). |
| sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' input.txt | less | 22 | less | Displays output one screen at a time, allowing bi-directional scrolling |
| To run in earnest, change the command to:
sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' test_file_02.txt > test_file_03.txt where test_file_02.txt is the output from the previous command |
|||
[1] Famous Sed One-Liners Explained by Peteris Krumins
http://www.catonmat.net/blog/sed-one-liners-explained-part-one/
http://www.catonmat.net/blog/sed-one-liners-explained-part-two/
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/
[2] Sed - An Introduction and Tutorial by Bruce Barnett
http://www.grymoire.com/Unix/Sed.html
[3] Sed one-liners compiled by Eric Pement
http://www.catonmat.net/blog/wp-content/uploads/2008/09/sed1line.txt
[4] Sed - UNIX Stream Editor - Cheat Sheet by Peteris Krumins
http://www.catonmat.net/blog/sed-stream-editor-cheat-sheet/
[5] Extra large file to practice on (33,000 lines)
http://www.archive.org/details/Law_Of_Success_in_16_Lessons
Left menu gives links to a selection of formats.
[Update 16/09/2011] Please note: Format options
include "PDF", "PDF with text" and "Full Text". I chose "PDF". I then
selected all text and copied and pasted into a text file. This gives a
different result to the same operation performed on "PDF with text". Please
be aware that the latter option pastes portions of sentences out of order.
The other point to note is that choosing "Full Text" makes it harder to
craft a command to identify and remove page numbers.
I deliberately chose PDF, then selected all text and pasted into a
text file named test_file_00.txt.
This results in text containing page headers and numbers; paragraphs
broken into short line segments; and numerous extraneous characters and
white space in need of tidying. This work, published in 1928 across
multiple volumes, has paragraph indents of 2, 3, 4, 5 or 6 whitespace
characters, as well as multiple spaces etween words within some paragraphs,
making for a particularly challenging exercise.
To test the commands, paste a few paragraphs into a file named input.txt.
[6] My commands
Remove the page numbers:
$ sed -e '/^[[:space:]]*\- *[0-9]\{1,3\} *\-$/d' test_file_00.txt > test_file_01.txt
Separate the paragraphs:
$ sed '/^[[:space:]]/{x;p;x;}' test_file_01.txt > test_file_02.txt
Remove the line breaks:
$ sed -e :a -e '$!N;s/\n\([a-zA-Z]\)/ \1/;ta' -e 'P;D' test_file_02.txt > test_file_03.txt
And there are plenty more operations that could be done on the file in question :-)
I hope this has been helpful.
Fay
Lugmaster
East Grinstead Linux User Group
"Don't go it alone" image from http://lug.org.uk/linktous. This image was made for the UK LUG site by Jake Rayson of Faversham LUG / favlug.