Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Text Processing and Searching
Introduction
Welcome to the world of text processing and searching on the Linux command line. In the Linux and Unix philosophy, text is the universal interface. Configuration files, log files, command outputs, source code – a vast amount of information is stored and transmitted as plain text. Being able to efficiently manipulate, search, and transform this text data directly from the command line is not just a convenience; it's a fundamental skill that unlocks immense power and automation capabilities.
Why rely on the command line when graphical text editors and IDEs exist?
- Automation: Command-line tools can be easily scripted. Repetitive text manipulation tasks that might take minutes or hours in a graphical editor can often be accomplished in seconds with a well-crafted command or script.
- Efficiency: For many tasks, especially on remote servers where graphical interfaces may be unavailable or slow, the command line is significantly faster.
- Integration: Command-line tools are designed to work together seamlessly using pipes (
|
), allowing you to chain simple tools to perform complex operations. - Resourcefulness: These tools are typically lightweight and available on almost any Linux/Unix system, even minimal installations.
- Universality: The principles and many of the tools (like
grep
,sed
,awk
) are standard across Unix-like systems (including macOS and Windows Subsystem for Linux).
In this section, we will embark on a deep dive into the essential Linux utilities designed for text processing and searching. We will start with basic file viewing and gradually move towards powerful tools that utilize regular expressions for pattern matching and transformation. Each theoretical part will be followed by a hands-on "Workshop" section, providing practical, step-by-step exercises using real-world scenarios to solidify your understanding. Prepare to become proficient in harnessing the text-processing power of the Linux command line.
We will cover:
- Viewing and concatenating files (
cat
,less
,more
,head
,tail
). - Searching text using patterns (
grep
and regular expressions). - Stream editing for text transformation (
sed
). - Advanced text processing with
awk
. - Sorting and managing duplicate lines (
sort
,uniq
). - Counting lines, words, and characters (
wc
). - Comparing file contents (
diff
,comm
). - Translating or deleting characters (
tr
). - Combining tools using pipes (
|
).
Let's begin by learning how to simply view the contents of text files.
1. Viewing and Concatenating Files
Before you can process or search text, you often need to view it. Linux provides several fundamental utilities for displaying file contents directly in your terminal. Understanding their differences and optimal use cases is the first step towards effective command-line text manipulation.
Core Utilities
-
cat
(Concatenate):- Purpose: Originally designed to concatenate (link together) files,
cat
is most commonly used to display the entire content of one or more files to standard output (usually your terminal screen). - Usage:
cat [options] [file...]
- Behavior: Reads the specified files sequentially and writes their content to standard output. If no file is given, it reads from standard input (e.g., keyboard input until Ctrl+D).
- Common Options:
-n
: Number all output lines.-b
: Number only non-empty output lines.-s
: Squeeze multiple adjacent blank lines into a single blank line.-E
: Display a$
at the end of each line.
- Caveat: Be cautious using
cat
on very large files, as it will attempt to dump the entire content to your screen, which can be slow and overwhelming. It's best suited for small files or when you specifically need the entire content piped to another command.
- Purpose: Originally designed to concatenate (link together) files,
-
less
:- Purpose: A powerful and widely preferred file pager. It allows you to view file content screen by screen, navigate forwards and backwards, and search within the file without loading the entire file into memory first. This makes it ideal for large files.
- Usage:
less [options] [file...]
- Behavior: Displays one screenful of the file. You can then use navigation commands.
- Key Navigation Commands (inside
less
):Space
orf
: Move forward one screen.b
: Move backward one screen.d
: Move down (forward) half a screen.u
: Move up (backward) half a screen.j
orEnter
: Move forward one line.k
: Move backward one line.g
: Go to the beginning of the file.G
: Go to the end of the file./pattern
: Search forward forpattern
.n
repeats the search forward,N
repeats backward.?pattern
: Search backward forpattern
.n
repeats the search backward,N
repeats forward.h
: Display help screen with more commands.q
: Quitless
.
- Advantages: Efficient for large files, rich navigation and search features.
-
more
:- Purpose: An older, simpler file pager than
less
. It allows forward navigation only. - Usage:
more [options] [file...]
- Behavior: Displays file content screen by screen.
- Key Navigation Commands (inside
more
):Space
: Move forward one screen.Enter
: Move forward one line./pattern
: Search forward forpattern
.q
: Quitmore
.
- Note:
less
is generally preferred overmore
due to its enhanced features (like backward scrolling). You might encountermore
on older or minimal systems.
- Purpose: An older, simpler file pager than
-
head
:- Purpose: Displays the beginning (the "head") of a file.
- Usage:
head [options] [file...]
- Behavior: By default, it shows the first 10 lines of the specified file(s).
- Common Options:
-n <number>
or-<number>
: Display the first<number>
lines instead of 10.-c <bytes>
: Display the first<number>
bytes instead of lines.
-
tail
:- Purpose: Displays the end (the "tail") of a file.
- Usage:
tail [options] [file...]
- Behavior: By default, it shows the last 10 lines of the specified file(s).
- Common Options:
-n <number>
or-<number>
: Display the last<number>
lines instead of 10.-n +<number>
: Display lines starting from<number>
. For example,tail -n +5
shows lines from the 5th line to the end.-c <bytes>
: Display the last<number>
bytes instead of lines.-f
: "Follow" mode.tail
does not exit after displaying the last lines but waits and displays new lines as they are appended to the file. This is extremely useful for monitoring log files in real-time. PressCtrl+C
to exit follow mode.
Choosing the Right Tool
- For small files where you want to see everything at once:
cat
. - For viewing files of any size, especially large ones, with navigation and search:
less
. - For quickly checking the beginning of a file:
head
. - For quickly checking the end of a file or monitoring a file for changes:
tail
(especiallytail -f
). - For combining files sequentially:
cat file1 file2 > combined_file
. - When piping output to another command that needs the entire content:
cat file | other_command
. - When piping potentially large output for interactive viewing:
some_command | less
.
Understanding these basic viewing tools is crucial as they often form the first step in a text processing pipeline.
Workshop Viewing and Concatenating Files
Objective: To practice using cat
, less
, head
, and tail
for basic file viewing and monitoring.
Scenario: We will work with a simulated system log file and a configuration file snippet.
Setup:
-
Create a sample log file: Open your terminal and run the following commands to create a file named
system.log
:echo "[2023-10-26 10:00:01] INFO: System startup sequence initiated." > system.log echo "[2023-10-26 10:00:05] INFO: Network service started." >> system.log echo "[2023-10-26 10:00:10] DEBUG: Checking disk space..." >> system.log echo "[2023-10-26 10:00:12] INFO: Disk space OK." >> system.log for i in $(seq 1 20); do echo "[2023-10-26 10:01:$((10 + i))] VERBOSE: Heartbeat signal $i received." >> system.log; done echo "[2023-10-26 10:02:00] WARN: High CPU usage detected (95%)." >> system.log echo "[2023-10-26 10:02:05] INFO: Adjusting process priorities." >> system.log echo "[2023-10-26 10:02:15] ERROR: Failed to connect to database server [db01.internal]." >> system.log echo "[2023-10-26 10:02:20] INFO: Retrying database connection..." >> system.log echo "[2023-10-26 10:02:30] INFO: Database connection successful." >> system.log echo "[2023-10-26 10:03:00] INFO: System initialization complete." >> system.log
>
redirects output, creating or overwriting the file.>>
redirects output, appending to the file if it exists.- The
for
loop adds 20 "VERBOSE" lines to make the file longer.
-
Create a sample configuration snippet: Create
config_part1.txt
: -
Create another configuration snippet: Create
config_part2.txt
:
Steps:
-
View a small file with
cat
:- Command:
cat config_part1.txt
- Observe: The entire content of
config_part1.txt
is printed to your terminal. - Try with line numbers:
cat -n config_part1.txt
- Observe: Each line is now prefixed with its line number.
- Command:
-
Concatenate files with
cat
:- Command:
cat config_part1.txt config_part2.txt
- Observe: The content of
config_part1.txt
is displayed first, immediately followed by the content ofconfig_part2.txt
. - Command:
cat config_part1.txt config_part2.txt > network.conf
- Verify: Use
cat network.conf
to see the combined content stored in the new file.
- Command:
-
View the longer log file with
cat
(and see why it's often not ideal):- Command:
cat system.log
- Observe: The entire log file scrolls past, likely too fast to read comfortably. The beginning of the file might scroll off the screen.
- Command:
-
View the log file properly with
less
:- Command:
less system.log
- Observe: You see the first screenful of the file. The bottom line indicates the filename and your position.
- Practice Navigation:
- Press
Space
to go down one page. - Press
b
to go back one page. - Press
j
several times to move down line by line. - Press
k
several times to move up line by line. - Press
G
to jump to the end of the file. - Press
g
to jump back to the beginning.
- Press
- Practice Searching:
- Type
/ERROR
and pressEnter
.less
will jump to the first occurrence of "ERROR". - Press
n
to find the next occurrence (if any). - Type
?INFO
and pressEnter
.less
will search backwards for "INFO". - Press
n
to find the next occurrence backwards.
- Type
- Quit: Press
q
to exitless
.
- Command:
-
View the beginning of the log file with
head
:- Command:
head system.log
- Observe: You see the first 10 lines of the file.
- Command:
head -n 5 system.log
- Observe: You see only the first 5 lines.
- Command:
head -n 3 network.conf
- Observe: You see the first 3 lines of the combined configuration file.
- Command:
-
View the end of the log file with
tail
:- Command:
tail system.log
- Observe: You see the last 10 lines of the file.
- Command:
tail -n 5 system.log
- Observe: You see only the last 5 lines.
- Command:
tail -n +25 system.log
- Observe: You see the content starting from line 25 until the end. (Count the lines if you're unsure!)
- Command:
-
Monitor the log file with
tail -f
(Simulate log updates):- Command:
tail -f system.log
- Observe: You see the last 10 lines, and the cursor waits.
tail
is now monitoring the file. - Open a second terminal window/tab. Do not close the first one yet.
- In the second terminal, append a new line to the log file:
- Switch back to the first terminal (where
tail -f
is running). - Observe: The new "ALERT" line you just added appears automatically in the
tail -f
output. - Repeat the append command in the second terminal a few times. Each new line will appear in the first terminal.
- Go back to the first terminal and press
Ctrl+C
to stop thetail -f
command.
- Command:
Cleanup (Optional):
This workshop demonstrated how to use the fundamental viewing tools, highlighting the interactive paging of less
and the real-time monitoring capability of tail -f
.
2. Searching Text with grep
Perhaps the single most important text-processing tool on Linux is grep
(Global search for Regular Expression and Print). It scans input (from files or standard input) line by line and prints lines that contain a match for a specified pattern. Its power comes from its speed and its support for regular expressions, a sophisticated way to define search patterns.
Basic Usage
The fundamental syntax is:
PATTERN
: The text or pattern you are searching for. If it contains spaces or special characters interpreted by the shell (like*
,?
,[
), you must enclose it in single ('
) or double ("
) quotes. Single quotes are generally safer as they prevent the shell from interpreting most special characters within the quotes.file...
: One or more files to search within. If omitted,grep
reads from standard input (useful for pipes).
Key grep
Options
grep
has many options to modify its behavior. Here are some of the most crucial ones:
-i
(--ignore-case
): Perform a case-insensitive search.grep -i 'error' file.log
will match "error", "Error", "ERROR", etc.-v
(--invert-match
): Select non-matching lines. It prints all lines that do not contain the pattern.-n
(--line-number
): Prepend each matching line with its line number within the input file.-c
(--count
): Suppress normal output; instead, print a count of matching lines for each input file.-l
(--files-with-matches
): Suppress normal output; instead, print the name of each input file from which output would normally have been printed. The scanning stops on the first match. Useful when you just want to know which files contain a pattern.-L
(--files-without-match
): The opposite of-l
. Print the names of files that do not contain the pattern.-r
or-R
(--recursive
): Search recursively. If a file argument is a directory,grep
searches all files under that directory.-R
additionally follows symbolic links.-w
(--word-regexp
): Select only those lines containing matches that form whole words. The matched substring must be either at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.grep -w 'err'
would match " err " but not "error".-o
(--only-matching
): Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.-A <num>
(--after-context=<num>
): Print<num>
lines of trailing context after matching lines.-B <num>
(--before-context=<num>
): Print<num>
lines of leading context before matching lines.-C <num>
or-<num>
(--context=<num>
): Print<num>
lines of output context (both before and after).-E
(--extended-regexp
): InterpretPATTERN
as an Extended Regular Expression (ERE). More on this below. (Equivalent to using theegrep
command, which is often just a link togrep
).-F
(--fixed-strings
): InterpretPATTERN
as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched. (Equivalent to using thefgrep
command). This can be significantly faster if you don't need pattern matching power.
Introduction to Regular Expressions (Regex)
Regular expressions are the heart of grep
's power. They are sequences of characters that define a search pattern. While a deep dive into regex is a topic in itself, understanding the basics is essential for using grep
effectively.
There are different "flavors" of regex. grep
primarily uses Basic Regular Expressions (BRE) by default, and Extended Regular Expressions (ERE) with the -E
option. EREs are generally more intuitive as fewer characters need escaping.
Common Regex Metacharacters (Special Characters):
(Note: Some require escaping \
in BRE but not in ERE)
-
Anchors:
^
: Matches the beginning of a line.grep '^Error' file
matches lines starting with "Error".$
: Matches the end of a line.grep 'complete.$' file
matches lines ending with "complete.". The.
here needs escaping in BRE (\.
) if you mean a literal dot, otherwise it's a wildcard (see below). In ERE,.
is a wildcard,\.
is a literal dot.
-
Character Representations:
.
(Dot): Matches any single character (except newline).grep 'l..e' file
matches "like", "love", "luxe", etc.[...]
(Bracket Expressions): Matches any one character enclosed in the brackets.[aeiou]
: Matches any single lowercase vowel.[0-9]
: Matches any single digit.[a-zA-Z]
: Matches any single uppercase or lowercase letter.[^...]
: Matches any single character not in the brackets.[^0-9]
matches any non-digit character.
-
Quantifiers (Specify Repetitions):
*
: Matches the preceding item zero or more times.grep 'ab*c' file
matches "ac", "abc", "abbc", "abbbc", etc.+
(ERE only, or\+
in BRE): Matches the preceding item one or more times.grep -E 'ab+c' file
matches "abc", "abbc", but not "ac".?
(ERE only, or\?
in BRE): Matches the preceding item zero or one time.grep -E 'colou?r' file
matches "color" and "colour".{n}
(ERE only, or\{n\}
in BRE): Matches the preceding item exactly n times.grep -E '[0-9]{3}'
matches exactly three digits.{n,}
(ERE only, or\{n,\}
in BRE): Matches the preceding item n or more times.grep -E 'go{2,}gle'
matches "google", "gooogle", etc.{n,m}
(ERE only, or\{n,m\}
in BRE): Matches the preceding item at least n times, but no more than m times.grep -E '[a-z]{3,5}'
matches 3, 4, or 5 lowercase letters.
-
Alternation (OR):
|
(ERE only, or\|
in BRE): Matches either the expression before or the expression after the pipe.grep -E 'error|warning'
matches lines containing either "error" or "warning".
-
Grouping:
(...)
(ERE only, or\(...\)
in BRE): Groups expressions together. This is often used with quantifiers or alternation.grep -E '(ab)+c'
matches "abc", "ababc", etc. It also captures the matched group for backreferences (used more insed
, but relevant).
-
Escaping:
\
: The backslash "escapes" the special meaning of a metacharacter, making it literal. To search for a literal dot, use\.
. To search for a literal asterisk, use\*
. To search for a literal backslash, use\\
.
BRE vs ERE Example:
To match lines containing one or more digits:
- BRE:
grep '[0-9]\+' file
- ERE:
grep -E '[0-9]+' file
oregrep '[0-9]+' file
Using ERE (grep -E
or egrep
) is often recommended for readability when your patterns involve quantifiers like +
, ?
, {}
, or alternation |
.
grep
is a versatile tool that forms the backbone of text searching on Linux. Mastering its options and basic regular expressions is a significant step towards command-line proficiency.
Workshop Searching with grep
Objective: To practice using grep
with various options and basic regular expressions to search within files.
Scenario: We will use the system.log
file created earlier and a sample data file containing user information.
Setup:
-
Ensure
system.log
exists: If you removed it previously, recreate it using the commands from the Workshop in section 1. -
Create a user data file: Create a file named
users.txt
:Note the inconsistent capitalization in "active" for user 107.cat << EOF > users.txt User ID,Name,Department,Status,Last Login 101,Alice Smith,Engineering,Active,2023-10-25 102,Bob Johnson,Sales,Inactive,2023-09-10 103,Charlie Brown,Engineering,Active,2023-10-26 104,David Williams,Support,Active,2023-10-26 105,Eve Davis,Sales,Active,2023-10-24 106,Frank Miller,Support,Pending,2023-10-20 107,Grace Wilson,Engineering,active,2023-10-26 EOF
Steps:
-
Simple Search: Find all lines in the log file containing the word "INFO".
- Command:
grep 'INFO' system.log
- Observe: All lines containing "INFO" are displayed.
- Command:
-
Case-Insensitive Search: Find lines in
users.txt
containing "active", regardless of case.- Command:
grep 'active' users.txt
(Note: misses user 107) - Command:
grep -i 'active' users.txt
- Observe: The second command finds both "Active" and "active" lines.
- Command:
-
Invert Match: Find all lines in the log file that do not contain "VERBOSE".
- Command:
grep -v 'VERBOSE' system.log
- Observe: The output includes INFO, DEBUG, WARN, ERROR lines, but excludes the numerous VERBOSE lines.
- Command:
-
Line Numbers: Find lines containing "database" and show their line numbers.
- Command:
grep -n 'database' system.log
- Observe: Each matching line is prefixed with its number (e.g.,
25:...
,26:...
,27:...
).
- Command:
-
Count Matches: Count how many errors occurred.
- Command:
grep -c 'ERROR' system.log
- Observe: The output should be
1
. - Command:
grep -c 'INFO' system.log
- Observe: The output shows the total count of INFO lines.
- Command:
-
Recursive Search (Setup): First, let's create a subdirectory and copy a file into it.
- Command:
mkdir logs_archive
- Command:
cp system.log logs_archive/system_old.log
- Command:
echo "[2023-10-25 18:00:00] ERROR: System halted." >> logs_archive/system_old.log
- Command:
-
Recursive Search (Execution): Search for "ERROR" in the current directory and subdirectories.
- Command:
grep 'ERROR' system.log
(Only finds the error in the current file) - Command:
grep -r 'ERROR' .
(Searches recursively starting from the current directory.
) - Observe: The second command finds the "ERROR" line in
system.log
and the two "ERROR" lines inlogs_archive/system_old.log
, prefixing each match with the filename.
- Command:
-
List Files Containing Matches: Find which files in the current directory and subdirectories contain the word "CPU".
- Command:
grep -rl 'CPU' .
- Observe: Only the filename
system.log
(prefixed with./
) is printed, as it's the only file containing "CPU".
- Command:
-
Word Match: Find lines containing the word "DB" (as a whole word) vs. lines containing "DB" as part of another word (like "DEBUG"). We'll use
echo
to pipe input togrep
.- Command:
echo "DEBUG DB connection" | grep 'DB'
(Matches) - Command:
echo "DEBUG DB connection" | grep -w 'DB'
(Matches "DB") - Command:
echo "DEBUG database connection" | grep 'DB'
(Matches, as "DB" is in "DEBUG") - Command:
echo "DEBUG database connection" | grep -w 'DB'
(Does not match, as "DB" in "DEBUG" isn't a whole word)
- Command:
-
Using Basic Regex (Anchors): Find log entries that occurred exactly at the start of a minute (timestamp ends in
:00]
).- Command:
grep ':00]' system.log
(Finds lines ending with :00]) - Let's refine using the end-of-line anchor
$
. We need to escape the]
because it's special in regex. Use single quotes. - Command:
grep ':00\]$' system.log
- Observe: This finds lines ending exactly with
:00]
.
- Command:
-
Using Basic Regex (Character Classes): Find user IDs in
users.txt
that are between 102 and 104 inclusive. User IDs are at the start of the line.- Command:
grep '^[1][0][2-4],' users.txt
- Observe: Matches lines starting with 102, 103, or 104, followed by a comma.
^
anchors to the start,[1]
matches '1',[0]
matches '0',[2-4]
matches '2', '3', or '4'.
- Command:
-
Using Extended Regex (Alternation): Find lines in the log containing either "ERROR" or "WARN".
- Command:
grep -E 'ERROR|WARN' system.log
- Observe: Lines with either pattern are shown. Compare with
grep 'ERROR|WARN' system.log
(which searches for the literal string "ERROR|WARN" unless yourgrep
defaults to ERE).
- Command:
-
Using Extended Regex (Quantifiers): Find lines where the CPU usage was 90% or higher (i.e., 9 followed by one more digit, then '%').
- Command:
grep -E 'CPU usage detected \(9[0-9]%\)' system.log
- Observe: Matches the "WARN: High CPU usage detected (95%)" line.
-E
: Use Extended Regex.\(
,\)
: Match literal parentheses. In ERE,(
and)
are for grouping, so we escape them.9[0-9]%
: Match '9', followed by any single digit ([0-9]
), followed by '%'.
- Command:
-
Context: Find the "database connection successful" message and show the line before and after it.
- Command:
grep -C 1 'Database connection successful' system.log
- Observe: You see the "Retrying..." line, the "successful" line, and the "initialization complete" line.
- Try with
-B 2
(2 lines before) and-A 2
(2 lines after).
- Command:
Cleanup (Optional):
This workshop provided hands-on experience with common grep
options and introduced the fundamentals of regular expressions for targeted searching. Experiment with different patterns and options on these files or other text files you have.
3. Stream Editing with sed
sed
stands for Stream EDitor. Unlike interactive editors like nano
or vim
, sed
is designed to perform text transformations non-interactively on an input stream (a file or piped data). It reads the input line by line, applies a set of specified commands to each line, and then outputs the transformed line. This makes it incredibly powerful for scripting and automating text modifications.
How sed
Works: The Cycle
sed
maintains two data buffers:
- Pattern Space: This is the primary workspace.
sed
reads one line from the input into the pattern space. - Hold Space: This is an auxiliary buffer. You can copy data from the pattern space to the hold space and vice-versa, allowing for more complex manipulations across multiple lines (though this is an advanced topic).
The basic sed
cycle for each input line is:
- Read a line from the input stream.
- Remove the trailing newline character.
- Place the line into the pattern space.
- Execute the
sed
commands provided in the script sequentially. Each command might modify the content of the pattern space. Commands can be restricted to operate only on lines matching certain addresses (patterns or line numbers). - Once all commands are executed for the current line:
- If the
-n
option was not used, print the (potentially modified) content of the pattern space, followed by a newline. - If the
-n
option was used, the pattern space is printed only if explicitly commanded (e.g., by thep
command).
- If the
- The pattern space is typically cleared (unless specific commands prevent it).
- If there is more input, repeat from step 1.
Basic Syntax
sed [options] 'script' [input-file...]
sed [options] -e 'script1' -e 'script2' ... [input-file...]
sed [options] -f script-file [input-file...]
options
: Controlsed
's behavior. Common ones include:-n
(--quiet
,--silent
): Suppress automatic printing of the pattern space. Only lines explicitly printed with thep
command will appear in the output. This is very common.-e <script>
: Add the commands inscript
to the set of commands to be executed. Useful for multiple simple commands.-f <script-file>
: Readsed
commands from the specifiedscript-file
instead of the command line. Essential for complex scripts.-i[SUFFIX]
(--in-place[=SUFFIX]
): Edit files in place. Use with extreme caution! This overwrites the original file. IfSUFFIX
is provided (e.g.,-i.bak
), a backup of the original file is created with that suffix before modification. Without a suffix, the original is overwritten directly. Always test yoursed
scripts without-i
first!
script
: One or moresed
commands. If containing shell metacharacters, enclose in single quotes ('
). The basic format of a command is[address[,address]]command[arguments]
.input-file...
: File(s) to process. If omitted,sed
reads from standard input.
Addresses
Addresses specify which lines a command should apply to. If no address is given, the command applies to all lines.
- Line Number: A simple integer
N
applies the command only to lineN
. - Range of Line Numbers:
N,M
applies the command to lines fromN
toM
inclusive. /pattern/
: A regular expression (BRE by default, ERE if-E
orr
option used). The command applies to any line matching the pattern. Regex syntax is similar togrep
./pattern1/,/pattern2/
: A range defined by patterns. The command applies starting from the first line matchingpattern1
up to and including the next line matchingpattern2
.$
: Represents the last line of input.1,$
means all lines.!
: Appended to an address or address range, it negates the match, applying the command to lines that do not match the address(es).1,10!d
deletes all lines except lines 1 through 10.
Common sed
Commands
-
s/regexp/replacement/[flags]
(Substitute): This is the most frequently used command. It searches forregexp
in the pattern space and replaces the first match withreplacement
.regexp
: A Basic Regular Expression (unless-E
used).replacement
: The text to substitute in. Can contain special characters:&
: Represents the entire matchedregexp
.s/hello/(&)/
replaces "hello" with "(hello)".\1
,\2
, ...: Represent text captured by the 1st, 2nd, ... parenthesized group(...)
in theregexp
(requires\(...\)
in BRE,(...)
in ERE).s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\3.\2.\1/
changes YYYY-MM-DD to DD.MM.YYYY.
flags
: Modify the substitution behavior:g
: Global. Replace all occurrences ofregexp
on the line, not just the first.N
(a number): Replace only the Nth occurrence.i
orI
: Case-insensitive matching forregexp
.p
: Print the pattern space if a substitution was made. Often used with-n
.
-
d
(Delete): Delete the pattern space. The current line is not printed, and the next cycle begins.sed '/^#/d' config.txt
deletes all lines starting with#
. -
p
(Print): Print the current pattern space. Usually used with-n
to selectively print only certain lines.sed -n '/ERROR/p' file.log
prints only lines containing "ERROR". -
a \text
(Append): Appendtext
after the current line. The text is output when the next line is read or the input ends.sed '/start_marker/a \New line inserted here' file.txt
. -
i \text
(Insert): Inserttext
before the current line.sed '1i \# Header added by sed' file.txt
inserts a header line before the first line. -
c \text
(Change): Replace the selected line(s) entirely withtext
.sed '/old_setting/c \new_setting=true' config.txt
. -
y/source-chars/dest-chars/
(Transliterate): Translate characters. Replaces every character insource-chars
found in the pattern space with the corresponding character indest-chars
. Similar to thetr
command but withinsed
.sed 'y/abc/ABC/'
changes 'a' to 'A', 'b' to 'B', 'c' to 'C'.
sed
is a dense but powerful tool. Starting with the s
, d
, and p
commands covers a large percentage of common use cases. Remember to always test without the -i
option first!
Workshop Stream Editing with sed
Objective: To practice using sed
for common text transformations like substitution, deletion, insertion, and selective printing.
Scenario: We'll modify a sample configuration file and process log data.
Setup:
-
Create a sample configuration file: Create
server.conf
.cat << EOF > server.conf # Server Configuration File # Last updated: 2023-10-25 ServerName primary.example.com ListenPort = 80 # ListenPort = 443 DocumentRoot /var/www/html ErrorLog /var/log/server_error.log # Security Settings (Use with caution) EnableSSL false SSLProtocol TLSv1.2 TLSv1.3 # Obsolete Setting below AllowInsecureAuth no EOF
-
Create a small CSV data file: Create
data.csv
.
Steps:
-
Simple Substitution (First Occurrence): In
server.conf
, change theListenPort
from 80 to 8080.- Command:
sed 's/80/8080/' server.conf
- Observe: Only the
ListenPort = 80
line is changed. The output is printed to the terminal; the original file is untouched.
- Command:
-
Global Substitution: Suppose we wanted to replace all instances of "Server" with "Host" (case-sensitive).
- Command:
sed 's/Server/Host/g' server.conf
- Observe: Both "ServerName" and "Server Configuration File" (in the comment) are changed to "HostName" and "Host Configuration File". The
g
flag makes it global on each line.
- Command:
-
Case-Insensitive Global Substitution: Replace all occurrences of "server" (any case) with "APPLICATION".
- Command:
sed 's/server/APPLICATION/gi' server.conf
- Observe: "ServerName", "server_error.log" are changed. The
i
flag makes the search case-insensitive.
- Command:
-
Using
&
for Backreference: Enclose the document root path in double quotes.- Command:
sed 's#^DocumentRoot .*#DocumentRoot "&"#' server.conf
(Using#
as delimiter because the replacement contains/
) - Observe: The line
DocumentRoot /var/www/html
becomesDocumentRoot "/var/www/html"
. The&
refers to the entire matched pattern (/var/www/html
). Correction: The prompt asks for the path, not the whole line. Let's refine. - Better Command:
sed 's#^\(DocumentRoot \)\(.*\)#\1"\2"#' server.conf
- Observe:
\(DocumentRoot \)
captures the first part into\1
.\(.*\)
captures the rest (the path) into\2
. The replacement uses\1"\2"
to insert quotes around the path.
- Command:
-
Deleting Lines: Remove all comment lines (starting with
#
) fromserver.conf
.- Command:
sed '/^#/d' server.conf
- Observe: All lines starting with
#
are gone from the output.
- Command:
-
Deleting Lines in a Range: Remove the "Security Settings" comment and the settings below it, up to the "Obsolete Setting" comment.
- Command:
sed '/# Security Settings/,/# Obsolete Setting/d' server.conf
- Observe: The lines from
# Security Settings
down to and including# Obsolete Setting below
are deleted.
- Command:
-
Selective Printing with
-n
andp
: Print only the lines containing "Log" fromserver.conf
.- Command:
sed -n '/Log/p' server.conf
- Observe: Only the
ErrorLog
line is printed. The-n
suppresses default output, andp
prints only matching lines.
- Command:
-
Combining Commands (
-e
): Changeprimary.example.com
tomain.domain.local
AND changeEnableSSL false
toEnableSSL true
.- Command:
sed -e 's/primary\.example\.com/main.domain.local/' -e 's/EnableSSL false/EnableSSL true/' server.conf
(Note:.
is escaped\.
to match literal dot) - Observe: Both substitutions are applied.
- Command:
-
Inserting Text: Insert a new setting
User webadmin
before theDocumentRoot
line.- Command:
sed '/^DocumentRoot/i \User webadmin' server.conf
- Observe: The line
User webadmin
appears immediately before theDocumentRoot
line.
- Command:
-
Appending Text: Append a comment
# End of Basic Settings
after theErrorLog
line.- Command:
sed '/^ErrorLog/a \# End of Basic Settings' server.conf
- Observe: The comment appears immediately after the
ErrorLog
line.
- Command:
-
Changing Lines: Change the "Obsolete Setting" line and the
AllowInsecureAuth
line below it to a single comment# Authentication handled by upstream proxy
.- Command:
sed '/# Obsolete Setting/,/AllowInsecureAuth/c \# Authentication handled by upstream proxy' server.conf
- Observe: The two lines matching the range are replaced by the single new comment line.
- Command:
-
Processing CSV Data: Change the status "FAIL" to "CRITICAL" in
data.csv
.- Command:
sed 's/,FAIL$/,CRITICAL/' data.csv
- Observe: The last line's status is changed. We use
$
to anchor the match to the end of the line, ensuring we only change the status field.
- Command:
-
In-Place Edit (Simulation with Backup): Let's try changing
ListenPort
to443
in the original file, but create a backup first.- First, verify the command:
sed 's/ListenPort = 80/ListenPort = 443/' server.conf
(Looks correct) - Now, execute with
-i.bak
:sed -i.bak 's/ListenPort = 80/ListenPort = 443/' server.conf
- Check:
ls server.conf*
(You should seeserver.conf
andserver.conf.bak
)cat server.conf
(Shows the modified file withListenPort = 443
)cat server.conf.bak
(Shows the original file withListenPort = 80
)
- First, verify the command:
Cleanup (Optional):
This workshop covered fundamental sed
operations. sed
truly shines when combined with grep
and other tools in pipelines for complex automated text processing tasks. Remember the -i
option is powerful but potentially destructive; always test first!
4. Advanced Text Processing with awk
While sed
operates primarily on lines, awk
is designed for field-oriented processing. It treats each input line (called a record) as being composed of multiple fields, which are typically separated by whitespace (spaces or tabs) by default. awk
allows you to easily extract, manipulate, and report on data based on these fields, making it exceptionally useful for processing structured text data like CSV files, log files, or command output.
awk
is not just a command; it's a complete programming language with variables, arithmetic operations, string functions, control structures (if, loops), and associative arrays.
How awk
Works: The Basic Model
awk
reads its input (files or standard input) one record (line) at a time. For each record, it performs the following:
- Splits the record into fields: Based on the current Field Separator (FS),
awk
divides the record into fields.- The fields are accessible using variables:
$1
for the first field,$2
for the second, and so on. $0
represents the entire, unmodified record.
- The fields are accessible using variables:
- Evaluates patterns:
awk
checks the record against eachpattern { action }
rule provided in the script. - Executes actions: If a pattern matches the current record (or if no pattern is specified, which matches every record), the corresponding
{ action }
block is executed. Actions typically involve printing, calculations, or manipulating variables. - Repeats: Continues this process until all records are read.
awk
Script Structure
An awk
script consists of a series of pattern { action }
rules.
-
pattern
: Specifies when the action should be executed. It can be:- Omitted: The action executes for every input record.
/regexp/
: A regular expression (ERE). The action executes if the entire record ($0
) matches the regex.expression
: A conditional expression (e.g.,$3 > 100
,$1 == "ERROR"
). The action executes if the expression evaluates to true (non-zero or non-empty). Field comparisons are usually done numerically or lexicographically depending on context.pattern1, pattern2
: A range pattern (likesed
). The action executes for all records starting from one matchingpattern1
up to the next one matchingpattern2
.BEGIN
: A special pattern. The associated action is executed once before any input records are read. Useful for initializing variables or printing headers.END
: A special pattern. The associated action is executed once after all input records have been processed. Useful for calculating totals or printing summaries.
-
{ action }
: A block ofawk
statements enclosed in curly braces. If omitted for a pattern, the default action is{ print $0 }
(print the entire matching record). Common actions include:print expression1, expression2, ...
: Prints the expressions, separated by the Output Field Separator (OFS) (a space by default). Without arguments,print
is equivalent toprint $0
.printf format, expression1, ...
: Formatted printing, similar to C'sprintf
.- Variable assignments (e.g.,
count = count + 1
orcount++
). - Control structures (
if (...) { ... } else { ... }
,while (...) { ... }
,for (...) { ... }
).
Built-in Variables
awk
provides many useful built-in variables:
$0
: The entire current input record.$1
,$2
, ...$N
: The fields of the current record.NF
(Number of Fields): The total number of fields in the current record.$NF
refers to the last field.NR
(Number of Records): The total number of input records processed so far (cumulative line number).FNR
(File Number of Records): The record number within the current input file (resets for each file).FS
(Field Separator): The regular expression used to separate fields on input. Default is whitespace (" "
). Can be set using the-F
command-line option (e.g.,-F,
for comma-separated) or by assigning toFS
within the script (often in aBEGIN
block).OFS
(Output Field Separator): The string used to separate fields in the output ofprint
. Default is a single space (" "
).ORS
(Output Record Separator): The string outputted after each record byprint
. Default is a newline ("\n"
).FILENAME
: The name of the current input file.
Basic awk
Usage Examples
- Print specific fields: Print the first and third fields of each line.
- Print lines matching a pattern: Print lines where the second field is exactly "ERROR".
- Using BEGIN for a header: Print a header, then the user ID and name from
users.txt
(comma-separated).-F,
: Sets the field separator to a comma.BEGIN { ... }
: Prints the header before processing lines.NR > 1
: Pattern to skip the header line in the input file (NR
is the record number).{ print $1, $2 }
: Prints the first and second fields for records whereNR > 1
.
- Using END for a summary: Count the number of lines.
- Performing calculations: Sum the values in the third column of
data.csv
.sum += $3
: For each data line (NR > 1), add the value of the 3rd field to thesum
variable (awk initializes numeric variables to 0).END { ... }
: After processing all lines, print the final sum.
awk
's ability to handle fields and perform calculations makes it significantly more powerful than sed
for structured data analysis and reporting directly on the command line.
Workshop Advanced Text Processing with awk
Objective: To practice using awk
for field extraction, pattern matching, calculations, and using BEGIN
/END
blocks.
Scenario: We'll analyze web server log data and process the user data file.
Setup:
-
Create a sample web server log file: Create
access.log
. This simulates a common log format (IP Address, -, -, Timestamp, Request, Status Code, Size).cat << EOF > access.log 192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512 10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048 192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128 172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024 10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50 192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096 10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100 EOF
-
Ensure
users.txt
exists: If you removed it previously, recreate it using the commands from the Workshop in section 2.cat << EOF > users.txt User ID,Name,Department,Status,Last Login 101,Alice Smith,Engineering,Active,2023-10-25 102,Bob Johnson,Sales,Inactive,2023-09-10 103,Charlie Brown,Engineering,Active,2023-10-26 104,David Williams,Support,Active,2023-10-26 105,Eve Davis,Sales,Active,2023-10-24 106,Frank Miller,Support,Pending,2023-10-20 107,Grace Wilson,Engineering,active,2023-10-26 EOF
Steps:
-
Extract Specific Fields: Print the IP address (field 1) and the requested path (field 7) from
access.log
.- Command:
awk '{ print $1, $7 }' access.log
- Observe: Each line shows the IP and the path requested.
awk
splits fields by whitespace by default.
- Command:
-
Using
BEGIN
for Header: Print a header "IP Address -> Request Path" before the output from step 1.- Command:
awk 'BEGIN { print "IP Address -> Request Path" } { print $1, $7 }' access.log
- Observe: The header line appears first.
- Command:
-
Conditional Printing (Expression): Print only the requests that resulted in a
404
status code (field 9).- Command:
awk '$9 == 404 { print $0 }' access.log
- Observe: Only the line containing
/favicon.ico
(which has status 404) is printed. Note thatawk
treats$9
as a number here for the comparison.
- Command:
-
Conditional Printing (Regex): Print requests made for PNG image files.
- Command:
awk '$7 ~ /\.png$/ { print $0 }' access.log
- Observe: Only the line requesting
/images/logo.png
is printed.$7 ~ /pattern/
: Thisawk
syntax checks if field 7 matches the regular expression/pattern/
./\.png$/
: The regex matches a literal dot (\.
) followed bypng
at the end of the string ($
).
- Command:
-
Using
-F
for Different Delimiter: Print the Name (field 2) and Department (field 3) for all users inusers.txt
.- Command:
awk -F, '{ print $2, $3 }' users.txt
- Observe: The command fails to skip the header. The output includes "Name,Department".
- Refined Command (skip header):
awk -F, 'NR > 1 { print "Name:", $2, "| Department:", $3 }' users.txt
- Observe: Sets comma as delimiter (
-F,
). Skips the first line (NR > 1
). Prints formatted output for fields 2 and 3.
- Command:
-
Performing Calculations: Calculate the total bytes transferred (field 10) for all successful requests (status code 200, field 9) in
access.log
.- Command:
awk '$9 == 200 { total_bytes += $10 } END { print "Total bytes for successful requests:", total_bytes }' access.log
- Observe:
$9 == 200
: Pattern matches only lines with status 200.{ total_bytes += $10 }
: Action adds the value of field 10 to thetotal_bytes
variable for matching lines.END { ... }
: After processing all lines, prints the final sum.
- Command:
-
Counting Items: Count how many requests came from the IP address
10.0.0.5
.- Command:
awk '$1 == "10.0.0.5" { count++ } END { print "Requests from 10.0.0.5:", count }' access.log
- Observe: Counts lines where the first field is
10.0.0.5
and prints the total at the end.
- Command:
-
Using
printf
for Formatted Output: Print the status (field 4) and name (field 2) of users fromusers.txt
, aligning the output neatly.- Command:
awk -F, 'NR > 1 { printf "Status: %-10s | Name: %s\n", $4, $2 }' users.txt
- Observe:
-F,
: Comma delimiter.NR > 1
: Skip header.printf format, val1, val2
: Formatted print.%-10s
: Format specifier for a string (s
), left-aligned (-
), in a field of minimum 10 characters wide.%s
: Format specifier for a simple string.\n
: Newline character. The output is neatly aligned.
- Command:
-
Modifying Field Separators: Print the User ID and Status from
users.txt
, but separate the output with a tab character. Also, skip the header.- Command:
awk -F, 'BEGIN { OFS="\t" } NR > 1 { print $1, $4 }' users.txt
- Observe:
BEGIN { OFS="\t" }
: Sets the Output Field Separator to a tab before processing lines.- The
print $1, $4
action now uses a tab between the fields in the output.
- Command:
-
Combining Patterns and Actions: For requests in
access.log
, print "Large Request" followed by the line if the size (field 10) is greater than 1500 bytes, and print "Error Request" if the status code (field 9) is 500 or more.- Command:
- Observe: The line for
logo.png
(2048 bytes) anddashboard
(4096 bytes) are printed with "Large Request:". The line for/api/users
(status 500) is printed with "Error Request:". Note that a single line can match multiplepattern {action}
pairs.
Cleanup (Optional):
This workshop demonstrated awk
's capability to parse structured data, perform conditional actions based on field values or patterns, calculate aggregates, and format output. awk
is a versatile tool for data extraction, transformation, and reporting tasks on the Linux command line.
5. Sorting and Uniqueness (sort
, uniq
)
Often, after extracting or manipulating text, you need to organize it or remove duplicate entries. Linux provides two essential utilities for this: sort
for ordering lines and uniq
for identifying and filtering adjacent duplicate lines. They are frequently used together.
sort
The sort
command rearranges lines from text files or standard input and prints the result to standard output. By default, it sorts based on comparing entire lines lexicographically (like dictionary order, but based on ASCII character values).
Basic Usage:
Common sort
Options:
-r
(--reverse
): Reverse the result of comparisons, sorting in descending order.-n
(--numeric-sort
): Compare according to string numerical value. Crucial for sorting numbers correctly (otherwise, "10" comes before "2").-k POS1[,POS2]
(--key=POS1[,POS2]
): Specify a sort key. Sort based on a field (key) within the line, not the entire line.POS1
: The starting position of the key. Fields are numbered starting from 1.POS2
: Optional ending position of the key. If omitted, the key extends to the end of the line.- Positions can have modifiers like
n
(numeric),r
(reverse) applied only to that key (e.g.,-k 3n
sorts numerically on field 3,-k 1,1
sorts only on field 1).
-t <separator>
(--field-separator=<separator>
): Specify the field separator character used when defining keys with-k
. Default separators are non-blank characters transitioning to blank characters (handling multiple spaces/tabs gracefully). For specific delimiters like commas or colons, use-t
.-u
(--unique
): Output only the first of an equal sequence. This is similar to piping the output throughuniq
, butsort -u
can be more efficient as it handles uniqueness during the sort process. Note: It checks for entire line uniqueness, even if sorting by key.-f
(--ignore-case
): Fold lower case to upper case characters for comparisons (case-insensitive sort).-h
(--human-numeric-sort
): Compare human-readable numbers (e.g., 2K, 1G). Requires GNUsort
.-M
(--month-sort
): Compare as month names ("JAN" < "FEB" < ...).--stable
: Use a stable sort algorithm. Lines that compare as equal maintain their original relative order. Defaultsort
is usually stable, but explicitly requesting it can be necessary in complex scenarios or scripts relying on this behavior.
Important Note on Keys (-k
):
- Fields are counted starting from 1.
-k 2
means the sort key starts at the beginning of field 2 and goes to the end of the line.-k 2,2
means the sort key is only field 2.-k 2,3
means the sort key includes field 2 and field 3.-k 2.3,2.5
means the key starts at the 3rd character of field 2 and ends at the 5th character of field 2 (character positions also start from 1).
uniq
The uniq
command filters adjacent matching lines from sorted input. It reads from standard input or a single file and writes unique lines to standard output. Crucially, uniq
only detects duplicate lines if they are consecutive. This means you almost always need to use sort
before piping data to uniq
.
Basic Usage:
Common uniq
Options:
-c
(--count
): Precede each output line with a count of how many times it occurred in the input (among adjacent lines).-d
(--repeated
): Only print duplicate lines, one for each group.-u
(--unique
): Only print lines that are not repeated (unique among adjacent lines).-i
(--ignore-case
): Ignore case differences when comparing lines.-f <N>
(--skip-fields=<N>
): Avoid comparing the firstN
fields. Fields are separated by whitespace by default.-s <N>
(--skip-chars=<N>
): Avoid comparing the firstN
characters. Applied after skipping fields (-f
).-w <N>
(--check-chars=<N>
): Compare no more thanN
characters per line. Applied after skipping fields/chars.
Combining sort
and uniq
The most common pattern is:
# Count occurrences of each line
command_producing_output | sort | uniq -c
# Get only unique lines
command_producing_output | sort | uniq
# Get only lines that appeared more than once
command_producing_output | sort | uniq -d
Alternatively, for just getting unique lines (equivalent to sort | uniq
):
Choosing between sort | uniq
and sort -u
often depends on what you need next. If you need the counts (uniq -c
), you must use uniq
. sort -u
might be slightly faster if you only need the unique lines themselves.
Workshop Sorting and Uniqueness
Objective: To practice sorting data numerically and lexicographically, using keys, and using uniq
to count or filter duplicates.
Scenario: We'll work with a list of fruit names and quantities, and revisit the access.log
file.
Setup:
-
Create a fruit list file: Create
fruits.txt
.cat << EOF > fruits.txt Apple,Red,15 Banana,Yellow,25 Orange,Orange,10 Apple,Green,20 Grape,Purple,50 Banana,Green,18 Orange,Blood,12 Apple,Red,15 EOF
- Format: Fruit Name, Color, Quantity
-
Ensure
access.log
exists: Recreate it if necessary from the Workshop in section 4.cat << EOF > access.log 192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512 10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048 192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128 172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024 10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50 192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096 10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100 EOF
Steps:
-
Default Sort: Sort
fruits.txt
lexicographically.- Command:
sort fruits.txt
- Observe: Lines are sorted alphabetically based on the entire line content. Note that "Apple,Green" comes before "Apple,Red".
- Command:
-
Reverse Sort: Sort
fruits.txt
in reverse alphabetical order.- Command:
sort -r fruits.txt
- Observe: The order from step 1 is reversed. "Orange,Orange" is now first.
- Command:
-
Numeric Sort (Incorrect without Key): Try sorting
fruits.txt
numerically (this won't work as expected yet).- Command:
sort -n fruits.txt
- Observe: The result is likely the same as the default sort, because the beginning of each line ("Apple", "Banana") is not numeric.
-n
applies to the whole line unless a key is specified.
- Command:
-
Sort by Key (Field): Sort
fruits.txt
based on the fruit name (field 1). Use comma as a delimiter.- Command:
sort -t, -k 1,1 fruits.txt
- Observe: All "Apple" lines are grouped, "Banana" lines are grouped, etc. Within each group, the original relative order might be preserved (depending on sort stability) or determined by subsequent fields.
-t,
sets the delimiter.-k 1,1
means sort only based on field 1.
- Command:
-
Sort by Numeric Key: Sort
fruits.txt
based on the quantity (field 3), numerically.- Command:
sort -t, -k 3n fruits.txt
- Observe: Lines are sorted from the smallest quantity (10) to the largest (50).
-k 3n
applies numeric sorting (n
) specifically to the key starting at field 3.
- Command:
-
Sort by Key (Reverse Numeric): Sort by quantity (field 3) in descending order.
- Command:
sort -t, -k 3nr fruits.txt
- Observe: Lines are sorted from largest quantity (50) down to smallest (10).
nr
applies reverse numeric sort to the key.
- Command:
-
Secondary Sort Key: Sort primarily by fruit name (field 1), and secondarily by quantity (field 3, numerically).
- Command:
sort -t, -k 1,1 -k 3n fruits.txt
- Observe: Lines are grouped by fruit name ("Apple", "Banana", ...). Within each group (e.g., "Apple"), the lines are sorted by quantity (15, 15, 20).
- Command:
-
Using
uniq
(Basic): Show only the unique lines fromfruits.txt
. Rememberuniq
needs sorted input!- Command:
sort fruits.txt | uniq
- Observe: The duplicate line "Apple,Red,15" appears only once.
- Command:
-
Using
sort -u
: Achieve the same result as step 8 usingsort -u
.- Command:
sort -u fruits.txt
- Observe: The output should be identical to
sort fruits.txt | uniq
.
- Command:
-
Counting Occurrences with
uniq -c
: Count how many times each unique line appears infruits.txt
.- Command:
sort fruits.txt | uniq -c
- Observe: Each unique line is prefixed with its count. "Apple,Red,15" should have a count of
2
.
- Command:
-
Sorting the Counts: Show the counts from step 10, sorted from most frequent to least frequent line.
- Command:
sort fruits.txt | uniq -c | sort -nr
- Observe: The output is sorted numerically (
-n
) and in reverse (-r
) based on the count (which is the first field ofuniq -c
's output). The line with count 2 appears first.
- Command:
-
Finding Duplicate Lines with
uniq -d
: Show only the lines that appear more than once infruits.txt
.- Command:
sort fruits.txt | uniq -d
- Observe: Only "Apple,Red,15" is printed.
- Command:
-
Finding Truly Unique Lines with
uniq -u
: Show only the lines that appear exactly once infruits.txt
.- Command:
sort fruits.txt | uniq -u
- Observe: All lines except "Apple,Red,15" are printed.
- Command:
-
Real-World Example: Top IP Addresses: Find the top 3 IP addresses accessing the web server based on
access.log
.- Step 1: Extract IP addresses (field 1).
- Step 2: Sort the IP addresses.
- Step 3: Count occurrences of each IP.
- Step 4: Sort the counts numerically in reverse order.
- Step 5: Get the top 3 using
head
. - Observe the final output showing the count and IP address for the 3 most frequent visitors. This demonstrates a typical pipeline combining
awk
,sort
,uniq
, andhead
.
Cleanup (Optional):
This workshop illustrated how sort
orders data based on various criteria (lexical, numeric, key-based) and how uniq
filters or counts adjacent duplicates in sorted input. Combining them is essential for summarizing and analyzing text data.
6. Counting Lines, Words, and Characters (wc
)
Sometimes, you don't need to see the content itself, but rather get metrics about the content: how many lines, words, or characters does a file or command output contain? The wc
(Word Count) command is the standard utility for this task.
Basic Usage
wc
reads from the specified files or standard input and prints counts based on the options provided, followed by the filename if input comes from a file. If multiple files are given, it prints counts for each file and then a total line.
Core wc
Options
-l
(--lines
): Print the newline counts. This effectively counts the number of lines.-w
(--words
): Print the word counts. Words are typically sequences of non-whitespace characters separated by whitespace.-c
(--bytes
): Print the byte counts.-m
(--chars
): Print the character counts. This can differ from byte counts (-c
) for multibyte character encodings (like UTF-8). Often,-m
is what you want for "character" count in modern systems.-L
(--max-line-length
): Print the maximum display width (length of the longest line).
Default Behavior
If no options are specified, wc
prints the line count, word count, and byte count (in that order), followed by the filename (if applicable).
Common Use Cases
- Counting lines in a file:
wc -l filename.txt
- Counting files in a directory:
ls -1 | wc -l
(Note:ls -1
lists one file per line) - Counting words in piped output:
grep 'ERROR' system.log | wc -w
(Counts words only in the lines containing "ERROR") - Checking if a file is empty: An empty file will have 0 lines, 0 words, and 0 bytes.
- Getting only the number: Often, you pipe the output of
wc -l
(which includes the filename) toawk '{print $1}'
or use command substitution if you just need the numeric value in a script.
wc
is a simple but fundamental utility for quick summaries of text data size and structure.
Workshop Counting with wc
Objective: To practice using wc
to count lines, words, bytes, and characters in files and piped input.
Scenario: We will use the fruits.txt
file and output from other commands.
Setup:
- Recreate
fruits.txt
: (Note: Added an extra blank line at the end for demonstration)
Steps:
-
Default
wc
Output: Runwc
onfruits.txt
without any options.- Command:
wc fruits.txt
- Observe: The output shows the line count, word count, and byte count, followed by the filename. The counts should reflect the 8 data lines plus the blank line (total 9 lines). Words are counted based on whitespace separation (so "Apple,Red,15" is likely counted as 1 word by default
wc
, depending on internal whitespace). Bytes depend on OS and line endings (LF vs CRLF).
- Command:
-
Count Lines: Get only the line count for
fruits.txt
.- Command:
wc -l fruits.txt
- Observe: Shows the number of lines (should be 9) and the filename.
- Command:
-
Count Words: Get only the word count for
fruits.txt
.- Command:
wc -w fruits.txt
- Observe: Shows the word count and the filename.
- Command:
-
Count Bytes: Get only the byte count for
fruits.txt
.- Command:
wc -c fruits.txt
- Observe: Shows the byte count and the filename.
- Command:
-
Count Characters: Get only the character count (useful for multi-byte encodings).
- Command:
wc -m fruits.txt
- Observe: Shows the character count and the filename. On systems using UTF-8 and standard Unix line endings (LF),
-m
and-c
might give the same result if the file only contains ASCII characters. If it contained multi-byte characters, they could differ.
- Command:
-
Find Longest Line: Find the length of the longest line in
fruits.txt
.- Command:
wc -L fruits.txt
- Observe: Shows the maximum line length (in display width) and the filename.
- Command:
-
Counting from Standard Input: Count the lines output by the
ls -1 /etc
command (listing files in /etc, one per line).- Command:
ls -1 /etc | wc -l
- Observe: Shows the number of files and directories directly within
/etc
. No filename is printed bywc
because it's reading from the pipe (standard input).
- Command:
-
Counting Filtered Output: Count how many different types of fruit are listed in
fruits.txt
(ignoring color and quantity).- Step 1: Extract the first field (fruit name).
- Step 2: Sort the names.
- Step 3: Get unique names.
- Step 4: Count the unique names.
- Observe: The final output should be
3
, representing Apple, Banana, Grape, Orange (Correction: 4 types).
-
Multiple Files: Count lines, words, and bytes in both
fruits.txt
and a system file like/etc/passwd
(if readable).- Command:
wc fruits.txt /etc/passwd
- Observe:
wc
prints the counts forfruits.txt
, then the counts for/etc/passwd
, and finally atotal
line summing the counts from both files.
- Command:
Cleanup (Optional):
This workshop showed how wc
provides quick statistics about text data, both from files and standard input, making it useful for summaries and checks within scripts or command pipelines.
7. Comparing File Contents (diff
, comm
)
Comparing files to identify differences is a common task, especially when tracking changes in configuration files, source code, or datasets. Linux offers two primary tools for this: diff
, which shows the differences in detail, and comm
, which identifies common and unique lines between sorted files.
diff
The diff
command compares two files line by line and outputs a description of the changes required to make the first file identical to the second file. It's the basis for creating patches (.diff
or .patch
files).
Basic Usage:
Output Formats:
diff
supports several output formats. The most common are:
-
Normal Format (Default):
- Shows differences using action characters (
a
for append,d
for delete,c
for change) along with line numbers and the differing lines themselves. Lines fromfile1
are prefixed with<
, lines fromfile2
are prefixed with>
. - Example:
1a2
means after line 1 offile1
, you need to append line 2 offile2
.3d2
means delete line 3 fromfile1
(which corresponds to line 2's position infile2
after the deletion).5c5
means change line 5 offile1
to become line 5 offile2
. - This format is concise but can be hard to read for larger changes.
- Shows differences using action characters (
-
Context Format (
-c
or-C NUM
,--context[=NUM]
):- Shows differing sections along with several lines (default 3, or
NUM
lines) of surrounding context (unchanged lines). - Lines from
file1
are marked with!
(or-
for deleted). Lines fromfile2
are marked with!
(or+
for added). Unchanged context lines start with two spaces. File headers (*** file1
,--- file2
) are included. - Easier to understand the location of changes.
- Shows differing sections along with several lines (default 3, or
-
Unified Format (
-u
or-U NUM
,--unified[=NUM]
):- Similar to context format but more compact, avoiding repetition of context lines. This is the most common format for patches used in version control systems (like Git) and software distribution.
- File headers are
--- file1
and+++ file2
. - Change hunks start with
@@ -line1,count1 +line2,count2 @@
. - Lines unique to
file1
(deleted) start with-
. - Lines unique to
file2
(added) start with+
. - Context lines start with a space.
- Highly recommended for readability and sharing.
Common diff
Options:
-i
(--ignore-case
): Ignore case differences in file contents.-w
(--ignore-all-space
): Ignore all whitespace differences (tabs, spaces).-b
(--ignore-space-change
): Ignore changes in the amount of whitespace (e.g., one space vs. two).-B
(--ignore-blank-lines
): Ignore changes whose lines are all blank.-q
(--brief
): Report only whether files differ, not the details. Prints "Files X and Y differ".-s
(--report-identical-files
): Report when two files are the same.-r
(--recursive
): Recursively compare subdirectories found. When comparing directories,diff
will compare files with the same name in both directories.-N
(--new-file
): Treat absent files as empty. Useful with-r
if a file exists in one directory but not the other.-y
(--side-by-side
): Output in two columns, showing lines side-by-side. Can be combined with--width=NUM
to control line width. Differences are marked.
comm
The comm
command compares two sorted files line by line and produces three columns of output:
- Lines unique to
file1
. - Lines unique to
file2
. - Lines common to both files.
Crucially, both input files MUST be sorted lexicographically for comm
to work correctly.
Basic Usage:
Common comm
Options:
-1
: Suppress output column 1 (lines unique tofile1
).-2
: Suppress output column 2 (lines unique tofile2
).-3
: Suppress output column 3 (lines common to both files).--check-order
: Check that the input is correctly sorted; signal an error if not.--nocheck-order
: Do not check input order (can be faster, but results are undefined if not sorted).--output-delimiter=STR
: Separate columns withSTR
instead of the default tab character.
Example Use Cases for comm
:
- Find lines present in
file2
but not infile1
:comm -13 file1 file2
(suppress unique to file1, suppress common). - Find lines common to both files:
comm -12 file1 file2
(suppress unique to file1, suppress unique to file2).
Choosing Between diff
and comm
- Use
diff
when you need to see the specific changes required to transform one file into another, especially for creating patches or understanding detailed modifications (code, configs). Unified format (-u
) is generally preferred. - Use
comm
when you have sorted lists and want to quickly find common items, or items unique to one list or the other. It's often used for set operations on lists (intersection, difference).
Workshop Comparing Files
Objective: To practice using diff
to identify changes between files in different formats and comm
to compare sorted lists.
Scenario: We'll create two versions of a simple configuration file and two lists of users.
Setup:
-
Create
config_v1.txt
: -
Create
config_v2.txt
(modified version): -
Create
users_group1.txt
(sorted): -
Create
users_group2.txt
(sorted):
Steps using diff
:
-
Default
diff
: Compareconfig_v1.txt
andconfig_v2.txt
.- Command:
diff config_v1.txt config_v2.txt
- Observe: The output uses the
c
(change) andd
(delete)/a
(add) notation. Notice how it describes changing lines 1-6 of v1 into lines 1-7 of v2. It can be hard to follow precisely which parts changed.
- Command:
-
Context
diff
: Compare using context format.- Command:
diff -c config_v1.txt config_v2.txt
- Observe: Easier to read. Shows file headers (
***
,---
) and context lines around the changes. Changed lines are marked with!
. Added lines with+
, deleted with-
.
- Command:
-
Unified
diff
(Recommended): Compare using unified format.- Command:
diff -u config_v1.txt config_v2.txt
- Observe: The most common patch format. Shows headers (
---
,+++
), hunks (@@ ... @@
), context lines (start with space), deleted lines (-
), and added lines (+
). This clearly shows the changes line by line.
- Command:
-
Ignore Whitespace Changes: Let's create a third file with only whitespace changes.
- Command:
cp config_v1.txt config_v1_ws.txt
- Command:
sed -i 's/=/ = /' config_v1_ws.txt
(Adds spaces around=
) - Command:
diff -u config_v1.txt config_v1_ws.txt
(Shows differences) - Command:
diff -uw config_v1.txt config_v1_ws.txt
(Ordiff -u -w
) - Observe: The
-w
option makesdiff
ignore the whitespace changes, reporting no differences (or only other non-whitespace differences if there were any).
- Command:
-
Brief Output: Just check if the files differ.
- Command:
diff -q config_v1.txt config_v2.txt
- Observe: Prints
Files config_v1.txt and config_v2.txt differ
. - Command:
diff -q config_v1.txt config_v1.txt
- Observe: Prints nothing, as the files are identical.
- Command:
diff -qs config_v1.txt config_v1.txt
- Observe:
-s
makes it report identical files:Files config_v1.txt and config_v1.txt are identical
.
- Command:
Steps using comm
:
Important: Remember comm
requires sorted input. Our users_*.txt
files were created sorted.
-
Default
comm
: Compareusers_group1.txt
andusers_group2.txt
.- Command:
comm users_group1.txt users_group2.txt
- Observe: Three columns (separated by tabs):
- Column 1:
charlie
,david
,frank
(unique to group1) - Column 2:
eve
,grace
,mallory
(unique to group2) - Column 3:
alice
,bob
(common to both)
- Column 1:
- Command:
-
Find Common Users: Show only users present in both groups.
- Command:
comm -12 users_group1.txt users_group2.txt
- Observe: Suppresses columns 1 and 2, leaving only
alice
andbob
.
- Command:
-
Find Users Only in Group 1: Show users present in group1 but not group2.
- Command:
comm -23 users_group1.txt users_group2.txt
- Observe: Suppresses columns 2 and 3, leaving
charlie
,david
,frank
.
- Command:
-
Find Users Only in Group 2: Show users present in group2 but not group1.
- Command:
comm -13 users_group1.txt users_group2.txt
- Observe: Suppresses columns 1 and 3, leaving
eve
,grace
,mallory
.
- Command:
-
Demonstrate Unsorted Input: Try
comm
on the unsorted config files.- Command:
comm config_v1.txt config_v2.txt
- Observe: The output is likely incorrect or might produce an error message (
comm: file 1 is not in sorted order
) depending on yourcomm
version and locale settings, because the files are not lexicographically sorted. This highlights the importance of sorting input forcomm
.
- Command:
Cleanup (Optional):
This workshop demonstrated how diff
provides detailed change information (especially with -u
) and how comm
efficiently finds commonalities and differences between pre-sorted lists.
8. Character Translation or Deletion (tr
)
The tr
command is a simple yet useful utility for translating or deleting characters. It reads from standard input and writes to standard output, making character-level substitutions or removals based on specified sets of characters. It does not take filenames as arguments; input must be redirected or piped to it.
Basic Syntax
-
Translate:
tr [options] SET1 SET2
- Replaces characters found in
SET1
with the corresponding character inSET2
. IfSET2
is shorter thanSET1
, the last character ofSET2
is repeated.
- Replaces characters found in
-
Delete:
tr [options] -d SET1
- Deletes all characters found in
SET1
from the input.
- Deletes all characters found in
-
Squeeze Repeats:
tr [options] -s SET1
- Replaces sequences of a character repeated multiple times in
SET1
with a single instance of that character. Can be combined with translation.
- Replaces sequences of a character repeated multiple times in
Defining Character Sets (SET1
, SET2
)
Sets are strings of characters. tr
provides several ways to define them:
- Literal Characters:
tr 'abc' 'xyz'
translates 'a' to 'x', 'b' to 'y', 'c' to 'z'. - Ranges:
a-z
represents all lowercase letters,0-9
all digits.tr 'a-z' 'A-Z'
converts input to uppercase. - Character Classes (POSIX): Predefined sets enclosed in
[:...:]
. These are often more portable than explicit ranges, especially across different locales.[:alnum:]
: Alphanumeric characters (a-z
,A-Z
,0-9
).[:alpha:]
: Alphabetic characters (a-z
,A-Z
).[:digit:]
: Digits (0-9
).[:lower:]
: Lowercase letters.[:upper:]
: Uppercase letters.[:space:]
: Whitespace characters (space, tab, newline, etc.).[:punct:]
: Punctuation characters.[:cntrl:]
: Control characters.[:print:]
: Printable characters (including space).[:graph:]
: Printable characters (not including space).[:xdigit:]
: Hexadecimal digits (0-9
,a-f
,A-F
).- Example:
tr '[:lower:]' '[:upper:]'
converts to uppercase (often preferred overa-z A-Z
for locale correctness).
- Octal Escapes:
\NNN
represents the character with octal value NNN.\n
is newline,\t
is tab. - Repetition:
[c*N]
inSET2
means N repetitions of characterc
.[c*]
means repeatc
indefinitely (useful ifSET2
needs to be as long asSET1
).
Common Options
-d
(--delete
): Delete characters inSET1
.SET2
must not be specified.-s
(--squeeze-repeats
): Squeeze repeated characters listed in the last specified set (SET1
if only one set,SET2
if translating).-c
or-C
(--complement
): Use the complement ofSET1
. All characters not inSET1
are selected.
Use Cases
- Changing case:
tr 'a-z' 'A-Z'
ortr '[:lower:]' '[:upper:]'
. - Deleting unwanted characters:
tr -d '[:digit:]'
removes all numbers.tr -d '\r'
removes carriage return characters (useful for DOS/Windows files). - Converting delimiters:
tr ',' '\t'
converts commas to tabs. - Squeezing whitespace:
tr -s ' '
replaces multiple spaces with a single space.tr -s '[:space:]'
squeezes all types of whitespace. - Extracting specific characters:
tr -cd '[:alnum:]'
deletes the complement of alphanumerics, effectively keeping only letters and numbers.
tr
is efficient for simple, character-by-character transformations applied uniformly across an input stream.
Workshop Character Translation or Deletion
Objective: To practice using tr
for case conversion, character deletion, delimiter replacement, and squeezing repeated characters.
Scenario: We will manipulate text strings provided via echo
and standard input.
Setup: No specific files are needed; we will use echo
and direct input.
Steps:
-
Uppercase Conversion: Convert a lowercase string to uppercase.
- Command:
echo "hello world" | tr 'a-z' 'A-Z'
- Observe: Output is
HELLO WORLD
. - Alternative (POSIX classes):
echo "hello world 123" | tr '[:lower:]' '[:upper:]'
- Observe: Output is
HELLO WORLD 123
.
- Command:
-
Lowercase Conversion: Convert a mixed-case string to lowercase.
- Command:
echo "Linux Is FUN!" | tr '[:upper:]' '[:lower:]'
- Observe: Output is
linux is fun!
.
- Command:
-
Deleting Specific Characters: Remove all vowels (aeiou) from a string.
- Command:
echo "quick brown fox jumps over the lazy dog" | tr -d 'aeiouAEIOU'
- Observe: Output is
qck brwn fx jmps vr th lzy dg
.-d
deletes characters in the specified set.
- Command:
-
Deleting Non-Printable Characters (Example): Imagine a string with a control character (we'll simulate with backspace
\b
).- Command:
echo -e "This\b is a\b test" | cat -v
(Usecat -v
to visualize control chars like^H
for backspace) - Command:
echo -e "This\b is a\b test" | tr -d '[:cntrl:]'
- Observe: The
tr
command removes the backspace characters, resulting inThis is a test
.
- Command:
-
Replacing Delimiters: Convert spaces to newlines (one word per line).
- Command:
echo "This is a line of text" | tr ' ' '\n'
- Observe:
- Command:
-
Squeezing Repeated Characters: Remove extra spaces between words.
- Command:
echo "Too many spaces here" | tr -s ' '
- Observe: Output is
Too many spaces here
.-s
squeezes repeated spaces into one.
- Command:
-
Squeezing All Whitespace: Replace any sequence of whitespace characters with a single newline.
- Command:
echo -e "Field1\tField2 Field3\nField4" | tr -s '[:space:]' '\n'
- Observe:
Multiple spaces, tabs (
\t
), and existing newlines (\n
) are all treated as whitespace to be squeezed and replaced by a single newline.
- Command:
-
Complement and Delete (Keep Only Digits): Remove everything except digits from a string.
- Command:
echo "Order Number: ORD-12345 / Date: 2023-10-26" | tr -cd '[:digit:]'
- Observe: Output is
1234520231026
.-d
: Delete mode.-c
: Use the complement of the set. The set is[:digit:]
.- So, delete everything that is not a digit.
- Command:
-
Simple Substitution Cipher (ROT13): Rotate letters 13 places (A->N, B->O, ..., M->Z, N->A, ...).
- Command:
echo "Secret Message" | tr 'a-zA-Z' 'n-za-mN-ZA-M'
- Observe: Output is
Frperg Zrffntr
. Applying it again should decode it. - Command:
echo "Frperg Zrffntr" | tr 'a-zA-Z' 'n-za-mN-ZA-M'
- Observe: Output is
Secret Message
.
- Command:
-
Interactive Use: Try translating input directly from the keyboard.
- Command:
tr '()' '{}'
- Now type:
function(arg1, arg2)
and press Enter. - Observe the output:
function{arg1, arg2}
- Press
Ctrl+D
to end the input totr
.
- Command:
Cleanup: No files to remove.
This workshop showed how tr
performs efficient character-level translation, deletion, and squeezing, making it valuable for data cleaning and simple transformations within pipelines.
9. Combining Tools with Pipes
The true power of the Linux command line for text processing comes not just from individual tools, but from their ability to be combined using the pipe (|
) operator.
A pipe connects the standard output (stdout) of the command on its left to the standard input (stdin) of the command on its right. This allows you to build complex data processing workflows by chaining simple, specialized utilities together.
The Flow:
command1
executes. Its standard output (what it would normally print to the screen) is not printed to the screen.- Instead, that output is redirected (
piped
) to become the standard input forcommand2
. command2
executes, reading its input fromcommand1
. Its standard output is then piped tocommand3
.command3
executes, reading input fromcommand2
. Its standard output goes to the terminal (unless piped further).
Standard Error (stderr): It's important to note that pipes typically only redirect standard output. Error messages, which are usually sent to standard error (stderr), will still appear on your terminal by default and are not passed down the pipeline. You can redirect stderr using shell redirection like 2>&1
if needed (e.g., command1 2>&1 | command2
sends both stdout and stderr of command1 to command2).
Why Use Pipes?
- Modularity: Each command does one thing well (
grep
searches,sort
sorts,awk
processes fields,wc
counts). Pipes let you combine these specialists. - Efficiency: Data often flows through the pipeline without needing to be written to temporary files, saving disk I/O and time. Processing can happen concurrently to some extent.
- Flexibility: You can easily swap out tools or add new stages to the pipeline to modify the workflow.
- Readability (often): A well-constructed pipeline can clearly express a sequence of data transformations.
Example Pipeline Scenarios
Let's revisit some examples using pipelines:
-
Find the most frequent error messages in a log file:
- Goal: Extract lines containing "ERROR", isolate the message part, count unique messages, and show the top 5 most frequent.
- Assume: Log format like
[Timestamp] LEVEL: Message text
- Pipeline:
- Breakdown:
grep 'ERROR' system.log
: Select only lines containing "ERROR".awk -F': ' '{print $2}'
: Split lines by ": " and print the second field (the message text).sort
: Sort the messages alphabetically (needed foruniq
).uniq -c
: Count adjacent identical messages (now grouped bysort
).sort -nr
: Sort the counts numerically (n
) and in reverse (r
) to get highest counts first.head -n 5
: Display only the top 5 lines (most frequent errors).
-
List active users from
/etc/passwd
who use/bin/bash
:- Goal: Find lines in
/etc/passwd
ending with/bin/bash
and extract the username (first field). - Pipeline:
- Breakdown:
grep '/bin/bash$' /etc/passwd
: Find lines ending ($
) with/bin/bash
.awk -F: '{print $1}'
: Set field separator to colon (:
) and print the first field (username).sort
: Sort the usernames alphabetically (optional, but good practice for lists).
- Goal: Find lines in
-
Calculate the total size of
.txt
files in the current directory:- Goal: List files, filter for
.txt
, extract size, sum the sizes. - Pipeline (using
ls
- potentially fragile, see note below): - Breakdown:
ls -l *.txt
: Get long listing of.txt
files.grep '^-'
: Filter for regular files (lines starting with-
).awk '{total += $5} END {print total}'
: Sum the 5th field (size) and print the total.
- Note on parsing
ls
: Parsing the output ofls
is generally discouraged in scripts because its format can change and handle filenames with spaces poorly. Better alternatives often involvefind
or shell globbing with loops. However, for interactive use or simple cases, it's common. A more robust way usingfind
andawk
:
- Goal: List files, filter for
Mastering the art of combining these tools with pipes is arguably the most crucial skill for effective command-line text processing in Linux. It allows you to solve complex problems by assembling simple, well-understood components.
Workshop Combining Tools with Pipes
Objective: To build and understand multi-stage pipelines using grep
, awk
, sort
, uniq
, wc
, head
, tail
, and tr
.
Scenario: We will analyze the access.log
file more deeply and process a list of words.
Setup:
-
Recreate
access.log
:(Added one duplicate request for /index.html)cat << EOF > access.log 192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512 10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048 192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128 172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024 10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50 192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096 10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100 192.168.1.10 - - [26/Oct/2023:10:19:00 +0000] "GET /index.html HTTP/1.1" 200 512 EOF
-
Create a word list file: Create
words.txt
.
Steps:
-
Most Requested Pages: Find the top 3 most frequently requested pages (field 7) from
access.log
.- Command:
- Observe: The output should show counts and paths, with
/index.html
likely having a count of 2. - Breakdown:
awk '{print $7}'
: Extract the request path.sort
: Sort paths alphabetically foruniq
.uniq -c
: Count identical adjacent paths.sort -nr
: Sort by count (numeric, reverse).head -n 3
: Get the top 3 lines.
-
Total Bytes Transferred per IP: Calculate the sum of bytes (field 10) for each unique IP address (field 1).
- Command:
- Observe: The output shows each unique IP and the total bytes transferred by that IP, sorted by bytes (most first).
- Breakdown:
awk '{ ip_bytes[$1] += $10 } ... '
: Uses anawk
associative arrayip_bytes
. For each line, it adds the bytes ($10
) to the element indexed by the IP address ($1
).... END { for (ip in ip_bytes) print ip, ip_bytes[ip] }
: After processing all lines, loop through the array indices (IPs) and print the IP and its accumulated byte count.sort -k 2nr
: Sort the output based on the second field (bytes), numerically (n
) and reversed (r
).
-
Unique Words (Case-Insensitive): Find all unique words in
words.txt
, ignoring case, and print them in lowercase.- Command:
- Observe: A sorted list of unique words, all in lowercase (
linux
,command
,line
,is
,powerful
,and
,flexible
,commands
,are
,great
,tools
). - Breakdown:
cat words.txt
: Output the file content.tr '[:upper:]' '[:lower:]'
: Convert everything to lowercase.tr -s '[:space:]' '\n'
: Convert all whitespace sequences (including original newlines) to single newlines (effectively putting each word on its own line).grep -v '^$'
: Remove any blank lines that might have been created.sort -u
: Sort the words and keep only unique occurrences.
-
Count 404 Errors: Count how many requests resulted in a 404 error in
access.log
.- Command:
- Observe: Should output
1
. (Note the spaces around 404 to avoid matching it in other fields like bytes). - Alternative using
awk
: - Observe: Also outputs
1
. This is often more precise thangrep
for field-based data.
-
IP Addresses Making POST Requests: List the unique IP addresses that made POST requests.
- Command:
- Observe: Should output
192.168.1.10
. - Breakdown:
grep '"POST '
: Find lines containing the literal string"POST
(part of the request field).awk '{print $1}'
: Extract the IP address (field 1).sort -u
: Get the unique IP addresses.
-
Extract Timestamps for Specific Page: Get just the timestamps (field 4, removing brackets) for requests to
/index.html
.- Command:
- Observe: Outputs the two timestamps associated with
/index.html
requests, without the square brackets. - Breakdown:
grep '"GET /index.html '
: Select lines for GET requests to/index.html
.awk '{print $4}'
: Extract the timestamp field (e.g.,[26/Oct/2023:10:15:01
).tr -d '[]'
: Delete the square bracket characters.
Cleanup (Optional):
This final workshop emphasized how pipelines allow you to construct sophisticated queries and transformations by chaining together the tools learned throughout this chapter. Practice building your own pipelines to solve different text processing challenges.
Conclusion
Throughout this exploration of text processing and searching in Linux, we have journeyed from simple file viewing with cat
and less
to the powerful pattern matching of grep
, the line-oriented transformations of sed
, and the field-based processing prowess of awk
. We've learned how to organize data with sort
, manage duplicates with uniq
, obtain summaries with wc
, compare files with diff
and comm
, and perform character manipulations with tr
.
Perhaps most importantly, we've seen how the pipe operator (|
) allows these individual tools to be chained together, creating elegant and efficient solutions to complex text manipulation problems. This modular, pipeline-driven approach is a cornerstone of the Linux/Unix philosophy.
Mastering these command-line utilities offers significant advantages:
- Automation: Easily script repetitive text-processing tasks.
- Efficiency: Perform operations quickly, especially on large datasets or remote systems.
- Flexibility: Combine tools in novel ways to address unique challenges.
- Universality: These skills are transferable across various Unix-like environments.
The key to proficiency is practice. Experiment with these commands on different types of text files – logs, configuration files, code, CSV data, or even plain text documents. Try variations of the options, explore different regular expression patterns, and build increasingly complex pipelines. Consult the man
pages (man grep
, man sed
, etc.) for comprehensive details on each command's capabilities.
While graphical tools have their place, the command line remains an indispensable environment for anyone serious about working efficiently with text data in Linux. The tools and techniques covered here form a solid foundation for tackling a wide array of system administration, data analysis, and development tasks. Keep experimenting, keep learning, and you will unlock the full text-processing power of your Linux system.