Skip to content
Author Nejat Hakan
eMail nejat.hakan@outlook.de
PayPal Me https://paypal.me/nejathakan


Text Processing and Searching

Introduction

Welcome to the world of text processing and searching on the Linux command line. In the Linux and Unix philosophy, text is the universal interface. Configuration files, log files, command outputs, source code – a vast amount of information is stored and transmitted as plain text. Being able to efficiently manipulate, search, and transform this text data directly from the command line is not just a convenience; it's a fundamental skill that unlocks immense power and automation capabilities.

Why rely on the command line when graphical text editors and IDEs exist?

  1. Automation: Command-line tools can be easily scripted. Repetitive text manipulation tasks that might take minutes or hours in a graphical editor can often be accomplished in seconds with a well-crafted command or script.
  2. Efficiency: For many tasks, especially on remote servers where graphical interfaces may be unavailable or slow, the command line is significantly faster.
  3. Integration: Command-line tools are designed to work together seamlessly using pipes (|), allowing you to chain simple tools to perform complex operations.
  4. Resourcefulness: These tools are typically lightweight and available on almost any Linux/Unix system, even minimal installations.
  5. Universality: The principles and many of the tools (like grep, sed, awk) are standard across Unix-like systems (including macOS and Windows Subsystem for Linux).

In this section, we will embark on a deep dive into the essential Linux utilities designed for text processing and searching. We will start with basic file viewing and gradually move towards powerful tools that utilize regular expressions for pattern matching and transformation. Each theoretical part will be followed by a hands-on "Workshop" section, providing practical, step-by-step exercises using real-world scenarios to solidify your understanding. Prepare to become proficient in harnessing the text-processing power of the Linux command line.

We will cover:

  • Viewing and concatenating files (cat, less, more, head, tail).
  • Searching text using patterns (grep and regular expressions).
  • Stream editing for text transformation (sed).
  • Advanced text processing with awk.
  • Sorting and managing duplicate lines (sort, uniq).
  • Counting lines, words, and characters (wc).
  • Comparing file contents (diff, comm).
  • Translating or deleting characters (tr).
  • Combining tools using pipes (|).

Let's begin by learning how to simply view the contents of text files.

1. Viewing and Concatenating Files

Before you can process or search text, you often need to view it. Linux provides several fundamental utilities for displaying file contents directly in your terminal. Understanding their differences and optimal use cases is the first step towards effective command-line text manipulation.

Core Utilities

  • cat (Concatenate):

    • Purpose: Originally designed to concatenate (link together) files, cat is most commonly used to display the entire content of one or more files to standard output (usually your terminal screen).
    • Usage: cat [options] [file...]
    • Behavior: Reads the specified files sequentially and writes their content to standard output. If no file is given, it reads from standard input (e.g., keyboard input until Ctrl+D).
    • Common Options:
      • -n: Number all output lines.
      • -b: Number only non-empty output lines.
      • -s: Squeeze multiple adjacent blank lines into a single blank line.
      • -E: Display a $ at the end of each line.
    • Caveat: Be cautious using cat on very large files, as it will attempt to dump the entire content to your screen, which can be slow and overwhelming. It's best suited for small files or when you specifically need the entire content piped to another command.
  • less:

    • Purpose: A powerful and widely preferred file pager. It allows you to view file content screen by screen, navigate forwards and backwards, and search within the file without loading the entire file into memory first. This makes it ideal for large files.
    • Usage: less [options] [file...]
    • Behavior: Displays one screenful of the file. You can then use navigation commands.
    • Key Navigation Commands (inside less):
      • Space or f: Move forward one screen.
      • b: Move backward one screen.
      • d: Move down (forward) half a screen.
      • u: Move up (backward) half a screen.
      • j or Enter: Move forward one line.
      • k: Move backward one line.
      • g: Go to the beginning of the file.
      • G: Go to the end of the file.
      • /pattern: Search forward for pattern. n repeats the search forward, N repeats backward.
      • ?pattern: Search backward for pattern. n repeats the search backward, N repeats forward.
      • h: Display help screen with more commands.
      • q: Quit less.
    • Advantages: Efficient for large files, rich navigation and search features.
  • more:

    • Purpose: An older, simpler file pager than less. It allows forward navigation only.
    • Usage: more [options] [file...]
    • Behavior: Displays file content screen by screen.
    • Key Navigation Commands (inside more):
      • Space: Move forward one screen.
      • Enter: Move forward one line.
      • /pattern: Search forward for pattern.
      • q: Quit more.
    • Note: less is generally preferred over more due to its enhanced features (like backward scrolling). You might encounter more on older or minimal systems.
  • head:

    • Purpose: Displays the beginning (the "head") of a file.
    • Usage: head [options] [file...]
    • Behavior: By default, it shows the first 10 lines of the specified file(s).
    • Common Options:
      • -n <number> or -<number>: Display the first <number> lines instead of 10.
      • -c <bytes>: Display the first <number> bytes instead of lines.
  • tail:

    • Purpose: Displays the end (the "tail") of a file.
    • Usage: tail [options] [file...]
    • Behavior: By default, it shows the last 10 lines of the specified file(s).
    • Common Options:
      • -n <number> or -<number>: Display the last <number> lines instead of 10.
      • -n +<number>: Display lines starting from <number>. For example, tail -n +5 shows lines from the 5th line to the end.
      • -c <bytes>: Display the last <number> bytes instead of lines.
      • -f: "Follow" mode. tail does not exit after displaying the last lines but waits and displays new lines as they are appended to the file. This is extremely useful for monitoring log files in real-time. Press Ctrl+C to exit follow mode.

Choosing the Right Tool

  • For small files where you want to see everything at once: cat.
  • For viewing files of any size, especially large ones, with navigation and search: less.
  • For quickly checking the beginning of a file: head.
  • For quickly checking the end of a file or monitoring a file for changes: tail (especially tail -f).
  • For combining files sequentially: cat file1 file2 > combined_file.
  • When piping output to another command that needs the entire content: cat file | other_command.
  • When piping potentially large output for interactive viewing: some_command | less.

Understanding these basic viewing tools is crucial as they often form the first step in a text processing pipeline.

Workshop Viewing and Concatenating Files

Objective: To practice using cat, less, head, and tail for basic file viewing and monitoring.

Scenario: We will work with a simulated system log file and a configuration file snippet.

Setup:

  1. Create a sample log file: Open your terminal and run the following commands to create a file named system.log:

    echo "[2023-10-26 10:00:01] INFO: System startup sequence initiated." > system.log
    echo "[2023-10-26 10:00:05] INFO: Network service started." >> system.log
    echo "[2023-10-26 10:00:10] DEBUG: Checking disk space..." >> system.log
    echo "[2023-10-26 10:00:12] INFO: Disk space OK." >> system.log
    for i in $(seq 1 20); do echo "[2023-10-26 10:01:$((10 + i))] VERBOSE: Heartbeat signal $i received." >> system.log; done
    echo "[2023-10-26 10:02:00] WARN: High CPU usage detected (95%)." >> system.log
    echo "[2023-10-26 10:02:05] INFO: Adjusting process priorities." >> system.log
    echo "[2023-10-26 10:02:15] ERROR: Failed to connect to database server [db01.internal]." >> system.log
    echo "[2023-10-26 10:02:20] INFO: Retrying database connection..." >> system.log
    echo "[2023-10-26 10:02:30] INFO: Database connection successful." >> system.log
    echo "[2023-10-26 10:03:00] INFO: System initialization complete." >> system.log
    

    • > redirects output, creating or overwriting the file.
    • >> redirects output, appending to the file if it exists.
    • The for loop adds 20 "VERBOSE" lines to make the file longer.
  2. Create a sample configuration snippet: Create config_part1.txt:

    cat << EOF > config_part1.txt
    # Network Settings
    IP_ADDRESS=192.168.1.100
    NETMASK=255.255.255.0
    GATEWAY=192.168.1.1
    EOF
    

  3. Create another configuration snippet: Create config_part2.txt:

    cat << EOF > config_part2.txt
    # DNS Settings
    DNS_SERVER_1=8.8.8.8
    DNS_SERVER_2=1.1.1.1
    EOF
    

Steps:

  1. View a small file with cat:

    • Command: cat config_part1.txt
    • Observe: The entire content of config_part1.txt is printed to your terminal.
    • Try with line numbers: cat -n config_part1.txt
    • Observe: Each line is now prefixed with its line number.
  2. Concatenate files with cat:

    • Command: cat config_part1.txt config_part2.txt
    • Observe: The content of config_part1.txt is displayed first, immediately followed by the content of config_part2.txt.
    • Command: cat config_part1.txt config_part2.txt > network.conf
    • Verify: Use cat network.conf to see the combined content stored in the new file.
  3. View the longer log file with cat (and see why it's often not ideal):

    • Command: cat system.log
    • Observe: The entire log file scrolls past, likely too fast to read comfortably. The beginning of the file might scroll off the screen.
  4. View the log file properly with less:

    • Command: less system.log
    • Observe: You see the first screenful of the file. The bottom line indicates the filename and your position.
    • Practice Navigation:
      • Press Space to go down one page.
      • Press b to go back one page.
      • Press j several times to move down line by line.
      • Press k several times to move up line by line.
      • Press G to jump to the end of the file.
      • Press g to jump back to the beginning.
    • Practice Searching:
      • Type /ERROR and press Enter. less will jump to the first occurrence of "ERROR".
      • Press n to find the next occurrence (if any).
      • Type ?INFO and press Enter. less will search backwards for "INFO".
      • Press n to find the next occurrence backwards.
    • Quit: Press q to exit less.
  5. View the beginning of the log file with head:

    • Command: head system.log
    • Observe: You see the first 10 lines of the file.
    • Command: head -n 5 system.log
    • Observe: You see only the first 5 lines.
    • Command: head -n 3 network.conf
    • Observe: You see the first 3 lines of the combined configuration file.
  6. View the end of the log file with tail:

    • Command: tail system.log
    • Observe: You see the last 10 lines of the file.
    • Command: tail -n 5 system.log
    • Observe: You see only the last 5 lines.
    • Command: tail -n +25 system.log
    • Observe: You see the content starting from line 25 until the end. (Count the lines if you're unsure!)
  7. Monitor the log file with tail -f (Simulate log updates):

    • Command: tail -f system.log
    • Observe: You see the last 10 lines, and the cursor waits. tail is now monitoring the file.
    • Open a second terminal window/tab. Do not close the first one yet.
    • In the second terminal, append a new line to the log file:
      echo "[$(date '+%Y-%m-%d %H:%M:%S')] ALERT: Disk space critically low!" >> system.log
      
    • Switch back to the first terminal (where tail -f is running).
    • Observe: The new "ALERT" line you just added appears automatically in the tail -f output.
    • Repeat the append command in the second terminal a few times. Each new line will appear in the first terminal.
    • Go back to the first terminal and press Ctrl+C to stop the tail -f command.

Cleanup (Optional):

rm system.log config_part1.txt config_part2.txt network.conf

This workshop demonstrated how to use the fundamental viewing tools, highlighting the interactive paging of less and the real-time monitoring capability of tail -f.

2. Searching Text with grep

Perhaps the single most important text-processing tool on Linux is grep (Global search for Regular Expression and Print). It scans input (from files or standard input) line by line and prints lines that contain a match for a specified pattern. Its power comes from its speed and its support for regular expressions, a sophisticated way to define search patterns.

Basic Usage

The fundamental syntax is:

grep [options] PATTERN [file...]
  • PATTERN: The text or pattern you are searching for. If it contains spaces or special characters interpreted by the shell (like *, ?, [), you must enclose it in single (') or double (") quotes. Single quotes are generally safer as they prevent the shell from interpreting most special characters within the quotes.
  • file...: One or more files to search within. If omitted, grep reads from standard input (useful for pipes).

Key grep Options

grep has many options to modify its behavior. Here are some of the most crucial ones:

  • -i (--ignore-case): Perform a case-insensitive search. grep -i 'error' file.log will match "error", "Error", "ERROR", etc.
  • -v (--invert-match): Select non-matching lines. It prints all lines that do not contain the pattern.
  • -n (--line-number): Prepend each matching line with its line number within the input file.
  • -c (--count): Suppress normal output; instead, print a count of matching lines for each input file.
  • -l (--files-with-matches): Suppress normal output; instead, print the name of each input file from which output would normally have been printed. The scanning stops on the first match. Useful when you just want to know which files contain a pattern.
  • -L (--files-without-match): The opposite of -l. Print the names of files that do not contain the pattern.
  • -r or -R (--recursive): Search recursively. If a file argument is a directory, grep searches all files under that directory. -R additionally follows symbolic links.
  • -w (--word-regexp): Select only those lines containing matches that form whole words. The matched substring must be either at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore. grep -w 'err' would match " err " but not "error".
  • -o (--only-matching): Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
  • -A <num> (--after-context=<num>): Print <num> lines of trailing context after matching lines.
  • -B <num> (--before-context=<num>): Print <num> lines of leading context before matching lines.
  • -C <num> or -<num> (--context=<num>): Print <num> lines of output context (both before and after).
  • -E (--extended-regexp): Interpret PATTERN as an Extended Regular Expression (ERE). More on this below. (Equivalent to using the egrep command, which is often just a link to grep).
  • -F (--fixed-strings): Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched. (Equivalent to using the fgrep command). This can be significantly faster if you don't need pattern matching power.

Introduction to Regular Expressions (Regex)

Regular expressions are the heart of grep's power. They are sequences of characters that define a search pattern. While a deep dive into regex is a topic in itself, understanding the basics is essential for using grep effectively.

There are different "flavors" of regex. grep primarily uses Basic Regular Expressions (BRE) by default, and Extended Regular Expressions (ERE) with the -E option. EREs are generally more intuitive as fewer characters need escaping.

Common Regex Metacharacters (Special Characters):

(Note: Some require escaping \ in BRE but not in ERE)

  1. Anchors:

    • ^: Matches the beginning of a line. grep '^Error' file matches lines starting with "Error".
    • $: Matches the end of a line. grep 'complete.$' file matches lines ending with "complete.". The . here needs escaping in BRE (\.) if you mean a literal dot, otherwise it's a wildcard (see below). In ERE, . is a wildcard, \. is a literal dot.
  2. Character Representations:

    • . (Dot): Matches any single character (except newline). grep 'l..e' file matches "like", "love", "luxe", etc.
    • [...] (Bracket Expressions): Matches any one character enclosed in the brackets.
      • [aeiou]: Matches any single lowercase vowel.
      • [0-9]: Matches any single digit.
      • [a-zA-Z]: Matches any single uppercase or lowercase letter.
      • [^...]: Matches any single character not in the brackets. [^0-9] matches any non-digit character.
  3. Quantifiers (Specify Repetitions):

    • *: Matches the preceding item zero or more times. grep 'ab*c' file matches "ac", "abc", "abbc", "abbbc", etc.
    • + (ERE only, or \+ in BRE): Matches the preceding item one or more times. grep -E 'ab+c' file matches "abc", "abbc", but not "ac".
    • ? (ERE only, or \? in BRE): Matches the preceding item zero or one time. grep -E 'colou?r' file matches "color" and "colour".
    • {n} (ERE only, or \{n\} in BRE): Matches the preceding item exactly n times. grep -E '[0-9]{3}' matches exactly three digits.
    • {n,} (ERE only, or \{n,\} in BRE): Matches the preceding item n or more times. grep -E 'go{2,}gle' matches "google", "gooogle", etc.
    • {n,m} (ERE only, or \{n,m\} in BRE): Matches the preceding item at least n times, but no more than m times. grep -E '[a-z]{3,5}' matches 3, 4, or 5 lowercase letters.
  4. Alternation (OR):

    • | (ERE only, or \| in BRE): Matches either the expression before or the expression after the pipe. grep -E 'error|warning' matches lines containing either "error" or "warning".
  5. Grouping:

    • (...) (ERE only, or \(...\) in BRE): Groups expressions together. This is often used with quantifiers or alternation. grep -E '(ab)+c' matches "abc", "ababc", etc. It also captures the matched group for backreferences (used more in sed, but relevant).
  6. Escaping:

    • \: The backslash "escapes" the special meaning of a metacharacter, making it literal. To search for a literal dot, use \.. To search for a literal asterisk, use \*. To search for a literal backslash, use \\.

BRE vs ERE Example:

To match lines containing one or more digits:

  • BRE: grep '[0-9]\+' file
  • ERE: grep -E '[0-9]+' file or egrep '[0-9]+' file

Using ERE (grep -E or egrep) is often recommended for readability when your patterns involve quantifiers like +, ?, {}, or alternation |.

grep is a versatile tool that forms the backbone of text searching on Linux. Mastering its options and basic regular expressions is a significant step towards command-line proficiency.

Workshop Searching with grep

Objective: To practice using grep with various options and basic regular expressions to search within files.

Scenario: We will use the system.log file created earlier and a sample data file containing user information.

Setup:

  1. Ensure system.log exists: If you removed it previously, recreate it using the commands from the Workshop in section 1.

  2. Create a user data file: Create a file named users.txt:

    cat << EOF > users.txt
    User ID,Name,Department,Status,Last Login
    101,Alice Smith,Engineering,Active,2023-10-25
    102,Bob Johnson,Sales,Inactive,2023-09-10
    103,Charlie Brown,Engineering,Active,2023-10-26
    104,David Williams,Support,Active,2023-10-26
    105,Eve Davis,Sales,Active,2023-10-24
    106,Frank Miller,Support,Pending,2023-10-20
    107,Grace Wilson,Engineering,active,2023-10-26
    EOF
    
    Note the inconsistent capitalization in "active" for user 107.

Steps:

  1. Simple Search: Find all lines in the log file containing the word "INFO".

    • Command: grep 'INFO' system.log
    • Observe: All lines containing "INFO" are displayed.
  2. Case-Insensitive Search: Find lines in users.txt containing "active", regardless of case.

    • Command: grep 'active' users.txt (Note: misses user 107)
    • Command: grep -i 'active' users.txt
    • Observe: The second command finds both "Active" and "active" lines.
  3. Invert Match: Find all lines in the log file that do not contain "VERBOSE".

    • Command: grep -v 'VERBOSE' system.log
    • Observe: The output includes INFO, DEBUG, WARN, ERROR lines, but excludes the numerous VERBOSE lines.
  4. Line Numbers: Find lines containing "database" and show their line numbers.

    • Command: grep -n 'database' system.log
    • Observe: Each matching line is prefixed with its number (e.g., 25:..., 26:..., 27:...).
  5. Count Matches: Count how many errors occurred.

    • Command: grep -c 'ERROR' system.log
    • Observe: The output should be 1.
    • Command: grep -c 'INFO' system.log
    • Observe: The output shows the total count of INFO lines.
  6. Recursive Search (Setup): First, let's create a subdirectory and copy a file into it.

    • Command: mkdir logs_archive
    • Command: cp system.log logs_archive/system_old.log
    • Command: echo "[2023-10-25 18:00:00] ERROR: System halted." >> logs_archive/system_old.log
  7. Recursive Search (Execution): Search for "ERROR" in the current directory and subdirectories.

    • Command: grep 'ERROR' system.log (Only finds the error in the current file)
    • Command: grep -r 'ERROR' . (Searches recursively starting from the current directory .)
    • Observe: The second command finds the "ERROR" line in system.log and the two "ERROR" lines in logs_archive/system_old.log, prefixing each match with the filename.
  8. List Files Containing Matches: Find which files in the current directory and subdirectories contain the word "CPU".

    • Command: grep -rl 'CPU' .
    • Observe: Only the filename system.log (prefixed with ./) is printed, as it's the only file containing "CPU".
  9. Word Match: Find lines containing the word "DB" (as a whole word) vs. lines containing "DB" as part of another word (like "DEBUG"). We'll use echo to pipe input to grep.

    • Command: echo "DEBUG DB connection" | grep 'DB' (Matches)
    • Command: echo "DEBUG DB connection" | grep -w 'DB' (Matches "DB")
    • Command: echo "DEBUG database connection" | grep 'DB' (Matches, as "DB" is in "DEBUG")
    • Command: echo "DEBUG database connection" | grep -w 'DB' (Does not match, as "DB" in "DEBUG" isn't a whole word)
  10. Using Basic Regex (Anchors): Find log entries that occurred exactly at the start of a minute (timestamp ends in :00]).

    • Command: grep ':00]' system.log (Finds lines ending with :00])
    • Let's refine using the end-of-line anchor $. We need to escape the ] because it's special in regex. Use single quotes.
    • Command: grep ':00\]$' system.log
    • Observe: This finds lines ending exactly with :00].
  11. Using Basic Regex (Character Classes): Find user IDs in users.txt that are between 102 and 104 inclusive. User IDs are at the start of the line.

    • Command: grep '^[1][0][2-4],' users.txt
    • Observe: Matches lines starting with 102, 103, or 104, followed by a comma. ^ anchors to the start, [1] matches '1', [0] matches '0', [2-4] matches '2', '3', or '4'.
  12. Using Extended Regex (Alternation): Find lines in the log containing either "ERROR" or "WARN".

    • Command: grep -E 'ERROR|WARN' system.log
    • Observe: Lines with either pattern are shown. Compare with grep 'ERROR|WARN' system.log (which searches for the literal string "ERROR|WARN" unless your grep defaults to ERE).
  13. Using Extended Regex (Quantifiers): Find lines where the CPU usage was 90% or higher (i.e., 9 followed by one more digit, then '%').

    • Command: grep -E 'CPU usage detected \(9[0-9]%\)' system.log
    • Observe: Matches the "WARN: High CPU usage detected (95%)" line.
      • -E: Use Extended Regex.
      • \(, \): Match literal parentheses. In ERE, ( and ) are for grouping, so we escape them.
      • 9[0-9]%: Match '9', followed by any single digit ([0-9]), followed by '%'.
  14. Context: Find the "database connection successful" message and show the line before and after it.

    • Command: grep -C 1 'Database connection successful' system.log
    • Observe: You see the "Retrying..." line, the "successful" line, and the "initialization complete" line.
    • Try with -B 2 (2 lines before) and -A 2 (2 lines after).

Cleanup (Optional):

rm users.txt system.log
rm -r logs_archive

This workshop provided hands-on experience with common grep options and introduced the fundamentals of regular expressions for targeted searching. Experiment with different patterns and options on these files or other text files you have.

3. Stream Editing with sed

sed stands for Stream EDitor. Unlike interactive editors like nano or vim, sed is designed to perform text transformations non-interactively on an input stream (a file or piped data). It reads the input line by line, applies a set of specified commands to each line, and then outputs the transformed line. This makes it incredibly powerful for scripting and automating text modifications.

How sed Works: The Cycle

sed maintains two data buffers:

  1. Pattern Space: This is the primary workspace. sed reads one line from the input into the pattern space.
  2. Hold Space: This is an auxiliary buffer. You can copy data from the pattern space to the hold space and vice-versa, allowing for more complex manipulations across multiple lines (though this is an advanced topic).

The basic sed cycle for each input line is:

  1. Read a line from the input stream.
  2. Remove the trailing newline character.
  3. Place the line into the pattern space.
  4. Execute the sed commands provided in the script sequentially. Each command might modify the content of the pattern space. Commands can be restricted to operate only on lines matching certain addresses (patterns or line numbers).
  5. Once all commands are executed for the current line:
    • If the -n option was not used, print the (potentially modified) content of the pattern space, followed by a newline.
    • If the -n option was used, the pattern space is printed only if explicitly commanded (e.g., by the p command).
  6. The pattern space is typically cleared (unless specific commands prevent it).
  7. If there is more input, repeat from step 1.

Basic Syntax

sed [options] 'script' [input-file...]
sed [options] -e 'script1' -e 'script2' ... [input-file...]
sed [options] -f script-file [input-file...]
  • options: Control sed's behavior. Common ones include:
    • -n (--quiet, --silent): Suppress automatic printing of the pattern space. Only lines explicitly printed with the p command will appear in the output. This is very common.
    • -e <script>: Add the commands in script to the set of commands to be executed. Useful for multiple simple commands.
    • -f <script-file>: Read sed commands from the specified script-file instead of the command line. Essential for complex scripts.
    • -i[SUFFIX] (--in-place[=SUFFIX]): Edit files in place. Use with extreme caution! This overwrites the original file. If SUFFIX is provided (e.g., -i.bak), a backup of the original file is created with that suffix before modification. Without a suffix, the original is overwritten directly. Always test your sed scripts without -i first!
  • script: One or more sed commands. If containing shell metacharacters, enclose in single quotes ('). The basic format of a command is [address[,address]]command[arguments].
  • input-file...: File(s) to process. If omitted, sed reads from standard input.

Addresses

Addresses specify which lines a command should apply to. If no address is given, the command applies to all lines.

  • Line Number: A simple integer N applies the command only to line N.
  • Range of Line Numbers: N,M applies the command to lines from N to M inclusive.
  • /pattern/: A regular expression (BRE by default, ERE if -E or r option used). The command applies to any line matching the pattern. Regex syntax is similar to grep.
  • /pattern1/,/pattern2/: A range defined by patterns. The command applies starting from the first line matching pattern1 up to and including the next line matching pattern2.
  • $: Represents the last line of input. 1,$ means all lines.
  • !: Appended to an address or address range, it negates the match, applying the command to lines that do not match the address(es). 1,10!d deletes all lines except lines 1 through 10.

Common sed Commands

  • s/regexp/replacement/[flags] (Substitute): This is the most frequently used command. It searches for regexp in the pattern space and replaces the first match with replacement.

    • regexp: A Basic Regular Expression (unless -E used).
    • replacement: The text to substitute in. Can contain special characters:
      • &: Represents the entire matched regexp. s/hello/(&)/ replaces "hello" with "(hello)".
      • \1, \2, ...: Represent text captured by the 1st, 2nd, ... parenthesized group (...) in the regexp (requires \(...\) in BRE, (...) in ERE). s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\3.\2.\1/ changes YYYY-MM-DD to DD.MM.YYYY.
    • flags: Modify the substitution behavior:
      • g: Global. Replace all occurrences of regexp on the line, not just the first.
      • N (a number): Replace only the Nth occurrence.
      • i or I: Case-insensitive matching for regexp.
      • p: Print the pattern space if a substitution was made. Often used with -n.
  • d (Delete): Delete the pattern space. The current line is not printed, and the next cycle begins. sed '/^#/d' config.txt deletes all lines starting with #.

  • p (Print): Print the current pattern space. Usually used with -n to selectively print only certain lines. sed -n '/ERROR/p' file.log prints only lines containing "ERROR".

  • a \text (Append): Append text after the current line. The text is output when the next line is read or the input ends. sed '/start_marker/a \New line inserted here' file.txt.

  • i \text (Insert): Insert text before the current line. sed '1i \# Header added by sed' file.txt inserts a header line before the first line.

  • c \text (Change): Replace the selected line(s) entirely with text. sed '/old_setting/c \new_setting=true' config.txt.

  • y/source-chars/dest-chars/ (Transliterate): Translate characters. Replaces every character in source-chars found in the pattern space with the corresponding character in dest-chars. Similar to the tr command but within sed. sed 'y/abc/ABC/' changes 'a' to 'A', 'b' to 'B', 'c' to 'C'.

sed is a dense but powerful tool. Starting with the s, d, and p commands covers a large percentage of common use cases. Remember to always test without the -i option first!

Workshop Stream Editing with sed

Objective: To practice using sed for common text transformations like substitution, deletion, insertion, and selective printing.

Scenario: We'll modify a sample configuration file and process log data.

Setup:

  1. Create a sample configuration file: Create server.conf.

    cat << EOF > server.conf
    # Server Configuration File
    # Last updated: 2023-10-25
    
    ServerName primary.example.com
    ListenPort = 80
    # ListenPort = 443
    
    DocumentRoot /var/www/html
    ErrorLog /var/log/server_error.log
    
    # Security Settings (Use with caution)
    EnableSSL false
    SSLProtocol TLSv1.2 TLSv1.3
    # Obsolete Setting below
    AllowInsecureAuth no
    EOF
    

  2. Create a small CSV data file: Create data.csv.

    cat << EOF > data.csv
    ID,Timestamp,Value,Status
    1,202310261000,55.3,OK
    2,202310261005,57.1,OK
    3,202310261010,61.0,WARN
    4,202310261015,59.5,OK
    5,202310261020,65.2,FAIL
    EOF
    

Steps:

  1. Simple Substitution (First Occurrence): In server.conf, change the ListenPort from 80 to 8080.

    • Command: sed 's/80/8080/' server.conf
    • Observe: Only the ListenPort = 80 line is changed. The output is printed to the terminal; the original file is untouched.
  2. Global Substitution: Suppose we wanted to replace all instances of "Server" with "Host" (case-sensitive).

    • Command: sed 's/Server/Host/g' server.conf
    • Observe: Both "ServerName" and "Server Configuration File" (in the comment) are changed to "HostName" and "Host Configuration File". The g flag makes it global on each line.
  3. Case-Insensitive Global Substitution: Replace all occurrences of "server" (any case) with "APPLICATION".

    • Command: sed 's/server/APPLICATION/gi' server.conf
    • Observe: "ServerName", "server_error.log" are changed. The i flag makes the search case-insensitive.
  4. Using & for Backreference: Enclose the document root path in double quotes.

    • Command: sed 's#^DocumentRoot .*#DocumentRoot "&"#' server.conf (Using # as delimiter because the replacement contains /)
    • Observe: The line DocumentRoot /var/www/html becomes DocumentRoot "/var/www/html". The & refers to the entire matched pattern (/var/www/html). Correction: The prompt asks for the path, not the whole line. Let's refine.
    • Better Command: sed 's#^\(DocumentRoot \)\(.*\)#\1"\2"#' server.conf
    • Observe: \(DocumentRoot \) captures the first part into \1. \(.*\) captures the rest (the path) into \2. The replacement uses \1"\2" to insert quotes around the path.
  5. Deleting Lines: Remove all comment lines (starting with #) from server.conf.

    • Command: sed '/^#/d' server.conf
    • Observe: All lines starting with # are gone from the output.
  6. Deleting Lines in a Range: Remove the "Security Settings" comment and the settings below it, up to the "Obsolete Setting" comment.

    • Command: sed '/# Security Settings/,/# Obsolete Setting/d' server.conf
    • Observe: The lines from # Security Settings down to and including # Obsolete Setting below are deleted.
  7. Selective Printing with -n and p: Print only the lines containing "Log" from server.conf.

    • Command: sed -n '/Log/p' server.conf
    • Observe: Only the ErrorLog line is printed. The -n suppresses default output, and p prints only matching lines.
  8. Combining Commands (-e): Change primary.example.com to main.domain.local AND change EnableSSL false to EnableSSL true.

    • Command: sed -e 's/primary\.example\.com/main.domain.local/' -e 's/EnableSSL false/EnableSSL true/' server.conf (Note: . is escaped \. to match literal dot)
    • Observe: Both substitutions are applied.
  9. Inserting Text: Insert a new setting User webadmin before the DocumentRoot line.

    • Command: sed '/^DocumentRoot/i \User webadmin' server.conf
    • Observe: The line User webadmin appears immediately before the DocumentRoot line.
  10. Appending Text: Append a comment # End of Basic Settings after the ErrorLog line.

    • Command: sed '/^ErrorLog/a \# End of Basic Settings' server.conf
    • Observe: The comment appears immediately after the ErrorLog line.
  11. Changing Lines: Change the "Obsolete Setting" line and the AllowInsecureAuth line below it to a single comment # Authentication handled by upstream proxy.

    • Command: sed '/# Obsolete Setting/,/AllowInsecureAuth/c \# Authentication handled by upstream proxy' server.conf
    • Observe: The two lines matching the range are replaced by the single new comment line.
  12. Processing CSV Data: Change the status "FAIL" to "CRITICAL" in data.csv.

    • Command: sed 's/,FAIL$/,CRITICAL/' data.csv
    • Observe: The last line's status is changed. We use $ to anchor the match to the end of the line, ensuring we only change the status field.
  13. In-Place Edit (Simulation with Backup): Let's try changing ListenPort to 443 in the original file, but create a backup first.

    • First, verify the command: sed 's/ListenPort = 80/ListenPort = 443/' server.conf (Looks correct)
    • Now, execute with -i.bak: sed -i.bak 's/ListenPort = 80/ListenPort = 443/' server.conf
    • Check:
      • ls server.conf* (You should see server.conf and server.conf.bak)
      • cat server.conf (Shows the modified file with ListenPort = 443)
      • cat server.conf.bak (Shows the original file with ListenPort = 80)

Cleanup (Optional):

rm server.conf server.conf.bak data.csv

This workshop covered fundamental sed operations. sed truly shines when combined with grep and other tools in pipelines for complex automated text processing tasks. Remember the -i option is powerful but potentially destructive; always test first!

4. Advanced Text Processing with awk

While sed operates primarily on lines, awk is designed for field-oriented processing. It treats each input line (called a record) as being composed of multiple fields, which are typically separated by whitespace (spaces or tabs) by default. awk allows you to easily extract, manipulate, and report on data based on these fields, making it exceptionally useful for processing structured text data like CSV files, log files, or command output.

awk is not just a command; it's a complete programming language with variables, arithmetic operations, string functions, control structures (if, loops), and associative arrays.

How awk Works: The Basic Model

awk reads its input (files or standard input) one record (line) at a time. For each record, it performs the following:

  1. Splits the record into fields: Based on the current Field Separator (FS), awk divides the record into fields.
    • The fields are accessible using variables: $1 for the first field, $2 for the second, and so on.
    • $0 represents the entire, unmodified record.
  2. Evaluates patterns: awk checks the record against each pattern { action } rule provided in the script.
  3. Executes actions: If a pattern matches the current record (or if no pattern is specified, which matches every record), the corresponding { action } block is executed. Actions typically involve printing, calculations, or manipulating variables.
  4. Repeats: Continues this process until all records are read.

awk Script Structure

An awk script consists of a series of pattern { action } rules.

pattern1 { action1 }
pattern2 { action2 }
...
  • pattern: Specifies when the action should be executed. It can be:

    • Omitted: The action executes for every input record.
    • /regexp/: A regular expression (ERE). The action executes if the entire record ($0) matches the regex.
    • expression: A conditional expression (e.g., $3 > 100, $1 == "ERROR"). The action executes if the expression evaluates to true (non-zero or non-empty). Field comparisons are usually done numerically or lexicographically depending on context.
    • pattern1, pattern2: A range pattern (like sed). The action executes for all records starting from one matching pattern1 up to the next one matching pattern2.
    • BEGIN: A special pattern. The associated action is executed once before any input records are read. Useful for initializing variables or printing headers.
    • END: A special pattern. The associated action is executed once after all input records have been processed. Useful for calculating totals or printing summaries.
  • { action }: A block of awk statements enclosed in curly braces. If omitted for a pattern, the default action is { print $0 } (print the entire matching record). Common actions include:

    • print expression1, expression2, ...: Prints the expressions, separated by the Output Field Separator (OFS) (a space by default). Without arguments, print is equivalent to print $0.
    • printf format, expression1, ...: Formatted printing, similar to C's printf.
    • Variable assignments (e.g., count = count + 1 or count++).
    • Control structures (if (...) { ... } else { ... }, while (...) { ... }, for (...) { ... }).

Built-in Variables

awk provides many useful built-in variables:

  • $0: The entire current input record.
  • $1, $2, ... $N: The fields of the current record.
  • NF (Number of Fields): The total number of fields in the current record. $NF refers to the last field.
  • NR (Number of Records): The total number of input records processed so far (cumulative line number).
  • FNR (File Number of Records): The record number within the current input file (resets for each file).
  • FS (Field Separator): The regular expression used to separate fields on input. Default is whitespace (" "). Can be set using the -F command-line option (e.g., -F, for comma-separated) or by assigning to FS within the script (often in a BEGIN block).
  • OFS (Output Field Separator): The string used to separate fields in the output of print. Default is a single space (" ").
  • ORS (Output Record Separator): The string outputted after each record by print. Default is a newline ("\n").
  • FILENAME: The name of the current input file.

Basic awk Usage Examples

  • Print specific fields: Print the first and third fields of each line.
    awk '{ print $1, $3 }' data.txt
    
  • Print lines matching a pattern: Print lines where the second field is exactly "ERROR".
    awk '$2 == "ERROR" { print $0 }' log.txt
    # Or simply (default action is print $0):
    awk '$2 == "ERROR"' log.txt
    
  • Using BEGIN for a header: Print a header, then the user ID and name from users.txt (comma-separated).
    awk -F, 'BEGIN { print "User ID\tName" } NR > 1 { print $1, $2 }' users.txt
    
    • -F,: Sets the field separator to a comma.
    • BEGIN { ... }: Prints the header before processing lines.
    • NR > 1: Pattern to skip the header line in the input file (NR is the record number).
    • { print $1, $2 }: Prints the first and second fields for records where NR > 1.
  • Using END for a summary: Count the number of lines.
    awk 'END { print "Total lines:", NR }' data.txt
    
  • Performing calculations: Sum the values in the third column of data.csv.
    awk -F, 'NR > 1 { sum += $3 } END { print "Total Value:", sum }' data.csv
    
    • sum += $3: For each data line (NR > 1), add the value of the 3rd field to the sum variable (awk initializes numeric variables to 0).
    • END { ... }: After processing all lines, print the final sum.

awk's ability to handle fields and perform calculations makes it significantly more powerful than sed for structured data analysis and reporting directly on the command line.

Workshop Advanced Text Processing with awk

Objective: To practice using awk for field extraction, pattern matching, calculations, and using BEGIN/END blocks.

Scenario: We'll analyze web server log data and process the user data file.

Setup:

  1. Create a sample web server log file: Create access.log. This simulates a common log format (IP Address, -, -, Timestamp, Request, Status Code, Size).

    cat << EOF > access.log
    192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512
    10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048
    192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128
    172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024
    10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50
    192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096
    10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100
    EOF
    

  2. Ensure users.txt exists: If you removed it previously, recreate it using the commands from the Workshop in section 2.

    cat << EOF > users.txt
    User ID,Name,Department,Status,Last Login
    101,Alice Smith,Engineering,Active,2023-10-25
    102,Bob Johnson,Sales,Inactive,2023-09-10
    103,Charlie Brown,Engineering,Active,2023-10-26
    104,David Williams,Support,Active,2023-10-26
    105,Eve Davis,Sales,Active,2023-10-24
    106,Frank Miller,Support,Pending,2023-10-20
    107,Grace Wilson,Engineering,active,2023-10-26
    EOF
    

Steps:

  1. Extract Specific Fields: Print the IP address (field 1) and the requested path (field 7) from access.log.

    • Command: awk '{ print $1, $7 }' access.log
    • Observe: Each line shows the IP and the path requested. awk splits fields by whitespace by default.
  2. Using BEGIN for Header: Print a header "IP Address -> Request Path" before the output from step 1.

    • Command: awk 'BEGIN { print "IP Address -> Request Path" } { print $1, $7 }' access.log
    • Observe: The header line appears first.
  3. Conditional Printing (Expression): Print only the requests that resulted in a 404 status code (field 9).

    • Command: awk '$9 == 404 { print $0 }' access.log
    • Observe: Only the line containing /favicon.ico (which has status 404) is printed. Note that awk treats $9 as a number here for the comparison.
  4. Conditional Printing (Regex): Print requests made for PNG image files.

    • Command: awk '$7 ~ /\.png$/ { print $0 }' access.log
    • Observe: Only the line requesting /images/logo.png is printed.
      • $7 ~ /pattern/: This awk syntax checks if field 7 matches the regular expression /pattern/.
      • /\.png$/: The regex matches a literal dot (\.) followed by png at the end of the string ($).
  5. Using -F for Different Delimiter: Print the Name (field 2) and Department (field 3) for all users in users.txt.

    • Command: awk -F, '{ print $2, $3 }' users.txt
    • Observe: The command fails to skip the header. The output includes "Name,Department".
    • Refined Command (skip header): awk -F, 'NR > 1 { print "Name:", $2, "| Department:", $3 }' users.txt
    • Observe: Sets comma as delimiter (-F,). Skips the first line (NR > 1). Prints formatted output for fields 2 and 3.
  6. Performing Calculations: Calculate the total bytes transferred (field 10) for all successful requests (status code 200, field 9) in access.log.

    • Command: awk '$9 == 200 { total_bytes += $10 } END { print "Total bytes for successful requests:", total_bytes }' access.log
    • Observe:
      • $9 == 200: Pattern matches only lines with status 200.
      • { total_bytes += $10 }: Action adds the value of field 10 to the total_bytes variable for matching lines.
      • END { ... }: After processing all lines, prints the final sum.
  7. Counting Items: Count how many requests came from the IP address 10.0.0.5.

    • Command: awk '$1 == "10.0.0.5" { count++ } END { print "Requests from 10.0.0.5:", count }' access.log
    • Observe: Counts lines where the first field is 10.0.0.5 and prints the total at the end.
  8. Using printf for Formatted Output: Print the status (field 4) and name (field 2) of users from users.txt, aligning the output neatly.

    • Command: awk -F, 'NR > 1 { printf "Status: %-10s | Name: %s\n", $4, $2 }' users.txt
    • Observe:
      • -F,: Comma delimiter.
      • NR > 1: Skip header.
      • printf format, val1, val2: Formatted print.
      • %-10s: Format specifier for a string (s), left-aligned (-), in a field of minimum 10 characters wide.
      • %s: Format specifier for a simple string.
      • \n: Newline character. The output is neatly aligned.
  9. Modifying Field Separators: Print the User ID and Status from users.txt, but separate the output with a tab character. Also, skip the header.

    • Command: awk -F, 'BEGIN { OFS="\t" } NR > 1 { print $1, $4 }' users.txt
    • Observe:
      • BEGIN { OFS="\t" }: Sets the Output Field Separator to a tab before processing lines.
      • The print $1, $4 action now uses a tab between the fields in the output.
  10. Combining Patterns and Actions: For requests in access.log, print "Large Request" followed by the line if the size (field 10) is greater than 1500 bytes, and print "Error Request" if the status code (field 9) is 500 or more.

    • Command:
      awk '$10 > 1500 { print "Large Request:", $0 } $9 >= 500 { print "Error Request:", $0 }' access.log
      
    • Observe: The line for logo.png (2048 bytes) and dashboard (4096 bytes) are printed with "Large Request:". The line for /api/users (status 500) is printed with "Error Request:". Note that a single line can match multiple pattern {action} pairs.

Cleanup (Optional):

rm access.log users.txt

This workshop demonstrated awk's capability to parse structured data, perform conditional actions based on field values or patterns, calculate aggregates, and format output. awk is a versatile tool for data extraction, transformation, and reporting tasks on the Linux command line.

5. Sorting and Uniqueness (sort, uniq)

Often, after extracting or manipulating text, you need to organize it or remove duplicate entries. Linux provides two essential utilities for this: sort for ordering lines and uniq for identifying and filtering adjacent duplicate lines. They are frequently used together.

sort

The sort command rearranges lines from text files or standard input and prints the result to standard output. By default, it sorts based on comparing entire lines lexicographically (like dictionary order, but based on ASCII character values).

Basic Usage:

sort [options] [file...]

Common sort Options:

  • -r (--reverse): Reverse the result of comparisons, sorting in descending order.
  • -n (--numeric-sort): Compare according to string numerical value. Crucial for sorting numbers correctly (otherwise, "10" comes before "2").
  • -k POS1[,POS2] (--key=POS1[,POS2]): Specify a sort key. Sort based on a field (key) within the line, not the entire line.
    • POS1: The starting position of the key. Fields are numbered starting from 1.
    • POS2: Optional ending position of the key. If omitted, the key extends to the end of the line.
    • Positions can have modifiers like n (numeric), r (reverse) applied only to that key (e.g., -k 3n sorts numerically on field 3, -k 1,1 sorts only on field 1).
  • -t <separator> (--field-separator=<separator>): Specify the field separator character used when defining keys with -k. Default separators are non-blank characters transitioning to blank characters (handling multiple spaces/tabs gracefully). For specific delimiters like commas or colons, use -t.
  • -u (--unique): Output only the first of an equal sequence. This is similar to piping the output through uniq, but sort -u can be more efficient as it handles uniqueness during the sort process. Note: It checks for entire line uniqueness, even if sorting by key.
  • -f (--ignore-case): Fold lower case to upper case characters for comparisons (case-insensitive sort).
  • -h (--human-numeric-sort): Compare human-readable numbers (e.g., 2K, 1G). Requires GNU sort.
  • -M (--month-sort): Compare as month names ("JAN" < "FEB" < ...).
  • --stable: Use a stable sort algorithm. Lines that compare as equal maintain their original relative order. Default sort is usually stable, but explicitly requesting it can be necessary in complex scenarios or scripts relying on this behavior.

Important Note on Keys (-k):

  • Fields are counted starting from 1.
  • -k 2 means the sort key starts at the beginning of field 2 and goes to the end of the line.
  • -k 2,2 means the sort key is only field 2.
  • -k 2,3 means the sort key includes field 2 and field 3.
  • -k 2.3,2.5 means the key starts at the 3rd character of field 2 and ends at the 5th character of field 2 (character positions also start from 1).

uniq

The uniq command filters adjacent matching lines from sorted input. It reads from standard input or a single file and writes unique lines to standard output. Crucially, uniq only detects duplicate lines if they are consecutive. This means you almost always need to use sort before piping data to uniq.

Basic Usage:

uniq [options] [input-file [output-file]]

Common uniq Options:

  • -c (--count): Precede each output line with a count of how many times it occurred in the input (among adjacent lines).
  • -d (--repeated): Only print duplicate lines, one for each group.
  • -u (--unique): Only print lines that are not repeated (unique among adjacent lines).
  • -i (--ignore-case): Ignore case differences when comparing lines.
  • -f <N> (--skip-fields=<N>): Avoid comparing the first N fields. Fields are separated by whitespace by default.
  • -s <N> (--skip-chars=<N>): Avoid comparing the first N characters. Applied after skipping fields (-f).
  • -w <N> (--check-chars=<N>): Compare no more than N characters per line. Applied after skipping fields/chars.

Combining sort and uniq

The most common pattern is:

# Count occurrences of each line
command_producing_output | sort | uniq -c

# Get only unique lines
command_producing_output | sort | uniq

# Get only lines that appeared more than once
command_producing_output | sort | uniq -d

Alternatively, for just getting unique lines (equivalent to sort | uniq):

command_producing_output | sort -u

Choosing between sort | uniq and sort -u often depends on what you need next. If you need the counts (uniq -c), you must use uniq. sort -u might be slightly faster if you only need the unique lines themselves.

Workshop Sorting and Uniqueness

Objective: To practice sorting data numerically and lexicographically, using keys, and using uniq to count or filter duplicates.

Scenario: We'll work with a list of fruit names and quantities, and revisit the access.log file.

Setup:

  1. Create a fruit list file: Create fruits.txt.

    cat << EOF > fruits.txt
    Apple,Red,15
    Banana,Yellow,25
    Orange,Orange,10
    Apple,Green,20
    Grape,Purple,50
    Banana,Green,18
    Orange,Blood,12
    Apple,Red,15
    EOF
    

    • Format: Fruit Name, Color, Quantity
  2. Ensure access.log exists: Recreate it if necessary from the Workshop in section 4.

    cat << EOF > access.log
    192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512
    10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048
    192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128
    172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024
    10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50
    192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096
    10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100
    EOF
    

Steps:

  1. Default Sort: Sort fruits.txt lexicographically.

    • Command: sort fruits.txt
    • Observe: Lines are sorted alphabetically based on the entire line content. Note that "Apple,Green" comes before "Apple,Red".
  2. Reverse Sort: Sort fruits.txt in reverse alphabetical order.

    • Command: sort -r fruits.txt
    • Observe: The order from step 1 is reversed. "Orange,Orange" is now first.
  3. Numeric Sort (Incorrect without Key): Try sorting fruits.txt numerically (this won't work as expected yet).

    • Command: sort -n fruits.txt
    • Observe: The result is likely the same as the default sort, because the beginning of each line ("Apple", "Banana") is not numeric. -n applies to the whole line unless a key is specified.
  4. Sort by Key (Field): Sort fruits.txt based on the fruit name (field 1). Use comma as a delimiter.

    • Command: sort -t, -k 1,1 fruits.txt
    • Observe: All "Apple" lines are grouped, "Banana" lines are grouped, etc. Within each group, the original relative order might be preserved (depending on sort stability) or determined by subsequent fields. -t, sets the delimiter. -k 1,1 means sort only based on field 1.
  5. Sort by Numeric Key: Sort fruits.txt based on the quantity (field 3), numerically.

    • Command: sort -t, -k 3n fruits.txt
    • Observe: Lines are sorted from the smallest quantity (10) to the largest (50). -k 3n applies numeric sorting (n) specifically to the key starting at field 3.
  6. Sort by Key (Reverse Numeric): Sort by quantity (field 3) in descending order.

    • Command: sort -t, -k 3nr fruits.txt
    • Observe: Lines are sorted from largest quantity (50) down to smallest (10). nr applies reverse numeric sort to the key.
  7. Secondary Sort Key: Sort primarily by fruit name (field 1), and secondarily by quantity (field 3, numerically).

    • Command: sort -t, -k 1,1 -k 3n fruits.txt
    • Observe: Lines are grouped by fruit name ("Apple", "Banana", ...). Within each group (e.g., "Apple"), the lines are sorted by quantity (15, 15, 20).
  8. Using uniq (Basic): Show only the unique lines from fruits.txt. Remember uniq needs sorted input!

    • Command: sort fruits.txt | uniq
    • Observe: The duplicate line "Apple,Red,15" appears only once.
  9. Using sort -u: Achieve the same result as step 8 using sort -u.

    • Command: sort -u fruits.txt
    • Observe: The output should be identical to sort fruits.txt | uniq.
  10. Counting Occurrences with uniq -c: Count how many times each unique line appears in fruits.txt.

    • Command: sort fruits.txt | uniq -c
    • Observe: Each unique line is prefixed with its count. "Apple,Red,15" should have a count of 2.
  11. Sorting the Counts: Show the counts from step 10, sorted from most frequent to least frequent line.

    • Command: sort fruits.txt | uniq -c | sort -nr
    • Observe: The output is sorted numerically (-n) and in reverse (-r) based on the count (which is the first field of uniq -c's output). The line with count 2 appears first.
  12. Finding Duplicate Lines with uniq -d: Show only the lines that appear more than once in fruits.txt.

    • Command: sort fruits.txt | uniq -d
    • Observe: Only "Apple,Red,15" is printed.
  13. Finding Truly Unique Lines with uniq -u: Show only the lines that appear exactly once in fruits.txt.

    • Command: sort fruits.txt | uniq -u
    • Observe: All lines except "Apple,Red,15" are printed.
  14. Real-World Example: Top IP Addresses: Find the top 3 IP addresses accessing the web server based on access.log.

    • Step 1: Extract IP addresses (field 1).
      awk '{ print $1 }' access.log
      
    • Step 2: Sort the IP addresses.
      awk '{ print $1 }' access.log | sort
      
    • Step 3: Count occurrences of each IP.
      awk '{ print $1 }' access.log | sort | uniq -c
      
    • Step 4: Sort the counts numerically in reverse order.
      awk '{ print $1 }' access.log | sort | uniq -c | sort -nr
      
    • Step 5: Get the top 3 using head.
      awk '{ print $1 }' access.log | sort | uniq -c | sort -nr | head -n 3
      
    • Observe the final output showing the count and IP address for the 3 most frequent visitors. This demonstrates a typical pipeline combining awk, sort, uniq, and head.

Cleanup (Optional):

rm fruits.txt access.log

This workshop illustrated how sort orders data based on various criteria (lexical, numeric, key-based) and how uniq filters or counts adjacent duplicates in sorted input. Combining them is essential for summarizing and analyzing text data.

6. Counting Lines, Words, and Characters (wc)

Sometimes, you don't need to see the content itself, but rather get metrics about the content: how many lines, words, or characters does a file or command output contain? The wc (Word Count) command is the standard utility for this task.

Basic Usage

wc [options] [file...]

wc reads from the specified files or standard input and prints counts based on the options provided, followed by the filename if input comes from a file. If multiple files are given, it prints counts for each file and then a total line.

Core wc Options

  • -l (--lines): Print the newline counts. This effectively counts the number of lines.
  • -w (--words): Print the word counts. Words are typically sequences of non-whitespace characters separated by whitespace.
  • -c (--bytes): Print the byte counts.
  • -m (--chars): Print the character counts. This can differ from byte counts (-c) for multibyte character encodings (like UTF-8). Often, -m is what you want for "character" count in modern systems.
  • -L (--max-line-length): Print the maximum display width (length of the longest line).

Default Behavior

If no options are specified, wc prints the line count, word count, and byte count (in that order), followed by the filename (if applicable).

wc filename
# Output: <lines> <words> <bytes> filename

Common Use Cases

  • Counting lines in a file: wc -l filename.txt
  • Counting files in a directory: ls -1 | wc -l (Note: ls -1 lists one file per line)
  • Counting words in piped output: grep 'ERROR' system.log | wc -w (Counts words only in the lines containing "ERROR")
  • Checking if a file is empty: An empty file will have 0 lines, 0 words, and 0 bytes.
  • Getting only the number: Often, you pipe the output of wc -l (which includes the filename) to awk '{print $1}' or use command substitution if you just need the numeric value in a script.

wc is a simple but fundamental utility for quick summaries of text data size and structure.

Workshop Counting with wc

Objective: To practice using wc to count lines, words, bytes, and characters in files and piped input.

Scenario: We will use the fruits.txt file and output from other commands.

Setup:

  1. Recreate fruits.txt:
    cat << EOF > fruits.txt
    Apple,Red,15
    Banana,Yellow,25
    Orange,Orange,10
    Apple,Green,20
    Grape,Purple,50
    Banana,Green,18
    Orange,Blood,12
    Apple,Red,15
    
    EOF
    
    (Note: Added an extra blank line at the end for demonstration)

Steps:

  1. Default wc Output: Run wc on fruits.txt without any options.

    • Command: wc fruits.txt
    • Observe: The output shows the line count, word count, and byte count, followed by the filename. The counts should reflect the 8 data lines plus the blank line (total 9 lines). Words are counted based on whitespace separation (so "Apple,Red,15" is likely counted as 1 word by default wc, depending on internal whitespace). Bytes depend on OS and line endings (LF vs CRLF).
  2. Count Lines: Get only the line count for fruits.txt.

    • Command: wc -l fruits.txt
    • Observe: Shows the number of lines (should be 9) and the filename.
  3. Count Words: Get only the word count for fruits.txt.

    • Command: wc -w fruits.txt
    • Observe: Shows the word count and the filename.
  4. Count Bytes: Get only the byte count for fruits.txt.

    • Command: wc -c fruits.txt
    • Observe: Shows the byte count and the filename.
  5. Count Characters: Get only the character count (useful for multi-byte encodings).

    • Command: wc -m fruits.txt
    • Observe: Shows the character count and the filename. On systems using UTF-8 and standard Unix line endings (LF), -m and -c might give the same result if the file only contains ASCII characters. If it contained multi-byte characters, they could differ.
  6. Find Longest Line: Find the length of the longest line in fruits.txt.

    • Command: wc -L fruits.txt
    • Observe: Shows the maximum line length (in display width) and the filename.
  7. Counting from Standard Input: Count the lines output by the ls -1 /etc command (listing files in /etc, one per line).

    • Command: ls -1 /etc | wc -l
    • Observe: Shows the number of files and directories directly within /etc. No filename is printed by wc because it's reading from the pipe (standard input).
  8. Counting Filtered Output: Count how many different types of fruit are listed in fruits.txt (ignoring color and quantity).

    • Step 1: Extract the first field (fruit name).
      awk -F, 'NR > 0 && NF > 0 { print $1 }' fruits.txt
      # Added NR>0 and NF>0 to handle potential blank lines robustly
      
    • Step 2: Sort the names.
      awk -F, 'NR > 0 && NF > 0 { print $1 }' fruits.txt | sort
      
    • Step 3: Get unique names.
      awk -F, 'NR > 0 && NF > 0 { print $1 }' fruits.txt | sort -u
      
    • Step 4: Count the unique names.
      awk -F, 'NR > 0 && NF > 0 { print $1 }' fruits.txt | sort -u | wc -l
      
    • Observe: The final output should be 3, representing Apple, Banana, Grape, Orange (Correction: 4 types).
  9. Multiple Files: Count lines, words, and bytes in both fruits.txt and a system file like /etc/passwd (if readable).

    • Command: wc fruits.txt /etc/passwd
    • Observe: wc prints the counts for fruits.txt, then the counts for /etc/passwd, and finally a total line summing the counts from both files.

Cleanup (Optional):

rm fruits.txt

This workshop showed how wc provides quick statistics about text data, both from files and standard input, making it useful for summaries and checks within scripts or command pipelines.

7. Comparing File Contents (diff, comm)

Comparing files to identify differences is a common task, especially when tracking changes in configuration files, source code, or datasets. Linux offers two primary tools for this: diff, which shows the differences in detail, and comm, which identifies common and unique lines between sorted files.

diff

The diff command compares two files line by line and outputs a description of the changes required to make the first file identical to the second file. It's the basis for creating patches (.diff or .patch files).

Basic Usage:

diff [options] file1 file2

Output Formats:

diff supports several output formats. The most common are:

  1. Normal Format (Default):

    • Shows differences using action characters (a for append, d for delete, c for change) along with line numbers and the differing lines themselves. Lines from file1 are prefixed with <, lines from file2 are prefixed with >.
    • Example: 1a2 means after line 1 of file1, you need to append line 2 of file2. 3d2 means delete line 3 from file1 (which corresponds to line 2's position in file2 after the deletion). 5c5 means change line 5 of file1 to become line 5 of file2.
    • This format is concise but can be hard to read for larger changes.
  2. Context Format (-c or -C NUM, --context[=NUM]):

    • Shows differing sections along with several lines (default 3, or NUM lines) of surrounding context (unchanged lines).
    • Lines from file1 are marked with ! (or - for deleted). Lines from file2 are marked with ! (or + for added). Unchanged context lines start with two spaces. File headers (*** file1, --- file2) are included.
    • Easier to understand the location of changes.
  3. Unified Format (-u or -U NUM, --unified[=NUM]):

    • Similar to context format but more compact, avoiding repetition of context lines. This is the most common format for patches used in version control systems (like Git) and software distribution.
    • File headers are --- file1 and +++ file2.
    • Change hunks start with @@ -line1,count1 +line2,count2 @@.
    • Lines unique to file1 (deleted) start with -.
    • Lines unique to file2 (added) start with +.
    • Context lines start with a space.
    • Highly recommended for readability and sharing.

Common diff Options:

  • -i (--ignore-case): Ignore case differences in file contents.
  • -w (--ignore-all-space): Ignore all whitespace differences (tabs, spaces).
  • -b (--ignore-space-change): Ignore changes in the amount of whitespace (e.g., one space vs. two).
  • -B (--ignore-blank-lines): Ignore changes whose lines are all blank.
  • -q (--brief): Report only whether files differ, not the details. Prints "Files X and Y differ".
  • -s (--report-identical-files): Report when two files are the same.
  • -r (--recursive): Recursively compare subdirectories found. When comparing directories, diff will compare files with the same name in both directories.
  • -N (--new-file): Treat absent files as empty. Useful with -r if a file exists in one directory but not the other.
  • -y (--side-by-side): Output in two columns, showing lines side-by-side. Can be combined with --width=NUM to control line width. Differences are marked.

comm

The comm command compares two sorted files line by line and produces three columns of output:

  1. Lines unique to file1.
  2. Lines unique to file2.
  3. Lines common to both files.

Crucially, both input files MUST be sorted lexicographically for comm to work correctly.

Basic Usage:

comm [options] file1 file2

Common comm Options:

  • -1: Suppress output column 1 (lines unique to file1).
  • -2: Suppress output column 2 (lines unique to file2).
  • -3: Suppress output column 3 (lines common to both files).
  • --check-order: Check that the input is correctly sorted; signal an error if not.
  • --nocheck-order: Do not check input order (can be faster, but results are undefined if not sorted).
  • --output-delimiter=STR: Separate columns with STR instead of the default tab character.

Example Use Cases for comm:

  • Find lines present in file2 but not in file1: comm -13 file1 file2 (suppress unique to file1, suppress common).
  • Find lines common to both files: comm -12 file1 file2 (suppress unique to file1, suppress unique to file2).

Choosing Between diff and comm

  • Use diff when you need to see the specific changes required to transform one file into another, especially for creating patches or understanding detailed modifications (code, configs). Unified format (-u) is generally preferred.
  • Use comm when you have sorted lists and want to quickly find common items, or items unique to one list or the other. It's often used for set operations on lists (intersection, difference).

Workshop Comparing Files

Objective: To practice using diff to identify changes between files in different formats and comm to compare sorted lists.

Scenario: We'll create two versions of a simple configuration file and two lists of users.

Setup:

  1. Create config_v1.txt:

    cat << EOF > config_v1.txt
    # Version 1 Configuration
    Hostname=server-alpha
    IPAddress=192.168.1.50
    AdminUser=admin
    LogLevel=INFO
    MaxConnections=100
    EOF
    

  2. Create config_v2.txt (modified version):

    cat << EOF > config_v2.txt
    # Version 2 Configuration (Updated)
    Hostname=server-beta
    IPAddress=192.168.1.55
    AdminUser=administrator
    LogLevel=DEBUG
    # MaxConnections=100 (Obsolete)
    MaxClients=150
    EOF
    

  3. Create users_group1.txt (sorted):

    cat << EOF | sort > users_group1.txt
    alice
    bob
    charlie
    david
    frank
    EOF
    

  4. Create users_group2.txt (sorted):

    cat << EOF | sort > users_group2.txt
    alice
    bob
    eve
    grace
    mallory
    EOF
    

Steps using diff:

  1. Default diff: Compare config_v1.txt and config_v2.txt.

    • Command: diff config_v1.txt config_v2.txt
    • Observe: The output uses the c (change) and d (delete)/a (add) notation. Notice how it describes changing lines 1-6 of v1 into lines 1-7 of v2. It can be hard to follow precisely which parts changed.
  2. Context diff: Compare using context format.

    • Command: diff -c config_v1.txt config_v2.txt
    • Observe: Easier to read. Shows file headers (***, ---) and context lines around the changes. Changed lines are marked with !. Added lines with +, deleted with -.
  3. Unified diff (Recommended): Compare using unified format.

    • Command: diff -u config_v1.txt config_v2.txt
    • Observe: The most common patch format. Shows headers (---, +++), hunks (@@ ... @@), context lines (start with space), deleted lines (-), and added lines (+). This clearly shows the changes line by line.
  4. Ignore Whitespace Changes: Let's create a third file with only whitespace changes.

    • Command: cp config_v1.txt config_v1_ws.txt
    • Command: sed -i 's/=/ = /' config_v1_ws.txt (Adds spaces around =)
    • Command: diff -u config_v1.txt config_v1_ws.txt (Shows differences)
    • Command: diff -uw config_v1.txt config_v1_ws.txt (Or diff -u -w)
    • Observe: The -w option makes diff ignore the whitespace changes, reporting no differences (or only other non-whitespace differences if there were any).
  5. Brief Output: Just check if the files differ.

    • Command: diff -q config_v1.txt config_v2.txt
    • Observe: Prints Files config_v1.txt and config_v2.txt differ.
    • Command: diff -q config_v1.txt config_v1.txt
    • Observe: Prints nothing, as the files are identical.
    • Command: diff -qs config_v1.txt config_v1.txt
    • Observe: -s makes it report identical files: Files config_v1.txt and config_v1.txt are identical.

Steps using comm:

Important: Remember comm requires sorted input. Our users_*.txt files were created sorted.

  1. Default comm: Compare users_group1.txt and users_group2.txt.

    • Command: comm users_group1.txt users_group2.txt
    • Observe: Three columns (separated by tabs):
      • Column 1: charlie, david, frank (unique to group1)
      • Column 2: eve, grace, mallory (unique to group2)
      • Column 3: alice, bob (common to both)
  2. Find Common Users: Show only users present in both groups.

    • Command: comm -12 users_group1.txt users_group2.txt
    • Observe: Suppresses columns 1 and 2, leaving only alice and bob.
  3. Find Users Only in Group 1: Show users present in group1 but not group2.

    • Command: comm -23 users_group1.txt users_group2.txt
    • Observe: Suppresses columns 2 and 3, leaving charlie, david, frank.
  4. Find Users Only in Group 2: Show users present in group2 but not group1.

    • Command: comm -13 users_group1.txt users_group2.txt
    • Observe: Suppresses columns 1 and 3, leaving eve, grace, mallory.
  5. Demonstrate Unsorted Input: Try comm on the unsorted config files.

    • Command: comm config_v1.txt config_v2.txt
    • Observe: The output is likely incorrect or might produce an error message (comm: file 1 is not in sorted order) depending on your comm version and locale settings, because the files are not lexicographically sorted. This highlights the importance of sorting input for comm.

Cleanup (Optional):

rm config_v1.txt config_v2.txt config_v1_ws.txt users_group1.txt users_group2.txt

This workshop demonstrated how diff provides detailed change information (especially with -u) and how comm efficiently finds commonalities and differences between pre-sorted lists.

8. Character Translation or Deletion (tr)

The tr command is a simple yet useful utility for translating or deleting characters. It reads from standard input and writes to standard output, making character-level substitutions or removals based on specified sets of characters. It does not take filenames as arguments; input must be redirected or piped to it.

Basic Syntax

  1. Translate: tr [options] SET1 SET2

    • Replaces characters found in SET1 with the corresponding character in SET2. If SET2 is shorter than SET1, the last character of SET2 is repeated.
  2. Delete: tr [options] -d SET1

    • Deletes all characters found in SET1 from the input.
  3. Squeeze Repeats: tr [options] -s SET1

    • Replaces sequences of a character repeated multiple times in SET1 with a single instance of that character. Can be combined with translation.

Defining Character Sets (SET1, SET2)

Sets are strings of characters. tr provides several ways to define them:

  • Literal Characters: tr 'abc' 'xyz' translates 'a' to 'x', 'b' to 'y', 'c' to 'z'.
  • Ranges: a-z represents all lowercase letters, 0-9 all digits. tr 'a-z' 'A-Z' converts input to uppercase.
  • Character Classes (POSIX): Predefined sets enclosed in [:...:]. These are often more portable than explicit ranges, especially across different locales.
    • [:alnum:]: Alphanumeric characters (a-z, A-Z, 0-9).
    • [:alpha:]: Alphabetic characters (a-z, A-Z).
    • [:digit:]: Digits (0-9).
    • [:lower:]: Lowercase letters.
    • [:upper:]: Uppercase letters.
    • [:space:]: Whitespace characters (space, tab, newline, etc.).
    • [:punct:]: Punctuation characters.
    • [:cntrl:]: Control characters.
    • [:print:]: Printable characters (including space).
    • [:graph:]: Printable characters (not including space).
    • [:xdigit:]: Hexadecimal digits (0-9, a-f, A-F).
    • Example: tr '[:lower:]' '[:upper:]' converts to uppercase (often preferred over a-z A-Z for locale correctness).
  • Octal Escapes: \NNN represents the character with octal value NNN. \n is newline, \t is tab.
  • Repetition: [c*N] in SET2 means N repetitions of character c. [c*] means repeat c indefinitely (useful if SET2 needs to be as long as SET1).

Common Options

  • -d (--delete): Delete characters in SET1. SET2 must not be specified.
  • -s (--squeeze-repeats): Squeeze repeated characters listed in the last specified set (SET1 if only one set, SET2 if translating).
  • -c or -C (--complement): Use the complement of SET1. All characters not in SET1 are selected.

Use Cases

  • Changing case: tr 'a-z' 'A-Z' or tr '[:lower:]' '[:upper:]'.
  • Deleting unwanted characters: tr -d '[:digit:]' removes all numbers. tr -d '\r' removes carriage return characters (useful for DOS/Windows files).
  • Converting delimiters: tr ',' '\t' converts commas to tabs.
  • Squeezing whitespace: tr -s ' ' replaces multiple spaces with a single space. tr -s '[:space:]' squeezes all types of whitespace.
  • Extracting specific characters: tr -cd '[:alnum:]' deletes the complement of alphanumerics, effectively keeping only letters and numbers.

tr is efficient for simple, character-by-character transformations applied uniformly across an input stream.

Workshop Character Translation or Deletion

Objective: To practice using tr for case conversion, character deletion, delimiter replacement, and squeezing repeated characters.

Scenario: We will manipulate text strings provided via echo and standard input.

Setup: No specific files are needed; we will use echo and direct input.

Steps:

  1. Uppercase Conversion: Convert a lowercase string to uppercase.

    • Command: echo "hello world" | tr 'a-z' 'A-Z'
    • Observe: Output is HELLO WORLD.
    • Alternative (POSIX classes): echo "hello world 123" | tr '[:lower:]' '[:upper:]'
    • Observe: Output is HELLO WORLD 123.
  2. Lowercase Conversion: Convert a mixed-case string to lowercase.

    • Command: echo "Linux Is FUN!" | tr '[:upper:]' '[:lower:]'
    • Observe: Output is linux is fun!.
  3. Deleting Specific Characters: Remove all vowels (aeiou) from a string.

    • Command: echo "quick brown fox jumps over the lazy dog" | tr -d 'aeiouAEIOU'
    • Observe: Output is qck brwn fx jmps vr th lzy dg. -d deletes characters in the specified set.
  4. Deleting Non-Printable Characters (Example): Imagine a string with a control character (we'll simulate with backspace \b).

    • Command: echo -e "This\b is a\b test" | cat -v (Use cat -v to visualize control chars like ^H for backspace)
    • Command: echo -e "This\b is a\b test" | tr -d '[:cntrl:]'
    • Observe: The tr command removes the backspace characters, resulting in This is a test.
  5. Replacing Delimiters: Convert spaces to newlines (one word per line).

    • Command: echo "This is a line of text" | tr ' ' '\n'
    • Observe:
      This
      is
      a
      line
      of
      text
      
  6. Squeezing Repeated Characters: Remove extra spaces between words.

    • Command: echo "Too many spaces here" | tr -s ' '
    • Observe: Output is Too many spaces here. -s squeezes repeated spaces into one.
  7. Squeezing All Whitespace: Replace any sequence of whitespace characters with a single newline.

    • Command: echo -e "Field1\tField2 Field3\nField4" | tr -s '[:space:]' '\n'
    • Observe:
      Field1
      Field2
      Field3
      Field4
      
      Multiple spaces, tabs (\t), and existing newlines (\n) are all treated as whitespace to be squeezed and replaced by a single newline.
  8. Complement and Delete (Keep Only Digits): Remove everything except digits from a string.

    • Command: echo "Order Number: ORD-12345 / Date: 2023-10-26" | tr -cd '[:digit:]'
    • Observe: Output is 1234520231026.
      • -d: Delete mode.
      • -c: Use the complement of the set. The set is [:digit:].
      • So, delete everything that is not a digit.
  9. Simple Substitution Cipher (ROT13): Rotate letters 13 places (A->N, B->O, ..., M->Z, N->A, ...).

    • Command: echo "Secret Message" | tr 'a-zA-Z' 'n-za-mN-ZA-M'
    • Observe: Output is Frperg Zrffntr. Applying it again should decode it.
    • Command: echo "Frperg Zrffntr" | tr 'a-zA-Z' 'n-za-mN-ZA-M'
    • Observe: Output is Secret Message.
  10. Interactive Use: Try translating input directly from the keyboard.

    • Command: tr '()' '{}'
    • Now type: function(arg1, arg2) and press Enter.
    • Observe the output: function{arg1, arg2}
    • Press Ctrl+D to end the input to tr.

Cleanup: No files to remove.

This workshop showed how tr performs efficient character-level translation, deletion, and squeezing, making it valuable for data cleaning and simple transformations within pipelines.

9. Combining Tools with Pipes

The true power of the Linux command line for text processing comes not just from individual tools, but from their ability to be combined using the pipe (|) operator.

A pipe connects the standard output (stdout) of the command on its left to the standard input (stdin) of the command on its right. This allows you to build complex data processing workflows by chaining simple, specialized utilities together.

The Flow:

command1 | command2 | command3 ...
  1. command1 executes. Its standard output (what it would normally print to the screen) is not printed to the screen.
  2. Instead, that output is redirected (piped) to become the standard input for command2.
  3. command2 executes, reading its input from command1. Its standard output is then piped to command3.
  4. command3 executes, reading input from command2. Its standard output goes to the terminal (unless piped further).

Standard Error (stderr): It's important to note that pipes typically only redirect standard output. Error messages, which are usually sent to standard error (stderr), will still appear on your terminal by default and are not passed down the pipeline. You can redirect stderr using shell redirection like 2>&1 if needed (e.g., command1 2>&1 | command2 sends both stdout and stderr of command1 to command2).

Why Use Pipes?

  • Modularity: Each command does one thing well (grep searches, sort sorts, awk processes fields, wc counts). Pipes let you combine these specialists.
  • Efficiency: Data often flows through the pipeline without needing to be written to temporary files, saving disk I/O and time. Processing can happen concurrently to some extent.
  • Flexibility: You can easily swap out tools or add new stages to the pipeline to modify the workflow.
  • Readability (often): A well-constructed pipeline can clearly express a sequence of data transformations.

Example Pipeline Scenarios

Let's revisit some examples using pipelines:

  1. Find the most frequent error messages in a log file:

    • Goal: Extract lines containing "ERROR", isolate the message part, count unique messages, and show the top 5 most frequent.
    • Assume: Log format like [Timestamp] LEVEL: Message text
    • Pipeline:
      grep 'ERROR' system.log | awk -F': ' '{print $2}' | sort | uniq -c | sort -nr | head -n 5
      
    • Breakdown:
      • grep 'ERROR' system.log: Select only lines containing "ERROR".
      • awk -F': ' '{print $2}': Split lines by ": " and print the second field (the message text).
      • sort: Sort the messages alphabetically (needed for uniq).
      • uniq -c: Count adjacent identical messages (now grouped by sort).
      • sort -nr: Sort the counts numerically (n) and in reverse (r) to get highest counts first.
      • head -n 5: Display only the top 5 lines (most frequent errors).
  2. List active users from /etc/passwd who use /bin/bash:

    • Goal: Find lines in /etc/passwd ending with /bin/bash and extract the username (first field).
    • Pipeline:
      grep '/bin/bash$' /etc/passwd | awk -F: '{print $1}' | sort
      
    • Breakdown:
      • grep '/bin/bash$' /etc/passwd: Find lines ending ($) with /bin/bash.
      • awk -F: '{print $1}': Set field separator to colon (:) and print the first field (username).
      • sort: Sort the usernames alphabetically (optional, but good practice for lists).
  3. Calculate the total size of .txt files in the current directory:

    • Goal: List files, filter for .txt, extract size, sum the sizes.
    • Pipeline (using ls - potentially fragile, see note below):
      ls -l *.txt | grep '^-' | awk '{total += $5} END {print total}'
      
    • Breakdown:
      • ls -l *.txt: Get long listing of .txt files.
      • grep '^-': Filter for regular files (lines starting with -).
      • awk '{total += $5} END {print total}': Sum the 5th field (size) and print the total.
    • Note on parsing ls: Parsing the output of ls is generally discouraged in scripts because its format can change and handle filenames with spaces poorly. Better alternatives often involve find or shell globbing with loops. However, for interactive use or simple cases, it's common. A more robust way using find and awk:
      find . -maxdepth 1 -name '*.txt' -printf '%s\n' | awk '{total += $1} END {print total}'
      # Or using find -exec wc -c ... | awk ...
      

Mastering the art of combining these tools with pipes is arguably the most crucial skill for effective command-line text processing in Linux. It allows you to solve complex problems by assembling simple, well-understood components.

Workshop Combining Tools with Pipes

Objective: To build and understand multi-stage pipelines using grep, awk, sort, uniq, wc, head, tail, and tr.

Scenario: We will analyze the access.log file more deeply and process a list of words.

Setup:

  1. Recreate access.log:

    cat << EOF > access.log
    192.168.1.10 - - [26/Oct/2023:10:15:01 +0000] "GET /index.html HTTP/1.1" 200 512
    10.0.0.5 - - [26/Oct/2023:10:15:05 +0000] "GET /images/logo.png HTTP/1.1" 200 2048
    192.168.1.10 - - [26/Oct/2023:10:16:10 +0000] "POST /login HTTP/1.1" 302 128
    172.16.5.20 - - [26/Oct/2023:10:17:00 +0000] "GET /styles.css HTTP/1.1" 200 1024
    10.0.0.5 - - [26/Oct/2023:10:17:30 +0000] "GET /favicon.ico HTTP/1.1" 404 50
    192.168.1.10 - - [26/Oct/2023:10:18:00 +0000] "GET /dashboard HTTP/1.1" 200 4096
    10.0.0.5 - - [26/Oct/2023:10:18:05 +0000] "GET /api/users HTTP/1.1" 500 100
    192.168.1.10 - - [26/Oct/2023:10:19:00 +0000] "GET /index.html HTTP/1.1" 200 512
    EOF
    
    (Added one duplicate request for /index.html)

  2. Create a word list file: Create words.txt.

    cat << EOF > words.txt
    Linux
    Command
    Line
    Is
    Powerful
    And
    Flexible
    LINUX
    Commands
    Are
    Great
    Tools
    flexible
    EOF
    

Steps:

  1. Most Requested Pages: Find the top 3 most frequently requested pages (field 7) from access.log.

    • Command:
      awk '{print $7}' access.log | sort | uniq -c | sort -nr | head -n 3
      
    • Observe: The output should show counts and paths, with /index.html likely having a count of 2.
    • Breakdown:
      • awk '{print $7}': Extract the request path.
      • sort: Sort paths alphabetically for uniq.
      • uniq -c: Count identical adjacent paths.
      • sort -nr: Sort by count (numeric, reverse).
      • head -n 3: Get the top 3 lines.
  2. Total Bytes Transferred per IP: Calculate the sum of bytes (field 10) for each unique IP address (field 1).

    • Command:
      awk '{ ip_bytes[$1] += $10 } END { for (ip in ip_bytes) print ip, ip_bytes[ip] }' access.log | sort -k 2nr
      
    • Observe: The output shows each unique IP and the total bytes transferred by that IP, sorted by bytes (most first).
    • Breakdown:
      • awk '{ ip_bytes[$1] += $10 } ... ': Uses an awk associative array ip_bytes. For each line, it adds the bytes ($10) to the element indexed by the IP address ($1).
      • ... END { for (ip in ip_bytes) print ip, ip_bytes[ip] }: After processing all lines, loop through the array indices (IPs) and print the IP and its accumulated byte count.
      • sort -k 2nr: Sort the output based on the second field (bytes), numerically (n) and reversed (r).
  3. Unique Words (Case-Insensitive): Find all unique words in words.txt, ignoring case, and print them in lowercase.

    • Command:
      cat words.txt | tr '[:upper:]' '[:lower:]' | tr -s '[:space:]' '\n' | grep -v '^$' | sort -u
      
    • Observe: A sorted list of unique words, all in lowercase (linux, command, line, is, powerful, and, flexible, commands, are, great, tools).
    • Breakdown:
      • cat words.txt: Output the file content.
      • tr '[:upper:]' '[:lower:]': Convert everything to lowercase.
      • tr -s '[:space:]' '\n': Convert all whitespace sequences (including original newlines) to single newlines (effectively putting each word on its own line).
      • grep -v '^$': Remove any blank lines that might have been created.
      • sort -u: Sort the words and keep only unique occurrences.
  4. Count 404 Errors: Count how many requests resulted in a 404 error in access.log.

    • Command:
      grep ' 404 ' access.log | wc -l
      
    • Observe: Should output 1. (Note the spaces around 404 to avoid matching it in other fields like bytes).
    • Alternative using awk:
      awk '$9 == 404 { count++ } END { print count }' access.log
      
    • Observe: Also outputs 1. This is often more precise than grep for field-based data.
  5. IP Addresses Making POST Requests: List the unique IP addresses that made POST requests.

    • Command:
      grep '"POST ' access.log | awk '{print $1}' | sort -u
      
    • Observe: Should output 192.168.1.10.
    • Breakdown:
      • grep '"POST ': Find lines containing the literal string "POST (part of the request field).
      • awk '{print $1}': Extract the IP address (field 1).
      • sort -u: Get the unique IP addresses.
  6. Extract Timestamps for Specific Page: Get just the timestamps (field 4, removing brackets) for requests to /index.html.

    • Command:
      grep '"GET /index.html ' access.log | awk '{print $4}' | tr -d '[]'
      
    • Observe: Outputs the two timestamps associated with /index.html requests, without the square brackets.
    • Breakdown:
      • grep '"GET /index.html ': Select lines for GET requests to /index.html.
      • awk '{print $4}': Extract the timestamp field (e.g., [26/Oct/2023:10:15:01).
      • tr -d '[]': Delete the square bracket characters.

Cleanup (Optional):

rm access.log words.txt

This final workshop emphasized how pipelines allow you to construct sophisticated queries and transformations by chaining together the tools learned throughout this chapter. Practice building your own pipelines to solve different text processing challenges.

Conclusion

Throughout this exploration of text processing and searching in Linux, we have journeyed from simple file viewing with cat and less to the powerful pattern matching of grep, the line-oriented transformations of sed, and the field-based processing prowess of awk. We've learned how to organize data with sort, manage duplicates with uniq, obtain summaries with wc, compare files with diff and comm, and perform character manipulations with tr.

Perhaps most importantly, we've seen how the pipe operator (|) allows these individual tools to be chained together, creating elegant and efficient solutions to complex text manipulation problems. This modular, pipeline-driven approach is a cornerstone of the Linux/Unix philosophy.

Mastering these command-line utilities offers significant advantages:

  • Automation: Easily script repetitive text-processing tasks.
  • Efficiency: Perform operations quickly, especially on large datasets or remote systems.
  • Flexibility: Combine tools in novel ways to address unique challenges.
  • Universality: These skills are transferable across various Unix-like environments.

The key to proficiency is practice. Experiment with these commands on different types of text files – logs, configuration files, code, CSV data, or even plain text documents. Try variations of the options, explore different regular expression patterns, and build increasingly complex pipelines. Consult the man pages (man grep, man sed, etc.) for comprehensive details on each command's capabilities.

While graphical tools have their place, the command line remains an indispensable environment for anyone serious about working efficiently with text data in Linux. The tools and techniques covered here form a solid foundation for tackling a wide array of system administration, data analysis, and development tasks. Keep experimenting, keep learning, and you will unlock the full text-processing power of your Linux system.