Skip to content
Author Nejat Hakan
eMail nejat.hakan@outlook.de
PayPal Me https://paypal.me/nejathakan


Advanced Shell Scripting and Automation

Introduction

Welcome back to your journey into the world of the shell! In previous explorations, you likely learned the fundamentals: navigating directories, executing commands, using pipes and redirection, and perhaps writing simple scripts with variables, loops (for, while), and conditional statements (if, case). Now, it's time to elevate those skills.

This chapter delves into advanced shell scripting techniques. But what does "advanced" really mean here? It's not about arcane commands known only to wizards. Instead, it's about writing scripts that are:

  1. More Robust: They handle errors gracefully and anticipate potential problems.
  2. More Efficient: They perform tasks faster or with fewer resources.
  3. More Flexible: They can adapt to different inputs and situations.
  4. More Reusable: They contain components (like functions) that can be used in multiple scripts.
  5. Capable of Automation: They can perform complex sequences of tasks without manual intervention, often on a schedule.

Mastering these techniques transforms the shell from a simple command executor into a powerful automation platform. Whether you need to manage system configurations, process large amounts of data, automate backups, or orchestrate complex workflows, advanced shell scripting is an invaluable skill. We'll cover functions, sophisticated error handling, advanced input/output techniques, regular expressions, argument parsing, and the foundational concepts of automation.

Let's begin building more powerful and professional shell scripts!

1. Functions: Building Reusable Code Blocks

Imagine you have a piece of code that you need to use multiple times within the same script, or perhaps across different scripts. Copying and pasting is inefficient and error-prone. If you need to change that code later, you have to find and update every single copy! This is where functions come in.

A function is a named block of code that performs a specific task. You define it once and can then "call" it (execute it) by its name whenever you need it.

Defining a Function

There are two common syntaxes for defining functions in Bash (and compatible shells):

# Syntax 1 (Recommended - POSIX standard)
function_name() {
    # Commands go here
    # ...
    # Optional: return a status code
    return 0 # 0 typically means success
}

# Syntax 2 (Bash-specific keyword)
function function_name {
    # Commands go here
    # ...
    # Optional: return a status code
    return 1 # Non-zero typically means failure
}
  • function_name: Choose a descriptive name for your function. Use letters, numbers, and underscores. Avoid special characters.
  • (): Parentheses are required after the function name in the first syntax. They signify that this is a function definition.
  • {}: Curly braces enclose the block of code that belongs to the function. There must be a space or newline after the opening brace { and before the closing brace }.
  • Commands: Any valid shell commands can be placed inside the function. These commands are executed sequentially when the function is called.
  • return <value>: (Optional) This command immediately stops the execution of the function and sets its exit status. The <value> should be an integer between 0 and 255. By convention:
    • 0 indicates that the function completed successfully.
    • Any non-zero value (1, 2, ..., 255) indicates that some kind of error or failure occurred within the function. If you don't use return, the function's exit status will be the exit status of the last command executed inside it.

Calling a Function

To execute the code inside a function (i.e., to "call" the function), you simply type its name on a line by itself, just like you would run any other command:

function_name

The shell will find the function definition and execute the commands within its curly braces.

Passing Arguments (Parameters) to Functions

Functions become much more flexible when they can accept input data. These input values are called arguments or parameters. When you call a function, you can list the arguments you want to pass to it right after the function name, separated by spaces.

Inside the function's code block, you can access these arguments using special variables called positional parameters:

  • $1: Represents the first argument passed to the function.
  • $2: Represents the second argument passed to the function.
  • $3: Represents the third argument, and so on ($4, $5, ... $9, ${10}, ${11}, etc. Use curly braces for numbers greater than 9).
  • $0: Inside a function, $0 usually still refers to the name of the main script itself, not the function name (this can be a bit confusing, but it's how it works).
  • $#: Represents the total number of arguments passed to the function. This is very useful for checking if the correct number of arguments was provided.
  • $*: Represents all the arguments passed to the function as a single string. If the arguments were "hello world" and "goodbye", $* would expand to "hello world goodbye". The arguments are joined together using the first character of the special shell variable IFS (Internal Field Separator), which is typically a space.
  • $@: Represents all the arguments passed to the function as separate strings (individual words). If the arguments were "hello world" and "goodbye", $@ would expand to "hello world" "goodbye" (two separate entities). This is generally the preferred way to handle multiple arguments, especially when you need to pass them on to another command inside the function, as it preserves arguments that contain spaces.

Example:

Let's create a function that takes a name and an age and prints a message.

#!/usr/bin/env bash

# Define a function to describe a person
describe_person() {
    # Check if exactly two arguments were provided
    if [ $# -ne 2 ]; then
        # Print an error message to standard error (stderr)
        echo "Error: describe_person requires exactly 2 arguments (name and age)." >&2
        echo "Usage: describe_person <name> <age>" >&2
        return 1 # Indicate failure (incorrect usage)
    fi

    # Assign arguments to meaningful local variables
    # Using 'local' prevents these variables from interfering with
    # variables outside the function.
    local person_name="$1"
    local person_age="$2"

    # Print the description
    echo "$person_name is $person_age years old."

    # Optional: Check if age is numeric (basic check)
    if ! [[ "$person_age" =~ ^[0-9]+$ ]]; then
         echo "Warning: Age '$person_age' does not look like a number." >&2
         # We could return a different non-zero status here if needed
    fi

    return 0 # Indicate success
}

# --- Script Main Body ---

echo "Calling function with correct arguments:"
describe_person "Alice" 30
# Check the exit status of the last command (the function call)
if [ $? -eq 0 ]; then
    echo "Function call succeeded."
else
    echo "Function call failed with status $?."
fi

echo # Blank line

echo "Calling function with incorrect number of arguments:"
describe_person "Bob"
if [ $? -eq 0 ]; then
    echo "Function call succeeded."
else
    echo "Function call failed with status $?." # Should be 1 here
fi

echo # Blank line

echo "Calling function with potentially non-numeric age:"
describe_person "Charlie" "twenty-five"
# This call will succeed (return 0) but print a warning.

Variable Scope: local vs. Global

Understanding where variables "live" (their scope) is crucial for writing bug-free functions.

  • Global Scope: By default, any variable you define in the main part of your script is global. This means it can be accessed and modified from anywhere within the script, including inside any function. Furthermore, if you define a variable inside a function without using the local keyword, it also becomes a global variable! This can lead to unexpected behavior, where one function accidentally changes a variable that another part of the script (or another function) relies on.

  • Local Scope: To create a variable that exists only within the function where it is defined, you must declare it using the local keyword. This variable will be invisible outside the function, and it won't clash with any global variable that happens to have the same name.

Example:

#!/usr/bin/env bash

# Global variable
my_var="I am global"

modifier_function() {
    # This variable is local to modifier_function
    local my_var="I am local inside modifier_function"
    echo "Inside modifier_function: my_var = '$my_var'"

    # Modifying a global variable (implicitly)
    # This is generally bad practice unless intended!
    another_var="Set inside modifier_function"
}

reader_function() {
    # This variable is local to reader_function
    local another_var="I am local inside reader_function"
    echo "Inside reader_function: my_var = '$my_var'" # Accesses the global 'my_var'
    echo "Inside reader_function: another_var = '$another_var'" # Accesses its own local 'another_var'
}

# --- Script Main Body ---
echo "--- Before calling functions ---"
echo "Global scope: my_var = '$my_var'"
# echo "Global scope: another_var = '$another_var'" # This would cause an error if 'set -u' is active

echo # Blank line
echo "--- Calling modifier_function ---"
modifier_function
echo "Global scope: modifier_function set another_var = '$another_var'" # Now accessible globally

echo # Blank line
echo "--- Calling reader_function ---"
reader_function

echo # Blank line
echo "--- After calling functions ---"
echo "Global scope: my_var = '$my_var'" # Unchanged by modifier_function's local variable
echo "Global scope: another_var = '$another_var'" # Changed by modifier_function

Best Practice: Always use the local keyword for variables defined inside your functions unless you have a very specific, deliberate reason to modify a global variable. This makes your functions self-contained, easier to understand, and less prone to causing side effects.

2. Robust Error Handling

Simple scripts often just stop or produce strange output when something goes wrong (like a file not found, a command failing, or incorrect input). Professional scripts, however, need to anticipate and handle errors gracefully. This might involve:

  • Detecting that an error occurred.
  • Reporting the error clearly to the user (or a log file).
  • Deciding whether the script can continue or if it needs to stop.
  • Cleaning up any temporary resources (like files or directories) before exiting.

Exit Status Codes ($?)

As mentioned briefly with functions, every command you run in the shell finishes with an exit status (also called a return code or exit code). This is a number between 0 and 255 that indicates whether the command succeeded or failed.

  • 0: Success. The command completed its task without any problems.
  • 1-255: Failure. The command encountered an error. The specific non-zero number can sometimes indicate the type of error (e.g., 1 often means a general error, 2 might mean misuse of a command, 127 often means "command not found"), but these conventions are not strictly enforced across all commands. The key takeaway is: zero means success, non-zero means failure.

The shell automatically stores the exit status of the most recently executed command (or function, or pipeline) in a special variable: $?.

You can check this variable immediately after running a command to see if it worked.

# Example: Check if a directory exists
ls /etc/passwd # This file exists, ls should succeed
echo "Exit status of first ls: $?" # Output should be 0

ls /non/existent/directory/or/file # This path likely doesn't exist, ls should fail
echo "Exit status of second ls: $?" # Output should be non-zero (e.g., 1 or 2)

# Using the exit status in an 'if' statement
if cp important_data.txt /backup/location/; then
    # This block runs ONLY if 'cp' exits with status 0
    echo "Backup of important_data.txt completed successfully."
else
    # This block runs if 'cp' exits with a non-zero status
    local cp_status=$? # Save the status immediately! $? changes with every command.
    echo "ERROR: Failed to copy important_data.txt!" >&2 # Send error to stderr
    echo "The 'cp' command exited with status: $cp_status" >&2
    # It's often good practice to exit the script if a critical step fails
    exit 1 # Exit the entire script with a failure status
fi

# If we reach here, the copy was successful (or the script exited)
echo "Script continues..."

Important: The value of $? is updated after every single command. If you need to use the exit status of a specific command later, save it to another variable immediately, as shown with local cp_status=$?.

Also, notice >&2. This redirects the error messages to standard error (file descriptor 2). Standard output (stdout, file descriptor 1) is for the normal, expected output of a program, while standard error (stderr) is the conventional channel for error messages and diagnostics. Redirecting errors this way keeps them separate from regular output, which is useful if you're redirecting stdout to a file but still want to see errors on the terminal.

The set Command for Automatic Error Handling

Checking $? after every single command can make your scripts very long and repetitive. Bash provides some options, enabled using the set built-in command, that can make your scripts automatically more sensitive to errors.

  • set -e (or set -o errexit): This is a very powerful option. When set -e is active, the script will exit immediately if any simple command exits with a non-zero status. This prevents the script from blindly continuing after a critical failure.

    • What's a "simple command"? Basically, a command that isn't part of a conditional test (like in if [...] or while [...]), isn't part of a boolean && or || list, isn't inverted with !, or isn't the last command in a pipeline (unless pipefail is also set).
    • Caveats: Because of these exceptions, set -e isn't a magic bullet. Sometimes a command failure is expected and handled (e.g., grep not finding a pattern might exit with 1, which could be okay). You can temporarily bypass set -e for a specific command by appending || true (e.g., command_that_might_fail || true), but use this judiciously. Despite its quirks, set -e catches many common errors and significantly improves robustness.
  • set -u (or set -o nounset): This option causes the script to treat attempts to expand (use the value of) an unset variable as an error, and the script will exit. Normally, using an unset variable just results in an empty string, which can hide typos or logic errors. set -u makes these errors obvious.

    • Example: If you typed echo "$massage" instead of echo "$message", set -u would cause the script to exit with an error instead of just printing a blank line.
    • If you need to check if a variable is set without triggering set -u, you can use parameter expansion checks like if [ -z "${my_var-}" ]; then ... (The hyphen - after the variable name tells bash to return an empty string if unset, rather than causing an error).
  • set -o pipefail: By default, the exit status of a pipeline (commands connected by |) is the exit status of the last command only. This means if an earlier command in the pipeline fails, but the last one succeeds, the pipeline as a whole is considered successful (exit status 0). set -o pipefail changes this behavior: the exit status of the pipeline becomes the exit status of the rightmost command in the pipeline that failed (returned non-zero), or zero if all commands in the pipeline succeeded. This is essential when you need to ensure that every step in a data processing pipeline worked correctly.

Recommended Practice: Start most of your scripts with these settings:

#!/usr/bin/env bash

# Exit on error, treat unset variables as errors, pipelines fail on first error
set -euo pipefail
# You can also write them on separate lines:
# set -e
# set -u
# set -o pipefail

# --- Rest of your script ---
echo "Hello, $USER" # This is fine

# mkdir /non_existent_parent/my_dir # With set -e, this failure would stop the script
# echo "$undefined_variable" # With set -u, this would stop the script

# Example of pipefail:
# Assume 'command1' fails (exit 1) and 'command2' succeeds (exit 0)
# command1 | command2
# Without pipefail: $? would be 0 (from command2)
# With pipefail: $? would be 1 (from command1)

These three settings form a great baseline for writing robust scripts.

The trap Command: Cleaning Up Gracefully

Sometimes, your script might create temporary files, lock files, establish network connections, or start background processes. What happens if the script exits unexpectedly – maybe because set -e triggered, or the user pressed Ctrl+C? Those temporary resources might be left behind in an inconsistent state.

The trap command allows you to specify one or more commands that should be executed automatically when your script receives certain signals from the operating system or encounters specific conditions.

Signals: Signals are a standard Unix mechanism for notifying processes about events. Some common signals you might want to trap are:

  • EXIT: This is a special pseudo-signal specific to Bash's trap. The command associated with EXIT is executed just before the script terminates, regardless of whether it's exiting normally (reaching the end or calling exit) or abnormally (due to set -e, an uncaught signal, etc.). This is ideal for cleanup tasks.
  • INT: Short for "Interrupt". Sent when the user presses Ctrl+C. Trapping this allows you to perform cleanup before exiting when the user interrupts the script.
  • TERM: Short for "Terminate". Sent by the system or other processes (like the kill command) to request termination. This is a more graceful shutdown request than KILL. Trapping TERM allows your script to shut down cleanly.
  • QUIT: Sent when the user presses Ctrl+\. Similar to INT, but often causes a core dump as well.
  • ERR: Another Bash pseudo-signal. If set -e is not active, the command associated with ERR is executed whenever a simple command fails (exits with non-zero status). This allows for custom error handling logic without exiting immediately. (If set -e is active, ERR traps still run, but the script will likely exit immediately afterward unless the trap itself exits with status 0).

Syntax:

trap 'command_or_function_to_run' SIGNAL1 SIGNAL2 ...
  • 'command_or_function_to_run': A string containing the command(s) to execute when any of the specified signals are received. This is often a call to a dedicated cleanup function. Using single quotes is important to prevent premature expansion of variables or commands within the trap string.
  • SIGNAL1 SIGNAL2 ...: A list of signal names (like EXIT, INT, TERM, ERR) or signal numbers.

Example:

#!/usr/bin/env bash
set -u # Use unset variable check

# Create a temporary directory safely using mktemp
# mktemp -d creates a unique temporary directory
TEMP_DIR=$(mktemp -d health_check_temp_XXXXXX)
echo "Created temporary directory: $TEMP_DIR"

# Flag to indicate if work was completed
WORK_COMPLETED=false

# Define a cleanup function
cleanup() {
    local exit_status=${1:-$?} # Get exit status passed to trap, or current $?
    echo # Blank line for clarity
    echo "--- Running cleanup ---"
    if [ -d "$TEMP_DIR" ]; then
        echo "Removing temporary directory: $TEMP_DIR"
        # Use 'rm -rf' carefully! Only on directories you know you created.
        rm -rf "$TEMP_DIR"
        echo "Temporary directory removed."
    else
        echo "Temporary directory '$TEMP_DIR' not found or already removed."
    fi

    if [ "$WORK_COMPLETED" = true ]; then
        echo "Script work was marked as completed."
    else
        echo "Script did not complete its main work."
    fi
    echo "Exiting with status: $exit_status"
    echo "--- Cleanup finished ---"
    # The trap handler should generally exit with the original status
    # However, the EXIT trap runs just *before* the final exit,
    # so we don't explicitly exit *from* the EXIT trap itself.
    # For INT/TERM traps, you might want 'exit $exit_status' at the end.
}

# Set the trap: Call the 'cleanup' function when the script receives
# EXIT, INT (Ctrl+C), or TERM signals.
# The cleanup function will receive the exit status as its first argument ($1)
# for EXIT traps in newer Bash versions when exiting due to a signal.
# We pass the exit status explicitly for INT/TERM.
trap 'cleanup $?' EXIT
trap 'cleanup 130; exit 130' INT # 130 is convention for exit after SIGINT
trap 'cleanup 143; exit 143' TERM # 143 is convention for exit after SIGTERM

# --- Main script logic ---
echo "Doing some simulated work..."
echo "Creating a temporary file..."
touch "$TEMP_DIR/workfile.txt"
echo "Current date: $(date)" > "$TEMP_DIR/workfile.txt"

sleep 10 # Simulate a long-running task

# If the script reaches this point, mark work as completed
WORK_COMPLETED=true
echo "Simulated work finished successfully."

# The EXIT trap will automatically run 'cleanup' when the script ends here.
# If you press Ctrl+C during the 'sleep', the INT trap runs 'cleanup'.
# If 'set -e' was active and a command failed, the EXIT trap would run 'cleanup'.

Using trap ensures that your script cleans up after itself, making it more reliable and preventing resource leaks, even when things go wrong. The EXIT trap is particularly useful for general-purpose cleanup.

3. Advanced Input/Output Redirection and Pipelines

You're likely familiar with basic I/O redirection:

  • command > file: Redirect standard output (stdout) to file, overwriting it.
  • command >> file: Redirect stdout to file, appending to it.
  • command < file: Use file as the standard input (stdin) for command.
  • command1 | command2: Pipe the stdout of command1 into the stdin of command2.

Now let's explore more sophisticated ways to manipulate input and output streams.

Redirecting Standard Error (stderr)

Commands typically produce output on two streams:

  1. Standard Output (stdout): File descriptor 1. Used for the main, expected results of the command.
  2. Standard Error (stderr): File descriptor 2. Used for error messages, warnings, and diagnostic information.

By default, both stdout and stderr are displayed on your terminal. Redirection allows you to handle them separately or together.

  • command > output.log: Redirects only stdout (FD 1) to output.log. Errors (stderr, FD 2) still appear on the terminal.
  • command 2> error.log: Redirects only stderr (FD 2) to error.log. Normal output (stdout, FD 1) still appears on the terminal.
  • command > output.log 2> error.log: Redirects stdout to output.log AND stderr to error.log.
  • command > combined.log 2>&1: Redirects stdout (FD 1) to combined.log, and then redirects stderr (FD 2) to the same place that FD 1 is currently pointing to (which is combined.log). This is the standard way to capture all output (both normal and errors) into a single file. The order matters! 2>&1 must come after the initial redirection of stdout.
  • command &> combined.log: A shorter (Bash-specific) way to achieve the same result as command > combined.log 2>&1. Redirects both stdout and stderr to combined.log.

Why separate stderr? It allows you to log errors separately from normal output, making it easier to monitor for problems. It also prevents error messages from polluting the data stream if you're piping the command's stdout to another command.

Example:

# Run a command, saving normal output and errors separately
find /etc -name "*.conf" > found_configs.txt 2> find_errors.log

# Run a command, saving all output to one log file for later review
./my_complex_script.sh > script_run.log 2>&1
# Or using the shorter syntax:
# ./my_complex_script.sh &> script_run.log

# Run a command, process its normal output, but ignore errors
ls /etc /nonexistent 2>/dev/null | sort
# /dev/null is a special file that discards anything written to it.
# This command sorts the list of files in /etc, but the error message
# about /nonexistent not being found is discarded.

Here Documents (<<)

Sometimes you need to provide multiple lines of input text to a command directly within your script, without creating a separate temporary file. A here document allows you to do exactly this.

Syntax:

command << DELIMITER
This is the first line of input.
Variables like $HOME and $(pwd) are expanded.
Commands like `date` are also executed if backticks are used (less common now).
This is the last line before the delimiter.
DELIMITER
  • command: The command that will receive the text as its standard input.
  • << DELIMITER: Tells the shell to read the following lines as input until it encounters a line containing exactly DELIMITER.
  • DELIMITER: A unique word used to mark the end of the input block. EOF (End Of File) is a common convention, but you can use any string that doesn't appear in the input text itself.
  • Crucially: The closing DELIMITER must be on a line by itself, with no leading or trailing whitespace.
  • Expansion: By default, shell variables ($VAR), command substitutions ($(...)), and arithmetic expansions ($((...))) are processed within the here document text before it's passed to the command.
  • No Expansion: If you want to prevent any expansion and pass the text literally, quote the delimiter: << 'DELIMITER' or << "DELIMITER".

Example:

#!/usr/bin/env bash

# Create a simple HTML file using 'cat' and a here document
cat > index.html << EOF
<!DOCTYPE html>
<html>
<head>
    <title>My Simple Page</title>
</head>
<body>
    <h1>Welcome!</h1>
    <p>This page was generated on $(date) by user $USER.</p>
    <p>Your home directory is: $HOME</p>
</body>
</html>
EOF
# Note: The 'EOF' above has no leading spaces!

echo "Created index.html:"
cat index.html

echo # Blank line

# Example without expansion (using quoted delimiter)
echo "--- Here document with quoted delimiter ---"
cat << 'END_TEXT'
This will be printed literally.
The variable $USER will not be expanded.
The command $(date) will not be executed.
END_TEXT

Here documents are incredibly useful for embedding configuration file snippets, SQL commands, email bodies, or any multi-line text directly into your scripts.

Here Strings (<<<)

For providing just a single line or a short string (often from a variable) as standard input to a command, a here string is a more concise alternative to echo "string" | command.

Syntax:

command <<< "Some string data"

Or using a variable:

my_string="Error code: 404 - Not Found"
command <<< "$my_string" # Use quotes if string contains spaces or special chars

Example:

# Instead of: echo "123 abc 456" | awk '{print $2}'
# Use a here string:
awk '{print $2}' <<< "123 abc 456" # Output: abc

# Pass a variable's content to grep
error_log_line="[ERROR] Failed to connect to database."
grep "database" <<< "$error_log_line" # Output: [ERROR] Failed to connect to database.

# Perform arithmetic with bc (basic calculator)
result=$(bc <<< "10 * (5 + 2)")
echo "Calculation result: $result" # Output: Calculation result: 70

Here strings are convenient for simple stdin redirection without the overhead of echo and a pipe.

Process Substitution (<() and >())

This is a more advanced and extremely powerful feature (available in Bash, Zsh, and Ksh, but not strictly POSIX sh). Process substitution allows you to treat the output of a command (or the input to a command) as if it were a file, without actually creating a named temporary file on disk.

The shell handles the magic behind the scenes, usually using named pipes (mkfifo) or file descriptors (/dev/fd/...).

  • <(command): Runs command asynchronously. Its standard output is connected to a special file descriptor (e.g., /dev/fd/63) or a named pipe. The <(...) construct then expands to the name of this file descriptor/pipe. You can use this name anywhere a command expects a filename as input, allowing you to feed the output of one command directly into another command that expects a file argument.

  • >(command): Runs command asynchronously. Its standard input is connected to a special file descriptor or named pipe. The >(...) construct expands to the name of this file descriptor/pipe. You can use this name anywhere a command expects a filename for output, allowing you to write output directly into the input stream of another command.

Use Cases for <() (Input):

Imagine a command like diff which normally compares two files. What if you want to compare the output of two commands? Process substitution makes this easy:

# Compare the list of files in /bin with the list of files in /usr/bin
# Without process substitution, you'd need temporary files:
# ls /bin > /tmp/bin_list.txt
# ls /usr/bin > /tmp/usr_bin_list.txt
# diff /tmp/bin_list.txt /tmp/usr_bin_list.txt
# rm /tmp/bin_list.txt /tmp/usr_bin_list.txt

# With process substitution:
diff <(ls /bin) <(ls /usr/bin)
# The shell runs 'ls /bin', connects its output to e.g. /dev/fd/63
# The shell runs 'ls /usr/bin', connects its output to e.g. /dev/fd/62
# The shell then executes: diff /dev/fd/63 /dev/fd/62
# No temporary files are manually created or cleaned up!

# Join files based on output of commands
join <(sort file1.txt) <(sort file2.txt)

Use Cases for >() (Output):

This is often used with commands like tee, which reads from stdin and writes to stdout and to one or more files. Process substitution lets tee write not just to files, but directly into the input of other commands.

# Log the output of 'make' to a file, and also filter it for errors
# and save errors to another file, without intermediate files.

# 'make' command output goes into 'tee'
# 'tee' writes one copy to 'build.log' (a regular file)
# 'tee' writes another copy to the process substitution '>(grep ...)'
# The 'grep' command receives the output as its stdin and writes matches to 'build_errors.log'
make | tee build.log >(grep -i 'error\|warning' > build_errors.log)

# Compare this to the non-process substitution way:
# make > /tmp/build_output.log
# cp /tmp/build_output.log build.log
# grep -i 'error\|warning' /tmp/build_output.log > build_errors.log
# rm /tmp/build_output.log

# Send output to multiple log processing pipelines simultaneously
# Note: The final >/dev/null might be needed because tee's own stdout
# (which is just a copy of its input here) might not be desired.
complex_data_generator | tee >(process_type_A > typeA.log) >(process_type_B > typeB.log) > /dev/null

Process substitution is a powerful way to connect commands in complex ways beyond simple pipelines, reducing the need for temporary files and making scripts cleaner and potentially more efficient.

4. Regular Expressions with grep, sed, and awk

Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are an incredibly powerful tool for matching and manipulating text based on patterns, rather than just fixed strings. Mastering regex is a skill in itself, but understanding the basics and how to use them with common shell tools is essential for advanced scripting.

Three indispensable command-line tools that heavily utilize regular expressions are:

  • grep (Global Regular Expression Print): Searches for lines containing text that matches a given pattern within files or standard input. It then prints the matching lines (by default).

  • sed (Stream Editor): Reads text line by line (from files or stdin), applies editing commands (often based on regex patterns), and prints the modified text to stdout. It's commonly used for search-and-replace operations.

  • awk: A versatile pattern-scanning and text-processing language. It reads input line by line, automatically splits each line into fields (columns, typically separated by whitespace), and allows you to perform actions based on patterns (including regex) or field values. It's excellent for extracting data, generating reports, and performing calculations on text data.

Basic Regex Concepts

While regex can get very complex, here are some fundamental building blocks:

  • Literal Characters: Most characters (like a, b, 1, _, -) match themselves literally.
  • Metacharacters: Special characters with meanings:
    • . (dot): Matches any single character (except newline).
    • *: Matches the preceding item zero or more times. E.g., a* matches "", a, aa, aaa.
    • +: Matches the preceding item one or more times (ERE/PERL). E.g., a+ matches a, aa, aaa, but not "".
    • ?: Matches the preceding item zero or one time (ERE/PERL). E.g., colou?r matches color and colour.
    • ^: Matches the beginning of the line. E.g., ^Error matches lines starting with "Error".
    • $: Matches the end of the line. E.g., \.log$ matches lines ending with ".log".
    • [] (Character Set): Matches any single character inside the brackets. E.g., [aeiou] matches any lowercase vowel. [0-9] matches any digit. [^0-9] matches any character that is not a digit.
    • | (Alternation): Acts like "OR" (ERE/PERL). E.g., error|warning matches lines containing "error" or "warning".
    • () (Grouping): Groups parts of the regex together. Used with repetition or alternation (ERE/PERL). E.g., (ab)+ matches ab, abab, ababab.
    • \ (Escape): Removes the special meaning of a metacharacter. E.g., \. matches a literal dot, \* matches a literal asterisk.
  • Quantifiers: Control how many times an item matches (*, +, ?, {n}, {n,}, {n,m}).
    • {n}: Matches the preceding item exactly n times. E.g., [0-9]{4} matches exactly four digits.
    • {n,}: Matches the preceding item n or more times. E.g., [a-z]{3,} matches three or more lowercase letters.
    • {n,m}: Matches the preceding item between n and m times (inclusive). E.g., [0-9]{2,4} matches two, three, or four digits.
  • ERE vs BRE: There are different "flavors" of regex. The two main ones in shell tools are Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE). ERE (enabled with grep -E, sed -E or sed -r on some systems, and default in awk) is generally more powerful and intuitive as characters like +, ?, |, () have their special meanings directly. In BRE (default for grep and sed), these characters need to be escaped (\+, \?, \|, \(\)) to have their special meaning, making patterns harder to read. It's often easier to use ERE.

Using grep

# Find lines containing the literal string "ERROR" in /var/log/syslog
grep "ERROR" /var/log/syslog

# Find lines containing "ERROR" (case-insensitive)
grep -i "error" /var/log/syslog

# Find lines containing "error" OR "warning" (case-insensitive, using ERE)
grep -E -i "error|warning" /var/log/syslog

# Find lines STARTING with a date like YYYY-MM-DD (using ERE)
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" /var/log/messages

# Count the number of matching lines
grep -c "Failed password" /var/log/auth.log

# Print lines that DO NOT match the pattern
grep -v "DEBUG" application.log

# Print only the matching part of the line (GNU grep specific)
grep -o -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log # Extract IP addresses

# Search recursively in a directory
grep -r "API_KEY" /etc/myapp/

Using sed

sed operates on a stream of text. Its most common use is the s (substitute) command.

# Syntax: sed 's/PATTERN/REPLACEMENT/FLAGS' input.txt

# Replace the FIRST occurrence of "apple" with "orange" on each line
sed 's/apple/orange/' input.txt

# Replace ALL occurrences of "apple" with "orange" on each line (g = global flag)
sed 's/apple/orange/g' input.txt

# Replace "red" or "blue" with "green" (using ERE and global flag)
sed -E 's/red|blue/green/g' input.txt

# Replace case-insensitively (GNU sed specific 'i' flag)
sed 's/error/ERROR/gi' input.txt

# Delete lines containing the word "temporary"
sed '/temporary/d' input.txt

# Print only lines containing "user=" (default is to print all lines)
sed -n '/user=/p' config.txt # -n suppresses default printing, p explicitly prints matches

# Insert text BEFORE lines matching a pattern
sed '/^\[SectionA\]/i # This is the configuration for Section A' config.ini

# Append text AFTER lines matching a pattern
sed '/^\[SectionB\]/a # End of Section B configuration' config.ini

# Edit a file directly, creating a backup first (use with caution!)
# Creates config.ini.bak before modifying config.ini
sed -i.bak 's/old_value/new_value/g' config.ini
# Edit in-place without backup (more dangerous!)
# sed -i 's/old_value/new_value/g' config.ini

Using awk

awk processes input field by field, making it great for columnar data.

# Default: Fields ($1, $2, $3, ...) are separated by whitespace. $0 is the whole line.

# Print the first field (column) of each line from /etc/passwd (separator is ':')
awk -F':' '{print $1}' /etc/passwd # -F specifies the field separator

# Print the username (field 1) and shell (field 7) from /etc/passwd
awk -F':' '{print "User:", $1, " Shell:", $7}' /etc/passwd

# Print lines where the third field (e.g., user ID) is greater than 999
awk -F':' '$3 > 999 {print $0}' /etc/passwd # {action} runs if pattern ($3 > 999) is true

# Calculate the sum of numbers in the second column of data.txt
# BEGIN block runs once before processing input.
# END block runs once after all lines are processed.
awk '{ sum += $2 } END { print "Total sum:", sum }' data.txt

# Print lines matching a regex AND where a field has a certain value
# Print lines containing "error" where the 5th field is "critical"
awk '$5 == "critical" && /error/ { print "Critical Error Found:", $0 }' logfile.log

# Change the output field separator
awk -F',' '{OFS=" | "; print $1, $3, $2}' input.csv # Print fields 1, 3, 2 separated by " | "

Learning grep, sed, and awk, along with basic regular expressions, dramatically increases your ability to manipulate and extract information from text, which is a core task in much shell scripting and automation.

5. Command Substitution: Capturing Command Output

Often in a script, you need to run a command and then use its output as:

  • The value assigned to a variable.
  • Part of a string.
  • An argument to another command.

This is achieved using command substitution.

Syntax (Modern and Recommended): $(...)

variable_name=$(command_to_execute)

The shell executes command_to_execute inside the parentheses. It captures the standard output of that command, removes any trailing newline characters, and substitutes the resulting string back into the command line or assignment.

Syntax (Legacy - Avoid if possible): `` (Backticks)

variable_name=`command_to_execute`

This older syntax does the same thing, but it has several disadvantages: * Readability: Backticks can be easily confused with single quotes (' '). * Nesting: Nesting command substitutions with backticks (cmd1 \cmd2`) requires awkward backslash escaping and is much harder to read than the nested$(...)form ($(cmd1 $(cmd2))`).

Always prefer the $(...) syntax.

Example:

#!/usr/bin/env bash
set -u # Ensure variables are set before use

# Get the current date and time in a specific format
current_timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
echo "Script started at: $current_timestamp"

# Get the number of lines in a file
config_file="/etc/ssh/sshd_config"
if [ -f "$config_file" ]; then
    line_count=$(wc -l < "$config_file") # Use input redirection for efficiency
    # Trim leading/trailing whitespace that wc might add
    line_count=$(echo $line_count)
    echo "The file '$config_file' has $line_count lines."
else
    echo "Config file '$config_file' not found."
fi

# Get the current working directory
working_dir=$(pwd)
echo "Current directory is: $working_dir"

# Create a backup filename incorporating the timestamp and hostname
hostname=$(hostname -s) # Use -s for short hostname
backup_filename="/backups/${hostname}_backup_${current_timestamp}.tar.gz"
echo "Proposed backup filename: $backup_filename"

# Use command output directly within another command's arguments
# Find files modified in the last 2 days
echo "Files modified recently in /etc:"
# Note: find command arguments can be complex, use quotes carefully
find /etc -maxdepth 1 -type f -mtime -2 -exec ls -ld {} \;

# Using output as part of an echo command
echo "System Load Average: $(uptime | awk -F'load average: ' '{print $2}')"

Command substitution is fundamental for making scripts dynamic, allowing them to gather information from the system or other commands and act upon it.

6. Processing Script Arguments Professionally: getopts

While accessing arguments directly using $1, $2, $#, etc., works for very simple scripts, it quickly becomes cumbersome and non-standard when you want to implement features common in command-line tools:

  • Options (Flags): Arguments starting with a hyphen, like -v (verbose) or -h (help).
  • Options with Arguments: Options that require a value, like -f filename or -o output.log.
  • Optional Arguments: Some options or arguments might not always be required.
  • Order Independence: Users should ideally be able to provide options in any order (e.g., -v -f file vs -f file -v).
  • Error Handling: Detecting invalid options or missing arguments for options that require them.

Manually parsing the $@ array to handle all these cases is complex and error-prone. Thankfully, Bash provides a built-in command specifically for this purpose: getopts.

getopts parses options and their arguments from the script's positional parameters ($@) according to Unix conventions. It's used inside a while loop.

Basic Structure:

#!/usr/bin/env bash

# --- Default values for settings ---
verbose=false # Use true/false or 0/1
output_file=""
input_file=""

# --- Function to display help message ---
usage() {
    echo "Usage: $0 [-v] [-o <output_file>] [-f <input_file>] [remaining_args...]"
    echo "Options:"
    echo "  -v              Enable verbose output"
    echo "  -o <file>       Specify output file"
    echo "  -f <file>       Specify input file (required)"
    echo "  -h              Display this help message"
    exit 1
}

# --- Option parsing loop ---
# The 'optstring' defines the valid option letters.
# - A letter by itself (e.g., 'v', 'h') is a simple flag.
# - A letter followed by a colon (e.g., 'o:', 'f:') means that option requires an argument.
# - A leading colon in the optstring (e.g., ':vho:f:') enables "silent" error handling.
#   Instead of printing its own errors, getopts will:
#     - Set the 'opt' variable to '?' for an invalid option.
#     - Set the 'opt' variable to ':' for a missing option argument.
#     - Store the invalid option character or the option missing an argument in OPTARG.
while getopts ":vho:f:" opt; do
  case $opt in
    v)
      # Option -v was found
      verbose=true
      ;;
    h)
      # Option -h was found
      usage
      ;;
    o)
      # Option -o was found, its argument is in $OPTARG
      output_file="$OPTARG"
      ;;
    f)
      # Option -f was found, its argument is in $OPTARG
      input_file="$OPTARG"
      ;;
    \?)
      # Invalid option found (stored in $OPTARG)
      echo "Error: Invalid option: -$OPTARG" >&2
      usage
      ;;
    :)
      # Option requires an argument, but none was given (option char in $OPTARG)
      echo "Error: Option -$OPTARG requires an argument." >&2
      usage
      ;;
  esac
done

# --- Shift away processed options ---
# $OPTIND is the index of the next argument to be processed after getopts finishes.
# 'shift' removes arguments from the beginning of the positional parameters ($@).
# This command removes all the options and their arguments that getopts processed,
# leaving only the remaining non-option arguments in $@.
shift $((OPTIND-1))

# --- Validate required arguments and use the settings ---
# Example: Check if the required -f option was provided
if [ -z "$input_file" ]; then
    echo "Error: Input file must be specified with -f." >&2
    usage
fi

echo "--- Settings ---"
echo "Verbose: $verbose"
echo "Output File: '$output_file'"
echo "Input File: '$input_file'"
echo "Remaining Arguments: $@" # Display any arguments left after options

# --- Main script logic starts here ---
echo # Blank line
echo "Starting main script logic..."
# Use the variables $verbose, $output_file, $input_file, and $@ here

# Example of using verbose flag
if [ "$verbose" = true ]; then
    echo "Verbose mode enabled. Performing extra logging..."
fi

# Example of using output file
if [ -n "$output_file" ]; then
    echo "Output will be directed to $output_file"
    # exec > "$output_file" # Example: Redirect script's stdout
else
    echo "Output will be sent to standard output."
fi

echo "Processing input from $input_file..."
if [ ! -f "$input_file" ]; then
    echo "Error: Input file '$input_file' not found!" >&2
    exit 1
fi
# ... process the file ...

echo "Script finished."

Key elements explained:

  • getopts optstring varname: The core command.
    • optstring: Defines valid options and whether they take arguments (e.g., ":vho:f:"). The leading colon enables silent error handling.
    • varname (e.g., opt): The variable that getopts sets in each iteration of the loop to the option character found (e.g., v, h, o, f, or ?, : for errors).
  • while getopts ...; do ... done: The loop continues as long as getopts finds valid options in $@.
  • case $opt in ... esac: Used to handle the different option characters found by getopts.
  • $OPTARG: When an option requires an argument (like -o file), getopts stores that argument (file) in this variable.
  • $OPTIND: getopts maintains this variable, which holds the index of the next argument in $@ to be processed. It starts at 1.
  • shift $((OPTIND-1)): This crucial step removes all the options and their arguments (which getopts has already processed) from the list of positional parameters ($@). After the shift, $@ will only contain the remaining arguments that were not options (e.g., if the command was ./script.sh -v -f input.txt data1 data2, after the shift, $@ would contain "data1" "data2").

Using getopts is the standard, robust way to handle command-line options in shell scripts, making them behave like familiar Unix utilities.

7. Automation: Scheduling Tasks with cron

Writing a useful script is great, but its power is truly unlocked when you can make it run automatically without your intervention. The classic and most common tool for scheduling tasks on Unix-like systems (Linux, macOS, BSD) is cron.

cron is a system daemon (a background service) that wakes up every minute, checks configuration files (called crontabs), and executes any commands scheduled to run at that specific minute.

The Crontab File

Each user on the system can have their own crontab file, specifying jobs to be run as that user. There is also typically a system-wide crontab (often /etc/crontab or files in /etc/cron.d/) for system administration tasks.

To edit your personal crontab, use the command:

crontab -e

This will open your crontab file in the default command-line text editor (like nano, vi, or vim). If you haven't used it before, it might be empty or contain comments explaining the format.

To simply view your current crontab without editing, use:

crontab -l

To remove your entire crontab, use (with caution!):

crontab -r

Crontab Format

Each line in a crontab file defines a single scheduled job (or is a comment starting with #). The format consists of 5 time/date fields followed by the command to execute:

# Use '#' for comments.
# Format: minute hour day_of_month month day_of_week command_to_run
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12) OR jan,feb,mar,...
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,...
# │ │ │ │ │
# * * * * * command_to_execute

Field Values:

  • *: Asterisk means "any value" or "every". For example, * in the minute field means "every minute".
  • number: A specific value (e.g., 30 in the minute field means "at 30 minutes past the hour").
  • value1,value2: A list of specific values (e.g., 0,15,30,45 in the minute field means run at 0, 15, 30, and 45 minutes past the hour).
  • start-end: A range of values (e.g., 9-17 in the hour field means from 9 AM to 5 PM inclusive).
  • */step: A step value. */15 in the minute field means "every 15 minutes" (equivalent to 0,15,30,45). 0 */2 * * * means "at minute 0 every 2 hours".

command_to_execute: This is the command or script you want to run. Crucially, always use absolute paths for your scripts and commands within crontabs, because cron runs with a very minimal environment and may not have your usual $PATH settings.

Example Crontab Entries:

# Comments are good practice! Explain what each job does.

# Run the health check script every hour at 15 minutes past the hour
# and append its output (stdout & stderr) to a log file.
15 * * * * /home/myuser/scripts/health_check.sh >> /home/myuser/logs/health_check.log 2>&1

# Perform a full backup using a backup script every Sunday at 2:30 AM.
# Discard normal output, but potentially log errors (often handled inside the script).
30 2 * * 0 /usr/local/bin/my_backup_script.sh --full > /dev/null

# Check for software updates every day at 4:00 AM (example for Debian/Ubuntu)
# Ensure PATH is set if needed, or use full paths like /usr/bin/apt-get
# 0 4 * * * /usr/bin/apt-get update && /usr/bin/apt-get -y upgrade > /var/log/apt/cron_upgrade.log 2>&1

# Run a custom data processing script every 10 minutes during work hours (Mon-Fri, 9am-5pm)
*/10 9-17 * * 1-5 /opt/data_scripts/process_incoming.sh --config /etc/data_scripts/config.ini

# Clean up temporary files older than 7 days every Monday at 1 AM
0 1 * * 1 /usr/bin/find /tmp -type f -mtime +7 -delete

Important Considerations for Cron Jobs:

  1. Absolute Paths: Always use full paths for scripts and commands (e.g., /home/user/scripts/myscript.sh, /usr/bin/python3). Don't rely on the $PATH environment variable, which is very limited in cron.
  2. Environment: cron runs jobs with a minimal environment. If your script relies on specific environment variables (like JAVA_HOME, PYTHONPATH, etc.), you must either:
    • Define them at the top of the crontab file (e.g., MAILTO="", PATH=/usr/local/bin:/usr/bin:/bin).
    • Define or source them inside your script.
  3. Permissions: Ensure that the script you are scheduling is executable by the user whose crontab you are editing (chmod +x /path/to/your/script.sh).
  4. Output Redirection: By default, cron captures any stdout or stderr produced by the command and tries to email it to the user who owns the crontab. This is often undesirable. It's standard practice to redirect output:
    • >> /path/to/logfile.log 2>&1: Append both stdout and stderr to a log file.
    • > /dev/null 2>&1: Discard all output (if you only care that the job runs or if the script handles its own logging).
  5. Working Directory: Cron jobs usually run from the user's home directory by default. If your script expects to be run from a specific directory (e.g., to find relative config files), you should cd to that directory within the crontab command or at the beginning of your script: * * * * * cd /path/to/app && ./run_app.sh
  6. Locking: If a script takes a long time to run and is scheduled frequently (e.g., every minute), you might end up with multiple instances running simultaneously. Consider implementing a locking mechanism (e.g., using flock or creating/checking for a lock file) within your script to prevent this.

cron is the standard, reliable way to schedule routine tasks, forming the backbone of automation on many Unix-like systems.

Conclusion

This chapter has equipped you with a powerful set of tools and techniques to move beyond basic shell scripting. You've learned how to create modular and reusable code with functions, how to make scripts resilient through robust error handling ($?, set -euo pipefail, trap), how to master input and output streams (>&, <<, <<<, <(), >()), how to harness the pattern-matching power of regular expressions with grep, sed, and awk, how to capture command results using command substitution ($()), how to parse command-line arguments professionally with getopts, and finally, how to automate script execution using the cron scheduler.

These concepts are the building blocks for creating sophisticated, reliable, and automated solutions for system administration, data processing, development workflows, and countless other tasks. Like any skill, mastery comes through practice. Apply these techniques to your own scripting challenges.

The following workshop chapter provides a hands-on project where you'll combine many of these advanced techniques to build a practical system monitoring script.


Workshop - Building a System Health Check Script

Objective

In this workshop, we will apply the concepts learned in the "Advanced Shell Scripting and Automation" chapter to build a practical, command-line system health monitoring script. This project provides hands-on experience with functions, error handling (set, $?, trap), command substitution, text processing (awk, grep), argument parsing (getopts), and generating formatted output.

Our script, health_check.sh, will perform the following tasks:

  1. Check the current CPU load average (1-minute).
  2. Check the percentage of available RAM.
  3. Check the percentage of used disk space for specified filesystems (defaulting to the root filesystem /).
  4. Allow the user to specify warning thresholds for CPU load and disk usage via command-line options.
  5. Allow the user to specify exactly which filesystem mount points to check via command-line options.
  6. Generate a clearly formatted report, highlighting any values that exceed the defined warning thresholds.
  7. Include a basic cleanup mechanism using trap.

Prerequisites

  • Access to a Linux or macOS terminal (the script will be primarily geared towards Linux due to command differences, e.g., free, but notes for macOS will be included where feasible).
  • A text editor (like nano, vim, emacs, VS Code, etc.).
  • Basic shell commands: touch, chmod, ./script.sh execution.
  • Standard Unix utilities available: uptime, free (Linux) / sysctl/vm_stat (macOS), df, awk, grep, sed, echo, date, mktemp, rm.

We will assume very little prior advanced scripting knowledge, reinforcing the concepts from the previous chapter.

Let's Begin!

We'll build the script step by step, adding features incrementally.

Step 1: Initial Script Setup and Basic Structure

First, let's create the script file, make it executable, and add the essential boilerplate, including our robustness settings and default values.

  1. Create the file:
    touch health_check.sh
    
  2. Make it executable:
    chmod +x health_check.sh
    
  3. Open health_check.sh in your text editor and add the following initial structure:
#!/usr/bin/env bash

# --- Robustness Settings ---
# Exit immediately if a command exits with a non-zero status.
set -e
# Treat unset variables as an error when substituting.
set -u
# Prevent errors in a pipeline from being masked.
set -o pipefail

# --- Default Configuration ---
# Warning threshold for 1-minute load average (use floating point)
readonly DEFAULT_LOAD_WARN_THRESHOLD="5.0"
# Warning threshold for disk usage percentage (integer)
readonly DEFAULT_DISK_WARN_THRESHOLD="85"
# Filesystems (mount points) to check by default (root filesystem)
# Using an array to hold multiple values
readonly DEFAULT_FILESYSTEMS_TO_CHECK=("/")
# Temporary directory base name
readonly TEMP_DIR_BASENAME="health_check_temp"

# --- Script Variables (will be populated by options or defaults) ---
LOAD_WARN_THRESHOLD=""
DISK_WARN_THRESHOLD=""
# Declare an array to hold the list of filesystems we'll actually check
declare -a FILESYSTEMS_TO_CHECK=()
# Variable to hold the path to our temporary directory
SCRIPT_TEMP_DIR=""

# --- Function Declarations (will be added later) ---
# We'll define functions like usage(), check_load(), check_memory(), etc. here

# --- Cleanup Function ---
cleanup() {
    # This function will be called by our trap
    local exit_status=${1:-$?} # Capture exit status if passed, otherwise use current $?
    echo # Add a newline for visual separation in output
    echo "--- Running cleanup ---"
    # Check if the temp directory variable was set and if the directory exists
    if [ -n "$SCRIPT_TEMP_DIR" ] && [ -d "$SCRIPT_TEMP_DIR" ]; then
        echo "Removing temporary directory: $SCRIPT_TEMP_DIR"
        # rm -rf is powerful, ensure we only remove what we created
        rm -rf "$SCRIPT_TEMP_DIR"
        echo "Temporary directory removed."
    # else
        # Optional: Add message if temp dir wasn't created or already removed
        # echo "Temporary directory not found or not created."
    fi
    echo "Exiting with status: $exit_status"
    echo "--- Cleanup finished ---"
    # We don't explicitly exit here in the EXIT trap handler itself
}

# --- Trap Setup ---
# Call the 'cleanup' function automatically when the script EXITS
# (normally or due to error/signal). Pass the exit status to cleanup.
trap 'cleanup $?' EXIT
# Optionally add traps for specific signals like INT (Ctrl+C) or TERM
# trap 'cleanup 130; exit 130' INT
# trap 'cleanup 143; exit 143' TERM

# --- Main Logic Function ---
main() {
    # Create a temporary directory for this script run
    # 'mktemp -d' creates a unique directory based on the template
    SCRIPT_TEMP_DIR=$(mktemp -d "${TEMP_DIR_BASENAME}_XXXXXX")
    echo "Using temporary directory: $SCRIPT_TEMP_DIR" # Log temp dir creation

    # --- Option Parsing (will be added in Step 2) ---
    # For now, just use defaults
    LOAD_WARN_THRESHOLD="$DEFAULT_LOAD_WARN_THRESHOLD"
    DISK_WARN_THRESHOLD="$DEFAULT_DISK_WARN_THRESHOLD"
    # Copy the default array elements to our working array
    FILESYSTEMS_TO_CHECK=("${DEFAULT_FILESYSTEMS_TO_CHECK[@]}")

    # --- Start the Checks ---
    echo # Blank line for spacing
    echo "--- System Health Check Report ---"
    # Use command substitution to include the date
    echo "Report generated on: $(date)"
    echo "Warning Thresholds: CPU Load >= $LOAD_WARN_THRESHOLD, Disk Usage >= $DISK_WARN_THRESHOLD%"
    echo # Blank line

    # --- Call Check Functions (will be implemented in Step 3) ---
    # check_cpu_load
    # check_memory
    # check_disk_usage

    echo # Blank line
    echo "--- Health Check Complete ---"
    # Note: The cleanup function runs automatically after this via the EXIT trap
}

# --- Script Entry Point ---
# Call the main function, passing all script arguments ($@) to it.
# This allows main() to later parse options passed to the script.
main "$@"

Explanation:

  • set -euo pipefail: Our standard safety net.
  • readonly Defaults: We define default thresholds and the default filesystem array using readonly to indicate they are constants. Using uppercase is a convention for constants.
  • Script Variables: We declare the variables that will hold the active configuration for this run (potentially overridden by options later). declare -a explicitly creates an array.
  • cleanup() function: Defines the actions to take just before the script exits. It checks if the temporary directory variable (SCRIPT_TEMP_DIR) was set and if the directory actually exists before attempting removal. This prevents errors if the script fails before mktemp runs.
  • trap cleanup EXIT: This is the crucial line that registers the cleanup function to be executed automatically whenever the script exits, for any reason.
  • mktemp -d ...: This command safely creates a unique temporary directory. Using mktemp is much safer than manually creating directories like /tmp/my_temp because it avoids race conditions and potential security issues if multiple instances run or if the directory name is predictable. We store the created path in SCRIPT_TEMP_DIR.
  • main() function: We encapsulate the core logic here. This improves organization. It currently initializes variables with defaults and prints the basic report structure.
  • main "$@": This executes the main function and passes all command-line arguments received by the script ($@) into it. This is essential for getopts later.

Testing Step 1:

Save the script and run it:

./health_check.sh

You should see output similar to this (the temp dir name will vary):

Using temporary directory: health_check_temp_ABCDEF
Report generated on: Tue Apr 23 10:30:00 BST 2024
Warning Thresholds: CPU Load >= 5.0, Disk Usage >= 85%

--- Health Check Complete ---

--- Running cleanup ---
Removing temporary directory: health_check_temp_ABCDEF
Temporary directory removed.
Exiting with status: 0
--- Cleanup finished ---
Verify that the temporary directory mentioned in the output does not exist after the script finishes (it should have been cleaned up). Also, test pressing Ctrl+C while it's running (if you add a sleep command inside main for testing) – the cleanup should still occur.

Step 2: Add Command-Line Option Parsing with getopts

Now, let's make the script more flexible by allowing users to override defaults using command-line options:

  • -l <load>: Set CPU load warning threshold.
  • -d <percent>: Set disk usage warning threshold.
  • -f <mount_point>: Specify a filesystem mount point to check (can be used multiple times).
  • -h: Display a help message.

  • Add the usage() function (place it before the main() function definition):

# --- Helper Function: Display Usage ---
usage() {
    # Using a here document for the multi-line message
    cat << EOF
Usage: $0 [-l <load_threshold>] [-d <disk_threshold_percent>] [-f <mount_point>] [-h]

Performs basic system health checks.

Options:
    -l <load>       Warning threshold for 1-min CPU load average (float).
                    Default: ${DEFAULT_LOAD_WARN_THRESHOLD}
    -d <percent>    Warning threshold for disk usage percentage (integer).
                    Default: ${DEFAULT_DISK_WARN_THRESHOLD}
    -f <mount>      Filesystem mount point to check (can be specified multiple times).
                    Default: Check only root filesystem ('${DEFAULT_FILESYSTEMS_TO_CHECK[*]}')
    -h              Display this help message and exit.
EOF
    # Exit with a non-zero status after showing help
    exit 1
}
  1. Modify the main() function to include the getopts loop:
# --- Main Logic Function ---
main() {
    # Create a temporary directory for this script run
    SCRIPT_TEMP_DIR=$(mktemp -d "${TEMP_DIR_BASENAME}_XXXXXX")
    # echo "Using temporary directory: $SCRIPT_TEMP_DIR" # Can comment out if too noisy

    # --- Initialize with defaults BEFORE parsing options ---
    LOAD_WARN_THRESHOLD="$DEFAULT_LOAD_WARN_THRESHOLD"
    DISK_WARN_THRESHOLD="$DEFAULT_DISK_WARN_THRESHOLD"
    # Important: Make a copy of the default array. If the user uses -f, we clear this copy.
    FILESYSTEMS_TO_CHECK=("${DEFAULT_FILESYSTEMS_TO_CHECK[@]}")
    # Flag to track if user specified any filesystems via -f
    local user_specified_filesystems=false

    # --- Option Parsing Loop ---
    while getopts ":l:d:f:h" opt; do
        case $opt in
            l)
                LOAD_WARN_THRESHOLD="$OPTARG"
                # Basic validation: check if it looks like a number (integer or float)
                if ! [[ "$OPTARG" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
                    echo "Error: Invalid load threshold specified with -l: '$OPTARG'. Must be a number." >&2
                    usage
                fi
                ;;
            d)
                DISK_WARN_THRESHOLD="$OPTARG"
                # Basic validation: check if it's an integer
                if ! [[ "$OPTARG" =~ ^[0-9]+$ ]]; then
                    echo "Error: Invalid disk threshold specified with -d: '$OPTARG'. Must be an integer percentage." >&2
                    usage
                fi
                ;;
            f)
                # If this is the first time -f is used, clear the default array
                if [ "$user_specified_filesystems" = false ]; then
                    FILESYSTEMS_TO_CHECK=()
                    user_specified_filesystems=true
                fi
                # Add the specified filesystem mount point to our array
                FILESYSTEMS_TO_CHECK+=("$OPTARG")
                ;;
            h)
                usage # Display help and exit
                ;;
            \?)
                # Invalid option
                echo "Error: Invalid option specified: -$OPTARG" >&2
                usage
                ;;
            :)
                # Missing option argument
                echo "Error: Option -$OPTARG requires an argument." >&2
                usage
                ;;
        esac
    done

    # --- Shift away processed options ---
    # Remove options and their arguments from $@
    shift $((OPTIND-1))

    # --- Argument Validation ---
    # Check if any arguments remain after options (we don't expect any for this script)
    if [ $# -gt 0 ]; then
        echo "Error: Unexpected arguments provided: '$@'" >&2
        usage
    fi
    # Check if user specified -f but the array ended up empty (e.g., bad paths later?)
    # Although, we add paths directly, so this check is less critical here now.
    # We should check *inside* the disk check if the specified paths are valid mount points.
    if [ "$user_specified_filesystems" = true ] && [ ${#FILESYSTEMS_TO_CHECK[@]} -eq 0 ]; then
         echo "Warning: -f option used, but no valid filesystem paths were added?" >&2
         # Decide if this is an error or just a warning. Maybe exit?
         # exit 1
    fi


    # --- Start the Checks ---
    echo # Blank line for spacing
    echo "--- System Health Check Report ---"
    echo "Report generated on: $(date)"
    # Display the *actual* thresholds being used
    echo "Warning Thresholds: CPU Load >= $LOAD_WARN_THRESHOLD, Disk Usage >= $DISK_WARN_THRESHOLD%"
    # Display the filesystems that will be checked
    echo "Filesystems to check: ${FILESYSTEMS_TO_CHECK[*]}" # [*] joins with space
    echo # Blank line

    # --- Call Check Functions (will be implemented in Step 3) ---
    check_cpu_load
    check_memory # Assuming Linux 'free' command for now
    check_disk_usage

    echo # Blank line
    echo "--- Health Check Complete ---"
}

Explanation of Changes in main():

  • Initialization: Variables are now initialized to defaults before the getopts loop. A copy of the default filesystem array is made.
  • user_specified_filesystems flag: Tracks whether the -f option was used.
  • getopts loop: Parses options -l, -d, -f, -h.
    • Validation added for -l (number/float) and -d (integer) using regex matching (=~).
    • For -f, the FILESYSTEMS_TO_CHECK array is cleared only the first time -f is encountered, then the provided argument ($OPTARG) is appended using +=.
    • Error cases (\? and :) now call the usage function for consistent error reporting.
  • shift $((OPTIND-1)): Removes processed options.
  • Argument Validation: Checks if unexpected non-option arguments were passed.
  • Output: The report header now shows the actual thresholds and filesystems being used for this run.
  • Function Calls: Placeholder calls to the check functions remain.

Testing Step 2:

  • ./health_check.sh -h (Should display the usage message)
  • ./health_check.sh -l 10.5 -d 90 (Should run using custom thresholds)
  • ./health_check.sh -f / -f /home -f /var (Should run checking specified paths)
  • ./health_check.sh -f /data (Should run checking only /data)
  • ./health_check.sh -x (Should show "Invalid option" error and usage)
  • ./health_check.sh -f (Should show "Option -f requires an argument" error and usage)
  • ./health_check.sh -l abc (Should show "Invalid load threshold" error and usage)
  • ./health_check.sh some_arg (Should show "Unexpected arguments" error and usage)

Step 3: Implement the Health Check Functions

Now, let's write the actual logic for checking CPU, memory, and disk. Place these function definitions before the main() function definition in your script.

3.1. check_cpu_load()

# --- Check Functions ---

### Checks the 1-minute CPU load average against the threshold ###
check_cpu_load() {
    # Use 'local' for variables inside functions
    local current_load
    local load_check_status="OK" # Assume OK initially

    echo "--- CPU Load ---"
    # Get load average from 'uptime'. The 1-min average is typically near the end.
    # Using awk to extract it robustly:
    # -F'[ ,:]+' splits by space, comma, or colon (one or more)
    # NF is Number of Fields. $(NF-2) is usually the 1-min avg.
    current_load=$(uptime | awk -F'[ ,:]+' '{print $(NF-2)}')

    echo "Current 1-minute load average: $current_load"

    # Compare floating point numbers using 'awk'
    # awk exits 0 if comparison is true, 1 if false.
    # We use 'exit !(condition)' because shell 'if' treats 0 as true (success).
    # So if load > threshold (true -> 1), !(1) is 0, awk exits 0, 'if' block runs.
    if awk -v load="$current_load" -v threshold="$LOAD_WARN_THRESHOLD" 'BEGIN { exit !(load >= threshold) }'; then
        echo "WARNING: Load average ($current_load) is >= threshold ($LOAD_WARN_THRESHOLD)"
        load_check_status="WARNING"
    else
        echo "Status: OK"
    fi
    echo "Overall CPU Load Status: $load_check_status"
    echo # Blank line for spacing
}

Explanation:

  • Uses uptime and awk to reliably extract the 1-minute load average.
  • Uses awk again for floating-point comparison against LOAD_WARN_THRESHOLD. The exit !(condition) pattern is used to make awk's exit status compatible with shell if.
  • Prints the current load and a clear WARNING or OK status message.

3.2. check_memory() (Linux Version using free)

### Checks available memory percentage (Linux using 'free') ###
check_memory() {
    local total_mem_kb avail_mem_kb total_mem_mb avail_mem_mb avail_mem_percent
    local mem_check_status="ERROR" # Default to error until we succeed

    echo "--- Memory Usage (Linux) ---"
    # Check if 'free' command exists
    if ! command -v free &> /dev/null; then
        echo "ERROR: 'free' command not found. Cannot check memory."
        echo "Overall Memory Status: $mem_check_status"
        echo # Blank line
        return # Exit the function early
    fi

    # Get memory details in Kilobytes using 'free' (no options needed)
    # Use awk to find the 'Mem:' line and extract total (col 2) and available (col 7)
    # The 'available' column (usually 7th) is generally preferred over 'free'
    # Output Format Example (may vary slightly):
    #               total        used        free      shared  buff/cache   available
    # Mem:        16018500     4017624     1186076      266780     10814800    12611376
    # Swap:        2097148           0     2097148
    local mem_line
    mem_line=$(free | grep '^Mem:')

    if [ -z "$mem_line" ]; then
         echo "ERROR: Could not parse 'Mem:' line from 'free' command output."
         free # Print raw output for debugging
         echo "Overall Memory Status: $mem_check_status"
         echo # Blank line
         return
    fi

    total_mem_kb=$(echo "$mem_line" | awk '{print $2}')
    avail_mem_kb=$(echo "$mem_line" | awk '{print $7}') # Assuming 7th field is 'available'

    if ! [[ "$total_mem_kb" =~ ^[0-9]+$ ]] || ! [[ "$avail_mem_kb" =~ ^[0-9]+$ ]]; then
        echo "ERROR: Failed to extract numeric memory values from 'free'."
        echo "Raw Mem line: $mem_line"
        echo "Overall Memory Status: $mem_check_status"
        echo # Blank line
        return
    fi

    # Convert KB to MB for potentially nicer display (integer division)
    total_mem_mb=$(( total_mem_kb / 1024 ))
    avail_mem_mb=$(( avail_mem_kb / 1024 ))

    # Calculate available percentage (using KB for accuracy before dividing)
    if [ "$total_mem_kb" -gt 0 ]; then
        avail_mem_percent=$(( (avail_mem_kb * 100) / total_mem_kb ))
        echo "Total Memory: ${total_mem_mb} MB"
        echo "Available Memory: ${avail_mem_mb} MB (${avail_mem_percent}%)"
        mem_check_status="OK" # Calculation succeeded
    else
        echo "ERROR: Total memory reported as zero. Cannot calculate percentage."
        echo "Overall Memory Status: $mem_check_status"
        echo # Blank line
        return
    fi

    # No threshold check for memory in this version, just reporting
    # (Could add a threshold check similar to disk usage if desired)
    echo "Overall Memory Status: $mem_check_status"
    echo # Blank line
}

Explanation (Linux):

  • Checks if the free command exists.
  • Parses the output of free using grep and awk to find the Mem: line and extract total/available memory (assuming KB output and field positions). Includes error checking for parsing.
  • Calculates MB and percentage available.
  • Reports the values. (No warning threshold is implemented for memory in this example, but it could be added).

(Optional) check_memory() (macOS Version) macOS requires sysctl and vm_stat, parsing is different.

### Checks memory usage (macOS using sysctl/vm_stat - Approximation) ###
# check_memory() {
#     local total_mem_bytes total_mem_mb page_size pages_free pages_inactive pages_wired
#     local mem_available_approx_mb mem_available_percent
#     local mem_check_status="ERROR"
#
#     echo "--- Memory Usage (macOS Approximation) ---"
#     if ! command -v sysctl &> /dev/null || ! command -v vm_stat &> /dev/null; then
#         echo "ERROR: 'sysctl' or 'vm_stat' command not found. Cannot check memory."
#         echo "Overall Memory Status: $mem_check_status"
#         echo; return
#     fi
#
#     total_mem_bytes=$(sysctl -n hw.memsize)
#     page_size=$(sysctl -n hw.pagesize)
#
#     # Parse vm_stat - requires careful awk field selection
#     local vm_output
#     vm_output=$(vm_stat)
#     pages_free=$(echo "$vm_output" | awk '/Pages free:/ { print $3 }' | tr -d '.')
#     pages_inactive=$(echo "$vm_output" | awk '/Pages inactive:/ { print $3 }' | tr -d '.')
#     # Wired memory is actively used and cannot be paged out
#     pages_wired=$(echo "$vm_output" | awk '/Pages wired down:/ { print $4 }' | tr -d '.')
#
#     if ! [[ "$total_mem_bytes" =~ ^[0-9]+$ ]] || \
#        ! [[ "$page_size" =~ ^[0-9]+$ ]] || \
#        ! [[ "$pages_free" =~ ^[0-9]+$ ]] || \
#        ! [[ "$pages_inactive" =~ ^[0-9]+$ ]] || \
#        ! [[ "$pages_wired" =~ ^[0-9]+$ ]]; then
#         echo "ERROR: Failed to parse numeric values from sysctl or vm_stat."
#         echo "Overall Memory Status: $mem_check_status"
#         echo; return
#     fi
#
#     total_mem_mb=$(( total_mem_bytes / 1024 / 1024 ))
#     # Approximation: Available is often considered Free + Inactive
#     # More sophisticated checks might consider compressed memory etc.
#     mem_available_approx_mb=$(( ( (pages_free + pages_inactive) * page_size ) / 1024 / 1024 ))
#
#     if [ "$total_mem_mb" -gt 0 ]; then
#         mem_available_percent=$(( (mem_available_approx_mb * 100) / total_mem_mb ))
#         echo "Total Memory: ${total_mem_mb} MB"
#         echo "Available Memory (Approx): ${mem_available_approx_mb} MB (${mem_available_percent}%)"
#         mem_check_status="OK"
#     else
#         echo "ERROR: Total memory reported as zero."
#         echo "Overall Memory Status: $mem_check_status"
#         echo; return
#     fi
#
#     echo "Overall Memory Status: $mem_check_status"
#     echo # Blank line
# }

3.3. check_disk_usage()

### Checks disk usage percentage for specified mount points ###
check_disk_usage() {
    local fs_mount usage_percent human_readable_info df_line
    local disk_check_status="OK" # Assume OK overall unless a warning is found

    echo "--- Disk Usage ---"
    if [ ${#FILESYSTEMS_TO_CHECK[@]} -eq 0 ]; then
        echo "No filesystems specified for checking."
        echo "Overall Disk Status: SKIPPED"
        echo # Blank line
        return
    fi

    # Iterate over the array of mount points provided
    for fs_mount in "${FILESYSTEMS_TO_CHECK[@]}"; do
        local single_fs_status="ERROR" # Status for this specific filesystem
        echo "Checking filesystem mounted at: '$fs_mount'"

        # Verify the path exists and df can report on it
        # Use 'df -P' for POSIX standard output (more reliable parsing)
        # Run df on the mount point itself. Grep for the exact mount point at the end of the line.
        # Escape the path for grep just in case it contains special chars.
        local escaped_mount
        escaped_mount=$(printf '%s\n' "$fs_mount" | sed 's:[][\/.^$*]:\\&:g') # Escape regex chars

        # Get the relevant line from df -P output
        # Ensure we match the mount point ($6) anchored at the end ($)
        if ! df_line=$(df -P "$fs_mount" 2>/dev/null | awk -v path="$fs_mount" 'NR > 1 && $6 == path {print}'); then
             echo "  ERROR: Could not get 'df -P' info for '$fs_mount'. Skipping."
             disk_check_status="ERROR" # Mark overall status as error
             continue # Skip to the next filesystem in the loop
        fi
         if [ -z "$df_line" ]; then
             echo "  ERROR: Path '$fs_mount' might not be a valid mount point or df failed."
             echo "  (Please provide mount points like '/', '/home', '/var', not arbitrary paths)."
             disk_check_status="ERROR"
             continue
        fi


        # Extract usage percentage (5th field), removing '%' sign
        usage_percent=$(echo "$df_line" | awk '{gsub(/%/, ""); print $5}')

        # Get human-readable info using 'df -h' separately for display
        # Need to handle potential errors here too
        if ! human_readable_info=$(df -h "$fs_mount" 2>/dev/null | awk -v path="$fs_mount" 'NR > 1 && $6 == path {print}'); then
            human_readable_info="Could not get human-readable info."
        fi

        # Validate extracted percentage
        if ! [[ "$usage_percent" =~ ^[0-9]+$ ]]; then
             echo "  ERROR: Failed to extract numeric usage percentage for '$fs_mount'."
             echo "  Raw df line: $df_line"
             disk_check_status="ERROR"
             single_fs_status="ERROR"
        else
             # Successfully parsed, now check threshold
             echo "  Usage: $usage_percent%"
             echo "  Details: $human_readable_info"
             single_fs_status="OK" # Parsed ok

             # Compare usage against threshold (integer comparison)
             if [ "$usage_percent" -ge "$DISK_WARN_THRESHOLD" ]; then
                 echo "  WARNING: Usage ($usage_percent%) is >= threshold ($DISK_WARN_THRESHOLD%)"
                 single_fs_status="WARNING"
                 # If any filesystem has a warning, the overall status is Warning
                 # unless an error occurred elsewhere.
                 if [ "$disk_check_status" != "ERROR" ]; then
                     disk_check_status="WARNING"
                 fi
             else
                 echo "  Status: OK"
             fi
        fi
         echo "  Filesystem '$fs_mount' Status: $single_fs_status"

    done # End of loop through filesystems

    echo "Overall Disk Status: $disk_check_status"
    echo # Blank line
}

Explanation:

  • Iterates through the FILESYSTEMS_TO_CHECK array.
  • Uses df -P "$fs_mount" to get POSIX-standard output for the specific mount point.
  • Uses awk 'NR > 1 && $6 == path {print}' to find the correct line (skipping header NR > 1) where the 6th field ($6) exactly matches the requested mount point path. Error handling is included if df fails or the path isn't found.
  • Parses the usage percentage (field 5) using awk, removing the % sign with gsub. Includes validation.
  • Runs df -h separately just to get the human-readable line for display purposes.
  • Compares the integer usage_percent with DISK_WARN_THRESHOLD using [ -ge ].
  • Sets individual and overall status indicators (OK, WARNING, ERROR).

Step 4: Final Assembly and Testing

  1. Ensure Order: Make sure all function definitions (usage, check_cpu_load, check_memory, check_disk_usage, cleanup) are placed in the script file before the main() function definition, and main "$@" is the last line that executes.
  2. Review: Read through your complete health_check.sh script.

Comprehensive Testing:

Execute the script with various combinations of options and scenarios:

  • Defaults: ./health_check.sh
  • Help: ./health_check.sh -h
  • Custom Thresholds (Trigger Warnings): ./health_check.sh -l 0.1 -d 5
  • Custom Thresholds (Normal): ./health_check.sh -l 100 -d 99
  • Specific Filesystems: ./health_check.sh -f / -f /tmp (or other valid mount points on your system - use df command to see yours)
  • Invalid Filesystem Path: ./health_check.sh -f /nonexistent_mount
  • Invalid Option: ./health_check.sh -z
  • Missing Argument: ./health_check.sh -d
  • Unexpected Argument: ./health_check.sh extra_arg
  • Combination: ./health_check.sh -v -l 5.5 -f /var -f /home (Add a -v option if you want more verbose internal logging, though not implemented here)

For each test, examine the output carefully:

  • Are the correct thresholds and filesystems listed in the header?
  • Are the CPU, memory, and disk values reported?
  • Do warnings appear correctly when thresholds are exceeded?
  • Are errors handled gracefully (e.g., for invalid paths or options)?
  • Does the cleanup function run and report the correct exit status? Does it remove the temporary directory?

Workshop Summary

Fantastic! You have successfully built a functional system health check script by applying advanced shell scripting techniques:

  • Modularity: Used functions (usage, cleanup, check_*, main) to organize code.
  • Robustness: Implemented error handling using set -euo pipefail, $?, explicit checks (command -v, file existence, parsing validation), and a trap for cleanup.
  • Configuration: Parsed command-line options professionally using getopts.
  • Data Handling: Used command substitution ($()) extensively to capture output from date, uptime, free, df, awk, mktemp. Processed text using awk, grep, and sed. Handled multiple filesystem paths using arrays.
  • Resource Management: Safely created and removed temporary resources (mktemp, rm in trap).

This script serves as a solid base. You can enhance it further by adding more checks (network connectivity, running processes, service status), improving parsing robustness (e.g., using df --output flags if available), adding color output, or reading configuration from a file.

This workshop demonstrates how the techniques from the previous chapter come together to create powerful and reliable automation tools using the shell.