Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Advanced Shell Scripting and Automation
Introduction
Welcome back to your journey into the world of the shell! In previous explorations, you likely learned the fundamentals: navigating directories, executing commands, using pipes and redirection, and perhaps writing simple scripts with variables, loops (for
, while
), and conditional statements (if
, case
). Now, it's time to elevate those skills.
This chapter delves into advanced shell scripting techniques. But what does "advanced" really mean here? It's not about arcane commands known only to wizards. Instead, it's about writing scripts that are:
- More Robust: They handle errors gracefully and anticipate potential problems.
- More Efficient: They perform tasks faster or with fewer resources.
- More Flexible: They can adapt to different inputs and situations.
- More Reusable: They contain components (like functions) that can be used in multiple scripts.
- Capable of Automation: They can perform complex sequences of tasks without manual intervention, often on a schedule.
Mastering these techniques transforms the shell from a simple command executor into a powerful automation platform. Whether you need to manage system configurations, process large amounts of data, automate backups, or orchestrate complex workflows, advanced shell scripting is an invaluable skill. We'll cover functions, sophisticated error handling, advanced input/output techniques, regular expressions, argument parsing, and the foundational concepts of automation.
Let's begin building more powerful and professional shell scripts!
1. Functions: Building Reusable Code Blocks
Imagine you have a piece of code that you need to use multiple times within the same script, or perhaps across different scripts. Copying and pasting is inefficient and error-prone. If you need to change that code later, you have to find and update every single copy! This is where functions come in.
A function is a named block of code that performs a specific task. You define it once and can then "call" it (execute it) by its name whenever you need it.
Defining a Function
There are two common syntaxes for defining functions in Bash (and compatible shells):
# Syntax 1 (Recommended - POSIX standard)
function_name() {
# Commands go here
# ...
# Optional: return a status code
return 0 # 0 typically means success
}
# Syntax 2 (Bash-specific keyword)
function function_name {
# Commands go here
# ...
# Optional: return a status code
return 1 # Non-zero typically means failure
}
function_name
: Choose a descriptive name for your function. Use letters, numbers, and underscores. Avoid special characters.()
: Parentheses are required after the function name in the first syntax. They signify that this is a function definition.{}
: Curly braces enclose the block of code that belongs to the function. There must be a space or newline after the opening brace{
and before the closing brace}
.Commands
: Any valid shell commands can be placed inside the function. These commands are executed sequentially when the function is called.return <value>
: (Optional) This command immediately stops the execution of the function and sets its exit status. The<value>
should be an integer between 0 and 255. By convention:0
indicates that the function completed successfully.- Any non-zero value (
1
,2
, ...,255
) indicates that some kind of error or failure occurred within the function. If you don't usereturn
, the function's exit status will be the exit status of the last command executed inside it.
Calling a Function
To execute the code inside a function (i.e., to "call" the function), you simply type its name on a line by itself, just like you would run any other command:
The shell will find the function definition and execute the commands within its curly braces.
Passing Arguments (Parameters) to Functions
Functions become much more flexible when they can accept input data. These input values are called arguments or parameters. When you call a function, you can list the arguments you want to pass to it right after the function name, separated by spaces.
Inside the function's code block, you can access these arguments using special variables called positional parameters:
$1
: Represents the first argument passed to the function.$2
: Represents the second argument passed to the function.$3
: Represents the third argument, and so on ($4
,$5
, ...$9
,${10}
,${11}
, etc. Use curly braces for numbers greater than 9).$0
: Inside a function,$0
usually still refers to the name of the main script itself, not the function name (this can be a bit confusing, but it's how it works).$#
: Represents the total number of arguments passed to the function. This is very useful for checking if the correct number of arguments was provided.$*
: Represents all the arguments passed to the function as a single string. If the arguments were"hello world"
and"goodbye"
,$*
would expand to"hello world goodbye"
. The arguments are joined together using the first character of the special shell variableIFS
(Internal Field Separator), which is typically a space.$@
: Represents all the arguments passed to the function as separate strings (individual words). If the arguments were"hello world"
and"goodbye"
,$@
would expand to"hello world"
"goodbye"
(two separate entities). This is generally the preferred way to handle multiple arguments, especially when you need to pass them on to another command inside the function, as it preserves arguments that contain spaces.
Example:
Let's create a function that takes a name and an age and prints a message.
#!/usr/bin/env bash
# Define a function to describe a person
describe_person() {
# Check if exactly two arguments were provided
if [ $# -ne 2 ]; then
# Print an error message to standard error (stderr)
echo "Error: describe_person requires exactly 2 arguments (name and age)." >&2
echo "Usage: describe_person <name> <age>" >&2
return 1 # Indicate failure (incorrect usage)
fi
# Assign arguments to meaningful local variables
# Using 'local' prevents these variables from interfering with
# variables outside the function.
local person_name="$1"
local person_age="$2"
# Print the description
echo "$person_name is $person_age years old."
# Optional: Check if age is numeric (basic check)
if ! [[ "$person_age" =~ ^[0-9]+$ ]]; then
echo "Warning: Age '$person_age' does not look like a number." >&2
# We could return a different non-zero status here if needed
fi
return 0 # Indicate success
}
# --- Script Main Body ---
echo "Calling function with correct arguments:"
describe_person "Alice" 30
# Check the exit status of the last command (the function call)
if [ $? -eq 0 ]; then
echo "Function call succeeded."
else
echo "Function call failed with status $?."
fi
echo # Blank line
echo "Calling function with incorrect number of arguments:"
describe_person "Bob"
if [ $? -eq 0 ]; then
echo "Function call succeeded."
else
echo "Function call failed with status $?." # Should be 1 here
fi
echo # Blank line
echo "Calling function with potentially non-numeric age:"
describe_person "Charlie" "twenty-five"
# This call will succeed (return 0) but print a warning.
Variable Scope: local
vs. Global
Understanding where variables "live" (their scope) is crucial for writing bug-free functions.
-
Global Scope: By default, any variable you define in the main part of your script is global. This means it can be accessed and modified from anywhere within the script, including inside any function. Furthermore, if you define a variable inside a function without using the
local
keyword, it also becomes a global variable! This can lead to unexpected behavior, where one function accidentally changes a variable that another part of the script (or another function) relies on. -
Local Scope: To create a variable that exists only within the function where it is defined, you must declare it using the
local
keyword. This variable will be invisible outside the function, and it won't clash with any global variable that happens to have the same name.
Example:
#!/usr/bin/env bash
# Global variable
my_var="I am global"
modifier_function() {
# This variable is local to modifier_function
local my_var="I am local inside modifier_function"
echo "Inside modifier_function: my_var = '$my_var'"
# Modifying a global variable (implicitly)
# This is generally bad practice unless intended!
another_var="Set inside modifier_function"
}
reader_function() {
# This variable is local to reader_function
local another_var="I am local inside reader_function"
echo "Inside reader_function: my_var = '$my_var'" # Accesses the global 'my_var'
echo "Inside reader_function: another_var = '$another_var'" # Accesses its own local 'another_var'
}
# --- Script Main Body ---
echo "--- Before calling functions ---"
echo "Global scope: my_var = '$my_var'"
# echo "Global scope: another_var = '$another_var'" # This would cause an error if 'set -u' is active
echo # Blank line
echo "--- Calling modifier_function ---"
modifier_function
echo "Global scope: modifier_function set another_var = '$another_var'" # Now accessible globally
echo # Blank line
echo "--- Calling reader_function ---"
reader_function
echo # Blank line
echo "--- After calling functions ---"
echo "Global scope: my_var = '$my_var'" # Unchanged by modifier_function's local variable
echo "Global scope: another_var = '$another_var'" # Changed by modifier_function
Best Practice: Always use the local
keyword for variables defined inside your functions unless you have a very specific, deliberate reason to modify a global variable. This makes your functions self-contained, easier to understand, and less prone to causing side effects.
2. Robust Error Handling
Simple scripts often just stop or produce strange output when something goes wrong (like a file not found, a command failing, or incorrect input). Professional scripts, however, need to anticipate and handle errors gracefully. This might involve:
- Detecting that an error occurred.
- Reporting the error clearly to the user (or a log file).
- Deciding whether the script can continue or if it needs to stop.
- Cleaning up any temporary resources (like files or directories) before exiting.
Exit Status Codes ($?)
As mentioned briefly with functions, every command you run in the shell finishes with an exit status (also called a return code or exit code). This is a number between 0 and 255 that indicates whether the command succeeded or failed.
0
: Success. The command completed its task without any problems.1-255
: Failure. The command encountered an error. The specific non-zero number can sometimes indicate the type of error (e.g.,1
often means a general error,2
might mean misuse of a command,127
often means "command not found"), but these conventions are not strictly enforced across all commands. The key takeaway is: zero means success, non-zero means failure.
The shell automatically stores the exit status of the most recently executed command (or function, or pipeline) in a special variable: $?
.
You can check this variable immediately after running a command to see if it worked.
# Example: Check if a directory exists
ls /etc/passwd # This file exists, ls should succeed
echo "Exit status of first ls: $?" # Output should be 0
ls /non/existent/directory/or/file # This path likely doesn't exist, ls should fail
echo "Exit status of second ls: $?" # Output should be non-zero (e.g., 1 or 2)
# Using the exit status in an 'if' statement
if cp important_data.txt /backup/location/; then
# This block runs ONLY if 'cp' exits with status 0
echo "Backup of important_data.txt completed successfully."
else
# This block runs if 'cp' exits with a non-zero status
local cp_status=$? # Save the status immediately! $? changes with every command.
echo "ERROR: Failed to copy important_data.txt!" >&2 # Send error to stderr
echo "The 'cp' command exited with status: $cp_status" >&2
# It's often good practice to exit the script if a critical step fails
exit 1 # Exit the entire script with a failure status
fi
# If we reach here, the copy was successful (or the script exited)
echo "Script continues..."
Important: The value of $?
is updated after every single command. If you need to use the exit status of a specific command later, save it to another variable immediately, as shown with local cp_status=$?
.
Also, notice >&2
. This redirects the error messages to standard error (file descriptor 2). Standard output (stdout, file descriptor 1) is for the normal, expected output of a program, while standard error (stderr) is the conventional channel for error messages and diagnostics. Redirecting errors this way keeps them separate from regular output, which is useful if you're redirecting stdout to a file but still want to see errors on the terminal.
The set
Command for Automatic Error Handling
Checking $?
after every single command can make your scripts very long and repetitive. Bash provides some options, enabled using the set
built-in command, that can make your scripts automatically more sensitive to errors.
-
set -e
(orset -o errexit
): This is a very powerful option. Whenset -e
is active, the script will exit immediately if any simple command exits with a non-zero status. This prevents the script from blindly continuing after a critical failure.- What's a "simple command"? Basically, a command that isn't part of a conditional test (like in
if [...]
orwhile [...]
), isn't part of a boolean&&
or||
list, isn't inverted with!
, or isn't the last command in a pipeline (unlesspipefail
is also set). - Caveats: Because of these exceptions,
set -e
isn't a magic bullet. Sometimes a command failure is expected and handled (e.g.,grep
not finding a pattern might exit with 1, which could be okay). You can temporarily bypassset -e
for a specific command by appending|| true
(e.g.,command_that_might_fail || true
), but use this judiciously. Despite its quirks,set -e
catches many common errors and significantly improves robustness.
- What's a "simple command"? Basically, a command that isn't part of a conditional test (like in
-
set -u
(orset -o nounset
): This option causes the script to treat attempts to expand (use the value of) an unset variable as an error, and the script will exit. Normally, using an unset variable just results in an empty string, which can hide typos or logic errors.set -u
makes these errors obvious.- Example: If you typed
echo "$massage"
instead ofecho "$message"
,set -u
would cause the script to exit with an error instead of just printing a blank line. - If you need to check if a variable is set without triggering
set -u
, you can use parameter expansion checks likeif [ -z "${my_var-}" ]; then ...
(The hyphen-
after the variable name tells bash to return an empty string if unset, rather than causing an error).
- Example: If you typed
-
set -o pipefail
: By default, the exit status of a pipeline (commands connected by|
) is the exit status of the last command only. This means if an earlier command in the pipeline fails, but the last one succeeds, the pipeline as a whole is considered successful (exit status 0).set -o pipefail
changes this behavior: the exit status of the pipeline becomes the exit status of the rightmost command in the pipeline that failed (returned non-zero), or zero if all commands in the pipeline succeeded. This is essential when you need to ensure that every step in a data processing pipeline worked correctly.
Recommended Practice: Start most of your scripts with these settings:
#!/usr/bin/env bash
# Exit on error, treat unset variables as errors, pipelines fail on first error
set -euo pipefail
# You can also write them on separate lines:
# set -e
# set -u
# set -o pipefail
# --- Rest of your script ---
echo "Hello, $USER" # This is fine
# mkdir /non_existent_parent/my_dir # With set -e, this failure would stop the script
# echo "$undefined_variable" # With set -u, this would stop the script
# Example of pipefail:
# Assume 'command1' fails (exit 1) and 'command2' succeeds (exit 0)
# command1 | command2
# Without pipefail: $? would be 0 (from command2)
# With pipefail: $? would be 1 (from command1)
These three settings form a great baseline for writing robust scripts.
The trap
Command: Cleaning Up Gracefully
Sometimes, your script might create temporary files, lock files, establish network connections, or start background processes. What happens if the script exits unexpectedly – maybe because set -e
triggered, or the user pressed Ctrl+C
? Those temporary resources might be left behind in an inconsistent state.
The trap
command allows you to specify one or more commands that should be executed automatically when your script receives certain signals from the operating system or encounters specific conditions.
Signals: Signals are a standard Unix mechanism for notifying processes about events. Some common signals you might want to trap are:
EXIT
: This is a special pseudo-signal specific to Bash'strap
. The command associated withEXIT
is executed just before the script terminates, regardless of whether it's exiting normally (reaching the end or callingexit
) or abnormally (due toset -e
, an uncaught signal, etc.). This is ideal for cleanup tasks.INT
: Short for "Interrupt". Sent when the user pressesCtrl+C
. Trapping this allows you to perform cleanup before exiting when the user interrupts the script.TERM
: Short for "Terminate". Sent by the system or other processes (like thekill
command) to request termination. This is a more graceful shutdown request thanKILL
. TrappingTERM
allows your script to shut down cleanly.QUIT
: Sent when the user pressesCtrl+\
. Similar toINT
, but often causes a core dump as well.ERR
: Another Bash pseudo-signal. Ifset -e
is not active, the command associated withERR
is executed whenever a simple command fails (exits with non-zero status). This allows for custom error handling logic without exiting immediately. (Ifset -e
is active,ERR
traps still run, but the script will likely exit immediately afterward unless the trap itself exits with status 0).
Syntax:
'command_or_function_to_run'
: A string containing the command(s) to execute when any of the specified signals are received. This is often a call to a dedicated cleanup function. Using single quotes is important to prevent premature expansion of variables or commands within the trap string.SIGNAL1 SIGNAL2 ...
: A list of signal names (likeEXIT
,INT
,TERM
,ERR
) or signal numbers.
Example:
#!/usr/bin/env bash
set -u # Use unset variable check
# Create a temporary directory safely using mktemp
# mktemp -d creates a unique temporary directory
TEMP_DIR=$(mktemp -d health_check_temp_XXXXXX)
echo "Created temporary directory: $TEMP_DIR"
# Flag to indicate if work was completed
WORK_COMPLETED=false
# Define a cleanup function
cleanup() {
local exit_status=${1:-$?} # Get exit status passed to trap, or current $?
echo # Blank line for clarity
echo "--- Running cleanup ---"
if [ -d "$TEMP_DIR" ]; then
echo "Removing temporary directory: $TEMP_DIR"
# Use 'rm -rf' carefully! Only on directories you know you created.
rm -rf "$TEMP_DIR"
echo "Temporary directory removed."
else
echo "Temporary directory '$TEMP_DIR' not found or already removed."
fi
if [ "$WORK_COMPLETED" = true ]; then
echo "Script work was marked as completed."
else
echo "Script did not complete its main work."
fi
echo "Exiting with status: $exit_status"
echo "--- Cleanup finished ---"
# The trap handler should generally exit with the original status
# However, the EXIT trap runs just *before* the final exit,
# so we don't explicitly exit *from* the EXIT trap itself.
# For INT/TERM traps, you might want 'exit $exit_status' at the end.
}
# Set the trap: Call the 'cleanup' function when the script receives
# EXIT, INT (Ctrl+C), or TERM signals.
# The cleanup function will receive the exit status as its first argument ($1)
# for EXIT traps in newer Bash versions when exiting due to a signal.
# We pass the exit status explicitly for INT/TERM.
trap 'cleanup $?' EXIT
trap 'cleanup 130; exit 130' INT # 130 is convention for exit after SIGINT
trap 'cleanup 143; exit 143' TERM # 143 is convention for exit after SIGTERM
# --- Main script logic ---
echo "Doing some simulated work..."
echo "Creating a temporary file..."
touch "$TEMP_DIR/workfile.txt"
echo "Current date: $(date)" > "$TEMP_DIR/workfile.txt"
sleep 10 # Simulate a long-running task
# If the script reaches this point, mark work as completed
WORK_COMPLETED=true
echo "Simulated work finished successfully."
# The EXIT trap will automatically run 'cleanup' when the script ends here.
# If you press Ctrl+C during the 'sleep', the INT trap runs 'cleanup'.
# If 'set -e' was active and a command failed, the EXIT trap would run 'cleanup'.
Using trap
ensures that your script cleans up after itself, making it more reliable and preventing resource leaks, even when things go wrong. The EXIT
trap is particularly useful for general-purpose cleanup.
3. Advanced Input/Output Redirection and Pipelines
You're likely familiar with basic I/O redirection:
command > file
: Redirect standard output (stdout) tofile
, overwriting it.command >> file
: Redirect stdout tofile
, appending to it.command < file
: Usefile
as the standard input (stdin) forcommand
.command1 | command2
: Pipe the stdout ofcommand1
into the stdin ofcommand2
.
Now let's explore more sophisticated ways to manipulate input and output streams.
Redirecting Standard Error (stderr
)
Commands typically produce output on two streams:
- Standard Output (stdout): File descriptor
1
. Used for the main, expected results of the command. - Standard Error (stderr): File descriptor
2
. Used for error messages, warnings, and diagnostic information.
By default, both stdout and stderr are displayed on your terminal. Redirection allows you to handle them separately or together.
command > output.log
: Redirects only stdout (FD 1) tooutput.log
. Errors (stderr, FD 2) still appear on the terminal.command 2> error.log
: Redirects only stderr (FD 2) toerror.log
. Normal output (stdout, FD 1) still appears on the terminal.command > output.log 2> error.log
: Redirects stdout tooutput.log
AND stderr toerror.log
.command > combined.log 2>&1
: Redirects stdout (FD 1) tocombined.log
, and then redirects stderr (FD 2) to the same place that FD 1 is currently pointing to (which iscombined.log
). This is the standard way to capture all output (both normal and errors) into a single file. The order matters!2>&1
must come after the initial redirection of stdout.command &> combined.log
: A shorter (Bash-specific) way to achieve the same result ascommand > combined.log 2>&1
. Redirects both stdout and stderr tocombined.log
.
Why separate stderr? It allows you to log errors separately from normal output, making it easier to monitor for problems. It also prevents error messages from polluting the data stream if you're piping the command's stdout to another command.
Example:
# Run a command, saving normal output and errors separately
find /etc -name "*.conf" > found_configs.txt 2> find_errors.log
# Run a command, saving all output to one log file for later review
./my_complex_script.sh > script_run.log 2>&1
# Or using the shorter syntax:
# ./my_complex_script.sh &> script_run.log
# Run a command, process its normal output, but ignore errors
ls /etc /nonexistent 2>/dev/null | sort
# /dev/null is a special file that discards anything written to it.
# This command sorts the list of files in /etc, but the error message
# about /nonexistent not being found is discarded.
Here Documents (<<
)
Sometimes you need to provide multiple lines of input text to a command directly within your script, without creating a separate temporary file. A here document allows you to do exactly this.
Syntax:
command << DELIMITER
This is the first line of input.
Variables like $HOME and $(pwd) are expanded.
Commands like `date` are also executed if backticks are used (less common now).
This is the last line before the delimiter.
DELIMITER
command
: The command that will receive the text as its standard input.<< DELIMITER
: Tells the shell to read the following lines as input until it encounters a line containing exactlyDELIMITER
.DELIMITER
: A unique word used to mark the end of the input block.EOF
(End Of File) is a common convention, but you can use any string that doesn't appear in the input text itself.- Crucially: The closing
DELIMITER
must be on a line by itself, with no leading or trailing whitespace. - Expansion: By default, shell variables (
$VAR
), command substitutions ($(...)
), and arithmetic expansions ($((...))
) are processed within the here document text before it's passed to the command. - No Expansion: If you want to prevent any expansion and pass the text literally, quote the delimiter:
<< 'DELIMITER'
or<< "DELIMITER"
.
Example:
#!/usr/bin/env bash
# Create a simple HTML file using 'cat' and a here document
cat > index.html << EOF
<!DOCTYPE html>
<html>
<head>
<title>My Simple Page</title>
</head>
<body>
<h1>Welcome!</h1>
<p>This page was generated on $(date) by user $USER.</p>
<p>Your home directory is: $HOME</p>
</body>
</html>
EOF
# Note: The 'EOF' above has no leading spaces!
echo "Created index.html:"
cat index.html
echo # Blank line
# Example without expansion (using quoted delimiter)
echo "--- Here document with quoted delimiter ---"
cat << 'END_TEXT'
This will be printed literally.
The variable $USER will not be expanded.
The command $(date) will not be executed.
END_TEXT
Here documents are incredibly useful for embedding configuration file snippets, SQL commands, email bodies, or any multi-line text directly into your scripts.
Here Strings (<<<
)
For providing just a single line or a short string (often from a variable) as standard input to a command, a here string is a more concise alternative to echo "string" | command
.
Syntax:
Or using a variable:
my_string="Error code: 404 - Not Found"
command <<< "$my_string" # Use quotes if string contains spaces or special chars
Example:
# Instead of: echo "123 abc 456" | awk '{print $2}'
# Use a here string:
awk '{print $2}' <<< "123 abc 456" # Output: abc
# Pass a variable's content to grep
error_log_line="[ERROR] Failed to connect to database."
grep "database" <<< "$error_log_line" # Output: [ERROR] Failed to connect to database.
# Perform arithmetic with bc (basic calculator)
result=$(bc <<< "10 * (5 + 2)")
echo "Calculation result: $result" # Output: Calculation result: 70
Here strings are convenient for simple stdin redirection without the overhead of echo
and a pipe.
Process Substitution (<()
and >()
)
This is a more advanced and extremely powerful feature (available in Bash, Zsh, and Ksh, but not strictly POSIX sh). Process substitution allows you to treat the output of a command (or the input to a command) as if it were a file, without actually creating a named temporary file on disk.
The shell handles the magic behind the scenes, usually using named pipes (mkfifo
) or file descriptors (/dev/fd/...
).
-
<(command)
: Runscommand
asynchronously. Its standard output is connected to a special file descriptor (e.g.,/dev/fd/63
) or a named pipe. The<(...)
construct then expands to the name of this file descriptor/pipe. You can use this name anywhere a command expects a filename as input, allowing you to feed the output of one command directly into another command that expects a file argument. -
>(command)
: Runscommand
asynchronously. Its standard input is connected to a special file descriptor or named pipe. The>(...)
construct expands to the name of this file descriptor/pipe. You can use this name anywhere a command expects a filename for output, allowing you to write output directly into the input stream of another command.
Use Cases for <()
(Input):
Imagine a command like diff
which normally compares two files. What if you want to compare the output of two commands? Process substitution makes this easy:
# Compare the list of files in /bin with the list of files in /usr/bin
# Without process substitution, you'd need temporary files:
# ls /bin > /tmp/bin_list.txt
# ls /usr/bin > /tmp/usr_bin_list.txt
# diff /tmp/bin_list.txt /tmp/usr_bin_list.txt
# rm /tmp/bin_list.txt /tmp/usr_bin_list.txt
# With process substitution:
diff <(ls /bin) <(ls /usr/bin)
# The shell runs 'ls /bin', connects its output to e.g. /dev/fd/63
# The shell runs 'ls /usr/bin', connects its output to e.g. /dev/fd/62
# The shell then executes: diff /dev/fd/63 /dev/fd/62
# No temporary files are manually created or cleaned up!
# Join files based on output of commands
join <(sort file1.txt) <(sort file2.txt)
Use Cases for >()
(Output):
This is often used with commands like tee
, which reads from stdin and writes to stdout and to one or more files. Process substitution lets tee
write not just to files, but directly into the input of other commands.
# Log the output of 'make' to a file, and also filter it for errors
# and save errors to another file, without intermediate files.
# 'make' command output goes into 'tee'
# 'tee' writes one copy to 'build.log' (a regular file)
# 'tee' writes another copy to the process substitution '>(grep ...)'
# The 'grep' command receives the output as its stdin and writes matches to 'build_errors.log'
make | tee build.log >(grep -i 'error\|warning' > build_errors.log)
# Compare this to the non-process substitution way:
# make > /tmp/build_output.log
# cp /tmp/build_output.log build.log
# grep -i 'error\|warning' /tmp/build_output.log > build_errors.log
# rm /tmp/build_output.log
# Send output to multiple log processing pipelines simultaneously
# Note: The final >/dev/null might be needed because tee's own stdout
# (which is just a copy of its input here) might not be desired.
complex_data_generator | tee >(process_type_A > typeA.log) >(process_type_B > typeB.log) > /dev/null
Process substitution is a powerful way to connect commands in complex ways beyond simple pipelines, reducing the need for temporary files and making scripts cleaner and potentially more efficient.
4. Regular Expressions with grep
, sed
, and awk
Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are an incredibly powerful tool for matching and manipulating text based on patterns, rather than just fixed strings. Mastering regex is a skill in itself, but understanding the basics and how to use them with common shell tools is essential for advanced scripting.
Three indispensable command-line tools that heavily utilize regular expressions are:
-
grep
(Global Regular Expression Print): Searches for lines containing text that matches a given pattern within files or standard input. It then prints the matching lines (by default). -
sed
(Stream Editor): Reads text line by line (from files or stdin), applies editing commands (often based on regex patterns), and prints the modified text to stdout. It's commonly used for search-and-replace operations. -
awk
: A versatile pattern-scanning and text-processing language. It reads input line by line, automatically splits each line into fields (columns, typically separated by whitespace), and allows you to perform actions based on patterns (including regex) or field values. It's excellent for extracting data, generating reports, and performing calculations on text data.
Basic Regex Concepts
While regex can get very complex, here are some fundamental building blocks:
- Literal Characters: Most characters (like
a
,b
,1
,_
,-
) match themselves literally. - Metacharacters: Special characters with meanings:
.
(dot): Matches any single character (except newline).*
: Matches the preceding item zero or more times. E.g.,a*
matches""
,a
,aa
,aaa
.+
: Matches the preceding item one or more times (ERE/PERL). E.g.,a+
matchesa
,aa
,aaa
, but not""
.?
: Matches the preceding item zero or one time (ERE/PERL). E.g.,colou?r
matchescolor
andcolour
.^
: Matches the beginning of the line. E.g.,^Error
matches lines starting with "Error".$
: Matches the end of the line. E.g.,\.log$
matches lines ending with ".log".[]
(Character Set): Matches any single character inside the brackets. E.g.,[aeiou]
matches any lowercase vowel.[0-9]
matches any digit.[^0-9]
matches any character that is not a digit.|
(Alternation): Acts like "OR" (ERE/PERL). E.g.,error|warning
matches lines containing "error" or "warning".()
(Grouping): Groups parts of the regex together. Used with repetition or alternation (ERE/PERL). E.g.,(ab)+
matchesab
,abab
,ababab
.\
(Escape): Removes the special meaning of a metacharacter. E.g.,\.
matches a literal dot,\*
matches a literal asterisk.
- Quantifiers: Control how many times an item matches (
*
,+
,?
,{n}
,{n,}
,{n,m}
).{n}
: Matches the preceding item exactlyn
times. E.g.,[0-9]{4}
matches exactly four digits.{n,}
: Matches the preceding itemn
or more times. E.g.,[a-z]{3,}
matches three or more lowercase letters.{n,m}
: Matches the preceding item betweenn
andm
times (inclusive). E.g.,[0-9]{2,4}
matches two, three, or four digits.
- ERE vs BRE: There are different "flavors" of regex. The two main ones in shell tools are Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE). ERE (enabled with
grep -E
,sed -E
orsed -r
on some systems, and default inawk
) is generally more powerful and intuitive as characters like+
,?
,|
,()
have their special meanings directly. In BRE (default forgrep
andsed
), these characters need to be escaped (\+
,\?
,\|
,\(\)
) to have their special meaning, making patterns harder to read. It's often easier to use ERE.
Using grep
# Find lines containing the literal string "ERROR" in /var/log/syslog
grep "ERROR" /var/log/syslog
# Find lines containing "ERROR" (case-insensitive)
grep -i "error" /var/log/syslog
# Find lines containing "error" OR "warning" (case-insensitive, using ERE)
grep -E -i "error|warning" /var/log/syslog
# Find lines STARTING with a date like YYYY-MM-DD (using ERE)
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" /var/log/messages
# Count the number of matching lines
grep -c "Failed password" /var/log/auth.log
# Print lines that DO NOT match the pattern
grep -v "DEBUG" application.log
# Print only the matching part of the line (GNU grep specific)
grep -o -E "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+" access.log # Extract IP addresses
# Search recursively in a directory
grep -r "API_KEY" /etc/myapp/
Using sed
sed
operates on a stream of text. Its most common use is the s
(substitute) command.
# Syntax: sed 's/PATTERN/REPLACEMENT/FLAGS' input.txt
# Replace the FIRST occurrence of "apple" with "orange" on each line
sed 's/apple/orange/' input.txt
# Replace ALL occurrences of "apple" with "orange" on each line (g = global flag)
sed 's/apple/orange/g' input.txt
# Replace "red" or "blue" with "green" (using ERE and global flag)
sed -E 's/red|blue/green/g' input.txt
# Replace case-insensitively (GNU sed specific 'i' flag)
sed 's/error/ERROR/gi' input.txt
# Delete lines containing the word "temporary"
sed '/temporary/d' input.txt
# Print only lines containing "user=" (default is to print all lines)
sed -n '/user=/p' config.txt # -n suppresses default printing, p explicitly prints matches
# Insert text BEFORE lines matching a pattern
sed '/^\[SectionA\]/i # This is the configuration for Section A' config.ini
# Append text AFTER lines matching a pattern
sed '/^\[SectionB\]/a # End of Section B configuration' config.ini
# Edit a file directly, creating a backup first (use with caution!)
# Creates config.ini.bak before modifying config.ini
sed -i.bak 's/old_value/new_value/g' config.ini
# Edit in-place without backup (more dangerous!)
# sed -i 's/old_value/new_value/g' config.ini
Using awk
awk
processes input field by field, making it great for columnar data.
# Default: Fields ($1, $2, $3, ...) are separated by whitespace. $0 is the whole line.
# Print the first field (column) of each line from /etc/passwd (separator is ':')
awk -F':' '{print $1}' /etc/passwd # -F specifies the field separator
# Print the username (field 1) and shell (field 7) from /etc/passwd
awk -F':' '{print "User:", $1, " Shell:", $7}' /etc/passwd
# Print lines where the third field (e.g., user ID) is greater than 999
awk -F':' '$3 > 999 {print $0}' /etc/passwd # {action} runs if pattern ($3 > 999) is true
# Calculate the sum of numbers in the second column of data.txt
# BEGIN block runs once before processing input.
# END block runs once after all lines are processed.
awk '{ sum += $2 } END { print "Total sum:", sum }' data.txt
# Print lines matching a regex AND where a field has a certain value
# Print lines containing "error" where the 5th field is "critical"
awk '$5 == "critical" && /error/ { print "Critical Error Found:", $0 }' logfile.log
# Change the output field separator
awk -F',' '{OFS=" | "; print $1, $3, $2}' input.csv # Print fields 1, 3, 2 separated by " | "
Learning grep
, sed
, and awk
, along with basic regular expressions, dramatically increases your ability to manipulate and extract information from text, which is a core task in much shell scripting and automation.
5. Command Substitution: Capturing Command Output
Often in a script, you need to run a command and then use its output as:
- The value assigned to a variable.
- Part of a string.
- An argument to another command.
This is achieved using command substitution.
Syntax (Modern and Recommended): $(...)
The shell executes command_to_execute
inside the parentheses. It captures the standard output of that command, removes any trailing newline characters, and substitutes the resulting string back into the command line or assignment.
Syntax (Legacy - Avoid if possible): ``
(Backticks)
This older syntax does the same thing, but it has several disadvantages:
* Readability: Backticks can be easily confused with single quotes (' '
).
* Nesting: Nesting command substitutions with backticks (cmd1 \
cmd2`) requires awkward backslash escaping and is much harder to read than the nested
$(...)form (
$(cmd1 $(cmd2))`).
Always prefer the $(...)
syntax.
Example:
#!/usr/bin/env bash
set -u # Ensure variables are set before use
# Get the current date and time in a specific format
current_timestamp=$(date +"%Y-%m-%d_%H-%M-%S")
echo "Script started at: $current_timestamp"
# Get the number of lines in a file
config_file="/etc/ssh/sshd_config"
if [ -f "$config_file" ]; then
line_count=$(wc -l < "$config_file") # Use input redirection for efficiency
# Trim leading/trailing whitespace that wc might add
line_count=$(echo $line_count)
echo "The file '$config_file' has $line_count lines."
else
echo "Config file '$config_file' not found."
fi
# Get the current working directory
working_dir=$(pwd)
echo "Current directory is: $working_dir"
# Create a backup filename incorporating the timestamp and hostname
hostname=$(hostname -s) # Use -s for short hostname
backup_filename="/backups/${hostname}_backup_${current_timestamp}.tar.gz"
echo "Proposed backup filename: $backup_filename"
# Use command output directly within another command's arguments
# Find files modified in the last 2 days
echo "Files modified recently in /etc:"
# Note: find command arguments can be complex, use quotes carefully
find /etc -maxdepth 1 -type f -mtime -2 -exec ls -ld {} \;
# Using output as part of an echo command
echo "System Load Average: $(uptime | awk -F'load average: ' '{print $2}')"
Command substitution is fundamental for making scripts dynamic, allowing them to gather information from the system or other commands and act upon it.
6. Processing Script Arguments Professionally: getopts
While accessing arguments directly using $1
, $2
, $#
, etc., works for very simple scripts, it quickly becomes cumbersome and non-standard when you want to implement features common in command-line tools:
- Options (Flags): Arguments starting with a hyphen, like
-v
(verbose) or-h
(help). - Options with Arguments: Options that require a value, like
-f filename
or-o output.log
. - Optional Arguments: Some options or arguments might not always be required.
- Order Independence: Users should ideally be able to provide options in any order (e.g.,
-v -f file
vs-f file -v
). - Error Handling: Detecting invalid options or missing arguments for options that require them.
Manually parsing the $@
array to handle all these cases is complex and error-prone. Thankfully, Bash provides a built-in command specifically for this purpose: getopts
.
getopts
parses options and their arguments from the script's positional parameters ($@
) according to Unix conventions. It's used inside a while
loop.
Basic Structure:
#!/usr/bin/env bash
# --- Default values for settings ---
verbose=false # Use true/false or 0/1
output_file=""
input_file=""
# --- Function to display help message ---
usage() {
echo "Usage: $0 [-v] [-o <output_file>] [-f <input_file>] [remaining_args...]"
echo "Options:"
echo " -v Enable verbose output"
echo " -o <file> Specify output file"
echo " -f <file> Specify input file (required)"
echo " -h Display this help message"
exit 1
}
# --- Option parsing loop ---
# The 'optstring' defines the valid option letters.
# - A letter by itself (e.g., 'v', 'h') is a simple flag.
# - A letter followed by a colon (e.g., 'o:', 'f:') means that option requires an argument.
# - A leading colon in the optstring (e.g., ':vho:f:') enables "silent" error handling.
# Instead of printing its own errors, getopts will:
# - Set the 'opt' variable to '?' for an invalid option.
# - Set the 'opt' variable to ':' for a missing option argument.
# - Store the invalid option character or the option missing an argument in OPTARG.
while getopts ":vho:f:" opt; do
case $opt in
v)
# Option -v was found
verbose=true
;;
h)
# Option -h was found
usage
;;
o)
# Option -o was found, its argument is in $OPTARG
output_file="$OPTARG"
;;
f)
# Option -f was found, its argument is in $OPTARG
input_file="$OPTARG"
;;
\?)
# Invalid option found (stored in $OPTARG)
echo "Error: Invalid option: -$OPTARG" >&2
usage
;;
:)
# Option requires an argument, but none was given (option char in $OPTARG)
echo "Error: Option -$OPTARG requires an argument." >&2
usage
;;
esac
done
# --- Shift away processed options ---
# $OPTIND is the index of the next argument to be processed after getopts finishes.
# 'shift' removes arguments from the beginning of the positional parameters ($@).
# This command removes all the options and their arguments that getopts processed,
# leaving only the remaining non-option arguments in $@.
shift $((OPTIND-1))
# --- Validate required arguments and use the settings ---
# Example: Check if the required -f option was provided
if [ -z "$input_file" ]; then
echo "Error: Input file must be specified with -f." >&2
usage
fi
echo "--- Settings ---"
echo "Verbose: $verbose"
echo "Output File: '$output_file'"
echo "Input File: '$input_file'"
echo "Remaining Arguments: $@" # Display any arguments left after options
# --- Main script logic starts here ---
echo # Blank line
echo "Starting main script logic..."
# Use the variables $verbose, $output_file, $input_file, and $@ here
# Example of using verbose flag
if [ "$verbose" = true ]; then
echo "Verbose mode enabled. Performing extra logging..."
fi
# Example of using output file
if [ -n "$output_file" ]; then
echo "Output will be directed to $output_file"
# exec > "$output_file" # Example: Redirect script's stdout
else
echo "Output will be sent to standard output."
fi
echo "Processing input from $input_file..."
if [ ! -f "$input_file" ]; then
echo "Error: Input file '$input_file' not found!" >&2
exit 1
fi
# ... process the file ...
echo "Script finished."
Key elements explained:
getopts optstring varname
: The core command.optstring
: Defines valid options and whether they take arguments (e.g.,":vho:f:"
). The leading colon enables silent error handling.varname
(e.g.,opt
): The variable thatgetopts
sets in each iteration of the loop to the option character found (e.g.,v
,h
,o
,f
, or?
,:
for errors).
while getopts ...; do ... done
: The loop continues as long asgetopts
finds valid options in$@
.case $opt in ... esac
: Used to handle the different option characters found bygetopts
.$OPTARG
: When an option requires an argument (like-o file
),getopts
stores that argument (file
) in this variable.$OPTIND
:getopts
maintains this variable, which holds the index of the next argument in$@
to be processed. It starts at 1.shift $((OPTIND-1))
: This crucial step removes all the options and their arguments (whichgetopts
has already processed) from the list of positional parameters ($@
). After theshift
,$@
will only contain the remaining arguments that were not options (e.g., if the command was./script.sh -v -f input.txt data1 data2
, after the shift,$@
would contain"data1" "data2"
).
Using getopts
is the standard, robust way to handle command-line options in shell scripts, making them behave like familiar Unix utilities.
7. Automation: Scheduling Tasks with cron
Writing a useful script is great, but its power is truly unlocked when you can make it run automatically without your intervention. The classic and most common tool for scheduling tasks on Unix-like systems (Linux, macOS, BSD) is cron
.
cron
is a system daemon (a background service) that wakes up every minute, checks configuration files (called crontabs), and executes any commands scheduled to run at that specific minute.
The Crontab File
Each user on the system can have their own crontab file, specifying jobs to be run as that user. There is also typically a system-wide crontab (often /etc/crontab
or files in /etc/cron.d/
) for system administration tasks.
To edit your personal crontab, use the command:
This will open your crontab file in the default command-line text editor (like nano
, vi
, or vim
). If you haven't used it before, it might be empty or contain comments explaining the format.
To simply view your current crontab without editing, use:
To remove your entire crontab, use (with caution!):
Crontab Format
Each line in a crontab file defines a single scheduled job (or is a comment starting with #
). The format consists of 5 time/date fields followed by the command to execute:
# Use '#' for comments.
# Format: minute hour day_of_month month day_of_week command_to_run
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12) OR jan,feb,mar,...
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,...
# │ │ │ │ │
# * * * * * command_to_execute
Field Values:
*
: Asterisk means "any value" or "every". For example,*
in the minute field means "every minute".number
: A specific value (e.g.,30
in the minute field means "at 30 minutes past the hour").value1,value2
: A list of specific values (e.g.,0,15,30,45
in the minute field means run at 0, 15, 30, and 45 minutes past the hour).start-end
: A range of values (e.g.,9-17
in the hour field means from 9 AM to 5 PM inclusive).*/step
: A step value.*/15
in the minute field means "every 15 minutes" (equivalent to0,15,30,45
).0 */2 * * *
means "at minute 0 every 2 hours".
command_to_execute
: This is the command or script you want to run. Crucially, always use absolute paths for your scripts and commands within crontabs, because cron
runs with a very minimal environment and may not have your usual $PATH
settings.
Example Crontab Entries:
# Comments are good practice! Explain what each job does.
# Run the health check script every hour at 15 minutes past the hour
# and append its output (stdout & stderr) to a log file.
15 * * * * /home/myuser/scripts/health_check.sh >> /home/myuser/logs/health_check.log 2>&1
# Perform a full backup using a backup script every Sunday at 2:30 AM.
# Discard normal output, but potentially log errors (often handled inside the script).
30 2 * * 0 /usr/local/bin/my_backup_script.sh --full > /dev/null
# Check for software updates every day at 4:00 AM (example for Debian/Ubuntu)
# Ensure PATH is set if needed, or use full paths like /usr/bin/apt-get
# 0 4 * * * /usr/bin/apt-get update && /usr/bin/apt-get -y upgrade > /var/log/apt/cron_upgrade.log 2>&1
# Run a custom data processing script every 10 minutes during work hours (Mon-Fri, 9am-5pm)
*/10 9-17 * * 1-5 /opt/data_scripts/process_incoming.sh --config /etc/data_scripts/config.ini
# Clean up temporary files older than 7 days every Monday at 1 AM
0 1 * * 1 /usr/bin/find /tmp -type f -mtime +7 -delete
Important Considerations for Cron Jobs:
- Absolute Paths: Always use full paths for scripts and commands (e.g.,
/home/user/scripts/myscript.sh
,/usr/bin/python3
). Don't rely on the$PATH
environment variable, which is very limited incron
. - Environment:
cron
runs jobs with a minimal environment. If your script relies on specific environment variables (likeJAVA_HOME
,PYTHONPATH
, etc.), you must either:- Define them at the top of the crontab file (e.g.,
MAILTO=""
,PATH=/usr/local/bin:/usr/bin:/bin
). - Define or
source
them inside your script.
- Define them at the top of the crontab file (e.g.,
- Permissions: Ensure that the script you are scheduling is executable by the user whose crontab you are editing (
chmod +x /path/to/your/script.sh
). - Output Redirection: By default,
cron
captures any stdout or stderr produced by the command and tries to email it to the user who owns the crontab. This is often undesirable. It's standard practice to redirect output:>> /path/to/logfile.log 2>&1
: Append both stdout and stderr to a log file.> /dev/null 2>&1
: Discard all output (if you only care that the job runs or if the script handles its own logging).
- Working Directory: Cron jobs usually run from the user's home directory by default. If your script expects to be run from a specific directory (e.g., to find relative config files), you should
cd
to that directory within the crontab command or at the beginning of your script:* * * * * cd /path/to/app && ./run_app.sh
- Locking: If a script takes a long time to run and is scheduled frequently (e.g., every minute), you might end up with multiple instances running simultaneously. Consider implementing a locking mechanism (e.g., using
flock
or creating/checking for a lock file) within your script to prevent this.
cron
is the standard, reliable way to schedule routine tasks, forming the backbone of automation on many Unix-like systems.
Conclusion
This chapter has equipped you with a powerful set of tools and techniques to move beyond basic shell scripting. You've learned how to create modular and reusable code with functions, how to make scripts resilient through robust error handling ($?
, set -euo pipefail
, trap
), how to master input and output streams (>&
, <<
, <<<
, <()
, >()
), how to harness the pattern-matching power of regular expressions with grep
, sed
, and awk
, how to capture command results using command substitution ($()
), how to parse command-line arguments professionally with getopts
, and finally, how to automate script execution using the cron
scheduler.
These concepts are the building blocks for creating sophisticated, reliable, and automated solutions for system administration, data processing, development workflows, and countless other tasks. Like any skill, mastery comes through practice. Apply these techniques to your own scripting challenges.
The following workshop chapter provides a hands-on project where you'll combine many of these advanced techniques to build a practical system monitoring script.
Workshop - Building a System Health Check Script
Objective
In this workshop, we will apply the concepts learned in the "Advanced Shell Scripting and Automation" chapter to build a practical, command-line system health monitoring script. This project provides hands-on experience with functions, error handling (set
, $?
, trap
), command substitution, text processing (awk
, grep
), argument parsing (getopts
), and generating formatted output.
Our script, health_check.sh
, will perform the following tasks:
- Check the current CPU load average (1-minute).
- Check the percentage of available RAM.
- Check the percentage of used disk space for specified filesystems (defaulting to the root filesystem
/
). - Allow the user to specify warning thresholds for CPU load and disk usage via command-line options.
- Allow the user to specify exactly which filesystem mount points to check via command-line options.
- Generate a clearly formatted report, highlighting any values that exceed the defined warning thresholds.
- Include a basic cleanup mechanism using
trap
.
Prerequisites
- Access to a Linux or macOS terminal (the script will be primarily geared towards Linux due to command differences, e.g.,
free
, but notes for macOS will be included where feasible). - A text editor (like
nano
,vim
,emacs
, VS Code, etc.). - Basic shell commands:
touch
,chmod
,./script.sh
execution. - Standard Unix utilities available:
uptime
,free
(Linux) /sysctl
/vm_stat
(macOS),df
,awk
,grep
,sed
,echo
,date
,mktemp
,rm
.
We will assume very little prior advanced scripting knowledge, reinforcing the concepts from the previous chapter.
Let's Begin!
We'll build the script step by step, adding features incrementally.
Step 1: Initial Script Setup and Basic Structure
First, let's create the script file, make it executable, and add the essential boilerplate, including our robustness settings and default values.
- Create the file:
- Make it executable:
- Open
health_check.sh
in your text editor and add the following initial structure:
#!/usr/bin/env bash
# --- Robustness Settings ---
# Exit immediately if a command exits with a non-zero status.
set -e
# Treat unset variables as an error when substituting.
set -u
# Prevent errors in a pipeline from being masked.
set -o pipefail
# --- Default Configuration ---
# Warning threshold for 1-minute load average (use floating point)
readonly DEFAULT_LOAD_WARN_THRESHOLD="5.0"
# Warning threshold for disk usage percentage (integer)
readonly DEFAULT_DISK_WARN_THRESHOLD="85"
# Filesystems (mount points) to check by default (root filesystem)
# Using an array to hold multiple values
readonly DEFAULT_FILESYSTEMS_TO_CHECK=("/")
# Temporary directory base name
readonly TEMP_DIR_BASENAME="health_check_temp"
# --- Script Variables (will be populated by options or defaults) ---
LOAD_WARN_THRESHOLD=""
DISK_WARN_THRESHOLD=""
# Declare an array to hold the list of filesystems we'll actually check
declare -a FILESYSTEMS_TO_CHECK=()
# Variable to hold the path to our temporary directory
SCRIPT_TEMP_DIR=""
# --- Function Declarations (will be added later) ---
# We'll define functions like usage(), check_load(), check_memory(), etc. here
# --- Cleanup Function ---
cleanup() {
# This function will be called by our trap
local exit_status=${1:-$?} # Capture exit status if passed, otherwise use current $?
echo # Add a newline for visual separation in output
echo "--- Running cleanup ---"
# Check if the temp directory variable was set and if the directory exists
if [ -n "$SCRIPT_TEMP_DIR" ] && [ -d "$SCRIPT_TEMP_DIR" ]; then
echo "Removing temporary directory: $SCRIPT_TEMP_DIR"
# rm -rf is powerful, ensure we only remove what we created
rm -rf "$SCRIPT_TEMP_DIR"
echo "Temporary directory removed."
# else
# Optional: Add message if temp dir wasn't created or already removed
# echo "Temporary directory not found or not created."
fi
echo "Exiting with status: $exit_status"
echo "--- Cleanup finished ---"
# We don't explicitly exit here in the EXIT trap handler itself
}
# --- Trap Setup ---
# Call the 'cleanup' function automatically when the script EXITS
# (normally or due to error/signal). Pass the exit status to cleanup.
trap 'cleanup $?' EXIT
# Optionally add traps for specific signals like INT (Ctrl+C) or TERM
# trap 'cleanup 130; exit 130' INT
# trap 'cleanup 143; exit 143' TERM
# --- Main Logic Function ---
main() {
# Create a temporary directory for this script run
# 'mktemp -d' creates a unique directory based on the template
SCRIPT_TEMP_DIR=$(mktemp -d "${TEMP_DIR_BASENAME}_XXXXXX")
echo "Using temporary directory: $SCRIPT_TEMP_DIR" # Log temp dir creation
# --- Option Parsing (will be added in Step 2) ---
# For now, just use defaults
LOAD_WARN_THRESHOLD="$DEFAULT_LOAD_WARN_THRESHOLD"
DISK_WARN_THRESHOLD="$DEFAULT_DISK_WARN_THRESHOLD"
# Copy the default array elements to our working array
FILESYSTEMS_TO_CHECK=("${DEFAULT_FILESYSTEMS_TO_CHECK[@]}")
# --- Start the Checks ---
echo # Blank line for spacing
echo "--- System Health Check Report ---"
# Use command substitution to include the date
echo "Report generated on: $(date)"
echo "Warning Thresholds: CPU Load >= $LOAD_WARN_THRESHOLD, Disk Usage >= $DISK_WARN_THRESHOLD%"
echo # Blank line
# --- Call Check Functions (will be implemented in Step 3) ---
# check_cpu_load
# check_memory
# check_disk_usage
echo # Blank line
echo "--- Health Check Complete ---"
# Note: The cleanup function runs automatically after this via the EXIT trap
}
# --- Script Entry Point ---
# Call the main function, passing all script arguments ($@) to it.
# This allows main() to later parse options passed to the script.
main "$@"
Explanation:
set -euo pipefail
: Our standard safety net.readonly
Defaults: We define default thresholds and the default filesystem array usingreadonly
to indicate they are constants. Using uppercase is a convention for constants.- Script Variables: We declare the variables that will hold the active configuration for this run (potentially overridden by options later).
declare -a
explicitly creates an array. cleanup()
function: Defines the actions to take just before the script exits. It checks if the temporary directory variable (SCRIPT_TEMP_DIR
) was set and if the directory actually exists before attempting removal. This prevents errors if the script fails beforemktemp
runs.trap cleanup EXIT
: This is the crucial line that registers thecleanup
function to be executed automatically whenever the script exits, for any reason.mktemp -d ...
: This command safely creates a unique temporary directory. Usingmktemp
is much safer than manually creating directories like/tmp/my_temp
because it avoids race conditions and potential security issues if multiple instances run or if the directory name is predictable. We store the created path inSCRIPT_TEMP_DIR
.main()
function: We encapsulate the core logic here. This improves organization. It currently initializes variables with defaults and prints the basic report structure.main "$@"
: This executes themain
function and passes all command-line arguments received by the script ($@
) into it. This is essential forgetopts
later.
Testing Step 1:
Save the script and run it:
You should see output similar to this (the temp dir name will vary):
Using temporary directory: health_check_temp_ABCDEF
Report generated on: Tue Apr 23 10:30:00 BST 2024
Warning Thresholds: CPU Load >= 5.0, Disk Usage >= 85%
--- Health Check Complete ---
--- Running cleanup ---
Removing temporary directory: health_check_temp_ABCDEF
Temporary directory removed.
Exiting with status: 0
--- Cleanup finished ---
Ctrl+C
while it's running (if you add a sleep
command inside main
for testing) – the cleanup should still occur.
Step 2: Add Command-Line Option Parsing with getopts
Now, let's make the script more flexible by allowing users to override defaults using command-line options:
-l <load>
: Set CPU load warning threshold.-d <percent>
: Set disk usage warning threshold.-f <mount_point>
: Specify a filesystem mount point to check (can be used multiple times).-
-h
: Display a help message. -
Add the
usage()
function (place it before themain()
function definition):
# --- Helper Function: Display Usage ---
usage() {
# Using a here document for the multi-line message
cat << EOF
Usage: $0 [-l <load_threshold>] [-d <disk_threshold_percent>] [-f <mount_point>] [-h]
Performs basic system health checks.
Options:
-l <load> Warning threshold for 1-min CPU load average (float).
Default: ${DEFAULT_LOAD_WARN_THRESHOLD}
-d <percent> Warning threshold for disk usage percentage (integer).
Default: ${DEFAULT_DISK_WARN_THRESHOLD}
-f <mount> Filesystem mount point to check (can be specified multiple times).
Default: Check only root filesystem ('${DEFAULT_FILESYSTEMS_TO_CHECK[*]}')
-h Display this help message and exit.
EOF
# Exit with a non-zero status after showing help
exit 1
}
- Modify the
main()
function to include thegetopts
loop:
# --- Main Logic Function ---
main() {
# Create a temporary directory for this script run
SCRIPT_TEMP_DIR=$(mktemp -d "${TEMP_DIR_BASENAME}_XXXXXX")
# echo "Using temporary directory: $SCRIPT_TEMP_DIR" # Can comment out if too noisy
# --- Initialize with defaults BEFORE parsing options ---
LOAD_WARN_THRESHOLD="$DEFAULT_LOAD_WARN_THRESHOLD"
DISK_WARN_THRESHOLD="$DEFAULT_DISK_WARN_THRESHOLD"
# Important: Make a copy of the default array. If the user uses -f, we clear this copy.
FILESYSTEMS_TO_CHECK=("${DEFAULT_FILESYSTEMS_TO_CHECK[@]}")
# Flag to track if user specified any filesystems via -f
local user_specified_filesystems=false
# --- Option Parsing Loop ---
while getopts ":l:d:f:h" opt; do
case $opt in
l)
LOAD_WARN_THRESHOLD="$OPTARG"
# Basic validation: check if it looks like a number (integer or float)
if ! [[ "$OPTARG" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
echo "Error: Invalid load threshold specified with -l: '$OPTARG'. Must be a number." >&2
usage
fi
;;
d)
DISK_WARN_THRESHOLD="$OPTARG"
# Basic validation: check if it's an integer
if ! [[ "$OPTARG" =~ ^[0-9]+$ ]]; then
echo "Error: Invalid disk threshold specified with -d: '$OPTARG'. Must be an integer percentage." >&2
usage
fi
;;
f)
# If this is the first time -f is used, clear the default array
if [ "$user_specified_filesystems" = false ]; then
FILESYSTEMS_TO_CHECK=()
user_specified_filesystems=true
fi
# Add the specified filesystem mount point to our array
FILESYSTEMS_TO_CHECK+=("$OPTARG")
;;
h)
usage # Display help and exit
;;
\?)
# Invalid option
echo "Error: Invalid option specified: -$OPTARG" >&2
usage
;;
:)
# Missing option argument
echo "Error: Option -$OPTARG requires an argument." >&2
usage
;;
esac
done
# --- Shift away processed options ---
# Remove options and their arguments from $@
shift $((OPTIND-1))
# --- Argument Validation ---
# Check if any arguments remain after options (we don't expect any for this script)
if [ $# -gt 0 ]; then
echo "Error: Unexpected arguments provided: '$@'" >&2
usage
fi
# Check if user specified -f but the array ended up empty (e.g., bad paths later?)
# Although, we add paths directly, so this check is less critical here now.
# We should check *inside* the disk check if the specified paths are valid mount points.
if [ "$user_specified_filesystems" = true ] && [ ${#FILESYSTEMS_TO_CHECK[@]} -eq 0 ]; then
echo "Warning: -f option used, but no valid filesystem paths were added?" >&2
# Decide if this is an error or just a warning. Maybe exit?
# exit 1
fi
# --- Start the Checks ---
echo # Blank line for spacing
echo "--- System Health Check Report ---"
echo "Report generated on: $(date)"
# Display the *actual* thresholds being used
echo "Warning Thresholds: CPU Load >= $LOAD_WARN_THRESHOLD, Disk Usage >= $DISK_WARN_THRESHOLD%"
# Display the filesystems that will be checked
echo "Filesystems to check: ${FILESYSTEMS_TO_CHECK[*]}" # [*] joins with space
echo # Blank line
# --- Call Check Functions (will be implemented in Step 3) ---
check_cpu_load
check_memory # Assuming Linux 'free' command for now
check_disk_usage
echo # Blank line
echo "--- Health Check Complete ---"
}
Explanation of Changes in main()
:
- Initialization: Variables are now initialized to defaults before the
getopts
loop. A copy of the default filesystem array is made. user_specified_filesystems
flag: Tracks whether the-f
option was used.getopts
loop: Parses options-l
,-d
,-f
,-h
.- Validation added for
-l
(number/float) and-d
(integer) using regex matching (=~
). - For
-f
, theFILESYSTEMS_TO_CHECK
array is cleared only the first time-f
is encountered, then the provided argument ($OPTARG
) is appended using+=
. - Error cases (
\?
and:
) now call theusage
function for consistent error reporting.
- Validation added for
shift $((OPTIND-1))
: Removes processed options.- Argument Validation: Checks if unexpected non-option arguments were passed.
- Output: The report header now shows the actual thresholds and filesystems being used for this run.
- Function Calls: Placeholder calls to the check functions remain.
Testing Step 2:
./health_check.sh -h
(Should display the usage message)./health_check.sh -l 10.5 -d 90
(Should run using custom thresholds)./health_check.sh -f / -f /home -f /var
(Should run checking specified paths)./health_check.sh -f /data
(Should run checking only/data
)./health_check.sh -x
(Should show "Invalid option" error and usage)./health_check.sh -f
(Should show "Option -f requires an argument" error and usage)./health_check.sh -l abc
(Should show "Invalid load threshold" error and usage)./health_check.sh some_arg
(Should show "Unexpected arguments" error and usage)
Step 3: Implement the Health Check Functions
Now, let's write the actual logic for checking CPU, memory, and disk. Place these function definitions before the main()
function definition in your script.
3.1. check_cpu_load()
# --- Check Functions ---
### Checks the 1-minute CPU load average against the threshold ###
check_cpu_load() {
# Use 'local' for variables inside functions
local current_load
local load_check_status="OK" # Assume OK initially
echo "--- CPU Load ---"
# Get load average from 'uptime'. The 1-min average is typically near the end.
# Using awk to extract it robustly:
# -F'[ ,:]+' splits by space, comma, or colon (one or more)
# NF is Number of Fields. $(NF-2) is usually the 1-min avg.
current_load=$(uptime | awk -F'[ ,:]+' '{print $(NF-2)}')
echo "Current 1-minute load average: $current_load"
# Compare floating point numbers using 'awk'
# awk exits 0 if comparison is true, 1 if false.
# We use 'exit !(condition)' because shell 'if' treats 0 as true (success).
# So if load > threshold (true -> 1), !(1) is 0, awk exits 0, 'if' block runs.
if awk -v load="$current_load" -v threshold="$LOAD_WARN_THRESHOLD" 'BEGIN { exit !(load >= threshold) }'; then
echo "WARNING: Load average ($current_load) is >= threshold ($LOAD_WARN_THRESHOLD)"
load_check_status="WARNING"
else
echo "Status: OK"
fi
echo "Overall CPU Load Status: $load_check_status"
echo # Blank line for spacing
}
Explanation:
- Uses
uptime
andawk
to reliably extract the 1-minute load average. - Uses
awk
again for floating-point comparison againstLOAD_WARN_THRESHOLD
. Theexit !(condition)
pattern is used to makeawk
's exit status compatible with shellif
. - Prints the current load and a clear WARNING or OK status message.
3.2. check_memory()
(Linux Version using free
)
### Checks available memory percentage (Linux using 'free') ###
check_memory() {
local total_mem_kb avail_mem_kb total_mem_mb avail_mem_mb avail_mem_percent
local mem_check_status="ERROR" # Default to error until we succeed
echo "--- Memory Usage (Linux) ---"
# Check if 'free' command exists
if ! command -v free &> /dev/null; then
echo "ERROR: 'free' command not found. Cannot check memory."
echo "Overall Memory Status: $mem_check_status"
echo # Blank line
return # Exit the function early
fi
# Get memory details in Kilobytes using 'free' (no options needed)
# Use awk to find the 'Mem:' line and extract total (col 2) and available (col 7)
# The 'available' column (usually 7th) is generally preferred over 'free'
# Output Format Example (may vary slightly):
# total used free shared buff/cache available
# Mem: 16018500 4017624 1186076 266780 10814800 12611376
# Swap: 2097148 0 2097148
local mem_line
mem_line=$(free | grep '^Mem:')
if [ -z "$mem_line" ]; then
echo "ERROR: Could not parse 'Mem:' line from 'free' command output."
free # Print raw output for debugging
echo "Overall Memory Status: $mem_check_status"
echo # Blank line
return
fi
total_mem_kb=$(echo "$mem_line" | awk '{print $2}')
avail_mem_kb=$(echo "$mem_line" | awk '{print $7}') # Assuming 7th field is 'available'
if ! [[ "$total_mem_kb" =~ ^[0-9]+$ ]] || ! [[ "$avail_mem_kb" =~ ^[0-9]+$ ]]; then
echo "ERROR: Failed to extract numeric memory values from 'free'."
echo "Raw Mem line: $mem_line"
echo "Overall Memory Status: $mem_check_status"
echo # Blank line
return
fi
# Convert KB to MB for potentially nicer display (integer division)
total_mem_mb=$(( total_mem_kb / 1024 ))
avail_mem_mb=$(( avail_mem_kb / 1024 ))
# Calculate available percentage (using KB for accuracy before dividing)
if [ "$total_mem_kb" -gt 0 ]; then
avail_mem_percent=$(( (avail_mem_kb * 100) / total_mem_kb ))
echo "Total Memory: ${total_mem_mb} MB"
echo "Available Memory: ${avail_mem_mb} MB (${avail_mem_percent}%)"
mem_check_status="OK" # Calculation succeeded
else
echo "ERROR: Total memory reported as zero. Cannot calculate percentage."
echo "Overall Memory Status: $mem_check_status"
echo # Blank line
return
fi
# No threshold check for memory in this version, just reporting
# (Could add a threshold check similar to disk usage if desired)
echo "Overall Memory Status: $mem_check_status"
echo # Blank line
}
Explanation (Linux):
- Checks if the
free
command exists. - Parses the output of
free
usinggrep
andawk
to find theMem:
line and extract total/available memory (assuming KB output and field positions). Includes error checking for parsing. - Calculates MB and percentage available.
- Reports the values. (No warning threshold is implemented for memory in this example, but it could be added).
(Optional) check_memory()
(macOS Version)
macOS requires sysctl
and vm_stat
, parsing is different.
### Checks memory usage (macOS using sysctl/vm_stat - Approximation) ###
# check_memory() {
# local total_mem_bytes total_mem_mb page_size pages_free pages_inactive pages_wired
# local mem_available_approx_mb mem_available_percent
# local mem_check_status="ERROR"
#
# echo "--- Memory Usage (macOS Approximation) ---"
# if ! command -v sysctl &> /dev/null || ! command -v vm_stat &> /dev/null; then
# echo "ERROR: 'sysctl' or 'vm_stat' command not found. Cannot check memory."
# echo "Overall Memory Status: $mem_check_status"
# echo; return
# fi
#
# total_mem_bytes=$(sysctl -n hw.memsize)
# page_size=$(sysctl -n hw.pagesize)
#
# # Parse vm_stat - requires careful awk field selection
# local vm_output
# vm_output=$(vm_stat)
# pages_free=$(echo "$vm_output" | awk '/Pages free:/ { print $3 }' | tr -d '.')
# pages_inactive=$(echo "$vm_output" | awk '/Pages inactive:/ { print $3 }' | tr -d '.')
# # Wired memory is actively used and cannot be paged out
# pages_wired=$(echo "$vm_output" | awk '/Pages wired down:/ { print $4 }' | tr -d '.')
#
# if ! [[ "$total_mem_bytes" =~ ^[0-9]+$ ]] || \
# ! [[ "$page_size" =~ ^[0-9]+$ ]] || \
# ! [[ "$pages_free" =~ ^[0-9]+$ ]] || \
# ! [[ "$pages_inactive" =~ ^[0-9]+$ ]] || \
# ! [[ "$pages_wired" =~ ^[0-9]+$ ]]; then
# echo "ERROR: Failed to parse numeric values from sysctl or vm_stat."
# echo "Overall Memory Status: $mem_check_status"
# echo; return
# fi
#
# total_mem_mb=$(( total_mem_bytes / 1024 / 1024 ))
# # Approximation: Available is often considered Free + Inactive
# # More sophisticated checks might consider compressed memory etc.
# mem_available_approx_mb=$(( ( (pages_free + pages_inactive) * page_size ) / 1024 / 1024 ))
#
# if [ "$total_mem_mb" -gt 0 ]; then
# mem_available_percent=$(( (mem_available_approx_mb * 100) / total_mem_mb ))
# echo "Total Memory: ${total_mem_mb} MB"
# echo "Available Memory (Approx): ${mem_available_approx_mb} MB (${mem_available_percent}%)"
# mem_check_status="OK"
# else
# echo "ERROR: Total memory reported as zero."
# echo "Overall Memory Status: $mem_check_status"
# echo; return
# fi
#
# echo "Overall Memory Status: $mem_check_status"
# echo # Blank line
# }
3.3. check_disk_usage()
### Checks disk usage percentage for specified mount points ###
check_disk_usage() {
local fs_mount usage_percent human_readable_info df_line
local disk_check_status="OK" # Assume OK overall unless a warning is found
echo "--- Disk Usage ---"
if [ ${#FILESYSTEMS_TO_CHECK[@]} -eq 0 ]; then
echo "No filesystems specified for checking."
echo "Overall Disk Status: SKIPPED"
echo # Blank line
return
fi
# Iterate over the array of mount points provided
for fs_mount in "${FILESYSTEMS_TO_CHECK[@]}"; do
local single_fs_status="ERROR" # Status for this specific filesystem
echo "Checking filesystem mounted at: '$fs_mount'"
# Verify the path exists and df can report on it
# Use 'df -P' for POSIX standard output (more reliable parsing)
# Run df on the mount point itself. Grep for the exact mount point at the end of the line.
# Escape the path for grep just in case it contains special chars.
local escaped_mount
escaped_mount=$(printf '%s\n' "$fs_mount" | sed 's:[][\/.^$*]:\\&:g') # Escape regex chars
# Get the relevant line from df -P output
# Ensure we match the mount point ($6) anchored at the end ($)
if ! df_line=$(df -P "$fs_mount" 2>/dev/null | awk -v path="$fs_mount" 'NR > 1 && $6 == path {print}'); then
echo " ERROR: Could not get 'df -P' info for '$fs_mount'. Skipping."
disk_check_status="ERROR" # Mark overall status as error
continue # Skip to the next filesystem in the loop
fi
if [ -z "$df_line" ]; then
echo " ERROR: Path '$fs_mount' might not be a valid mount point or df failed."
echo " (Please provide mount points like '/', '/home', '/var', not arbitrary paths)."
disk_check_status="ERROR"
continue
fi
# Extract usage percentage (5th field), removing '%' sign
usage_percent=$(echo "$df_line" | awk '{gsub(/%/, ""); print $5}')
# Get human-readable info using 'df -h' separately for display
# Need to handle potential errors here too
if ! human_readable_info=$(df -h "$fs_mount" 2>/dev/null | awk -v path="$fs_mount" 'NR > 1 && $6 == path {print}'); then
human_readable_info="Could not get human-readable info."
fi
# Validate extracted percentage
if ! [[ "$usage_percent" =~ ^[0-9]+$ ]]; then
echo " ERROR: Failed to extract numeric usage percentage for '$fs_mount'."
echo " Raw df line: $df_line"
disk_check_status="ERROR"
single_fs_status="ERROR"
else
# Successfully parsed, now check threshold
echo " Usage: $usage_percent%"
echo " Details: $human_readable_info"
single_fs_status="OK" # Parsed ok
# Compare usage against threshold (integer comparison)
if [ "$usage_percent" -ge "$DISK_WARN_THRESHOLD" ]; then
echo " WARNING: Usage ($usage_percent%) is >= threshold ($DISK_WARN_THRESHOLD%)"
single_fs_status="WARNING"
# If any filesystem has a warning, the overall status is Warning
# unless an error occurred elsewhere.
if [ "$disk_check_status" != "ERROR" ]; then
disk_check_status="WARNING"
fi
else
echo " Status: OK"
fi
fi
echo " Filesystem '$fs_mount' Status: $single_fs_status"
done # End of loop through filesystems
echo "Overall Disk Status: $disk_check_status"
echo # Blank line
}
Explanation:
- Iterates through the
FILESYSTEMS_TO_CHECK
array. - Uses
df -P "$fs_mount"
to get POSIX-standard output for the specific mount point. - Uses
awk 'NR > 1 && $6 == path {print}'
to find the correct line (skipping headerNR > 1
) where the 6th field ($6
) exactly matches the requested mount pointpath
. Error handling is included ifdf
fails or the path isn't found. - Parses the usage percentage (field 5) using
awk
, removing the%
sign withgsub
. Includes validation. - Runs
df -h
separately just to get the human-readable line for display purposes. - Compares the integer
usage_percent
withDISK_WARN_THRESHOLD
using[ -ge ]
. - Sets individual and overall status indicators (
OK
,WARNING
,ERROR
).
Step 4: Final Assembly and Testing
- Ensure Order: Make sure all function definitions (
usage
,check_cpu_load
,check_memory
,check_disk_usage
,cleanup
) are placed in the script file before themain()
function definition, andmain "$@"
is the last line that executes. - Review: Read through your complete
health_check.sh
script.
Comprehensive Testing:
Execute the script with various combinations of options and scenarios:
- Defaults:
./health_check.sh
- Help:
./health_check.sh -h
- Custom Thresholds (Trigger Warnings):
./health_check.sh -l 0.1 -d 5
- Custom Thresholds (Normal):
./health_check.sh -l 100 -d 99
- Specific Filesystems:
./health_check.sh -f / -f /tmp
(or other valid mount points on your system - usedf
command to see yours) - Invalid Filesystem Path:
./health_check.sh -f /nonexistent_mount
- Invalid Option:
./health_check.sh -z
- Missing Argument:
./health_check.sh -d
- Unexpected Argument:
./health_check.sh extra_arg
- Combination:
./health_check.sh -v -l 5.5 -f /var -f /home
(Add a-v
option if you want more verbose internal logging, though not implemented here)
For each test, examine the output carefully:
- Are the correct thresholds and filesystems listed in the header?
- Are the CPU, memory, and disk values reported?
- Do warnings appear correctly when thresholds are exceeded?
- Are errors handled gracefully (e.g., for invalid paths or options)?
- Does the cleanup function run and report the correct exit status? Does it remove the temporary directory?
Workshop Summary
Fantastic! You have successfully built a functional system health check script by applying advanced shell scripting techniques:
- Modularity: Used functions (
usage
,cleanup
,check_*
,main
) to organize code. - Robustness: Implemented error handling using
set -euo pipefail
,$?
, explicit checks (command -v
, file existence, parsing validation), and atrap
for cleanup. - Configuration: Parsed command-line options professionally using
getopts
. - Data Handling: Used command substitution (
$()
) extensively to capture output fromdate
,uptime
,free
,df
,awk
,mktemp
. Processed text usingawk
,grep
, andsed
. Handled multiple filesystem paths using arrays. - Resource Management: Safely created and removed temporary resources (
mktemp
,rm
intrap
).
This script serves as a solid base. You can enhance it further by adding more checks (network connectivity, running processes, service status), improving parsing robustness (e.g., using df --output
flags if available), adding color output, or reading configuration from a file.
This workshop demonstrates how the techniques from the previous chapter come together to create powerful and reliable automation tools using the shell.