Skip to content
Author Nejat Hakan
eMail nejat.hakan@outlook.de
PayPal Me https://paypal.me/nejathakan


System Monitoring and Resource Management

Introduction Monitoring and Management Essentials

Welcome to the critical domain of System Monitoring and Resource Management on Linux. In any computing environment, from a personal laptop to vast server farms, understanding what the system is doing, how its resources are being utilized, and how to manage those resources effectively is paramount. Without proper monitoring, diagnosing performance issues becomes guesswork, potential problems go unnoticed until they cause outages, and capacity planning is impossible. Without effective resource management, critical processes might starve while unimportant tasks consume valuable CPU time or memory, leading to instability and poor performance.

Why is this so important?

  1. Performance Optimization: By observing resource usage (CPU, Memory, Disk I/O, Network), you can identify bottlenecks. Is an application slow because the CPU is maxed out, it's waiting for disk access, or it's constantly swapping memory? Monitoring provides the answers needed to tune the system or application.
  2. Stability and Reliability: Unexpected resource exhaustion (e.g., running out of memory or disk space) is a common cause of system crashes or hangs. Continuous monitoring allows you to foresee these situations and take corrective action before they cause critical failures. Spotting runaway processes consuming excessive resources is key to maintaining stability.
  3. Troubleshooting: When things go wrong (and they inevitably do), system logs and real-time monitoring data are your primary tools for diagnosis. Understanding system metrics helps you correlate events and pinpoint the root cause of a problem, whether it's a hardware fault, a software bug, or a configuration issue.
  4. Security Auditing: Monitoring system logs and network connections can help detect unauthorized access attempts, unusual process activity, or other potential security breaches. Resource usage patterns can sometimes indicate malware activity.
  5. Capacity Planning: By tracking resource utilization trends over time, administrators can make informed decisions about future hardware needs. Do you need more RAM? Faster disks? A more powerful CPU? Or perhaps another server entirely? Monitoring data provides the justification for upgrades or scaling.

Key Resources We Monitor and Manage:

  • CPU (Central Processing Unit): The "brain" of the computer. We monitor its utilization (how busy it is), load average (how many processes are waiting), and context switches.
  • Memory (RAM & Swap): Random Access Memory is crucial for active processes. We monitor total usage, free memory, cached data, and swap space usage (virtual memory on disk). Excessive swapping is often a sign of insufficient RAM.
  • Disk I/O (Input/Output): How quickly data can be read from and written to storage devices (HDDs, SSDs). We monitor throughput (MB/s), operations per second (IOPS), wait times, and device utilization. Slow disk I/O can severely impact overall system responsiveness.
  • Network I/O: The rate at which data is sent and received over network interfaces. We monitor bandwidth usage, packet counts, errors, and established connections.

This section will guide you through the fundamental tools and concepts needed to effectively monitor your Linux systems and manage their resources. We will start with essential command-line tools, delve into specific resource monitoring techniques, explore process management, understand system logging, and touch upon more advanced tools and concepts like control groups. Each technical sub-section will be followed by a hands-on workshop to solidify your understanding.

1. Essential Real-Time Monitoring Tools

Before diving into specific resources, let's familiarize ourselves with the workhorses of real-time system monitoring on the command line. These tools provide a dynamic overview of the system's current state.

top The Classic Task Manager

The top command provides a dynamic, real-time view of a running system. It displays system summary information as well as a list of tasks currently being managed by the kernel. Its output refreshes periodically (typically every 3 seconds), allowing you to observe changes as they happen.

Understanding the top Output:

The output is divided into two main parts: the summary area (top few lines) and the task area (the list of processes).

  • Summary Area:

    • top - 10:30:01 up 5 days, 1:15, 2 users, load average: 0.05, 0.15, 0.10
      • 10:30:01: Current system time.
      • up 5 days, 1:15: System uptime (how long since the last boot).
      • 2 users: Number of currently logged-in users.
      • load average: 0.05, 0.15, 0.10: System load average over the last 1, 5, and 15 minutes. This represents the average number of processes in the run queue (running or waiting for CPU time) plus those waiting for uninterruptible I/O. On a multi-core system, a load average equal to the number of CPU cores generally means the system is fully utilized. Values significantly higher indicate the system is overloaded.
    • Tasks: 250 total, 1 running, 249 sleeping, 0 stopped, 0 zombie
      • Total number of processes.
      • Breakdown by state: Running (actively using CPU or ready to), Sleeping (waiting for an event or resource), Stopped (suspended, e.g., by Ctrl+Z), Zombie (terminated but waiting for parent process to collect status).
    • %Cpu(s): 1.5 us, 0.8 sy, 0.0 ni, 97.5 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
      • CPU utilization breakdown (press 1 to toggle per-CPU view):
        • us: user space (running user processes)
        • sy: system/kernel space (running kernel tasks)
        • ni: nice (user processes with modified priority)
        • id: idle (CPU is not busy)
        • wa: wait (waiting for I/O operations to complete) - High wa often indicates a disk bottleneck.
        • hi: hardware interrupts
        • si: software interrupts
        • st: steal time (relevant in virtualized environments, time stolen by the hypervisor)
    • MiB Mem : 15890.5 total, 8140.2 free, 4150.3 used, 3600.0 buff/cache
    • MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 11250.8 avail Mem
      • Memory usage (RAM): Total, free, used, and buffered/cached memory. Linux uses free RAM extensively for caching disk data (buffers/cache) to speed up access. This cache is readily relinquished if applications need the memory.
      • Swap usage (Virtual Memory): Total, free, and used swap space. High swap usage usually indicates insufficient RAM for the current workload.
      • avail Mem: An estimation of how much memory is available for starting new applications, without swapping. This is often a more useful metric than free.
  • Task Area (Columns):

    • PID: Process ID (unique identifier).
    • USER: User owning the process.
    • PR: Priority (kernel scheduling priority).
    • NI: Nice value (user-space priority adjustment, lower is higher priority).
    • VIRT: Virtual Memory size used by the process (KB).
    • RES: Resident Memory size (physical RAM used, KB).
    • SHR: Shared Memory size (KB).
    • S: Process Status (R=Running, S=Sleeping, D=Disk Sleep, Z=Zombie, T=Stopped/Traced).
    • %CPU: Percentage of CPU time used by the process since the last update.
    • %MEM: Percentage of physical RAM used by the process.
    • TIME+: Total CPU time consumed by the task (hundredths of a second).
    • COMMAND: The command name or command line.

Interactive top Commands:

While top is running, press these keys:

  • q: Quit top.
  • h: Display help screen.
  • k: Kill a process (you'll be prompted for the PID and signal).
  • r: Renice a process (change its priority, prompts for PID and nice value).
  • f: Fields management (add/remove/reorder columns).
  • o or O: Change sorting order (prompts for sort field letter).
  • M: Sort by memory usage (%MEM).
  • P: Sort by CPU usage (%CPU).
  • T: Sort by total CPU time (TIME+).
  • 1: Toggle summary CPU display between combined and per-CPU.
  • z: Toggle color display.
  • c: Toggle display between command name and full command line.
  • u: Filter by user (prompts for username).
  • Spacebar or Enter: Refresh display immediately.

htop An Enhanced Interactive Process Viewer

htop is often preferred over top because it offers several improvements:

  • Colorized Output: Easier to read and distinguish information.
  • Scrolling: You can scroll vertically and horizontally to see all processes and full command lines.
  • Easier Interaction: No need to enter PIDs for killing or renicing; you can select processes with arrow keys.
  • Mouse Support: If run in a terminal emulator that supports it.
  • Tree View: Press F5 to see parent-child relationships between processes.
  • Setup Menu: Press F2 to easily customize displayed meters, columns, colors, and options.

Understanding the htop Output:

  • Top Meters: Configurable graphical meters showing CPU (per core), Memory, and Swap usage. Load average, uptime, and task counts are also displayed.
  • Task Area: Similar columns to top, but often more intuitively arranged and configurable via F2 Setup.
  • Bottom Menu: Shows key function key shortcuts (F1 Help, F2 Setup, F3 Search, F4 Filter, F5 Tree View, F6 SortBy, F7 Nice-, F8 Nice+, F9 Kill, F10 Quit).

htop provides largely the same information as top but in a more user-friendly and visually appealing package. If it's not installed by default, it's usually available via the package manager (e.g., sudo apt install htop or sudo yum install htop).

ps Reporting a Snapshot of Current Processes

Unlike top and htop which are dynamic, ps (process status) provides a static snapshot of the processes running at the moment the command is executed. It's highly versatile due to its numerous options for selecting processes and customizing the output format.

Common ps Usage Patterns:

  1. BSD Syntax (common on Linux): ps aux

    • a: Show processes for all users.
    • u: Display user-oriented format (includes USER, %CPU, %MEM, VSZ, RSS, etc.).
    • x: Show processes not attached to a terminal (like daemons/services).
    • This is arguably the most common and useful invocation for a general overview.
    # Example Output Snippet (ps aux)
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root           1  0.0  0.1 169404 11928 ?        Ss   Jul10   0:02 /sbin/init splash
    root           2  0.0  0.0      0     0 ?        S    Jul10   0:00 [kthreadd]
    root         889  0.1  0.3 123456 50000 ?        Sl   Jul10   1:30 /usr/lib/some-service
    student     1234  0.5  1.0 876543 160000 pts/0   S+   10:00   0:05 gnome-terminal
    student     5678 12.3  5.5 1500000 880000 pts/1  R+   10:25   0:55 /usr/bin/firefox
    
  2. System V Syntax: ps -ef

    • -e: Show every process.
    • -f: Show full-format listing (includes UID, PID, PPID, C, STIME).
    • Often used to see parent/child process relationships (PPID).
    # Example Output Snippet (ps -ef)
    UID          PID    PPID  C STIME TTY          TIME CMD
    root           1       0  0 Jul10 ?        00:00:02 /sbin/init splash
    root           2       0  0 Jul10 ?        00:00:00 [kthreadd]
    root         889       1  0 Jul10 ?        00:01:30 /usr/lib/some-service
    student     1234    1200  0 10:00 pts/0    00:00:05 gnome-terminal
    student     5678    1234 12 10:25 pts/1    00:00:55 /usr/bin/firefox
    
  3. Custom Format: ps -eo <columns>

    • -e: Show every process.
    • -o: Specify user-defined format. You list the column names you want. Common columns: pid,ppid,user,%cpu,%mem,vsz,rss,stat,start,time,comm,args.
    • comm: Command name only.
    • args: Full command line with arguments.
    ps -eo pid,user,%cpu,%mem,comm --sort=-%cpu | head -n 10
    # Shows PID, User, %CPU, %MEM, Command Name for all processes
    # Sorted by %CPU descending (--sort=-%cpu), showing top 10 (head -n 10)
    

Key ps Output Columns (common to aux or -eo):

  • USER / UID: User owning the process.
  • PID: Process ID.
  • PPID: Parent Process ID.
  • %CPU: Approximate CPU utilization. Note: This is often averaged over the process's lifetime, unlike top's real-time view, unless specifically requested otherwise.
  • %MEM: Approximate physical memory (RAM) utilization.
  • VSZ (Virtual Set Size): Total virtual memory used by the process (in KB).
  • RSS (Resident Set Size): Physical memory (RAM) occupied by the process (in KB). This is often a more relevant metric than VSZ for actual RAM usage.
  • TTY: Controlling terminal (? means no controlling terminal, typical for daemons).
  • STAT / S: Process state (see top explanation: R, S, D, Z, T, etc.; + means foreground process group).
  • START / STIME: Time or date the process started.
  • TIME: Cumulative CPU time consumed by the process (often in MM:SS or HH:MM:SS format).
  • COMMAND / CMD / comm / args: The command being run.

Combining ps with grep:

A very common use case is finding a specific process:

ps aux | grep firefox
# Find all processes with "firefox" in their command line

ps -ef | grep sshd
# Find processes related to the SSH daemon
Self-Grep Issue: Note that the grep command itself will often appear in the output. You can filter it out: ps aux | grep firefox | grep -v grep.

Workshop Identifying and Inspecting Processes

Goal: To practice using top, htop, and ps to identify system activity and gather details about specific processes.

Scenario: Let's simulate a scenario where a background process starts consuming some resources, and we need to investigate it.

Steps:

  1. Open Two Terminals: You'll need one terminal (Terminal A) to run commands and another (Terminal B) to run a background task.

  2. Start a Background Task (Terminal B):

    • Run the following command. This will simply loop indefinitely, consuming a small amount of CPU. We add sleep 1 to prevent it from consuming 100% CPU, making it slightly more realistic for a background task.
      while true; do echo "Working..."; sleep 1; done &
      
    • The & runs the command in the background. Note the PID (Process ID) that is printed, e.g., [1] 12345. You'll use this PID later. If you miss it, don't worry, we'll find it.
  3. Monitor with top (Terminal A):

    • Run top.
    • Observe the process list. It might take a few refresh cycles. Look for a process named bash or sh (or potentially sleep) that is associated with your user and has a non-zero %CPU (though small due to sleep 1) and a TIME+ value that increments.
    • Press P to sort by CPU usage. Does your process appear near the top (it might not if the system is busy)?
    • Press M to sort by Memory usage.
    • Press c to toggle the full command line. Can you now see the while true; do ... command?
    • Make a note of the PID of your loop process as shown in top.
    • Press q to exit top.
  4. Monitor with htop (Terminal A):

    • If you have htop installed, run htop. (If not, you can install it: sudo apt update && sudo apt install htop or sudo yum install htop).
    • Observe the meters at the top.
    • Look for your process in the list. Use the Up/Down arrow keys to navigate.
    • Press F6 (SortBy) and select PERCENT_CPU.
    • Press F5 (Tree) view. Can you see your shell process (bash, zsh, etc.) and the sleep command running under it (or the while loop itself if represented that way)? Press F5 again to exit tree view.
    • Press F4 (Filter). Type your username and press Enter. Now only your processes are shown. Does this make it easier to find the loop? Press F4 again and Enter with an empty string to clear the filter.
    • Press F3 (Search). Type sleep and press Enter. htop will highlight matching processes. Press F3 again to find the next match.
    • Press F9 (Kill). Use the arrow keys to highlight your background loop process (the bash/sh one, not sleep directly if visible separately). Do not press Enter yet. Press Esc twice to cancel the kill operation. We'll kill it later.
    • Press F10 (Quit).
  5. Inspect with ps (Terminal A):

    • Run ps aux. Scan the output for your background loop process (look for while true or similar in the COMMAND column). Note its PID, USER, %CPU, %MEM, STAT (should be S for sleeping most of the time, occasionally R), and START time.
    • Run ps -ef. Find the process again. Note the PID and PPID (Parent Process ID). The PPID should correspond to the PID of the shell process running in Terminal B.
    • Let's assume the PID you found for the loop was 12345. Get specific details using -o:
      ps -eo pid,ppid,user,%cpu,%mem,stat,etime,args -p 12345
      # -p specifies the PID to display
      # etime shows the elapsed time since the process started
      
    • Find the process using pgrep (a utility to find PIDs by name or other attributes):
      pgrep -a -u $USER sleep
      # -a: Show full command line
      # -u $USER: Limit to your user's processes
      # sleep: Find processes matching "sleep" (might show the sleep part of the loop)
      
      pgrep -a -f "while true"
      # -f: Match against the full argument list
      
      This should give you the PID of the main loop shell.
  6. Terminate the Background Task (Terminal A or B):

    • You have the PID (let's say it's 12345). Use the kill command in either terminal:
      kill 12345
      
    • Go back to Terminal B. You should see a message like Terminated or [1]+ Terminated .... The loop has stopped.
    • Run ps aux | grep 12345 | grep -v grep. You should get no output, confirming the process is gone.

Conclusion: You've successfully used top, htop, and ps to monitor system activity in real-time, identify a specific process, inspect its details (PID, PPID, resource usage, state), and terminate it using its PID. These are fundamental skills for managing any Linux system.

2. CPU Monitoring and Analysis

The Central Processing Unit (CPU) is often the first place administrators look when diagnosing performance issues. Understanding how to monitor CPU utilization and interpret the related metrics is crucial.

Key CPU Concepts

  • Cores and Threads: Modern CPUs have multiple cores, each capable of executing instructions independently. Some cores support hyper-threading (or Simultaneous Multi-Threading - SMT), allowing a single physical core to appear as two logical processors to the OS, potentially increasing throughput for certain workloads. When monitoring, it's important to know if you're looking at total utilization across all logical processors or utilization per core/thread.
  • CPU Utilization: This is typically expressed as a percentage, indicating how much time the CPU spent doing useful work versus being idle. It's broken down into categories (as seen in top):
    • %us (user): Time spent executing user-space processes (applications). High us usually means application code is consuming CPU.
    • %sy (system): Time spent executing kernel-space code (system calls, kernel threads). High sy might indicate heavy I/O, intense networking, or kernel-level tasks.
    • %ni (nice): Time spent executing niced (lower priority) user processes.
    • %id (idle): Time the CPU had nothing to do. High id means the CPU is not a bottleneck.
    • %wa (I/O wait): Time the CPU spent waiting for I/O operations (like disk reads/writes) to complete. Important: This is time the CPU could have been doing something else but was stalled waiting for I/O. High wa strongly suggests an I/O bottleneck (often disk, sometimes network). The CPU itself isn't necessarily busy, but tasks waiting for I/O are preventing it from being truly idle.
    • %hi (hardware interrupts): Time spent servicing hardware interrupts (e.g., from network cards, disk controllers).
    • %si (software interrupts): Time spent servicing software interrupts (often related to network packet processing). High si can point to very high network traffic.
    • %st (steal time): In virtualized environments, this is time the hypervisor "stole" from this virtual CPU to run other tasks (like another VM or the hypervisor itself). High st indicates the VM isn't getting its fair share of CPU from the host.
  • Load Average: As seen in top and uptime, the load average (1, 5, 15-minute averages) represents the average number of tasks in the run queue (R state) or waiting for uninterruptible I/O (D state).
    • A load average consistently below the number of logical CPU cores indicates the system is generally not CPU-bound.
    • A load average consistently near or equal to the number of cores means the system is fully utilized.
    • A load average consistently above the number of cores indicates the system is overloaded – there are more tasks ready to run than available CPU cores can handle, leading to waiting times and reduced responsiveness. High load average can be caused by high CPU usage or high I/O wait.

Tools for CPU Monitoring

While top and htop give a good overview, other tools provide different perspectives:

  • mpstat (MultiProcessor Statistics): Part of the sysstat package (often needs installation: sudo apt install sysstat or sudo yum install sysstat). Excellent for viewing statistics per logical processor.

    • mpstat -P ALL: Show statistics for all CPUs (ALL) individually, plus a summary average.
    • mpstat -P ALL 1 5: Show stats for all CPUs every 1 second, 5 times.

    # Example Output (mpstat -P ALL 1 1)
    Linux 5.15.0-76-generic (...)  _x86_64_  (4 CPU)
    
    11:00:01 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    11:00:02 AM  all    1.50    0.00    0.75    0.10    0.00    0.15    0.00    0.00    0.00   97.50
    11:00:02 AM    0    2.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   97.00
    11:00:02 AM    1    1.00    0.00    0.50    0.30    0.00    0.20    0.00    0.00    0.00   98.00
    11:00:02 AM    2    1.80    0.00    0.90    0.00    0.00    0.30    0.00    0.00    0.00   97.00
    11:00:02 AM    3    1.20    0.00    0.60    0.10    0.00    0.10    0.00    0.00    0.00   98.00
    
    This is invaluable for spotting imbalances (one core heavily loaded while others are idle) or understanding utilization patterns on multi-core systems.

  • vmstat (Virtual Memory Statistics): Also part of sysstat (or sometimes installed by default). While primarily for memory (vm), it provides useful CPU context.

    • vmstat 1: Report every 1 second indefinitely.
    • vmstat 2 5: Report every 2 seconds, 5 times.
    # Example Output (vmstat 1 3)
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     1  0      0 8140100 150000 3450000    0    0     5    20  100  250  2  1 97  0  0
     2  0      0 8139900 150000 3450200    0    0     0   150  850 1500 15  5 80  0  0
     0  0      0 8139700 150000 3450400    0    0     0    80  500  900  5  2 93  0  0
    
    • Key CPU Columns: us, sy, id, wa, st (same meanings as in top).
    • Key Process Columns: r (runnable processes waiting for CPU), b (processes in uninterruptible sleep, often waiting for I/O). High r values correlate with high CPU load. High b values correlate with high I/O wait.
  • uptime: Quickly shows the load averages.

    uptime
    # Output: 11:15:01 up 5 days, 1:50,  2 users,  load average: 0.10, 0.18, 0.12
    

Workshop Generating and Analyzing CPU Load

Goal: To generate CPU load and observe its effect using various monitoring tools, focusing on per-core statistics and load average.

Tools Required: top or htop, mpstat, uptime, and the stress utility.

Steps:

  1. Install stress and sysstat:

    • On Debian/Ubuntu: sudo apt update && sudo apt install stress sysstat
    • On CentOS/RHEL/Fedora: sudo yum install epel-release && sudo yum install stress sysstat (or sudo dnf install stress sysstat)
    • sysstat provides mpstat. stress is a simple tool to impose CPU, memory, or I/O load.
  2. Check Initial State:

    • Open three terminals (A, B, C).
    • Terminal A: Run htop or top. Note the baseline CPU usage and load average. Press 1 in top to see per-CPU views if not already visible.
    • Terminal B: Run mpstat -P ALL 1. Observe the per-CPU idle (%idle) percentages. They should be high (close to 100%).
    • Terminal C: Run uptime. Note the initial load average.
  3. Generate CPU Load (Terminal C):

    • First, find out how many CPU cores (logical processors) you have: nproc
    • Let's generate load equivalent to one fully utilized core. Replace 1 with the number of cores if you want to stress more later.
      stress --cpu 1 --timeout 60s
      # --cpu 1: Start 1 worker process spinning on sqrt()
      # --timeout 60s: Stop after 60 seconds
      
  4. Observe While Under Load:

    • Terminal A (htop/top):
      • Watch the main CPU meter(s). You should see utilization increase significantly.
      • If using top with per-CPU view (press 1), one CPU line should show very low %idle. htop will show one CPU bar nearly full.
      • Find the stress process(es). They should be consuming close to 100% CPU (on one core). Note their PID and %CPU.
      • Watch the load average. The 1-minute average should start climbing towards 1.00.
    • Terminal B (mpstat):
      • Observe the output refreshing every second. One specific CPU line should show a dramatic drop in %idle and a corresponding increase in %usr. Other CPUs should remain mostly idle.
      • The all line will show the average utilization across all cores.
    • Terminal C: After stress finishes (60 seconds), it will exit. Run uptime again immediately. Compare the load averages to the initial values. The 1-minute average should be elevated (close to 1.00 if the test ran long enough), while the 5 and 15-minute averages will be lower but rising. Run uptime a few more times over the next few minutes and watch the averages decrease as the system recovers.
  5. (Optional) Generate More Load:

    • If you have multiple cores (e.g., nproc reported 4), try stressing all of them:
      stress --cpu 4 --timeout 60s
      
    • Now observe htop/top and mpstat. All CPU cores should show high utilization (low %idle). The load average in top and uptime should climb towards 4.00 (or the number of cores you stressed).
  6. (Optional) Generate I/O Wait Load:

    • I/O wait is harder to simulate perfectly with stress, but we can try:
      stress --io 4 --timeout 60s
      # --io 4: Start 4 workers syncing data to disk frequently
      
    • Observe top/htop. Look at the %wa value in the CPU summary line. Does it increase significantly?
    • Observe mpstat. Does %iowait increase?
    • Observe vmstat 1. Look at the wa column under cpu and the b column under procs. Do they increase?
    • Note: The effectiveness of --io depends heavily on your disk speed and system configuration.

Conclusion: You have used stress to create controlled CPU load and observed its impact using top, htop, mpstat, and uptime. You saw how load affects overall and per-core utilization percentages and how the system load average reflects the demand on the CPU(s). You also briefly explored how I/O-bound tasks affect the %wa metric. This hands-on experience helps in interpreting these metrics when analyzing real-world performance issues.

3. Memory Monitoring and Analysis

Memory (RAM) is another critical resource. Insufficient memory forces the system to use slower swap space (disk), drastically reducing performance. Understanding memory usage patterns is essential for system health.

Key Memory Concepts

  • RAM (Random Access Memory): Fast, volatile storage used by the CPU to hold running applications and their data.
  • Swap Space: A designated area on a hard drive or SSD used as "virtual memory" when physical RAM is full. Accessing swap is orders of magnitude slower than accessing RAM. Heavy swap usage is a major performance killer.
  • Physical vs. Virtual Memory:
    • Physical Memory (Resident Set Size - RSS): The actual amount of RAM a process occupies.
    • Virtual Memory (Virtual Set Size - VSZ): The total address space requested by a process. This includes code, data, shared libraries, and mapped files, some of which might be in RAM, some in swap, and some not loaded yet. VSZ is often much larger than RSS. RSS is usually the more important metric for actual RAM consumption.
  • Buffers: Temporary storage for raw disk blocks (metadata or file content). Used by the kernel to optimize block device I/O. Data written might be held in a buffer briefly before being written to disk.
  • Cache: Page cache holding data read from files on disk. If a file is read, its contents are stored in the page cache in RAM. Subsequent reads of the same file can be served quickly from the cache instead of going back to the slow disk.
  • Buffers vs. Cache: Historically distinct, modern Linux kernels often manage them similarly within the "page cache." The buff/cache value seen in tools like free and top represents the sum of memory used for both purposes. Crucially, most of this buff/cache memory is reclaimable. If applications need more RAM, the kernel will shrink the cache/buffers to free up space.
  • Free vs. Available Memory:
    • free: Memory that is completely unused. In Linux, this number might seem low because the kernel actively uses "free" RAM for buffers and cache to improve performance.
    • available: An estimation (available since kernel 3.14) of how much memory is truly available for starting new applications without resorting to swapping. It accounts for free memory plus reclaimable parts of buff/cache. available is generally the most useful metric to determine if the system is under memory pressure.
  • OOM Killer (Out Of Memory Killer): A Linux kernel process that activates when the system is critically low on memory and cannot reclaim enough (e.g., by shrinking caches or swapping). To prevent a total system lockup, the OOM Killer selects a process (based on heuristics like memory usage and "oom_score") and terminates (SIGKILLs) it to free up memory. While it saves the system from crashing, it means an application was forcibly killed. Seeing OOM Killer activity in logs (dmesg or journalctl) indicates severe memory pressure.

Tools for Memory Monitoring

  • free: The primary command-line tool for a quick snapshot of memory usage.

    • free: Shows values in kibibytes (KiB).
    • free -h: Shows values in human-readable format (MiB, GiB). This is usually preferred.
    • free -s 1: Refresh every 1 second.
    # Example Output (free -h)
                  total        used        free      shared  buff/cache   available
    Mem:           15Gi       4.0Gi       7.8Gi       150Mi       3.8Gi        11Gi
    Swap:         2.0Gi          0B       2.0Gi
    
    • Mem line: Physical RAM statistics.
      • total: Total installed RAM.
      • used: Calculated as total - free - buff/cache. Can be misleading alone.
      • free: Truly unused memory.
      • shared: Memory used by tmpfs (RAM-based file systems).
      • buff/cache: Memory used by kernel buffers and page cache.
      • available: Estimate of memory available for new applications. Focus on this!
    • Swap line: Swap space statistics. Non-zero used indicates swapping is occurring or has occurred.
  • top / htop: Provide real-time memory summary (similar to free) and per-process memory usage (VIRT, RES, SHR, %MEM). Sorting by %MEM (M in top, F6 in htop) quickly identifies memory-hungry processes.

  • vmstat: Reports virtual memory statistics over time.

    • vmstat 1
    # Example Output focusing on memory/swap columns
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     1  0      0 8140100 150000 3450000    0    0     5    20  100  250  2  1 97  0  0
     0  0   1024 8030000 150200 3550000   10   25    80    150 600  800 5  3 85  7  0
    
    • Memory Columns: swpd (amount of swap used), free, buff, cache.
    • Swap Columns: si (amount swapped in from disk per second), so (amount swapped out to disk per second). Sustained non-zero values for si and so indicate active swapping and likely insufficient RAM.
  • /proc/meminfo: A virtual file providing detailed memory statistics directly from the kernel. free, top, etc., parse this file. Useful for getting specific values or scripting.

    cat /proc/meminfo
    # Example Lines:
    # MemTotal:       16272896 kB
    # MemFree:         8335568 kB
    # MemAvailable:   11520896 kB
    # Buffers:          153600 kB
    # Cached:          3532800 kB
    # SwapTotal:       2097148 kB
    # SwapFree:        2097148 kB
    # ... many other details
    

  • smem: An advanced tool (may need installation) that provides more detailed reports on memory usage, particularly distinguishing between shared and private memory per process, giving a more accurate view of proportional usage (PSS - Proportional Set Size). smem -tk shows totals and reports in KB.

Workshop Simulating Memory Pressure and Observing Swapping

Goal: To simulate a low-memory situation, observe the use of cache, witness swapping activity, and see how tools report these conditions.

Tools Required: free, top or htop, vmstat, stress (or stress-ng).

Steps:

  1. Install Tools (if needed):

    • Ensure sysstat (for vmstat) and stress (or stress-ng which has more memory options) are installed.
    • Debian/Ubuntu: sudo apt update && sudo apt install sysstat stress
    • CentOS/RHEL/Fedora: sudo yum install sysstat stress or sudo dnf install sysstat stress
  2. Establish Baseline:

    • Open three terminals (A, B, C).
    • Terminal A: Run free -h. Note the initial total, used, free, buff/cache, and available memory, plus swap usage.
    • Terminal B: Run vmstat 1. Observe the free, cache, si, and so columns. Note the initial lack of swap activity (si/so should be 0).
    • Terminal C: Run htop or top. Observe the memory summary line.
  3. Consume Cache (Optional but illustrative):

    • In a fourth terminal (D) or reuse C temporarily, perform an operation that reads a large amount of data. This forces Linux to cache it. Reading a large system file or device often works.
      # Be careful with disk space if using dd to create a file
      # Option 1: Read a large existing file (adjust path/size)
      # sudo find / -type f -size +100M -exec cat {} \; > /dev/null
      # Option 2: Use dd to read from zero (safe, generates cache)
      dd if=/dev/zero of=/dev/null bs=1M count=1024  # Reads 1GB
      
    • Immediately after the command finishes, check free -h in Terminal A again. You should see:
      • free memory decreased significantly.
      • buff/cache increased significantly.
      • available memory decreased less than free, because the cache is reclaimable.
    • Check vmstat 1 in Terminal B. The cache column should have increased.
  4. Clear Caches (Optional, requires root):

    • To demonstrate cache reclaimability (use with caution, may temporarily impact performance):
      sync # Flush any pending writes to disk first
      echo 3 | sudo tee /proc/sys/vm/drop_caches
      
    • Check free -h in Terminal A again. free memory should increase, and buff/cache should decrease, returning closer to the initial state. available should also increase.
  5. Generate Memory Load (Terminal C):

    • Determine roughly how much available RAM you have from free -h. Let's aim to consume slightly more than that to force swapping. If you have 11Gi available, try allocating 12G. Adjust the 12G value based on your system.
      # Use stress to allocate memory
      # --vm N: Spawn N workers spinning on malloc()/free()
      # --vm-bytes SIZE: Allocate SIZE per worker
      # Let's start 1 worker allocating 12GB (adjust size!)
      stress --vm 1 --vm-bytes 12G --timeout 120s
      # If stress fails or doesn't consume enough, try stress-ng
      # stress-ng --vm 1 --vm-bytes 12G --timeout 120s
      
    • Warning: This might make your system temporarily unresponsive!
  6. Observe While Under Memory Pressure:

    • Terminal A (free -h): Run free -h periodically (or watch -n 1 free -h).
      • Watch available memory decrease rapidly.
      • Watch used swap increase from 0.
    • Terminal B (vmstat 1):
      • Watch the free memory column drop.
      • Watch the cache column likely decrease as the kernel tries to reclaim cache before swapping.
      • Crucially, watch the so (swap-out) column. You should see non-zero values as the system writes memory pages to the swap disk.
      • If the system becomes responsive enough for stress to free memory later, or if you allocate less, you might see si (swap-in) activity as swapped-out pages are needed again.
    • Terminal C (htop/top):
      • The memory summary line should show high RAM usage and increasing Swap usage.
      • Find the stress or stress-ng process. Its %MEM and RES (Resident Set Size) should be very high. Its VIRT (Virtual Size) might be even higher.
      • The system might feel sluggish. Observe CPU usage - you might see increased %sy (system CPU) and potentially %wa (I/O wait) due to the swapping activity (which involves disk I/O).
  7. After stress Finishes:

    • Continue monitoring with free -h and vmstat 1 for a minute or two.
    • Swap usage (used in free -h, swpd in vmstat) might remain high even after the process exits. Linux generally doesn't eagerly un-swap pages unless the memory is needed elsewhere or the page is accessed again (triggering swap-in).
    • Available memory should recover. Swap activity (si/so in vmstat) should return to 0.

Conclusion: You simulated memory pressure, observed how Linux uses free RAM for cache, how it reclaims cache when needed, and critically, what happens when physical RAM is exhausted – swapping. You used free, vmstat, and top/htop to monitor available memory, cache usage, swap usage, and swap I/O activity (si/so). Witnessing non-zero si/so is a strong indicator that the system needs more RAM for its workload.

4. Disk I/O Monitoring and Analysis

Disk Input/Output (I/O) performance is critical for application responsiveness, especially for databases, file servers, or any application that frequently reads or writes data. Slow disk I/O can lead to high %iowait CPU time, bottlenecking the entire system even if the CPU itself isn't busy.

Key Disk I/O Concepts

  • Throughput: The rate at which data is transferred, usually measured in Megabytes per second (MB/s) or Gigabytes per second (GB/s). High throughput is important for large file transfers or sequential reads/writes.
  • IOPS (Input/Output Operations Per Second): The number of read or write operations completed per second. High IOPS are crucial for workloads involving many small, random reads/writes, such as database lookups or virtual machine hosting. SSDs typically offer vastly higher IOPS than traditional HDDs.
  • Latency: The time it takes for a single I/O request to be completed, often measured in milliseconds (ms). Lower latency is better, meaning the disk responds faster. High latency directly impacts application responsiveness.
  • Queue Depth: The number of pending I/O requests waiting to be serviced by the disk device. A consistently high queue depth indicates the disk cannot keep up with the demand.
  • Utilization (%util): The percentage of time the disk device was busy processing I/O requests. A value close to 100% indicates the disk is saturated and is likely a bottleneck. However, high utilization on its own isn't always bad if latency remains low. A fast SSD might be 100% utilized but still providing excellent performance. Combine %util with latency/wait times for a better picture.
  • Service Time (svctm - often deprecated/misleading): Historically, the average time spent servicing I/O requests, including wait time. On modern kernels/tools, this value is often inaccurate and should be disregarded in favor of await.
  • Wait Time (await, r_await, w_await): The average time (in ms) an I/O request spends from when it's issued to when it's completed. This includes both queue time (waiting to be processed) and service time (actively being processed). await is a crucial indicator of disk performance as experienced by applications. r_await and w_await provide separate average wait times for read and write requests, respectively. High await times directly point to an I/O bottleneck.

Tools for Disk I/O Monitoring

  • iostat: The standard tool for reporting CPU statistics and input/output statistics for devices and partitions. Part of the sysstat package.

    • iostat: Basic report with CPU and device I/O since boot.
    • iostat -d: Show only the device utilization report.
    • iostat -x: Show extended statistics (highly recommended). Includes await, %util, queue size, etc.
    • iostat -xk 1: Show extended stats (x) in kilobytes (k) every 1 second.
    • iostat -x /dev/sda 2 5: Show extended stats just for device /dev/sda, every 2 seconds, 5 times.
    # Example Output (iostat -xk 1)
    avg-cpu:  %user   %nice    %sys %iowait   %steal   %idle
              1.50    0.00    0.75    0.10    0.00   97.65
    
    Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
    sda              1.50    5.00     60.00    120.00     0.10     2.00   6.25  28.57    2.50    5.10   0.05    40.00    24.00   1.50   0.98
    nvme0n1         25.00  150.00   1000.00  5000.00     0.50     5.00   1.96   3.23    0.15    0.40   0.10    40.00    33.33   0.05   0.88
    
    • Key Columns (-x mode):
      • r/s, w/s: Reads/Writes completed per second (IOPS = r/s + w/s).
      • rkB/s, wkB/s: Kilobytes read/written per second (Throughput). (Use -m for MB/s).
      • rrqm/s, wrqm/s: Read/Write requests merged per second by the kernel.
      • r_await, w_await: Average time (ms) for read/write requests to be served (including queue + service time). Very important metrics!
      • aqu-sz: Average queue length (number of requests waiting).
      • rareq-sz, wareq-sz: Average size (kB) of read/write requests.
      • %util: Percentage of CPU time during which I/O requests were issued to the device (device saturation).
  • iotop: An htop-like tool specifically for monitoring disk I/O usage per process. Requires root privileges. (Needs installation: sudo apt install iotop or sudo yum install iotop).

    • sudo iotop: Shows current I/O activity, updating periodically.
    • sudo iotop -o: Show only processes or threads actually doing I/O.
    • sudo iotop -a: Show accumulated I/O instead of bandwidth.
    # Example Output (sudo iotop -o)
    Total DISK READ:         1.20 M/s | Total DISK WRITE:       5.50 M/s
    Actual DISK READ:        0.80 M/s | Actual DISK WRITE:      3.00 M/s
      PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
     1234 be/4 student     800.00 K/s    2.50 M/s  0.00 %  5.50 % dd if=/dev/zero of=testfile bs=1M count=100
     5678 be/4 root          0.00 B/s  500.00 K/s  0.00 %  1.10 % [jbd2/sda1-8]
     9012 be/4 mysql       400.00 K/s    2.00 M/s  0.00 %  3.20 % mysqld --user=mysql ...
    
    • Shows PID, User, Disk Read rate, Disk Write rate, Swap In percentage, I/O wait percentage (IO>), and Command.
    • Excellent for quickly identifying which process is responsible for heavy disk activity seen in iostat.
  • vmstat: Provides basic block I/O stats.

    • vmstat 1
    • Columns bi (blocks received from a block device - read) and bo (blocks sent to a block device - written). Units are typically blocks (often 1KB). Useful for seeing if any disk activity is happening alongside memory/CPU stats.

Workshop Generating Disk Load and Analyzing I/O Statistics

Goal: To generate different types of disk load (read and write) and observe the impact using iostat and iotop.

Tools Required: iostat, iotop, dd (usually pre-installed).

Steps:

  1. Install Tools (if needed):

    • Ensure sysstat (for iostat) and iotop are installed.
    • Debian/Ubuntu: sudo apt update && sudo apt install sysstat iotop
    • CentOS/RHEL/Fedora: sudo yum install sysstat iotop or sudo dnf install sysstat iotop
  2. Identify Target Device:

    • Use lsblk or df -h to identify a suitable disk partition with some free space (e.g., /dev/sda1, /dev/nvme0n1p2). We'll write a test file there. Avoid writing directly to the raw device (/dev/sda) unless you know what you are doing. Find your home directory's partition or use /tmp. Let's assume we're writing to a filesystem mounted from /dev/sda1.
  3. Establish Baseline:

    • Open three terminals (A, B, C).
    • Terminal A: Run iostat -xk 1. Observe the baseline r/s, w/s, rkB/s, wkB/s, await times, and %util for your target device (e.g., sda). They should be relatively low.
    • Terminal B: Run sudo iotop. You might need to enter your password. Observe the baseline. Press o to only show active processes.
    • Terminal C: This will be used to generate the load.
  4. Generate Write Load (Terminal C):

    • Use dd to write a moderately large file (e.g., 1GB) from /dev/zero (a source of infinite null bytes, low CPU overhead) to your chosen filesystem.
      # Adjust 'of=./testfile' path if needed. Use a filesystem on your target device.
      dd if=/dev/zero of=./testfile bs=1M count=1024 oflag=direct status=progress
      # bs=1M: Write in 1 Megabyte blocks
      # count=1024: Write 1024 blocks (1GB total)
      # oflag=direct: Try to bypass the buffer cache for writing. This generates more immediate physical I/O, making effects clearer in iostat/iotop. Might require root or specific filesystem mount options. If it fails, remove oflag=direct.
      # status=progress: Show dd's progress.
      
  5. Observe While Under Write Load:

    • Terminal A (iostat): Watch the line for your target device (sda, nvme0n1, etc.).
      • w/s (writes per second) and wkB/s (write throughput) should increase significantly.
      • r/s and rkB/s should remain low.
      • Observe w_await (write await time). Does it increase? How much?
      • Observe %util. It should increase, potentially reaching 100% if dd can write faster than the disk can handle.
      • Observe aqu-sz (queue size). Does it grow?
    • Terminal B (iotop):
      • The dd process should appear prominently, showing high DISK WRITE values.
      • Note the IO> percentage for dd.
      • You might also see related kernel threads like jbd2 or kworker doing I/O, especially without oflag=direct.
  6. Clean Up Write Test File (Terminal C):

    rm ./testfile
    

  7. Generate Read Load (Terminal C):

    • First, create a file to read from (if you removed the previous one). We want physical reads, so ideally, clear caches first (requires root).
      # Create the file again (can use cache this time)
      dd if=/dev/zero of=./testfile bs=1M count=1024 status=progress
      
      # Clear caches (optional, requires root)
      sync
      echo 3 | sudo tee /proc/sys/vm/drop_caches
      
      # Now read the file using dd and discard the output
      dd if=./testfile of=/dev/null bs=1M iflag=direct status=progress
      # iflag=direct: Try to bypass cache for reading.
      # of=/dev/null: Discard the data read.
      
  8. Observe While Under Read Load:

    • Terminal A (iostat): Watch the line for your target device.
      • r/s (reads per second) and rkB/s (read throughput) should increase significantly.
      • w/s and wkB/s should remain low (unless metadata updates cause small writes).
      • Observe r_await (read await time).
      • Observe %util.
    • Terminal B (iotop):
      • The dd process should appear, showing high DISK READ values.
  9. Clean Up (Terminal C):

    rm ./testfile
    

  10. Stop Monitoring: Press Ctrl+C in Terminals A and B.

Conclusion: You generated controlled disk write and read loads using dd. You used iostat to observe key performance indicators like IOPS (r/s, w/s), throughput (rkB/s, wkB/s), latency (r_await, w_await), queue size (aqu-sz), and saturation (%util) for the specific device. You also used iotop to pinpoint the dd process as the source of the I/O activity. Analyzing these metrics helps you understand your storage performance limits and identify potential disk bottlenecks affecting applications. High await times (e.g., > 10-20ms for many workloads, though acceptable values vary greatly) and high %util are key signs of a bottleneck.

5. Network Monitoring and Analysis

Network performance is crucial for servers, workstations accessing network resources, and virtually any system connected to the internet. Monitoring network traffic helps diagnose connectivity issues, identify bandwidth hogs, detect security anomalies, and ensure services are reachable.

Key Network Concepts

  • Bandwidth: The maximum theoretical data transfer rate of a network link, often measured in Mbps (Megabits per second) or Gbps (Gigabits per second).
  • Throughput: The actual measured data transfer rate being achieved, usually lower than the theoretical bandwidth due to overhead, latency, congestion, etc. Measured in Mbps, Gbps, or often KB/s, MB/s in monitoring tools.
  • Latency: The time delay for a packet to travel from source to destination and back (Round Trip Time - RTT), typically measured in milliseconds (ms). High latency impacts interactive applications (SSH, web browsing) and protocols sensitive to delays.
  • Packets: Data is transmitted over networks in small units called packets. Monitoring includes packets sent (TX) and received (RX) per second.
  • Errors and Drops: Packets that were received corrupted (errors) or discarded (drops) usually due to network congestion, faulty hardware, or configuration issues. Non-zero error/drop counts indicate network problems.
  • Sockets and Connections: Network communication occurs via sockets. A socket is an endpoint defined by an IP address and a port number.
    • TCP (Transmission Control Protocol): Connection-oriented, reliable protocol (e.g., for HTTP, SSH, FTP). Connections go through states like LISTEN (waiting for incoming connection), ESTABLISHED (active connection), TIME_WAIT (waiting after connection close), CLOSE_WAIT, etc.
    • UDP (User Datagram Protocol): Connectionless, unreliable protocol (e.g., for DNS, DHCP, some streaming). Simpler, less overhead, but no guaranteed delivery.
  • Network Interface: The hardware (e.g., eth0, enp3s0, wlan0) or virtual device (e.g., lo - loopback) through which the system communicates with the network.

Tools for Network Monitoring

  • ip: The modern standard Linux tool for displaying and manipulating routing, network devices, interfaces, and tunnels. Replaces older tools like ifconfig and route.

    • ip addr show (or ip a): Show IP addresses and details for all interfaces. Look for RX/TX packet/byte counts, errors, drops.
    • ip -s link show <interface>: Show detailed statistics (-s) for a specific interface, including byte/packet counts, errors, drops, multicast.
    # Example (ip -s link show eth0)
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
        link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
        RX: bytes  packets  errors  dropped missed  mcast
        1234567890 1000100  0       5       0       1500
        TX: bytes  packets  errors  dropped carrier collsns
        987654321  950050   0       0       0       0
    
  • ss: The modern standard tool for investigating sockets. Replaces the older netstat. Excellent for seeing active connections and listening ports.

    • ss -tulnp: Show listening (-l) TCP (-t) and UDP (-u) sockets, disable name resolution (-n - faster), and show the process (-p) using the socket (requires root/sudo for -p). Very common and useful.
    • ss -tan: Show all (-a) TCP (-t) sockets, numeric (-n). Useful for seeing all active, listening, and waiting connections.
    • ss -tun: Show TCP (-t) and UDP (-u) sockets, numeric (-n).
    • ss -s: Show summary statistics for connections by state.

    # Example (sudo ss -tulnp)
    State    Recv-Q   Send-Q     Local Address:Port      Peer Address:Port   Process
    LISTEN   0        4096       127.0.0.53%lo:53             0.0.0.0:*       users:(("systemd-resolve",pid=678,fd=13))
    LISTEN   0        128            0.0.0.0:22             0.0.0.0:*       users:(("sshd",pid=889,fd=3))
    LISTEN   0        128               [::]:22                [::]:*       users:(("sshd",pid=889,fd=4))
    
    # Example (ss -tan state established)
    State        Recv-Q  Send-Q    Local Address:Port     Peer Address:Port   Process
    ESTAB        0       0         192.168.1.100:22       192.168.1.50:54321
    ESTAB        0       0         192.168.1.100:44332    104.18.32.111:443
    

  • netstat (Legacy): Still found on many systems but generally superseded by ip and ss. Common legacy usage:

    • netstat -tulnp: Similar to ss -tulnp.
    • netstat -i: Show interface statistics (similar to ip -s link).
    • netstat -r: Show routing table (use ip route instead).
  • iftop: A top-like utility for displaying bandwidth usage on an interface per connection. Excellent for identifying which hosts/ports are consuming the most bandwidth in real-time. Requires root privileges and installation (sudo apt install iftop or sudo yum install iftop).

    • sudo iftop: Monitors the first detected external interface.
    • sudo iftop -i <interface>: Specify which interface to monitor (e.g., eth0).
    • Interactive keys: n (toggle DNS resolution), s/d (toggle source/destination host display), p (toggle port display), L (toggle scale: bits/bytes), q (quit).
  • nload: A simple command-line tool that displays network traffic (incoming/outgoing throughput) as graphs. Easy to quickly visualize current load. (Needs installation: sudo apt install nload or sudo yum install nload).

    • nload: Monitors all auto-detected interfaces (use arrows to switch).
    • nload <interface>: Monitor a specific interface.
  • ping: Sends ICMP ECHO_REQUEST packets to a host to test reachability and measure round-trip time (latency).

    • ping google.com
    • ping -c 5 8.8.8.8: Send 5 pings to IP 8.8.8.8.
  • traceroute / mtr: Shows the path (route) packets take to reach a destination host, displaying latency to each hop along the way. Useful for diagnosing network path issues. mtr provides a dynamic, updating view combining ping and traceroute.

    • traceroute google.com
    • mtr google.com (often preferred, may need installation).

Workshop Monitoring Network Activity During a Download

Goal: To generate network traffic by downloading a file and observe the activity using ss, iftop, nload, and ip.

Tools Required: wget or curl (usually pre-installed), ss, ip, iftop, nload.

Steps:

  1. Install Tools (if needed):

    • Ensure iftop and nload are installed.
    • Debian/Ubuntu: sudo apt update && sudo apt install iftop nload wget
    • CentOS/RHEL/Fedora: sudo yum install iftop nload wget or sudo dnf install iftop nload wget
  2. Identify Network Interface:

    • Run ip a. Identify your primary active network interface (e.g., eth0, enp3s0, wlan0). It will have your main IP address. Let's assume it's eth0.
  3. Establish Baseline:

    • Open three terminals (A, B, C).
    • Terminal A: Run sudo iftop -i eth0 (replace eth0 if needed). Observe the baseline traffic (likely low). Note the scale (e.g., Kb/Mb). Press L to toggle between bits (b) and Bytes (B) per second.
    • Terminal B: Run nload eth0 (replace eth0 if needed). Observe the baseline incoming/outgoing graphs.
    • Terminal C: This will be used to generate traffic. Check initial connection state: ss -tan state established (should show few or no relevant connections). Check listening ports: sudo ss -tulnp.
  4. Generate Network Load (Terminal C):

    • Use wget to download a reasonably large file from a fast source. Linux distribution ISOs are good candidates. (Example: Ubuntu 22.04 Desktop ISO ~4.7GB). Find a current mirror link.
      # Example - find a current link from a distribution's website
      # wget http://releases.ubuntu.com/22.04.3/ubuntu-22.04.3-desktop-amd64.iso -O /dev/null
      # -O /dev/null: Download the file but discard it (saves disk space)
      # If the download is too slow/fast, find a different file/mirror.
      
    • Let the download run for a minute or two.
  5. Observe While Under Load:

    • Terminal A (iftop):
      • You should see a prominent connection between your host's IP and the download server's IP/hostname.
      • The "=>" direction (incoming traffic to your host) should show significant bandwidth usage, matching the download speed reported by wget.
      • The "<=" direction (outgoing) will show some traffic (TCP acknowledgements) but much less.
      • Observe the peak and cumulative transfer rates at the top.
      • Press p to toggle port display. You should see the connection using port 80 (HTTP) or 443 (HTTPS).
    • Terminal B (nload):
      • The "Incoming" graph should show significant activity, corresponding to the download speed.
      • The "Outgoing" graph should show much lower activity.
      • Note the current, average, max, and total transfer values.
    • Terminal C (Check connections while wget runs, maybe open another tab/terminal D):
      • Run ss -tan state established | grep '<server_ip>' (replace <server_ip> with the IP iftop shows for the download server, or the port like :80 or :443). You should see the active TCP connection used by wget.
      • Run ip -s link show eth0 (replace eth0). Compare the RX bytes and packets counts before and during/after the download. They should have increased significantly. Check errors and dropped counts - hopefully, they remain 0.
  6. Stop the Download: Press Ctrl+C in the wget terminal (Terminal C).

  7. Observe After Load:

    • Watch iftop and nload. The high traffic rates should quickly drop back to baseline levels.
    • Check ss -tan state established again. The connection to the download server should eventually disappear (might enter TIME_WAIT state first).
  8. Stop Monitoring: Press q in iftop and Ctrl+C in nload.

Conclusion: You generated network traffic using wget and monitored it effectively. iftop helped identify the specific connection responsible for the bandwidth usage and the hosts involved. nload provided a simple visual representation of the throughput. ss allowed you to inspect the state of the underlying TCP socket, and ip -s link provided cumulative statistics for the interface, including vital error counts. These tools are essential for understanding network utilization and diagnosing connectivity or performance problems.

6. Process Management and Control

Monitoring tells you what processes are doing and how they are using resources. Process management is about controlling those processes – terminating misbehaving ones, adjusting their priority, or starting them with specific characteristics.

Key Process Management Concepts

  • Process ID (PID): A unique number assigned to each running process. Used to identify the target for management commands.
  • Parent Process ID (PPID): The PID of the process that created this process. Forms a hierarchy or tree. Process 1 (init or systemd) is the ancestor of most user processes.
  • Process States: As seen in top/htop/ps:
    • R (Running or Runnable): Either actively using the CPU or waiting in the run queue for its turn.
    • S (Interruptible Sleep): Waiting for an event (e.g., I/O completion, signal, timer). Most processes spend most of their time in this state.
    • D (Uninterruptible Sleep): Waiting directly on hardware (usually disk I/O), cannot be interrupted by signals. Processes stuck in D state can indicate hardware or driver problems and are difficult to kill.
    • Z (Zombie): Process has terminated, but its exit status hasn't been collected by its parent process yet. It consumes minimal resources (just an entry in the process table). Persistent zombies usually indicate a bug in the parent process.
    • T (Stopped or Traced): Process execution has been suspended, usually by a signal like SIGSTOP (e.g., pressing Ctrl+Z in the terminal) or because it's being debugged (ptrace).
  • Signals: A standard mechanism in Unix-like systems for notifying processes of events or requesting actions. Processes can react to signals in predefined ways, ignore them, or be forcibly terminated. Common signals:
    • SIGTERM (15): The standard "polite" request to terminate. Allows the process to shut down gracefully (save files, close connections, etc.). This is the default signal sent by kill.
    • SIGKILL (9): The "force kill" signal. The kernel terminates the process immediately without giving it a chance to clean up. Should be used as a last resort if SIGTERM fails, as it can lead to data loss or corruption. Processes in D state usually cannot be killed even by SIGKILL.
    • SIGHUP (1): Hang Up signal. Historically used when a terminal connection was lost. Often used now to signal daemons to reload their configuration files.
    • SIGINT (2): Interrupt signal. Sent when you press Ctrl+C in the terminal. Usually requests termination.
    • SIGQUIT (3): Quit signal. Sent by Ctrl+\. Similar to SIGINT but can also trigger a core dump.
    • SIGSTOP (19): Stop signal. Suspends process execution (puts it in T state). Cannot be caught or ignored.
    • SIGCONT (18): Continue signal. Resumes a stopped process.
  • Priority and Niceness: Linux uses a priority system to schedule which runnable process gets CPU time next.
    • Priority: Internal kernel value (0-139). Lower number means higher priority. 0-99 are for real-time processes, 100-139 for user-space tasks.
    • Nice Value: User-space control (-20 to +19). Maps onto the priority range 100-139.
      • -20: Highest user-space priority (most likely to get CPU time).
      • 0: Default priority.
      • +19: Lowest user-space priority (will only run when higher priority tasks are idle).
    • Only root can increase a process's priority (decrease its nice value below 0). Any user can decrease their own process's priority (increase its nice value).

Process Management Commands

  • kill <PID>: Sends a signal to a process specified by its PID.

    • kill 12345: Sends SIGTERM (15) to PID 12345 (requests graceful shutdown).
    • kill -9 12345 or kill -SIGKILL 12345: Sends SIGKILL (9) to PID 12345 (force kill). Use with caution!
    • kill -l: Lists all available signal names and numbers.
    • kill -HUP 6789 or kill -1 6789: Sends SIGHUP (1) to PID 6789 (often for config reload).
  • pkill <pattern>: Sends a signal to processes matching a pattern (usually the process name).

    • pkill firefox: Sends SIGTERM to all processes named firefox.
    • pkill -9 -u student sleep: Sends SIGKILL to all processes named sleep owned by user student.
    • pkill -f "python .*my_script\.py": Sends SIGTERM to processes whose full command line matches the pattern (-f flag). Be careful with patterns!
  • killall <process_name>: Similar to pkill, but matches exact process names only (unless options like -r for regex are used). Behavior can sometimes differ slightly from pkill.

    • killall nginx: Sends SIGTERM to all processes exactly named nginx.
    • killall -s SIGHUP nginx: Sends SIGHUP to nginx processes.
  • nice -n <niceness> <command>: Starts a command with a specific nice value.

    • nice -n 10 ./my_cpu_intensive_script.sh: Runs the script with reduced priority (nice value 10).
    • sudo nice -n -5 ./important_task: Runs task with increased priority (nice value -5, requires root).
  • renice <niceness> -p <PID>: Changes the nice value of a running process.

    • renice 15 -p 12345: Decreases the priority of PID 12345 (sets nice value to 15).
    • sudo renice -10 -p 6789: Increases the priority of PID 6789 (sets nice value to -10, requires root).
    • renice 5 -u student: Attempts to set the nice value to 5 for all processes owned by user student.
  • pgrep <pattern>: Finds PIDs matching a pattern (useful for getting the PID to use with kill or renice).

    • pgrep firefox: Prints PIDs of firefox processes.
    • pgrep -u root sshd: Prints PIDs of sshd processes owned by root.
    • pgrep -f "my_script\.py": Prints PIDs matching the full command line.

Workshop Managing Process States and Priorities

Goal: To practice starting processes, finding their PIDs, sending signals (SIGTERM, SIGKILL, SIGSTOP, SIGCONT), and adjusting priorities using nice and renice.

Tools Required: sleep, yes, ps, pgrep, kill, nice, renice, top or htop.

Steps:

  1. Start Sample Processes:

    • Open two or three terminals (A, B, C).
    • Terminal A: Start a simple background process that does nothing but wait.
      sleep 1000 &
      # Note the PID printed, e.g., [1] 23456
      
    • Terminal B: Start a CPU-intensive process in the background. The yes command outputs 'y' (or its argument) repeatedly, consuming CPU.
      yes > /dev/null &
      # Note the PID printed, e.g., [1] 23458
      
  2. Identify Processes:

    • Terminal C: Use ps and pgrep to find the PIDs.
      ps aux | grep sleep
      pgrep sleep
      ps aux | grep yes
      pgrep yes
      # Identify the PIDs for the 'sleep 1000' and 'yes > /dev/null' commands you started.
      # Let's assume sleep PID is 23456 and yes PID is 23458.
      
    • Run htop or top. Find both processes. Note their default NI (Nice) value (usually 0) and PR (Priority). The yes process should show high %CPU usage.
  3. Terminate Gracefully (SIGTERM):

    • Terminal C: Send SIGTERM to the sleep process using its PID.
      kill 23456
      
    • Check Terminal A. You should see a "Terminated" message.
    • Verify with ps aux | grep 23456 | grep -v grep or pgrep sleep. It should be gone.
  4. Attempt Graceful, Then Force Kill (SIGTERM -> SIGKILL):

    • Some processes might ignore SIGTERM or take time to shut down. We'll simulate this with yes, which usually exits quickly on SIGTERM, but imagine it didn't.
    • Terminal C: Send SIGTERM to the yes process.
      kill 23458
      
    • Check Terminal B. It should terminate almost immediately.
    • For practice: Restart yes > /dev/null & in Terminal B and get its new PID (e.g., 23460). Now, pretend SIGTERM didn't work and you need to force it.
      kill -9 23460
      # OR
      kill -SIGKILL 23460
      
    • Check Terminal B. It should show "Killed". Verify with ps or pgrep that it's gone.
  5. Stop and Continue a Process (SIGSTOP, SIGCONT):

    • Terminal B: Start yes > /dev/null & again. Get its new PID (e.g., 23462).
    • Terminal C: Observe the yes process in htop/top. Note its high %CPU and R (Running) state.
    • Terminal C: Send the SIGSTOP signal.
      kill -SIGSTOP 23462
      
    • Observe in htop/top. The yes process's state (S column) should change to T (Stopped), and its %CPU usage should drop to 0.
    • Terminal C: Send the SIGCONT signal to resume it.
      kill -SIGCONT 23462
      
    • Observe in htop/top. The yes process should return to the R state and resume consuming CPU.
    • Clean up: kill 23462 (send SIGTERM).
  6. Run with Lower Priority (nice):

    • Terminal B: Start yes with a lower priority (higher nice value).
      nice -n 15 yes > /dev/null &
      # Get its PID (e.g., 23464)
      
    • Terminal C: Observe in htop/top. Find the new yes process. Its NI value should be 15. Its PR value will be higher (lower priority) than the default. If other CPU-bound tasks were running at default priority, this yes process would get less CPU time.
  7. Change Priority of Running Process (renice):

    • Keep the yes process from step 6 (PID 23464, nice 15) running.
    • Terminal C: Change its priority back to the default nice value (0).
      renice 0 -p 23464
      
    • Observe in htop/top. The NI value for PID 23464 should change back to 0.
    • Terminal C: Try to increase its priority (lower nice value) without sudo.
      renice -5 -p 23464
      # This should fail with a "Permission denied" error.
      
    • Terminal C: Increase its priority using sudo.
      sudo renice -5 -p 23464
      
    • Observe in htop/top. The NI value should now be -5, and the PR value should be lower (higher priority).
    • Clean up: sudo kill 23464 (or just kill 23464).

Conclusion: You've practiced finding processes using ps and pgrep. You learned how to terminate processes using SIGTERM (graceful) and SIGKILL (forceful). You experimented with stopping (SIGSTOP) and resuming (SIGCONT) processes. Finally, you used nice to start a process with adjusted priority and renice to change the priority of a running process, observing the effects on the Nice (NI) value in top/htop and understanding the permissions required. These commands give you direct control over running tasks on your system.

7. System Logging

System logs are chronological records of events occurring on the system, generated by the kernel, system services, and applications. They are indispensable for troubleshooting problems, auditing security events, and understanding system behavior over time. Modern Linux systems primarily use systemd-journald, while traditional syslog also remains relevant.

Modern Logging systemd-journald

systemd-journald is a system service that collects and stores logging data. It captures syslog messages, kernel messages, standard output/error of services, and more. Its key features include:

  • Structured Logging: Logs can include key-value pairs (metadata) beyond the simple message string, allowing for powerful filtering.
  • Indexing: Logs are indexed, making searching and filtering very fast.
  • Centralized Collection: Gathers logs from various sources into one journal.
  • Volatility Control: Can store logs persistently on disk (usually under /var/log/journal) or just in memory (/run/log/journal). Configuration is in /etc/systemd/journald.conf.
  • Integration with systemd units: Easy to view logs specific to a service managed by systemd.

The journalctl Command:

This is the primary tool for querying the systemd journal.

  • journalctl: Show the entire journal (newest entries last). Press q to quit, use arrows/PageUp/PageDown to navigate.
  • journalctl -r: Show the journal in reverse order (newest entries first).
  • journalctl -n 20: Show the last 20 log entries.
  • journalctl -f: Follow the journal in real-time (like tail -f). New entries are printed as they arrive. Ctrl+C to exit.
  • journalctl -u <unit_name>: Show logs only for a specific systemd unit (service or target). Very useful!
    • journalctl -u sshd (Show logs for the SSH daemon service)
    • journalctl -u nginx.service
  • journalctl /path/to/executable: Show logs generated by a specific program.
    • journalctl /usr/sbin/sshd
  • journalctl --since "YYYY-MM-DD HH:MM:SS": Show logs since a specific time.
    • journalctl --since "2023-10-27 09:00:00"
    • journalctl --since "1 hour ago"
    • journalctl --since yesterday
  • journalctl --until "YYYY-MM-DD HH:MM:SS": Show logs until a specific time. Can be combined with --since.
  • journalctl -p <priority>: Filter by message priority. Priorities are: emerg (0), alert (1), crit (2), err (3), warning (4), notice (5), info (6), debug (7). Filtering by err also shows higher priorities (crit, alert, emerg).
    • journalctl -p err (Show errors and worse)
    • journalctl -p 3 (Same as above)
    • journalctl -p warning..err (Show warnings, notices, info, debug)
  • journalctl _PID=<pid>: Show logs for a specific Process ID.
  • journalctl _COMM=<command_name>: Show logs for processes with a specific command name.
  • journalctl -k: Show only kernel messages (equivalent to dmesg).
  • journalctl -b: Show messages from the current boot.
    • journalctl -b -1: Show messages from the previous boot.
  • journalctl --disk-usage: Show how much disk space the persistent journal logs are using.
  • journalctl --vacuum-size=1G: Reduce journal size on disk to 1 Gigabyte (removes oldest logs).
  • journalctl --vacuum-time=2weeks: Remove journal entries older than two weeks.

Traditional syslog (rsyslog, syslog-ng)

Before systemd-journald, syslog was the standard logging mechanism. Many systems still run a syslog daemon (like rsyslog - the most common, or syslog-ng) alongside journald. journald often forwards messages to rsyslog for traditional file-based logging.

  • Configuration: Typically /etc/rsyslog.conf and files within /etc/rsyslog.d/. These files define rules based on "facility" (type of program generating the message, e.g., kern, auth, mail, cron) and "priority" (same levels as journalctl) to determine which log file to write messages to.
  • Common Log Files (under /var/log/):
    • /var/log/syslog or /var/log/messages: General system messages.
    • /var/log/auth.log or /var/log/secure: Authentication-related messages (logins, sudo, ssh).
    • /var/log/kern.log: Kernel messages.
    • /var/log/dmesg: Kernel ring buffer messages from boot time (often overwritten or rotated).
    • /var/log/cron.log or within syslog/messages: Cron job execution logs.
    • /var/log/boot.log: System boot messages.
    • Application-specific logs: Many applications (like Apache, Nginx, databases) manage their own logs, often also under /var/log/.
  • Tools for Reading Text Logs:
    • tail -f <logfile>: Follow a specific log file in real-time.
    • less <logfile>: View a log file with scrolling and searching capabilities.
    • grep <pattern> <logfile>: Search for specific patterns within a log file.
    • zcat, zless, zgrep: Used to view/search compressed log files (often ending in .gz).
  • Log Rotation: Log files can grow indefinitely. The logrotate utility (configured via /etc/logrotate.conf and /etc/logrotate.d/) automatically manages log files – rotating them (e.g., renaming syslog to syslog.1), compressing old logs (syslog.1.gz), and eventually deleting the oldest ones to prevent disk space exhaustion.

Workshop Exploring System Logs with journalctl and Text Files

Goal: To practice querying system logs using journalctl for systemd-based logging and standard tools for traditional text log files.

Tools Required: journalctl, logger, tail, less, grep, sudo.

Steps:

  1. Generate a Custom Log Message:

    • The logger command sends a message to the system logger (journald and/or syslog).
    • Open a terminal (A).
      logger "STUDENT_WORKSHOP_TEST_MESSAGE - Step 1"
      
  2. Find the Message with journalctl:

    • Terminal A:
      # View recent logs, look for your message
      journalctl -n 50
      
      # Filter by the 'logger' command specifically
      journalctl -t logger
      
      # Follow the logs and generate another message
      journalctl -f &
      # Note the PID of the background journalctl process if needed later
      logger "STUDENT_WORKSHOP_TEST_MESSAGE - Step 2 Following"
      # You should see the Step 2 message appear immediately in the journalctl -f output.
      # Press Ctrl+C to stop following (or kill the background PID if needed)
      
  3. Explore journalctl Filtering:

    • Terminal A:
      # View logs from the SSH service (replace sshd if using a different name)
      journalctl -u sshd -n 20 -r # Last 20 sshd logs, newest first
      
      # View kernel messages from this boot
      journalctl -k -b 0 -n 30
      
      # View error messages (priority err or higher) from the last hour
      journalctl -p err --since "1 hour ago"
      
      # View logs for your current login session (find your session PID if needed)
      # Example: Find your shell's PID: echo $$
      # journalctl _PID=$(echo $$) # May not show much unless shell logs directly
      
  4. Explore Traditional Log Files (if applicable):

    • System configuration varies. journald might be the primary store, or logs might also be written to /var/log.
    • Terminal A: Check if the logger message went to a text file.
      # Check common locations
      sudo grep "STUDENT_WORKSHOP_TEST_MESSAGE" /var/log/syslog
      sudo grep "STUDENT_WORKSHOP_TEST_MESSAGE" /var/log/messages
      # If found, view the end of that file
      sudo tail -n 20 /var/log/syslog # (replace filename if needed)
      
    • Examine authentication logs:
      # View last 30 lines of auth log (use less for full view)
      sudo tail -n 30 /var/log/auth.log # (or /var/log/secure)
      sudo less /var/log/auth.log     # (Press 'q' to exit less)
      # Look for login attempts, sudo usage etc.
      
    • Examine kernel logs:
      sudo less /var/log/kern.log
      # Look for hardware detection, driver messages etc.
      
  5. Simulate an Event and Find Logs:

    • Terminal B: Attempt an invalid SSH login to your own machine (it will fail).
      ssh non_existent_user@localhost
      # Enter a dummy password when prompted. It should fail.
      
    • Terminal A: Look for evidence of the failed login attempt.
      # Check journald for sshd messages
      journalctl -u sshd -n 10 -r
      
      # Check traditional auth log
      sudo tail -n 10 /var/log/auth.log # (or /var/log/secure)
      # You should see lines indicating a failed password attempt for 'non_existent_user'.
      
  6. Check Log Rotation Config (Optional):

    • Terminal A: Look at the main logrotate configuration and specific rules.
      less /etc/logrotate.conf
      ls /etc/logrotate.d/
      less /etc/logrotate.d/rsyslog # (Example specific rule)
      
    • Look for settings like daily, weekly, rotate 4, compress, size.

Conclusion: You practiced using journalctl to view, follow, and filter logs from the systemd journal based on time, unit, priority, and other metadata. You also used traditional tools (tail, less, grep) to examine text-based log files in /var/log (if configured). You generated specific log entries using logger and simulated an event (failed SSH login) to practice finding relevant log information for troubleshooting. Understanding how to navigate and interpret system logs is a fundamental troubleshooting skill.

8. Advanced Monitoring and Resource Control

Beyond the standard command-line tools, several more advanced utilities offer consolidated views or finer control over system resources. Control Groups (cgroups) are a powerful kernel feature for limiting and isolating resource usage.

glances The All-in-One Monitoring Tool

glances is a cross-platform, curses-based monitoring tool written in Python. It aims to present a large amount of information from various system resources in a single view, dynamically adapting to the terminal size. It combines aspects of top, htop, iostat, nload, free, and more.

  • Features: CPU, Memory, Load, Process List, Network I/O, Disk I/O, Filesystem Usage, Sensors (temperature/voltage, if supported), Docker container stats, alerts, web UI, REST API.
  • Installation: Often requires installation (sudo apt install glances or sudo pip install glances). May need extra Python libraries for optional features (like sensors, Docker).
  • Usage: Simply run glances.
  • Interactive Keys: Similar to htop: q (Quit), 1 (Toggle per-CPU), m (Sort by MEM%), p (Sort by CPU%), i (Sort by I/O rate), d (Show/hide disk I/O), n (Show/hide network I/O), f (Show/hide filesystem), s (Show/hide sensors), l (Show/hide logs/alerts), h (Help).

glances is excellent for getting a quick, comprehensive overview of the system's current state.

Control Groups (cgroups)

Control Groups are a Linux kernel feature that allows you to allocate, limit, prioritize, and account for resource usage (CPU, memory, network bandwidth, disk I/O) for collections of processes.

  • Hierarchy: Cgroups are organized hierarchically, usually mounted under /sys/fs/cgroup/. Different resource controllers (like cpu, memory, blkio, net_cls) manage specific resources within this hierarchy.
  • Use Cases:
    • Resource Limiting: Prevent a group of processes (e.g., a specific user's tasks, a web server, a container) from consuming excessive memory or CPU, ensuring fairness and stability.
    • Prioritization: Allocate more CPU shares to critical applications.
    • Accounting: Measure resource consumption by specific groups.
    • Freezing/Thawing: Stop and resume all processes within a cgroup.
  • Management: While you can interact directly with the /sys/fs/cgroup/ filesystem (creating directories, writing values to control files like memory.limit_in_bytes or cpu.shares), it's complex. Higher-level tools often manage cgroups:
    • systemd: Heavily utilizes cgroups for managing services, user sessions, and scopes. You can set resource limits directly in systemd unit files (e.g., MemoryLimit=, CPUShares=, TasksMax=). The systemd-run command can launch transient processes within a dedicated cgroup scope.
    • Containerization Platforms (Docker, Podman, Kubernetes): Rely extensively on cgroups to isolate containers and enforce resource limits defined for them.
    • Dedicated tools like libcgroup-tools (provides cgcreate, cgexec, etc.) exist but are less commonly used directly now compared to systemd or container tools.

Example using systemd-run (Simplified):

# Run 'stress' allocating 1GB RAM, but limit its cgroup scope to 500MB
# This should cause 'stress' to be killed by the OOM killer within its cgroup
sudo systemd-run --scope -p MemoryLimit=500M stress --vm 1 --vm-bytes 1G

# Check system logs for OOM kill message related to the scope
journalctl -k | grep -i oom

Cgroups are a powerful but advanced topic. For most users and administrators, interaction happens indirectly via systemd service management or container platforms. Understanding the concept is important for comprehending modern Linux resource management.

Workshop Using glances and Experimenting with systemd-run Limits

Goal: To explore the comprehensive view provided by glances and demonstrate a basic resource limit using systemd-run and cgroups.

Tools Required: glances, stress, systemd-run, journalctl.

Steps:

  1. Install glances and stress (if needed):

    • Debian/Ubuntu: sudo apt update && sudo apt install glances stress
    • CentOS/RHEL/Fedora: sudo yum install glances stress or sudo dnf install glances stress
    • (Optional) For full features: sudo pip install 'glances[all]'
  2. Explore glances:

    • Open a terminal (A). Run glances.
    • Maximize the terminal window for the best view.
    • Observe the different sections: CPU (overall and per-core if toggled with 1), Load, Memory (including cache/available), Swap, Network I/O, Disk I/O, Filesystem usage, Process list.
    • Use the interactive keys:
      • m: Sort processes by memory.
      • p: Sort processes by CPU.
      • i: Sort processes by I/O.
      • d: Toggle Disk I/O section visibility.
      • n: Toggle Network I/O section visibility.
      • f: Toggle Filesystem section visibility.
      • l: Toggle Logs/Alerts section visibility (might show warnings/criticals).
      • h: View the help screen.
      • q: Quit glances.
    • Run glances again. While it's running, generate some load in another terminal (B) (e.g., stress --cpu 1 or dd if=/dev/zero of=test bs=1M count=100 oflag=direct). Observe how glances reflects the CPU or Disk I/O load in real-time. Stop the load generation and quit glances.
  3. Experiment with systemd-run Memory Limit:

    • Terminal A: Prepare to watch the system logs for OOM (Out Of Memory) events.
      journalctl -kf &
      # -k: kernel messages, -f: follow
      
    • Terminal B: Run the stress command within a cgroup scope limited to 100MB of memory, but ask stress to allocate 200MB.
      sudo systemd-run --unit=stress-test --scope -p MemoryLimit=100M stress --vm 1 --vm-bytes 200M --verbose
      # --unit=stress-test: Gives the transient unit a name
      # --scope: Creates a scope unit (doesn't track service lifecycle)
      # -p MemoryLimit=100M: Apply a 100MB memory limit via cgroup memory controller
      # stress --vm 1 --vm-bytes 200M: Ask stress to allocate 200MB
      # --verbose: Make stress print more info
      
    • Observe Terminal B: The stress command will likely run for a short time and then terminate abruptly. You might see an error message from stress or just notice it exits.
    • Observe Terminal A: Watch the journalctl output. You should see kernel messages indicating an OOM kill event, mentioning the stress process and likely the stress-test.scope cgroup being killed due to exceeding its MemoryLimit. It might look something like:
      Memory cgroup out of memory: Killed process 12345 (stress) total-vm:..., anon-rss:..., file-rss:..., shmem-rss:...
      oom_reaper: reaped process 12345 (stress), now reaping process ...
      
    • Stop the journalctl -f process in Terminal A (Ctrl+C).
  4. Experiment with systemd-run CPU Limit (Optional):

    • CPU limits are often set using shares (CPUShares) or quotas (CPUQuota). Shares are relative priorities when CPU is contended, while quota is a hard limit on CPU time percentage. Let's try quota.
    • Terminal A: Run htop or top.
    • Terminal B: Run stress normally first to see it use 100% of one core.
      stress --cpu 1 &
      # Observe in htop/top - it uses 100% CPU (on one core).
      kill %1 # Kills the background job
      
    • Terminal B: Now run it within a scope limited to 20% CPU time.
      # CPUQuota=20% means the process(es) in this cgroup can use max 20% of one CPU core's time
      sudo systemd-run --unit=stress-cpu-test --scope -p CPUQuota=20% stress --cpu 1 &
      
    • Observe Terminal A (htop/top): Find the new stress process running within the stress-cpu-test.scope. Its CPU usage should be capped at approximately 20%, even though it's trying to run flat out.
    • Clean up: sudo killall stress or find the specific PID and kill it.

Conclusion: You used glances to get a consolidated, real-time view of system resources, demonstrating its utility as a comprehensive dashboard. You then experimented with systemd-run to leverage Control Groups (cgroups) for resource limiting. You successfully applied a MemoryLimit that triggered an OOM kill within the cgroup when exceeded, and you optionally applied a CPUQuota to restrict the CPU time available to a process. This demonstrates the power of cgroups in enforcing resource boundaries, a fundamental concept used heavily by systemd and containerization technologies.

Conclusion Summarizing Monitoring and Management

Effective system monitoring and resource management are not optional extras; they are fundamental requirements for maintaining stable, performant, and reliable Linux systems. Throughout this section, we've journeyed from basic real-time observation to specific resource analysis and active process control.

Key Takeaways:

  1. Real-Time Observation: Tools like top, htop, and glances provide immediate insight into the current state of CPU, memory, processes, and load average.
  2. Snapshot Analysis: ps gives detailed information about processes at a specific moment, crucial for scripting and targeted queries.
  3. Resource-Specific Tools: We delved into dedicated tools for deeper analysis:
    • CPU: mpstat (per-core stats), vmstat (run queue, context switches), uptime (load average). Understanding %user, %system, %idle, and especially %iowait is critical.
    • Memory: free (available vs free, cache), vmstat (swapping activity si/so), /proc/meminfo (details). Recognizing memory pressure and swapping is key.
    • Disk I/O: iostat (throughput, IOPS, await times, utilization), iotop (per-process I/O). High await times are a strong indicator of bottlenecks.
    • Network: ip (interface stats, errors), ss (socket states, connections), iftop/nload (real-time bandwidth per connection/interface), ping/mtr (latency/path).
  4. Process Control: We learned to manage processes using signals (kill, pkill, killall) for termination (SIGTERM, SIGKILL) or state changes (SIGSTOP, SIGCONT), and to influence scheduling with nice and renice.
  5. Logging: Understanding journalctl for querying the systemd journal and traditional tools (tail, grep, less) for /var/log files is essential for troubleshooting and auditing.
  6. Advanced Concepts: glances offers a unified view, and Control Groups (cgroups), often managed via systemd or container tools, provide powerful mechanisms for resource limiting and isolation.

The Continuous Cycle:

Monitoring and management form a continuous cycle:

  • Monitor: Regularly observe system metrics using appropriate tools.
  • Analyze: Interpret the data – identify trends, anomalies, bottlenecks, or errors.
  • Act: Take corrective action – kill runaway processes, adjust priorities, optimize configurations, plan for upgrades.
  • Verify: Confirm that the actions taken had the desired effect by monitoring again.

Mastering these tools and concepts empowers you to diagnose problems effectively, optimize performance proactively, and ensure the overall health and efficiency of your Linux environments. This knowledge is foundational for any system administrator, developer, or power user working with Linux. Remember that context is crucial – what constitutes "high" usage or a "problem" depends heavily on the specific system, its hardware, and its intended workload. Keep exploring, keep experimenting (safely!), and keep learning. ```