Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
System Monitoring and Resource Management
Introduction Monitoring and Management Essentials
Welcome to the critical domain of System Monitoring and Resource Management on Linux. In any computing environment, from a personal laptop to vast server farms, understanding what the system is doing, how its resources are being utilized, and how to manage those resources effectively is paramount. Without proper monitoring, diagnosing performance issues becomes guesswork, potential problems go unnoticed until they cause outages, and capacity planning is impossible. Without effective resource management, critical processes might starve while unimportant tasks consume valuable CPU time or memory, leading to instability and poor performance.
Why is this so important?
- Performance Optimization: By observing resource usage (CPU, Memory, Disk I/O, Network), you can identify bottlenecks. Is an application slow because the CPU is maxed out, it's waiting for disk access, or it's constantly swapping memory? Monitoring provides the answers needed to tune the system or application.
- Stability and Reliability: Unexpected resource exhaustion (e.g., running out of memory or disk space) is a common cause of system crashes or hangs. Continuous monitoring allows you to foresee these situations and take corrective action before they cause critical failures. Spotting runaway processes consuming excessive resources is key to maintaining stability.
- Troubleshooting: When things go wrong (and they inevitably do), system logs and real-time monitoring data are your primary tools for diagnosis. Understanding system metrics helps you correlate events and pinpoint the root cause of a problem, whether it's a hardware fault, a software bug, or a configuration issue.
- Security Auditing: Monitoring system logs and network connections can help detect unauthorized access attempts, unusual process activity, or other potential security breaches. Resource usage patterns can sometimes indicate malware activity.
- Capacity Planning: By tracking resource utilization trends over time, administrators can make informed decisions about future hardware needs. Do you need more RAM? Faster disks? A more powerful CPU? Or perhaps another server entirely? Monitoring data provides the justification for upgrades or scaling.
Key Resources We Monitor and Manage:
- CPU (Central Processing Unit): The "brain" of the computer. We monitor its utilization (how busy it is), load average (how many processes are waiting), and context switches.
- Memory (RAM & Swap): Random Access Memory is crucial for active processes. We monitor total usage, free memory, cached data, and swap space usage (virtual memory on disk). Excessive swapping is often a sign of insufficient RAM.
- Disk I/O (Input/Output): How quickly data can be read from and written to storage devices (HDDs, SSDs). We monitor throughput (MB/s), operations per second (IOPS), wait times, and device utilization. Slow disk I/O can severely impact overall system responsiveness.
- Network I/O: The rate at which data is sent and received over network interfaces. We monitor bandwidth usage, packet counts, errors, and established connections.
This section will guide you through the fundamental tools and concepts needed to effectively monitor your Linux systems and manage their resources. We will start with essential command-line tools, delve into specific resource monitoring techniques, explore process management, understand system logging, and touch upon more advanced tools and concepts like control groups. Each technical sub-section will be followed by a hands-on workshop to solidify your understanding.
1. Essential Real-Time Monitoring Tools
Before diving into specific resources, let's familiarize ourselves with the workhorses of real-time system monitoring on the command line. These tools provide a dynamic overview of the system's current state.
top
The Classic Task Manager
The top
command provides a dynamic, real-time view of a running system. It displays system summary information as well as a list of tasks currently being managed by the kernel. Its output refreshes periodically (typically every 3 seconds), allowing you to observe changes as they happen.
Understanding the top
Output:
The output is divided into two main parts: the summary area (top few lines) and the task area (the list of processes).
-
Summary Area:
top - 10:30:01 up 5 days, 1:15, 2 users, load average: 0.05, 0.15, 0.10
10:30:01
: Current system time.up 5 days, 1:15
: System uptime (how long since the last boot).2 users
: Number of currently logged-in users.load average: 0.05, 0.15, 0.10
: System load average over the last 1, 5, and 15 minutes. This represents the average number of processes in the run queue (running or waiting for CPU time) plus those waiting for uninterruptible I/O. On a multi-core system, a load average equal to the number of CPU cores generally means the system is fully utilized. Values significantly higher indicate the system is overloaded.
Tasks: 250 total, 1 running, 249 sleeping, 0 stopped, 0 zombie
- Total number of processes.
- Breakdown by state: Running (actively using CPU or ready to), Sleeping (waiting for an event or resource), Stopped (suspended, e.g., by
Ctrl+Z
), Zombie (terminated but waiting for parent process to collect status).
%Cpu(s): 1.5 us, 0.8 sy, 0.0 ni, 97.5 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
- CPU utilization breakdown (press
1
to toggle per-CPU view):us
: user space (running user processes)sy
: system/kernel space (running kernel tasks)ni
: nice (user processes with modified priority)id
: idle (CPU is not busy)wa
: wait (waiting for I/O operations to complete) - Highwa
often indicates a disk bottleneck.hi
: hardware interruptssi
: software interruptsst
: steal time (relevant in virtualized environments, time stolen by the hypervisor)
- CPU utilization breakdown (press
MiB Mem : 15890.5 total, 8140.2 free, 4150.3 used, 3600.0 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 11250.8 avail Mem
- Memory usage (RAM): Total, free, used, and buffered/cached memory. Linux uses free RAM extensively for caching disk data (buffers/cache) to speed up access. This cache is readily relinquished if applications need the memory.
- Swap usage (Virtual Memory): Total, free, and used swap space. High swap usage usually indicates insufficient RAM for the current workload.
avail Mem
: An estimation of how much memory is available for starting new applications, without swapping. This is often a more useful metric thanfree
.
-
Task Area (Columns):
PID
: Process ID (unique identifier).USER
: User owning the process.PR
: Priority (kernel scheduling priority).NI
: Nice value (user-space priority adjustment, lower is higher priority).VIRT
: Virtual Memory size used by the process (KB).RES
: Resident Memory size (physical RAM used, KB).SHR
: Shared Memory size (KB).S
: Process Status (R=Running, S=Sleeping, D=Disk Sleep, Z=Zombie, T=Stopped/Traced).%CPU
: Percentage of CPU time used by the process since the last update.%MEM
: Percentage of physical RAM used by the process.TIME+
: Total CPU time consumed by the task (hundredths of a second).COMMAND
: The command name or command line.
Interactive top
Commands:
While top
is running, press these keys:
q
: Quittop
.h
: Display help screen.k
: Kill a process (you'll be prompted for the PID and signal).r
: Renice a process (change its priority, prompts for PID and nice value).f
: Fields management (add/remove/reorder columns).o
orO
: Change sorting order (prompts for sort field letter).M
: Sort by memory usage (%MEM
).P
: Sort by CPU usage (%CPU
).T
: Sort by total CPU time (TIME+
).1
: Toggle summary CPU display between combined and per-CPU.z
: Toggle color display.c
: Toggle display between command name and full command line.u
: Filter by user (prompts for username).Spacebar
orEnter
: Refresh display immediately.
htop
An Enhanced Interactive Process Viewer
htop
is often preferred over top
because it offers several improvements:
- Colorized Output: Easier to read and distinguish information.
- Scrolling: You can scroll vertically and horizontally to see all processes and full command lines.
- Easier Interaction: No need to enter PIDs for killing or renicing; you can select processes with arrow keys.
- Mouse Support: If run in a terminal emulator that supports it.
- Tree View: Press
F5
to see parent-child relationships between processes. - Setup Menu: Press
F2
to easily customize displayed meters, columns, colors, and options.
Understanding the htop
Output:
- Top Meters: Configurable graphical meters showing CPU (per core), Memory, and Swap usage. Load average, uptime, and task counts are also displayed.
- Task Area: Similar columns to
top
, but often more intuitively arranged and configurable viaF2
Setup. - Bottom Menu: Shows key function key shortcuts (
F1
Help,F2
Setup,F3
Search,F4
Filter,F5
Tree View,F6
SortBy,F7
Nice-,F8
Nice+,F9
Kill,F10
Quit).
htop
provides largely the same information as top
but in a more user-friendly and visually appealing package. If it's not installed by default, it's usually available via the package manager (e.g., sudo apt install htop
or sudo yum install htop
).
ps
Reporting a Snapshot of Current Processes
Unlike top
and htop
which are dynamic, ps
(process status) provides a static snapshot of the processes running at the moment the command is executed. It's highly versatile due to its numerous options for selecting processes and customizing the output format.
Common ps
Usage Patterns:
-
BSD Syntax (common on Linux):
ps aux
a
: Show processes for all users.u
: Display user-oriented format (includes USER, %CPU, %MEM, VSZ, RSS, etc.).x
: Show processes not attached to a terminal (like daemons/services).- This is arguably the most common and useful invocation for a general overview.
# Example Output Snippet (ps aux) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.1 169404 11928 ? Ss Jul10 0:02 /sbin/init splash root 2 0.0 0.0 0 0 ? S Jul10 0:00 [kthreadd] root 889 0.1 0.3 123456 50000 ? Sl Jul10 1:30 /usr/lib/some-service student 1234 0.5 1.0 876543 160000 pts/0 S+ 10:00 0:05 gnome-terminal student 5678 12.3 5.5 1500000 880000 pts/1 R+ 10:25 0:55 /usr/bin/firefox
-
System V Syntax:
ps -ef
-e
: Show every process.-f
: Show full-format listing (includes UID, PID, PPID, C, STIME).- Often used to see parent/child process relationships (PPID).
# Example Output Snippet (ps -ef) UID PID PPID C STIME TTY TIME CMD root 1 0 0 Jul10 ? 00:00:02 /sbin/init splash root 2 0 0 Jul10 ? 00:00:00 [kthreadd] root 889 1 0 Jul10 ? 00:01:30 /usr/lib/some-service student 1234 1200 0 10:00 pts/0 00:00:05 gnome-terminal student 5678 1234 12 10:25 pts/1 00:00:55 /usr/bin/firefox
-
Custom Format:
ps -eo <columns>
-e
: Show every process.-o
: Specify user-defined format. You list the column names you want. Common columns:pid,ppid,user,%cpu,%mem,vsz,rss,stat,start,time,comm,args
.comm
: Command name only.args
: Full command line with arguments.
Key ps
Output Columns (common to aux
or -eo
):
USER
/UID
: User owning the process.PID
: Process ID.PPID
: Parent Process ID.%CPU
: Approximate CPU utilization. Note: This is often averaged over the process's lifetime, unliketop
's real-time view, unless specifically requested otherwise.%MEM
: Approximate physical memory (RAM) utilization.VSZ
(Virtual Set Size): Total virtual memory used by the process (in KB).RSS
(Resident Set Size): Physical memory (RAM) occupied by the process (in KB). This is often a more relevant metric than VSZ for actual RAM usage.TTY
: Controlling terminal (?
means no controlling terminal, typical for daemons).STAT
/S
: Process state (seetop
explanation: R, S, D, Z, T, etc.;+
means foreground process group).START
/STIME
: Time or date the process started.TIME
: Cumulative CPU time consumed by the process (often inMM:SS
orHH:MM:SS
format).COMMAND
/CMD
/comm
/args
: The command being run.
Combining ps
with grep
:
A very common use case is finding a specific process:
ps aux | grep firefox
# Find all processes with "firefox" in their command line
ps -ef | grep sshd
# Find processes related to the SSH daemon
grep
command itself will often appear in the output. You can filter it out: ps aux | grep firefox | grep -v grep
.
Workshop Identifying and Inspecting Processes
Goal: To practice using top
, htop
, and ps
to identify system activity and gather details about specific processes.
Scenario: Let's simulate a scenario where a background process starts consuming some resources, and we need to investigate it.
Steps:
-
Open Two Terminals: You'll need one terminal (Terminal A) to run commands and another (Terminal B) to run a background task.
-
Start a Background Task (Terminal B):
- Run the following command. This will simply loop indefinitely, consuming a small amount of CPU. We add
sleep 1
to prevent it from consuming 100% CPU, making it slightly more realistic for a background task. - The
&
runs the command in the background. Note the PID (Process ID) that is printed, e.g.,[1] 12345
. You'll use this PID later. If you miss it, don't worry, we'll find it.
- Run the following command. This will simply loop indefinitely, consuming a small amount of CPU. We add
-
Monitor with
top
(Terminal A):- Run
top
. - Observe the process list. It might take a few refresh cycles. Look for a process named
bash
orsh
(or potentiallysleep
) that is associated with your user and has a non-zero%CPU
(though small due tosleep 1
) and aTIME+
value that increments. - Press
P
to sort by CPU usage. Does your process appear near the top (it might not if the system is busy)? - Press
M
to sort by Memory usage. - Press
c
to toggle the full command line. Can you now see thewhile true; do ...
command? - Make a note of the PID of your loop process as shown in
top
. - Press
q
to exittop
.
- Run
-
Monitor with
htop
(Terminal A):- If you have
htop
installed, runhtop
. (If not, you can install it:sudo apt update && sudo apt install htop
orsudo yum install htop
). - Observe the meters at the top.
- Look for your process in the list. Use the Up/Down arrow keys to navigate.
- Press
F6
(SortBy
) and selectPERCENT_CPU
. - Press
F5
(Tree
) view. Can you see your shell process (bash
,zsh
, etc.) and thesleep
command running under it (or thewhile
loop itself if represented that way)? PressF5
again to exit tree view. - Press
F4
(Filter
). Type your username and press Enter. Now only your processes are shown. Does this make it easier to find the loop? PressF4
again and Enter with an empty string to clear the filter. - Press
F3
(Search
). Typesleep
and press Enter.htop
will highlight matching processes. PressF3
again to find the next match. - Press
F9
(Kill
). Use the arrow keys to highlight your background loop process (thebash
/sh
one, notsleep
directly if visible separately). Do not press Enter yet. PressEsc
twice to cancel the kill operation. We'll kill it later. - Press
F10
(Quit
).
- If you have
-
Inspect with
ps
(Terminal A):- Run
ps aux
. Scan the output for your background loop process (look forwhile true
or similar in theCOMMAND
column). Note its PID, USER, %CPU, %MEM, STAT (should beS
for sleeping most of the time, occasionallyR
), and START time. - Run
ps -ef
. Find the process again. Note the PID and PPID (Parent Process ID). The PPID should correspond to the PID of the shell process running in Terminal B. - Let's assume the PID you found for the loop was
12345
. Get specific details using-o
: - Find the process using
pgrep
(a utility to find PIDs by name or other attributes): This should give you the PID of the main loop shell.
- Run
-
Terminate the Background Task (Terminal A or B):
- You have the PID (let's say it's
12345
). Use thekill
command in either terminal: - Go back to Terminal B. You should see a message like
Terminated
or[1]+ Terminated ...
. The loop has stopped. - Run
ps aux | grep 12345 | grep -v grep
. You should get no output, confirming the process is gone.
- You have the PID (let's say it's
Conclusion: You've successfully used top
, htop
, and ps
to monitor system activity in real-time, identify a specific process, inspect its details (PID, PPID, resource usage, state), and terminate it using its PID. These are fundamental skills for managing any Linux system.
2. CPU Monitoring and Analysis
The Central Processing Unit (CPU) is often the first place administrators look when diagnosing performance issues. Understanding how to monitor CPU utilization and interpret the related metrics is crucial.
Key CPU Concepts
- Cores and Threads: Modern CPUs have multiple cores, each capable of executing instructions independently. Some cores support hyper-threading (or Simultaneous Multi-Threading - SMT), allowing a single physical core to appear as two logical processors to the OS, potentially increasing throughput for certain workloads. When monitoring, it's important to know if you're looking at total utilization across all logical processors or utilization per core/thread.
- CPU Utilization: This is typically expressed as a percentage, indicating how much time the CPU spent doing useful work versus being idle. It's broken down into categories (as seen in
top
):%us
(user): Time spent executing user-space processes (applications). Highus
usually means application code is consuming CPU.%sy
(system): Time spent executing kernel-space code (system calls, kernel threads). Highsy
might indicate heavy I/O, intense networking, or kernel-level tasks.%ni
(nice): Time spent executing niced (lower priority) user processes.%id
(idle): Time the CPU had nothing to do. Highid
means the CPU is not a bottleneck.%wa
(I/O wait): Time the CPU spent waiting for I/O operations (like disk reads/writes) to complete. Important: This is time the CPU could have been doing something else but was stalled waiting for I/O. Highwa
strongly suggests an I/O bottleneck (often disk, sometimes network). The CPU itself isn't necessarily busy, but tasks waiting for I/O are preventing it from being truly idle.%hi
(hardware interrupts): Time spent servicing hardware interrupts (e.g., from network cards, disk controllers).%si
(software interrupts): Time spent servicing software interrupts (often related to network packet processing). Highsi
can point to very high network traffic.%st
(steal time): In virtualized environments, this is time the hypervisor "stole" from this virtual CPU to run other tasks (like another VM or the hypervisor itself). Highst
indicates the VM isn't getting its fair share of CPU from the host.
- Load Average: As seen in
top
anduptime
, the load average (1, 5, 15-minute averages) represents the average number of tasks in the run queue (R state) or waiting for uninterruptible I/O (D state).- A load average consistently below the number of logical CPU cores indicates the system is generally not CPU-bound.
- A load average consistently near or equal to the number of cores means the system is fully utilized.
- A load average consistently above the number of cores indicates the system is overloaded – there are more tasks ready to run than available CPU cores can handle, leading to waiting times and reduced responsiveness. High load average can be caused by high CPU usage or high I/O wait.
Tools for CPU Monitoring
While top
and htop
give a good overview, other tools provide different perspectives:
-
mpstat
(MultiProcessor Statistics): Part of thesysstat
package (often needs installation:sudo apt install sysstat
orsudo yum install sysstat
). Excellent for viewing statistics per logical processor.mpstat -P ALL
: Show statistics for all CPUs (ALL
) individually, plus a summary average.mpstat -P ALL 1 5
: Show stats for all CPUs every 1 second, 5 times.
This is invaluable for spotting imbalances (one core heavily loaded while others are idle) or understanding utilization patterns on multi-core systems.# Example Output (mpstat -P ALL 1 1) Linux 5.15.0-76-generic (...) _x86_64_ (4 CPU) 11:00:01 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:00:02 AM all 1.50 0.00 0.75 0.10 0.00 0.15 0.00 0.00 0.00 97.50 11:00:02 AM 0 2.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 97.00 11:00:02 AM 1 1.00 0.00 0.50 0.30 0.00 0.20 0.00 0.00 0.00 98.00 11:00:02 AM 2 1.80 0.00 0.90 0.00 0.00 0.30 0.00 0.00 0.00 97.00 11:00:02 AM 3 1.20 0.00 0.60 0.10 0.00 0.10 0.00 0.00 0.00 98.00
-
vmstat
(Virtual Memory Statistics): Also part ofsysstat
(or sometimes installed by default). While primarily for memory (vm
), it provides useful CPU context.vmstat 1
: Report every 1 second indefinitely.vmstat 2 5
: Report every 2 seconds, 5 times.
# Example Output (vmstat 1 3) procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 8140100 150000 3450000 0 0 5 20 100 250 2 1 97 0 0 2 0 0 8139900 150000 3450200 0 0 0 150 850 1500 15 5 80 0 0 0 0 0 8139700 150000 3450400 0 0 0 80 500 900 5 2 93 0 0
- Key CPU Columns:
us
,sy
,id
,wa
,st
(same meanings as intop
). - Key Process Columns:
r
(runnable processes waiting for CPU),b
(processes in uninterruptible sleep, often waiting for I/O). Highr
values correlate with high CPU load. Highb
values correlate with high I/O wait.
-
uptime
: Quickly shows the load averages.
Workshop Generating and Analyzing CPU Load
Goal: To generate CPU load and observe its effect using various monitoring tools, focusing on per-core statistics and load average.
Tools Required: top
or htop
, mpstat
, uptime
, and the stress
utility.
Steps:
-
Install
stress
andsysstat
:- On Debian/Ubuntu:
sudo apt update && sudo apt install stress sysstat
- On CentOS/RHEL/Fedora:
sudo yum install epel-release && sudo yum install stress sysstat
(orsudo dnf install stress sysstat
) sysstat
providesmpstat
.stress
is a simple tool to impose CPU, memory, or I/O load.
- On Debian/Ubuntu:
-
Check Initial State:
- Open three terminals (A, B, C).
- Terminal A: Run
htop
ortop
. Note the baseline CPU usage and load average. Press1
intop
to see per-CPU views if not already visible. - Terminal B: Run
mpstat -P ALL 1
. Observe the per-CPU idle (%idle
) percentages. They should be high (close to 100%). - Terminal C: Run
uptime
. Note the initial load average.
-
Generate CPU Load (Terminal C):
- First, find out how many CPU cores (logical processors) you have:
nproc
- Let's generate load equivalent to one fully utilized core. Replace
1
with the number of cores if you want to stress more later.
- First, find out how many CPU cores (logical processors) you have:
-
Observe While Under Load:
- Terminal A (
htop
/top
):- Watch the main CPU meter(s). You should see utilization increase significantly.
- If using
top
with per-CPU view (press1
), one CPU line should show very low%idle
.htop
will show one CPU bar nearly full. - Find the
stress
process(es). They should be consuming close to 100% CPU (on one core). Note theirPID
and%CPU
. - Watch the
load average
. The 1-minute average should start climbing towards1.00
.
- Terminal B (
mpstat
):- Observe the output refreshing every second. One specific
CPU
line should show a dramatic drop in%idle
and a corresponding increase in%usr
. Other CPUs should remain mostly idle. - The
all
line will show the average utilization across all cores.
- Observe the output refreshing every second. One specific
- Terminal C: After
stress
finishes (60 seconds), it will exit. Runuptime
again immediately. Compare the load averages to the initial values. The 1-minute average should be elevated (close to 1.00 if the test ran long enough), while the 5 and 15-minute averages will be lower but rising. Runuptime
a few more times over the next few minutes and watch the averages decrease as the system recovers.
- Terminal A (
-
(Optional) Generate More Load:
- If you have multiple cores (e.g.,
nproc
reported 4), try stressing all of them: - Now observe
htop
/top
andmpstat
. All CPU cores should show high utilization (low%idle
). Theload average
intop
anduptime
should climb towards4.00
(or the number of cores you stressed).
- If you have multiple cores (e.g.,
-
(Optional) Generate I/O Wait Load:
- I/O wait is harder to simulate perfectly with
stress
, but we can try: - Observe
top
/htop
. Look at the%wa
value in the CPU summary line. Does it increase significantly? - Observe
mpstat
. Does%iowait
increase? - Observe
vmstat 1
. Look at thewa
column undercpu
and theb
column underprocs
. Do they increase? - Note: The effectiveness of
--io
depends heavily on your disk speed and system configuration.
- I/O wait is harder to simulate perfectly with
Conclusion: You have used stress
to create controlled CPU load and observed its impact using top
, htop
, mpstat
, and uptime
. You saw how load affects overall and per-core utilization percentages and how the system load average reflects the demand on the CPU(s). You also briefly explored how I/O-bound tasks affect the %wa
metric. This hands-on experience helps in interpreting these metrics when analyzing real-world performance issues.
3. Memory Monitoring and Analysis
Memory (RAM) is another critical resource. Insufficient memory forces the system to use slower swap space (disk), drastically reducing performance. Understanding memory usage patterns is essential for system health.
Key Memory Concepts
- RAM (Random Access Memory): Fast, volatile storage used by the CPU to hold running applications and their data.
- Swap Space: A designated area on a hard drive or SSD used as "virtual memory" when physical RAM is full. Accessing swap is orders of magnitude slower than accessing RAM. Heavy swap usage is a major performance killer.
- Physical vs. Virtual Memory:
- Physical Memory (Resident Set Size - RSS): The actual amount of RAM a process occupies.
- Virtual Memory (Virtual Set Size - VSZ): The total address space requested by a process. This includes code, data, shared libraries, and mapped files, some of which might be in RAM, some in swap, and some not loaded yet. VSZ is often much larger than RSS. RSS is usually the more important metric for actual RAM consumption.
- Buffers: Temporary storage for raw disk blocks (metadata or file content). Used by the kernel to optimize block device I/O. Data written might be held in a buffer briefly before being written to disk.
- Cache: Page cache holding data read from files on disk. If a file is read, its contents are stored in the page cache in RAM. Subsequent reads of the same file can be served quickly from the cache instead of going back to the slow disk.
- Buffers vs. Cache: Historically distinct, modern Linux kernels often manage them similarly within the "page cache." The
buff/cache
value seen in tools likefree
andtop
represents the sum of memory used for both purposes. Crucially, most of thisbuff/cache
memory is reclaimable. If applications need more RAM, the kernel will shrink the cache/buffers to free up space. - Free vs. Available Memory:
free
: Memory that is completely unused. In Linux, this number might seem low because the kernel actively uses "free" RAM for buffers and cache to improve performance.available
: An estimation (available since kernel 3.14) of how much memory is truly available for starting new applications without resorting to swapping. It accounts forfree
memory plus reclaimable parts ofbuff/cache
.available
is generally the most useful metric to determine if the system is under memory pressure.
- OOM Killer (Out Of Memory Killer): A Linux kernel process that activates when the system is critically low on memory and cannot reclaim enough (e.g., by shrinking caches or swapping). To prevent a total system lockup, the OOM Killer selects a process (based on heuristics like memory usage and "oom_score") and terminates (
SIGKILL
s) it to free up memory. While it saves the system from crashing, it means an application was forcibly killed. Seeing OOM Killer activity in logs (dmesg
orjournalctl
) indicates severe memory pressure.
Tools for Memory Monitoring
-
free
: The primary command-line tool for a quick snapshot of memory usage.free
: Shows values in kibibytes (KiB).free -h
: Shows values in human-readable format (MiB, GiB). This is usually preferred.free -s 1
: Refresh every 1 second.
# Example Output (free -h) total used free shared buff/cache available Mem: 15Gi 4.0Gi 7.8Gi 150Mi 3.8Gi 11Gi Swap: 2.0Gi 0B 2.0Gi
Mem
line: Physical RAM statistics.total
: Total installed RAM.used
: Calculated astotal - free - buff/cache
. Can be misleading alone.free
: Truly unused memory.shared
: Memory used bytmpfs
(RAM-based file systems).buff/cache
: Memory used by kernel buffers and page cache.available
: Estimate of memory available for new applications. Focus on this!
Swap
line: Swap space statistics. Non-zeroused
indicates swapping is occurring or has occurred.
-
top
/htop
: Provide real-time memory summary (similar tofree
) and per-process memory usage (VIRT
,RES
,SHR
,%MEM
). Sorting by%MEM
(M
intop
,F6
inhtop
) quickly identifies memory-hungry processes. -
vmstat
: Reports virtual memory statistics over time.vmstat 1
# Example Output focusing on memory/swap columns procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 8140100 150000 3450000 0 0 5 20 100 250 2 1 97 0 0 0 0 1024 8030000 150200 3550000 10 25 80 150 600 800 5 3 85 7 0
- Memory Columns:
swpd
(amount of swap used),free
,buff
,cache
. - Swap Columns:
si
(amount swapped in from disk per second),so
(amount swapped out to disk per second). Sustained non-zero values forsi
andso
indicate active swapping and likely insufficient RAM.
-
/proc/meminfo
: A virtual file providing detailed memory statistics directly from the kernel.free
,top
, etc., parse this file. Useful for getting specific values or scripting. -
smem
: An advanced tool (may need installation) that provides more detailed reports on memory usage, particularly distinguishing between shared and private memory per process, giving a more accurate view of proportional usage (PSS - Proportional Set Size).smem -tk
shows totals and reports in KB.
Workshop Simulating Memory Pressure and Observing Swapping
Goal: To simulate a low-memory situation, observe the use of cache, witness swapping activity, and see how tools report these conditions.
Tools Required: free
, top
or htop
, vmstat
, stress
(or stress-ng
).
Steps:
-
Install Tools (if needed):
- Ensure
sysstat
(forvmstat
) andstress
(orstress-ng
which has more memory options) are installed. - Debian/Ubuntu:
sudo apt update && sudo apt install sysstat stress
- CentOS/RHEL/Fedora:
sudo yum install sysstat stress
orsudo dnf install sysstat stress
- Ensure
-
Establish Baseline:
- Open three terminals (A, B, C).
- Terminal A: Run
free -h
. Note the initialtotal
,used
,free
,buff/cache
, andavailable
memory, plus swap usage. - Terminal B: Run
vmstat 1
. Observe thefree
,cache
,si
, andso
columns. Note the initial lack of swap activity (si
/so
should be 0). - Terminal C: Run
htop
ortop
. Observe the memory summary line.
-
Consume Cache (Optional but illustrative):
- In a fourth terminal (D) or reuse C temporarily, perform an operation that reads a large amount of data. This forces Linux to cache it. Reading a large system file or device often works.
- Immediately after the command finishes, check
free -h
in Terminal A again. You should see:free
memory decreased significantly.buff/cache
increased significantly.available
memory decreased less thanfree
, because the cache is reclaimable.
- Check
vmstat 1
in Terminal B. Thecache
column should have increased.
-
Clear Caches (Optional, requires root):
- To demonstrate cache reclaimability (use with caution, may temporarily impact performance):
- Check
free -h
in Terminal A again.free
memory should increase, andbuff/cache
should decrease, returning closer to the initial state.available
should also increase.
-
Generate Memory Load (Terminal C):
- Determine roughly how much available RAM you have from
free -h
. Let's aim to consume slightly more than that to force swapping. If you have 11Gi available, try allocating 12G. Adjust the12G
value based on your system.# Use stress to allocate memory # --vm N: Spawn N workers spinning on malloc()/free() # --vm-bytes SIZE: Allocate SIZE per worker # Let's start 1 worker allocating 12GB (adjust size!) stress --vm 1 --vm-bytes 12G --timeout 120s # If stress fails or doesn't consume enough, try stress-ng # stress-ng --vm 1 --vm-bytes 12G --timeout 120s
- Warning: This might make your system temporarily unresponsive!
- Determine roughly how much available RAM you have from
-
Observe While Under Memory Pressure:
- Terminal A (
free -h
): Runfree -h
periodically (orwatch -n 1 free -h
).- Watch
available
memory decrease rapidly. - Watch
used
swap increase from 0.
- Watch
- Terminal B (
vmstat 1
):- Watch the
free
memory column drop. - Watch the
cache
column likely decrease as the kernel tries to reclaim cache before swapping. - Crucially, watch the
so
(swap-out) column. You should see non-zero values as the system writes memory pages to the swap disk. - If the system becomes responsive enough for
stress
to free memory later, or if you allocate less, you might seesi
(swap-in) activity as swapped-out pages are needed again.
- Watch the
- Terminal C (
htop
/top
):- The memory summary line should show high RAM usage and increasing Swap usage.
- Find the
stress
orstress-ng
process. Its%MEM
andRES
(Resident Set Size) should be very high. ItsVIRT
(Virtual Size) might be even higher. - The system might feel sluggish. Observe CPU usage - you might see increased
%sy
(system CPU) and potentially%wa
(I/O wait) due to the swapping activity (which involves disk I/O).
- Terminal A (
-
After
stress
Finishes:- Continue monitoring with
free -h
andvmstat 1
for a minute or two. - Swap usage (
used
infree -h
,swpd
invmstat
) might remain high even after the process exits. Linux generally doesn't eagerly un-swap pages unless the memory is needed elsewhere or the page is accessed again (triggering swap-in). - Available memory should recover. Swap activity (
si
/so
invmstat
) should return to 0.
- Continue monitoring with
Conclusion: You simulated memory pressure, observed how Linux uses free RAM for cache, how it reclaims cache when needed, and critically, what happens when physical RAM is exhausted – swapping. You used free
, vmstat
, and top
/htop
to monitor available memory, cache usage, swap usage, and swap I/O activity (si
/so
). Witnessing non-zero si
/so
is a strong indicator that the system needs more RAM for its workload.
4. Disk I/O Monitoring and Analysis
Disk Input/Output (I/O) performance is critical for application responsiveness, especially for databases, file servers, or any application that frequently reads or writes data. Slow disk I/O can lead to high %iowait
CPU time, bottlenecking the entire system even if the CPU itself isn't busy.
Key Disk I/O Concepts
- Throughput: The rate at which data is transferred, usually measured in Megabytes per second (MB/s) or Gigabytes per second (GB/s). High throughput is important for large file transfers or sequential reads/writes.
- IOPS (Input/Output Operations Per Second): The number of read or write operations completed per second. High IOPS are crucial for workloads involving many small, random reads/writes, such as database lookups or virtual machine hosting. SSDs typically offer vastly higher IOPS than traditional HDDs.
- Latency: The time it takes for a single I/O request to be completed, often measured in milliseconds (ms). Lower latency is better, meaning the disk responds faster. High latency directly impacts application responsiveness.
- Queue Depth: The number of pending I/O requests waiting to be serviced by the disk device. A consistently high queue depth indicates the disk cannot keep up with the demand.
- Utilization (
%util
): The percentage of time the disk device was busy processing I/O requests. A value close to 100% indicates the disk is saturated and is likely a bottleneck. However, high utilization on its own isn't always bad if latency remains low. A fast SSD might be 100% utilized but still providing excellent performance. Combine%util
with latency/wait times for a better picture. - Service Time (
svctm
- often deprecated/misleading): Historically, the average time spent servicing I/O requests, including wait time. On modern kernels/tools, this value is often inaccurate and should be disregarded in favor ofawait
. - Wait Time (
await
,r_await
,w_await
): The average time (in ms) an I/O request spends from when it's issued to when it's completed. This includes both queue time (waiting to be processed) and service time (actively being processed).await
is a crucial indicator of disk performance as experienced by applications.r_await
andw_await
provide separate average wait times for read and write requests, respectively. Highawait
times directly point to an I/O bottleneck.
Tools for Disk I/O Monitoring
-
iostat
: The standard tool for reporting CPU statistics and input/output statistics for devices and partitions. Part of thesysstat
package.iostat
: Basic report with CPU and device I/O since boot.iostat -d
: Show only the device utilization report.iostat -x
: Show extended statistics (highly recommended). Includesawait
,%util
, queue size, etc.iostat -xk 1
: Show extended stats (x
) in kilobytes (k
) every 1 second.iostat -x /dev/sda 2 5
: Show extended stats just for device/dev/sda
, every 2 seconds, 5 times.
# Example Output (iostat -xk 1) avg-cpu: %user %nice %sys %iowait %steal %idle 1.50 0.00 0.75 0.10 0.00 97.65 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 1.50 5.00 60.00 120.00 0.10 2.00 6.25 28.57 2.50 5.10 0.05 40.00 24.00 1.50 0.98 nvme0n1 25.00 150.00 1000.00 5000.00 0.50 5.00 1.96 3.23 0.15 0.40 0.10 40.00 33.33 0.05 0.88
- Key Columns (
-x
mode):r/s
,w/s
: Reads/Writes completed per second (IOPS =r/s
+w/s
).rkB/s
,wkB/s
: Kilobytes read/written per second (Throughput). (Use-m
for MB/s).rrqm/s
,wrqm/s
: Read/Write requests merged per second by the kernel.r_await
,w_await
: Average time (ms) for read/write requests to be served (including queue + service time). Very important metrics!aqu-sz
: Average queue length (number of requests waiting).rareq-sz
,wareq-sz
: Average size (kB) of read/write requests.%util
: Percentage of CPU time during which I/O requests were issued to the device (device saturation).
-
iotop
: Anhtop
-like tool specifically for monitoring disk I/O usage per process. Requires root privileges. (Needs installation:sudo apt install iotop
orsudo yum install iotop
).sudo iotop
: Shows current I/O activity, updating periodically.sudo iotop -o
: Show only processes or threads actually doing I/O.sudo iotop -a
: Show accumulated I/O instead of bandwidth.
# Example Output (sudo iotop -o) Total DISK READ: 1.20 M/s | Total DISK WRITE: 5.50 M/s Actual DISK READ: 0.80 M/s | Actual DISK WRITE: 3.00 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 1234 be/4 student 800.00 K/s 2.50 M/s 0.00 % 5.50 % dd if=/dev/zero of=testfile bs=1M count=100 5678 be/4 root 0.00 B/s 500.00 K/s 0.00 % 1.10 % [jbd2/sda1-8] 9012 be/4 mysql 400.00 K/s 2.00 M/s 0.00 % 3.20 % mysqld --user=mysql ...
- Shows PID, User, Disk Read rate, Disk Write rate, Swap In percentage, I/O wait percentage (
IO>
), and Command. - Excellent for quickly identifying which process is responsible for heavy disk activity seen in
iostat
.
-
vmstat
: Provides basic block I/O stats.vmstat 1
- Columns
bi
(blocks received from a block device - read) andbo
(blocks sent to a block device - written). Units are typically blocks (often 1KB). Useful for seeing if any disk activity is happening alongside memory/CPU stats.
Workshop Generating Disk Load and Analyzing I/O Statistics
Goal: To generate different types of disk load (read and write) and observe the impact using iostat
and iotop
.
Tools Required: iostat
, iotop
, dd
(usually pre-installed).
Steps:
-
Install Tools (if needed):
- Ensure
sysstat
(foriostat
) andiotop
are installed. - Debian/Ubuntu:
sudo apt update && sudo apt install sysstat iotop
- CentOS/RHEL/Fedora:
sudo yum install sysstat iotop
orsudo dnf install sysstat iotop
- Ensure
-
Identify Target Device:
- Use
lsblk
ordf -h
to identify a suitable disk partition with some free space (e.g.,/dev/sda1
,/dev/nvme0n1p2
). We'll write a test file there. Avoid writing directly to the raw device (/dev/sda
) unless you know what you are doing. Find your home directory's partition or use/tmp
. Let's assume we're writing to a filesystem mounted from/dev/sda1
.
- Use
-
Establish Baseline:
- Open three terminals (A, B, C).
- Terminal A: Run
iostat -xk 1
. Observe the baseliner/s
,w/s
,rkB/s
,wkB/s
,await
times, and%util
for your target device (e.g.,sda
). They should be relatively low. - Terminal B: Run
sudo iotop
. You might need to enter your password. Observe the baseline. Presso
to only show active processes. - Terminal C: This will be used to generate the load.
-
Generate Write Load (Terminal C):
- Use
dd
to write a moderately large file (e.g., 1GB) from/dev/zero
(a source of infinite null bytes, low CPU overhead) to your chosen filesystem.# Adjust 'of=./testfile' path if needed. Use a filesystem on your target device. dd if=/dev/zero of=./testfile bs=1M count=1024 oflag=direct status=progress # bs=1M: Write in 1 Megabyte blocks # count=1024: Write 1024 blocks (1GB total) # oflag=direct: Try to bypass the buffer cache for writing. This generates more immediate physical I/O, making effects clearer in iostat/iotop. Might require root or specific filesystem mount options. If it fails, remove oflag=direct. # status=progress: Show dd's progress.
- Use
-
Observe While Under Write Load:
- Terminal A (
iostat
): Watch the line for your target device (sda
,nvme0n1
, etc.).w/s
(writes per second) andwkB/s
(write throughput) should increase significantly.r/s
andrkB/s
should remain low.- Observe
w_await
(write await time). Does it increase? How much? - Observe
%util
. It should increase, potentially reaching 100% ifdd
can write faster than the disk can handle. - Observe
aqu-sz
(queue size). Does it grow?
- Terminal B (
iotop
):- The
dd
process should appear prominently, showing highDISK WRITE
values. - Note the
IO>
percentage fordd
. - You might also see related kernel threads like
jbd2
orkworker
doing I/O, especially withoutoflag=direct
.
- The
- Terminal A (
-
Clean Up Write Test File (Terminal C):
-
Generate Read Load (Terminal C):
- First, create a file to read from (if you removed the previous one). We want physical reads, so ideally, clear caches first (requires root).
# Create the file again (can use cache this time) dd if=/dev/zero of=./testfile bs=1M count=1024 status=progress # Clear caches (optional, requires root) sync echo 3 | sudo tee /proc/sys/vm/drop_caches # Now read the file using dd and discard the output dd if=./testfile of=/dev/null bs=1M iflag=direct status=progress # iflag=direct: Try to bypass cache for reading. # of=/dev/null: Discard the data read.
- First, create a file to read from (if you removed the previous one). We want physical reads, so ideally, clear caches first (requires root).
-
Observe While Under Read Load:
- Terminal A (
iostat
): Watch the line for your target device.r/s
(reads per second) andrkB/s
(read throughput) should increase significantly.w/s
andwkB/s
should remain low (unless metadata updates cause small writes).- Observe
r_await
(read await time). - Observe
%util
.
- Terminal B (
iotop
):- The
dd
process should appear, showing highDISK READ
values.
- The
- Terminal A (
-
Clean Up (Terminal C):
- Stop Monitoring: Press
Ctrl+C
in Terminals A and B.
Conclusion: You generated controlled disk write and read loads using dd
. You used iostat
to observe key performance indicators like IOPS (r/s
, w/s
), throughput (rkB/s
, wkB/s
), latency (r_await
, w_await
), queue size (aqu-sz
), and saturation (%util
) for the specific device. You also used iotop
to pinpoint the dd
process as the source of the I/O activity. Analyzing these metrics helps you understand your storage performance limits and identify potential disk bottlenecks affecting applications. High await
times (e.g., > 10-20ms for many workloads, though acceptable values vary greatly) and high %util
are key signs of a bottleneck.
5. Network Monitoring and Analysis
Network performance is crucial for servers, workstations accessing network resources, and virtually any system connected to the internet. Monitoring network traffic helps diagnose connectivity issues, identify bandwidth hogs, detect security anomalies, and ensure services are reachable.
Key Network Concepts
- Bandwidth: The maximum theoretical data transfer rate of a network link, often measured in Mbps (Megabits per second) or Gbps (Gigabits per second).
- Throughput: The actual measured data transfer rate being achieved, usually lower than the theoretical bandwidth due to overhead, latency, congestion, etc. Measured in Mbps, Gbps, or often KB/s, MB/s in monitoring tools.
- Latency: The time delay for a packet to travel from source to destination and back (Round Trip Time - RTT), typically measured in milliseconds (ms). High latency impacts interactive applications (SSH, web browsing) and protocols sensitive to delays.
- Packets: Data is transmitted over networks in small units called packets. Monitoring includes packets sent (TX) and received (RX) per second.
- Errors and Drops: Packets that were received corrupted (errors) or discarded (drops) usually due to network congestion, faulty hardware, or configuration issues. Non-zero error/drop counts indicate network problems.
- Sockets and Connections: Network communication occurs via sockets. A socket is an endpoint defined by an IP address and a port number.
- TCP (Transmission Control Protocol): Connection-oriented, reliable protocol (e.g., for HTTP, SSH, FTP). Connections go through states like
LISTEN
(waiting for incoming connection),ESTABLISHED
(active connection),TIME_WAIT
(waiting after connection close),CLOSE_WAIT
, etc. - UDP (User Datagram Protocol): Connectionless, unreliable protocol (e.g., for DNS, DHCP, some streaming). Simpler, less overhead, but no guaranteed delivery.
- TCP (Transmission Control Protocol): Connection-oriented, reliable protocol (e.g., for HTTP, SSH, FTP). Connections go through states like
- Network Interface: The hardware (e.g.,
eth0
,enp3s0
,wlan0
) or virtual device (e.g.,lo
- loopback) through which the system communicates with the network.
Tools for Network Monitoring
-
ip
: The modern standard Linux tool for displaying and manipulating routing, network devices, interfaces, and tunnels. Replaces older tools likeifconfig
androute
.ip addr show
(orip a
): Show IP addresses and details for all interfaces. Look for RX/TX packet/byte counts, errors, drops.ip -s link show <interface>
: Show detailed statistics (-s
) for a specific interface, including byte/packet counts, errors, drops, multicast.
# Example (ip -s link show eth0) 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 1234567890 1000100 0 5 0 1500 TX: bytes packets errors dropped carrier collsns 987654321 950050 0 0 0 0
-
ss
: The modern standard tool for investigating sockets. Replaces the oldernetstat
. Excellent for seeing active connections and listening ports.ss -tulnp
: Show listening (-l
) TCP (-t
) and UDP (-u
) sockets, disable name resolution (-n
- faster), and show the process (-p
) using the socket (requires root/sudo for-p
). Very common and useful.ss -tan
: Show all (-a
) TCP (-t
) sockets, numeric (-n
). Useful for seeing all active, listening, and waiting connections.ss -tun
: Show TCP (-t
) and UDP (-u
) sockets, numeric (-n
).ss -s
: Show summary statistics for connections by state.
-
netstat
(Legacy): Still found on many systems but generally superseded byip
andss
. Common legacy usage:netstat -tulnp
: Similar toss -tulnp
.netstat -i
: Show interface statistics (similar toip -s link
).netstat -r
: Show routing table (useip route
instead).
-
iftop
: Atop
-like utility for displaying bandwidth usage on an interface per connection. Excellent for identifying which hosts/ports are consuming the most bandwidth in real-time. Requires root privileges and installation (sudo apt install iftop
orsudo yum install iftop
).sudo iftop
: Monitors the first detected external interface.sudo iftop -i <interface>
: Specify which interface to monitor (e.g.,eth0
).- Interactive keys:
n
(toggle DNS resolution),s
/d
(toggle source/destination host display),p
(toggle port display),L
(toggle scale: bits/bytes),q
(quit).
-
nload
: A simple command-line tool that displays network traffic (incoming/outgoing throughput) as graphs. Easy to quickly visualize current load. (Needs installation:sudo apt install nload
orsudo yum install nload
).nload
: Monitors all auto-detected interfaces (use arrows to switch).nload <interface>
: Monitor a specific interface.
-
ping
: Sends ICMP ECHO_REQUEST packets to a host to test reachability and measure round-trip time (latency).ping google.com
ping -c 5 8.8.8.8
: Send 5 pings to IP 8.8.8.8.
-
traceroute
/mtr
: Shows the path (route) packets take to reach a destination host, displaying latency to each hop along the way. Useful for diagnosing network path issues.mtr
provides a dynamic, updating view combiningping
andtraceroute
.traceroute google.com
mtr google.com
(often preferred, may need installation).
Workshop Monitoring Network Activity During a Download
Goal: To generate network traffic by downloading a file and observe the activity using ss
, iftop
, nload
, and ip
.
Tools Required: wget
or curl
(usually pre-installed), ss
, ip
, iftop
, nload
.
Steps:
-
Install Tools (if needed):
- Ensure
iftop
andnload
are installed. - Debian/Ubuntu:
sudo apt update && sudo apt install iftop nload wget
- CentOS/RHEL/Fedora:
sudo yum install iftop nload wget
orsudo dnf install iftop nload wget
- Ensure
-
Identify Network Interface:
- Run
ip a
. Identify your primary active network interface (e.g.,eth0
,enp3s0
,wlan0
). It will have your main IP address. Let's assume it'seth0
.
- Run
-
Establish Baseline:
- Open three terminals (A, B, C).
- Terminal A: Run
sudo iftop -i eth0
(replaceeth0
if needed). Observe the baseline traffic (likely low). Note the scale (e.g., Kb/Mb). PressL
to toggle between bits (b) and Bytes (B) per second. - Terminal B: Run
nload eth0
(replaceeth0
if needed). Observe the baseline incoming/outgoing graphs. - Terminal C: This will be used to generate traffic. Check initial connection state:
ss -tan state established
(should show few or no relevant connections). Check listening ports:sudo ss -tulnp
.
-
Generate Network Load (Terminal C):
- Use
wget
to download a reasonably large file from a fast source. Linux distribution ISOs are good candidates. (Example: Ubuntu 22.04 Desktop ISO ~4.7GB). Find a current mirror link. - Let the download run for a minute or two.
- Use
-
Observe While Under Load:
- Terminal A (
iftop
):- You should see a prominent connection between your host's IP and the download server's IP/hostname.
- The "=>" direction (incoming traffic to your host) should show significant bandwidth usage, matching the download speed reported by
wget
. - The "<=" direction (outgoing) will show some traffic (TCP acknowledgements) but much less.
- Observe the peak and cumulative transfer rates at the top.
- Press
p
to toggle port display. You should see the connection using port 80 (HTTP) or 443 (HTTPS).
- Terminal B (
nload
):- The "Incoming" graph should show significant activity, corresponding to the download speed.
- The "Outgoing" graph should show much lower activity.
- Note the current, average, max, and total transfer values.
- Terminal C (Check connections while
wget
runs, maybe open another tab/terminal D):- Run
ss -tan state established | grep '<server_ip>'
(replace<server_ip>
with the IPiftop
shows for the download server, or the port like:80
or:443
). You should see the active TCP connection used bywget
. - Run
ip -s link show eth0
(replaceeth0
). Compare the RXbytes
andpackets
counts before and during/after the download. They should have increased significantly. Checkerrors
anddropped
counts - hopefully, they remain 0.
- Run
- Terminal A (
-
Stop the Download: Press
Ctrl+C
in thewget
terminal (Terminal C). -
Observe After Load:
- Watch
iftop
andnload
. The high traffic rates should quickly drop back to baseline levels. - Check
ss -tan state established
again. The connection to the download server should eventually disappear (might enterTIME_WAIT
state first).
- Watch
-
Stop Monitoring: Press
q
iniftop
andCtrl+C
innload
.
Conclusion: You generated network traffic using wget
and monitored it effectively. iftop
helped identify the specific connection responsible for the bandwidth usage and the hosts involved. nload
provided a simple visual representation of the throughput. ss
allowed you to inspect the state of the underlying TCP socket, and ip -s link
provided cumulative statistics for the interface, including vital error counts. These tools are essential for understanding network utilization and diagnosing connectivity or performance problems.
6. Process Management and Control
Monitoring tells you what processes are doing and how they are using resources. Process management is about controlling those processes – terminating misbehaving ones, adjusting their priority, or starting them with specific characteristics.
Key Process Management Concepts
- Process ID (PID): A unique number assigned to each running process. Used to identify the target for management commands.
- Parent Process ID (PPID): The PID of the process that created this process. Forms a hierarchy or tree. Process 1 (
init
orsystemd
) is the ancestor of most user processes. - Process States: As seen in
top
/htop
/ps
:R
(Running or Runnable): Either actively using the CPU or waiting in the run queue for its turn.S
(Interruptible Sleep): Waiting for an event (e.g., I/O completion, signal, timer). Most processes spend most of their time in this state.D
(Uninterruptible Sleep): Waiting directly on hardware (usually disk I/O), cannot be interrupted by signals. Processes stuck inD
state can indicate hardware or driver problems and are difficult to kill.Z
(Zombie): Process has terminated, but its exit status hasn't been collected by its parent process yet. It consumes minimal resources (just an entry in the process table). Persistent zombies usually indicate a bug in the parent process.T
(Stopped or Traced): Process execution has been suspended, usually by a signal likeSIGSTOP
(e.g., pressingCtrl+Z
in the terminal) or because it's being debugged (ptrace
).
- Signals: A standard mechanism in Unix-like systems for notifying processes of events or requesting actions. Processes can react to signals in predefined ways, ignore them, or be forcibly terminated. Common signals:
SIGTERM
(15): The standard "polite" request to terminate. Allows the process to shut down gracefully (save files, close connections, etc.). This is the default signal sent bykill
.SIGKILL
(9): The "force kill" signal. The kernel terminates the process immediately without giving it a chance to clean up. Should be used as a last resort ifSIGTERM
fails, as it can lead to data loss or corruption. Processes inD
state usually cannot be killed even bySIGKILL
.SIGHUP
(1): Hang Up signal. Historically used when a terminal connection was lost. Often used now to signal daemons to reload their configuration files.SIGINT
(2): Interrupt signal. Sent when you pressCtrl+C
in the terminal. Usually requests termination.SIGQUIT
(3): Quit signal. Sent byCtrl+\
. Similar toSIGINT
but can also trigger a core dump.SIGSTOP
(19): Stop signal. Suspends process execution (puts it inT
state). Cannot be caught or ignored.SIGCONT
(18): Continue signal. Resumes a stopped process.
- Priority and Niceness: Linux uses a priority system to schedule which runnable process gets CPU time next.
- Priority: Internal kernel value (0-139). Lower number means higher priority. 0-99 are for real-time processes, 100-139 for user-space tasks.
- Nice Value: User-space control (-20 to +19). Maps onto the priority range 100-139.
-20
: Highest user-space priority (most likely to get CPU time).0
: Default priority.+19
: Lowest user-space priority (will only run when higher priority tasks are idle).
- Only
root
can increase a process's priority (decrease its nice value below 0). Any user can decrease their own process's priority (increase its nice value).
Process Management Commands
-
kill <PID>
: Sends a signal to a process specified by its PID.kill 12345
: SendsSIGTERM
(15) to PID 12345 (requests graceful shutdown).kill -9 12345
orkill -SIGKILL 12345
: SendsSIGKILL
(9) to PID 12345 (force kill). Use with caution!kill -l
: Lists all available signal names and numbers.kill -HUP 6789
orkill -1 6789
: SendsSIGHUP
(1) to PID 6789 (often for config reload).
-
pkill <pattern>
: Sends a signal to processes matching a pattern (usually the process name).pkill firefox
: SendsSIGTERM
to all processes namedfirefox
.pkill -9 -u student sleep
: SendsSIGKILL
to all processes namedsleep
owned by userstudent
.pkill -f "python .*my_script\.py"
: SendsSIGTERM
to processes whose full command line matches the pattern (-f
flag). Be careful with patterns!
-
killall <process_name>
: Similar topkill
, but matches exact process names only (unless options like-r
for regex are used). Behavior can sometimes differ slightly frompkill
.killall nginx
: SendsSIGTERM
to all processes exactly namednginx
.killall -s SIGHUP nginx
: SendsSIGHUP
tonginx
processes.
-
nice -n <niceness> <command>
: Starts a command with a specific nice value.nice -n 10 ./my_cpu_intensive_script.sh
: Runs the script with reduced priority (nice value 10).sudo nice -n -5 ./important_task
: Runs task with increased priority (nice value -5, requires root).
-
renice <niceness> -p <PID>
: Changes the nice value of a running process.renice 15 -p 12345
: Decreases the priority of PID 12345 (sets nice value to 15).sudo renice -10 -p 6789
: Increases the priority of PID 6789 (sets nice value to -10, requires root).renice 5 -u student
: Attempts to set the nice value to 5 for all processes owned by userstudent
.
-
pgrep <pattern>
: Finds PIDs matching a pattern (useful for getting the PID to use withkill
orrenice
).pgrep firefox
: Prints PIDs offirefox
processes.pgrep -u root sshd
: Prints PIDs ofsshd
processes owned byroot
.pgrep -f "my_script\.py"
: Prints PIDs matching the full command line.
Workshop Managing Process States and Priorities
Goal: To practice starting processes, finding their PIDs, sending signals (SIGTERM
, SIGKILL
, SIGSTOP
, SIGCONT
), and adjusting priorities using nice
and renice
.
Tools Required: sleep
, yes
, ps
, pgrep
, kill
, nice
, renice
, top
or htop
.
Steps:
-
Start Sample Processes:
- Open two or three terminals (A, B, C).
- Terminal A: Start a simple background process that does nothing but wait.
- Terminal B: Start a CPU-intensive process in the background. The
yes
command outputs 'y' (or its argument) repeatedly, consuming CPU.
-
Identify Processes:
- Terminal C: Use
ps
andpgrep
to find the PIDs. - Run
htop
ortop
. Find both processes. Note their defaultNI
(Nice) value (usually 0) andPR
(Priority). Theyes
process should show high%CPU
usage.
- Terminal C: Use
-
Terminate Gracefully (
SIGTERM
):- Terminal C: Send
SIGTERM
to thesleep
process using its PID. - Check Terminal A. You should see a "Terminated" message.
- Verify with
ps aux | grep 23456 | grep -v grep
orpgrep sleep
. It should be gone.
- Terminal C: Send
-
Attempt Graceful, Then Force Kill (
SIGTERM
->SIGKILL
):- Some processes might ignore
SIGTERM
or take time to shut down. We'll simulate this withyes
, which usually exits quickly onSIGTERM
, but imagine it didn't. - Terminal C: Send
SIGTERM
to theyes
process. - Check Terminal B. It should terminate almost immediately.
- For practice: Restart
yes > /dev/null &
in Terminal B and get its new PID (e.g., 23460). Now, pretendSIGTERM
didn't work and you need to force it. - Check Terminal B. It should show "Killed". Verify with
ps
orpgrep
that it's gone.
- Some processes might ignore
-
Stop and Continue a Process (
SIGSTOP
,SIGCONT
):- Terminal B: Start
yes > /dev/null &
again. Get its new PID (e.g., 23462). - Terminal C: Observe the
yes
process inhtop
/top
. Note its high%CPU
andR
(Running) state. - Terminal C: Send the
SIGSTOP
signal. - Observe in
htop
/top
. Theyes
process's state (S
column) should change toT
(Stopped), and its%CPU
usage should drop to 0. - Terminal C: Send the
SIGCONT
signal to resume it. - Observe in
htop
/top
. Theyes
process should return to theR
state and resume consuming CPU. - Clean up:
kill 23462
(sendSIGTERM
).
- Terminal B: Start
-
Run with Lower Priority (
nice
):- Terminal B: Start
yes
with a lower priority (higher nice value). - Terminal C: Observe in
htop
/top
. Find the newyes
process. ItsNI
value should be 15. ItsPR
value will be higher (lower priority) than the default. If other CPU-bound tasks were running at default priority, thisyes
process would get less CPU time.
- Terminal B: Start
-
Change Priority of Running Process (
renice
):- Keep the
yes
process from step 6 (PID 23464, nice 15) running. - Terminal C: Change its priority back to the default nice value (0).
- Observe in
htop
/top
. TheNI
value for PID 23464 should change back to 0. - Terminal C: Try to increase its priority (lower nice value) without
sudo
. - Terminal C: Increase its priority using
sudo
. - Observe in
htop
/top
. TheNI
value should now be -5, and thePR
value should be lower (higher priority). - Clean up:
sudo kill 23464
(or justkill 23464
).
- Keep the
Conclusion: You've practiced finding processes using ps
and pgrep
. You learned how to terminate processes using SIGTERM
(graceful) and SIGKILL
(forceful). You experimented with stopping (SIGSTOP
) and resuming (SIGCONT
) processes. Finally, you used nice
to start a process with adjusted priority and renice
to change the priority of a running process, observing the effects on the Nice (NI
) value in top
/htop
and understanding the permissions required. These commands give you direct control over running tasks on your system.
7. System Logging
System logs are chronological records of events occurring on the system, generated by the kernel, system services, and applications. They are indispensable for troubleshooting problems, auditing security events, and understanding system behavior over time. Modern Linux systems primarily use systemd-journald
, while traditional syslog
also remains relevant.
Modern Logging systemd-journald
systemd-journald
is a system service that collects and stores logging data. It captures syslog
messages, kernel messages, standard output/error of services, and more. Its key features include:
- Structured Logging: Logs can include key-value pairs (metadata) beyond the simple message string, allowing for powerful filtering.
- Indexing: Logs are indexed, making searching and filtering very fast.
- Centralized Collection: Gathers logs from various sources into one journal.
- Volatility Control: Can store logs persistently on disk (usually under
/var/log/journal
) or just in memory (/run/log/journal
). Configuration is in/etc/systemd/journald.conf
. - Integration with
systemd
units: Easy to view logs specific to a service managed bysystemd
.
The journalctl
Command:
This is the primary tool for querying the systemd
journal.
journalctl
: Show the entire journal (newest entries last). Pressq
to quit, use arrows/PageUp/PageDown to navigate.journalctl -r
: Show the journal in reverse order (newest entries first).journalctl -n 20
: Show the last 20 log entries.journalctl -f
: Follow the journal in real-time (liketail -f
). New entries are printed as they arrive.Ctrl+C
to exit.journalctl -u <unit_name>
: Show logs only for a specificsystemd
unit (service or target). Very useful!journalctl -u sshd
(Show logs for the SSH daemon service)journalctl -u nginx.service
journalctl /path/to/executable
: Show logs generated by a specific program.journalctl /usr/sbin/sshd
journalctl --since "YYYY-MM-DD HH:MM:SS"
: Show logs since a specific time.journalctl --since "2023-10-27 09:00:00"
journalctl --since "1 hour ago"
journalctl --since yesterday
journalctl --until "YYYY-MM-DD HH:MM:SS"
: Show logs until a specific time. Can be combined with--since
.journalctl -p <priority>
: Filter by message priority. Priorities are:emerg
(0),alert
(1),crit
(2),err
(3),warning
(4),notice
(5),info
(6),debug
(7). Filtering byerr
also shows higher priorities (crit, alert, emerg).journalctl -p err
(Show errors and worse)journalctl -p 3
(Same as above)journalctl -p warning..err
(Show warnings, notices, info, debug)
journalctl _PID=<pid>
: Show logs for a specific Process ID.journalctl _COMM=<command_name>
: Show logs for processes with a specific command name.journalctl -k
: Show only kernel messages (equivalent todmesg
).journalctl -b
: Show messages from the current boot.journalctl -b -1
: Show messages from the previous boot.
journalctl --disk-usage
: Show how much disk space the persistent journal logs are using.journalctl --vacuum-size=1G
: Reduce journal size on disk to 1 Gigabyte (removes oldest logs).journalctl --vacuum-time=2weeks
: Remove journal entries older than two weeks.
Traditional syslog
(rsyslog
, syslog-ng
)
Before systemd-journald
, syslog
was the standard logging mechanism. Many systems still run a syslog
daemon (like rsyslog
- the most common, or syslog-ng
) alongside journald
. journald
often forwards messages to rsyslog
for traditional file-based logging.
- Configuration: Typically
/etc/rsyslog.conf
and files within/etc/rsyslog.d/
. These files define rules based on "facility" (type of program generating the message, e.g.,kern
,auth
,mail
,cron
) and "priority" (same levels asjournalctl
) to determine which log file to write messages to. - Common Log Files (under
/var/log/
):/var/log/syslog
or/var/log/messages
: General system messages./var/log/auth.log
or/var/log/secure
: Authentication-related messages (logins,sudo
,ssh
)./var/log/kern.log
: Kernel messages./var/log/dmesg
: Kernel ring buffer messages from boot time (often overwritten or rotated)./var/log/cron.log
or withinsyslog
/messages
: Cron job execution logs./var/log/boot.log
: System boot messages.- Application-specific logs: Many applications (like Apache, Nginx, databases) manage their own logs, often also under
/var/log/
.
- Tools for Reading Text Logs:
tail -f <logfile>
: Follow a specific log file in real-time.less <logfile>
: View a log file with scrolling and searching capabilities.grep <pattern> <logfile>
: Search for specific patterns within a log file.zcat
,zless
,zgrep
: Used to view/search compressed log files (often ending in.gz
).
- Log Rotation: Log files can grow indefinitely. The
logrotate
utility (configured via/etc/logrotate.conf
and/etc/logrotate.d/
) automatically manages log files – rotating them (e.g., renamingsyslog
tosyslog.1
), compressing old logs (syslog.1.gz
), and eventually deleting the oldest ones to prevent disk space exhaustion.
Workshop Exploring System Logs with journalctl
and Text Files
Goal: To practice querying system logs using journalctl
for systemd-based logging and standard tools for traditional text log files.
Tools Required: journalctl
, logger
, tail
, less
, grep
, sudo
.
Steps:
-
Generate a Custom Log Message:
- The
logger
command sends a message to the system logger (journald
and/orsyslog
). - Open a terminal (A).
- The
-
Find the Message with
journalctl
:- Terminal A:
# View recent logs, look for your message journalctl -n 50 # Filter by the 'logger' command specifically journalctl -t logger # Follow the logs and generate another message journalctl -f & # Note the PID of the background journalctl process if needed later logger "STUDENT_WORKSHOP_TEST_MESSAGE - Step 2 Following" # You should see the Step 2 message appear immediately in the journalctl -f output. # Press Ctrl+C to stop following (or kill the background PID if needed)
- Terminal A:
-
Explore
journalctl
Filtering:- Terminal A:
# View logs from the SSH service (replace sshd if using a different name) journalctl -u sshd -n 20 -r # Last 20 sshd logs, newest first # View kernel messages from this boot journalctl -k -b 0 -n 30 # View error messages (priority err or higher) from the last hour journalctl -p err --since "1 hour ago" # View logs for your current login session (find your session PID if needed) # Example: Find your shell's PID: echo $$ # journalctl _PID=$(echo $$) # May not show much unless shell logs directly
- Terminal A:
-
Explore Traditional Log Files (if applicable):
- System configuration varies.
journald
might be the primary store, or logs might also be written to/var/log
. - Terminal A: Check if the
logger
message went to a text file. - Examine authentication logs:
- Examine kernel logs:
- System configuration varies.
-
Simulate an Event and Find Logs:
- Terminal B: Attempt an invalid SSH login to your own machine (it will fail).
- Terminal A: Look for evidence of the failed login attempt.
-
Check Log Rotation Config (Optional):
- Terminal A: Look at the main logrotate configuration and specific rules.
- Look for settings like
daily
,weekly
,rotate 4
,compress
,size
.
Conclusion: You practiced using journalctl
to view, follow, and filter logs from the systemd journal based on time, unit, priority, and other metadata. You also used traditional tools (tail
, less
, grep
) to examine text-based log files in /var/log
(if configured). You generated specific log entries using logger
and simulated an event (failed SSH login) to practice finding relevant log information for troubleshooting. Understanding how to navigate and interpret system logs is a fundamental troubleshooting skill.
8. Advanced Monitoring and Resource Control
Beyond the standard command-line tools, several more advanced utilities offer consolidated views or finer control over system resources. Control Groups (cgroups) are a powerful kernel feature for limiting and isolating resource usage.
glances
The All-in-One Monitoring Tool
glances
is a cross-platform, curses-based monitoring tool written in Python. It aims to present a large amount of information from various system resources in a single view, dynamically adapting to the terminal size. It combines aspects of top
, htop
, iostat
, nload
, free
, and more.
- Features: CPU, Memory, Load, Process List, Network I/O, Disk I/O, Filesystem Usage, Sensors (temperature/voltage, if supported), Docker container stats, alerts, web UI, REST API.
- Installation: Often requires installation (
sudo apt install glances
orsudo pip install glances
). May need extra Python libraries for optional features (like sensors, Docker). - Usage: Simply run
glances
. - Interactive Keys: Similar to
htop
:q
(Quit),1
(Toggle per-CPU),m
(Sort by MEM%),p
(Sort by CPU%),i
(Sort by I/O rate),d
(Show/hide disk I/O),n
(Show/hide network I/O),f
(Show/hide filesystem),s
(Show/hide sensors),l
(Show/hide logs/alerts),h
(Help).
glances
is excellent for getting a quick, comprehensive overview of the system's current state.
Control Groups (cgroups)
Control Groups are a Linux kernel feature that allows you to allocate, limit, prioritize, and account for resource usage (CPU, memory, network bandwidth, disk I/O) for collections of processes.
- Hierarchy: Cgroups are organized hierarchically, usually mounted under
/sys/fs/cgroup/
. Different resource controllers (likecpu
,memory
,blkio
,net_cls
) manage specific resources within this hierarchy. - Use Cases:
- Resource Limiting: Prevent a group of processes (e.g., a specific user's tasks, a web server, a container) from consuming excessive memory or CPU, ensuring fairness and stability.
- Prioritization: Allocate more CPU shares to critical applications.
- Accounting: Measure resource consumption by specific groups.
- Freezing/Thawing: Stop and resume all processes within a cgroup.
- Management: While you can interact directly with the
/sys/fs/cgroup/
filesystem (creating directories, writing values to control files likememory.limit_in_bytes
orcpu.shares
), it's complex. Higher-level tools often manage cgroups:systemd
: Heavily utilizes cgroups for managing services, user sessions, and scopes. You can set resource limits directly in systemd unit files (e.g.,MemoryLimit=
,CPUShares=
,TasksMax=
). Thesystemd-run
command can launch transient processes within a dedicated cgroup scope.- Containerization Platforms (Docker, Podman, Kubernetes): Rely extensively on cgroups to isolate containers and enforce resource limits defined for them.
- Dedicated tools like
libcgroup-tools
(providescgcreate
,cgexec
, etc.) exist but are less commonly used directly now compared to systemd or container tools.
Example using systemd-run
(Simplified):
# Run 'stress' allocating 1GB RAM, but limit its cgroup scope to 500MB
# This should cause 'stress' to be killed by the OOM killer within its cgroup
sudo systemd-run --scope -p MemoryLimit=500M stress --vm 1 --vm-bytes 1G
# Check system logs for OOM kill message related to the scope
journalctl -k | grep -i oom
Cgroups are a powerful but advanced topic. For most users and administrators, interaction happens indirectly via systemd
service management or container platforms. Understanding the concept is important for comprehending modern Linux resource management.
Workshop Using glances
and Experimenting with systemd-run
Limits
Goal: To explore the comprehensive view provided by glances
and demonstrate a basic resource limit using systemd-run
and cgroups.
Tools Required: glances
, stress
, systemd-run
, journalctl
.
Steps:
-
Install
glances
andstress
(if needed):- Debian/Ubuntu:
sudo apt update && sudo apt install glances stress
- CentOS/RHEL/Fedora:
sudo yum install glances stress
orsudo dnf install glances stress
- (Optional) For full features:
sudo pip install 'glances[all]'
- Debian/Ubuntu:
-
Explore
glances
:- Open a terminal (A). Run
glances
. - Maximize the terminal window for the best view.
- Observe the different sections: CPU (overall and per-core if toggled with
1
), Load, Memory (including cache/available), Swap, Network I/O, Disk I/O, Filesystem usage, Process list. - Use the interactive keys:
m
: Sort processes by memory.p
: Sort processes by CPU.i
: Sort processes by I/O.d
: Toggle Disk I/O section visibility.n
: Toggle Network I/O section visibility.f
: Toggle Filesystem section visibility.l
: Toggle Logs/Alerts section visibility (might show warnings/criticals).h
: View the help screen.q
: Quitglances
.
- Run
glances
again. While it's running, generate some load in another terminal (B) (e.g.,stress --cpu 1
ordd if=/dev/zero of=test bs=1M count=100 oflag=direct
). Observe howglances
reflects the CPU or Disk I/O load in real-time. Stop the load generation and quitglances
.
- Open a terminal (A). Run
-
Experiment with
systemd-run
Memory Limit:- Terminal A: Prepare to watch the system logs for OOM (Out Of Memory) events.
- Terminal B: Run the
stress
command within a cgroup scope limited to 100MB of memory, but askstress
to allocate 200MB.sudo systemd-run --unit=stress-test --scope -p MemoryLimit=100M stress --vm 1 --vm-bytes 200M --verbose # --unit=stress-test: Gives the transient unit a name # --scope: Creates a scope unit (doesn't track service lifecycle) # -p MemoryLimit=100M: Apply a 100MB memory limit via cgroup memory controller # stress --vm 1 --vm-bytes 200M: Ask stress to allocate 200MB # --verbose: Make stress print more info
- Observe Terminal B: The
stress
command will likely run for a short time and then terminate abruptly. You might see an error message fromstress
or just notice it exits. - Observe Terminal A: Watch the
journalctl
output. You should see kernel messages indicating an OOM kill event, mentioning thestress
process and likely thestress-test.scope
cgroup being killed due to exceeding itsMemoryLimit
. It might look something like: - Stop the
journalctl -f
process in Terminal A (Ctrl+C
).
-
Experiment with
systemd-run
CPU Limit (Optional):- CPU limits are often set using shares (
CPUShares
) or quotas (CPUQuota
). Shares are relative priorities when CPU is contended, while quota is a hard limit on CPU time percentage. Let's try quota. - Terminal A: Run
htop
ortop
. - Terminal B: Run
stress
normally first to see it use 100% of one core. - Terminal B: Now run it within a scope limited to 20% CPU time.
- Observe Terminal A (
htop
/top
): Find the newstress
process running within thestress-cpu-test.scope
. Its CPU usage should be capped at approximately 20%, even though it's trying to run flat out. - Clean up:
sudo killall stress
or find the specific PID and kill it.
- CPU limits are often set using shares (
Conclusion: You used glances
to get a consolidated, real-time view of system resources, demonstrating its utility as a comprehensive dashboard. You then experimented with systemd-run
to leverage Control Groups (cgroups) for resource limiting. You successfully applied a MemoryLimit
that triggered an OOM kill within the cgroup when exceeded, and you optionally applied a CPUQuota
to restrict the CPU time available to a process. This demonstrates the power of cgroups in enforcing resource boundaries, a fundamental concept used heavily by systemd
and containerization technologies.
Conclusion Summarizing Monitoring and Management
Effective system monitoring and resource management are not optional extras; they are fundamental requirements for maintaining stable, performant, and reliable Linux systems. Throughout this section, we've journeyed from basic real-time observation to specific resource analysis and active process control.
Key Takeaways:
- Real-Time Observation: Tools like
top
,htop
, andglances
provide immediate insight into the current state of CPU, memory, processes, and load average. - Snapshot Analysis:
ps
gives detailed information about processes at a specific moment, crucial for scripting and targeted queries. - Resource-Specific Tools: We delved into dedicated tools for deeper analysis:
- CPU:
mpstat
(per-core stats),vmstat
(run queue, context switches),uptime
(load average). Understanding%user
,%system
,%idle
, and especially%iowait
is critical. - Memory:
free
(available vs free, cache),vmstat
(swapping activitysi
/so
),/proc/meminfo
(details). Recognizing memory pressure and swapping is key. - Disk I/O:
iostat
(throughput, IOPS, await times, utilization),iotop
(per-process I/O). Highawait
times are a strong indicator of bottlenecks. - Network:
ip
(interface stats, errors),ss
(socket states, connections),iftop
/nload
(real-time bandwidth per connection/interface),ping
/mtr
(latency/path).
- CPU:
- Process Control: We learned to manage processes using signals (
kill
,pkill
,killall
) for termination (SIGTERM
,SIGKILL
) or state changes (SIGSTOP
,SIGCONT
), and to influence scheduling withnice
andrenice
. - Logging: Understanding
journalctl
for querying the systemd journal and traditional tools (tail
,grep
,less
) for/var/log
files is essential for troubleshooting and auditing. - Advanced Concepts:
glances
offers a unified view, and Control Groups (cgroups
), often managed viasystemd
or container tools, provide powerful mechanisms for resource limiting and isolation.
The Continuous Cycle:
Monitoring and management form a continuous cycle:
- Monitor: Regularly observe system metrics using appropriate tools.
- Analyze: Interpret the data – identify trends, anomalies, bottlenecks, or errors.
- Act: Take corrective action – kill runaway processes, adjust priorities, optimize configurations, plan for upgrades.
- Verify: Confirm that the actions taken had the desired effect by monitoring again.
Mastering these tools and concepts empowers you to diagnose problems effectively, optimize performance proactively, and ensure the overall health and efficiency of your Linux environments. This knowledge is foundational for any system administrator, developer, or power user working with Linux. Remember that context is crucial – what constitutes "high" usage or a "problem" depends heavily on the specific system, its hardware, and its intended workload. Keep exploring, keep experimenting (safely!), and keep learning. ```