Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Network/Server Monitoring Nagios
Introduction to Nagios
Welcome to this comprehensive guide on Network and Server Monitoring with Nagios. In the world of IT infrastructure management, proactive monitoring is not just a best practice; it's a necessity. Downtime can lead to significant financial losses, damage to reputation, and decreased productivity. Nagios stands as one of the most established, powerful, and flexible open-source monitoring solutions available, empowering administrators to detect and resolve IT infrastructure problems before they affect critical business processes.
This guide is designed for university students and aspiring system administrators who wish to delve deep into Nagios, understand its architecture, learn how to install and configure it from scratch (self-hosting), and master its various features to monitor diverse environments effectively. We will cover everything from the fundamental concepts to advanced techniques, ensuring you gain practical, hands-on experience through detailed workshops.
What is Nagios?
Nagios, specifically Nagios Core, is an open-source application that provides monitoring and alerting services for servers, switches, applications, and services. It was originally created by Ethan Galstad and is now developed and maintained by a vibrant community. Nagios doesn't perform any monitoring itself; instead, it relies on plugins to do the actual work. It acts as a scheduler, a state manager, an alerter, and a central dashboard for the information gathered by these plugins.
The primary goals of Nagios are:
- Monitoring: To continuously check the status of hosts (servers, network devices) and the services running on them.
- Alerting: To notify administrators when problems arise, allowing for rapid response.
- Reporting: To provide historical data and reports on availability, performance, and incidents.
- Visibility: To offer a centralized view of the entire IT infrastructure's health.
Why is Monitoring Important?
Effective monitoring offers numerous benefits:
- Proactive Problem Detection: Identify issues before they escalate into major outages.
- Reduced Downtime: Faster problem resolution leads to increased availability of services.
- Capacity Planning: Track resource utilization (CPU, memory, disk, network bandwidth) to predict future needs.
- SLA Management: Verify that Service Level Agreements (SLAs) are being met.
- Security: Monitor for unauthorized changes or suspicious activities.
- Troubleshooting: Provide valuable data to diagnose complex problems quickly.
- Peace of Mind: Knowing that your systems are being watched over, even when you're not actively looking.
Core Concepts of Nagios
Understanding these fundamental concepts is crucial before diving into the practical aspects:
- Hosts: These are physical or virtual devices on your network that you want to monitor. Examples include servers, workstations, routers, switches, printers, etc. Each host has an address (IP or FQDN) and can be in states like
UP
,DOWN
, orUNREACHABLE
. - Services: These are specific functionalities or resources associated with a host. Examples include CPU load, disk usage, memory usage, a running web server (HTTP), an SSH daemon, a specific process, or network connectivity (PING). Services have states like
OK
,WARNING
,CRITICAL
, orUNKNOWN
. - Plugins: These are external, executable scripts or programs that perform the actual checks. Nagios Core calls these plugins to determine the status of a host or service. Plugins return an exit code (indicating status) and output text (providing details). There are thousands of plugins available for almost any conceivable check.
- Commands: Nagios uses command definitions to specify how to execute plugins. These definitions include the plugin's path and any arguments it requires.
- Checks:
- Active Checks: These are initiated by the Nagios server. Nagios schedules and executes plugins at regular intervals to check the status of hosts and services.
- Passive Checks: These are initiated by external applications or processes on the monitored hosts. The results are submitted to Nagios for processing. This is useful for monitoring asynchronous events or services behind restrictive firewalls.
- States:
- Host States:
UP
(reachable and responding),DOWN
(unreachable or not responding),UNREACHABLE
(an intermediate host, like a router, is down, preventing Nagios from reaching the target host). - Service States:
OK
(functioning correctly),WARNING
(potential issue or approaching a threshold),CRITICAL
(serious issue, service likely unavailable),UNKNOWN
(unable to determine status, often due to plugin errors or misconfiguration). - State Types:
- Soft State: A temporary, unconfirmed state. When a host or service first changes state, it enters a soft state. Nagios will re-check it multiple times (configurable) before confirming.
- Hard State: A confirmed, persistent state. After a configurable number of re-checks in a soft state, if the state remains the same, it transitions to a hard state. Notifications and event handlers are typically triggered only on hard state changes.
- Host States:
- Notifications: When a host or service enters a hard problem state (or recovers), Nagios can send notifications to designated contacts (e.g., administrators) via various methods like email, SMS, or custom scripts.
- Contacts and Contact Groups: Contacts are individuals who receive notifications. Contact groups are collections of contacts, simplifying notification management.
- Timeperiods: These define when Nagios is allowed to perform checks or send notifications (e.g., "24x7", "workhours", "nonworkhours").
- Event Handlers: Optional scripts that can be executed when a host or service changes state, allowing for automated remediation attempts (e.g., restarting a failed service).
- NRPE (Nagios Remote Plugin Executor): A common addon used to execute Nagios plugins on remote Linux/Unix hosts. The Nagios server uses the
check_nrpe
plugin to connect to an NRPE daemon running on the remote host, which then executes local plugins. - NSClient++: A versatile agent often used for monitoring Windows machines. It can act as an NRPE daemon, an NSCA client, and has its own built-in checks.
- NSCA (Nagios Service Check Acceptor): A daemon that runs on the Nagios server to accept passive check results submitted by external applications using the
send_nsca
client.
Nagios Architecture Overview
A typical Nagios setup involves:
-
The Nagios Server:
This is the central machine where Nagios Core is installed. It is responsible for:- Scheduling Checks: Deciding when and how often to check hosts and services.
- Executing Checks: Running plugins (either locally or via agents like NRPE for remote checks).
- Processing Check Results: Determining the status of hosts/services based on plugin output.
- State Management: Tracking current and historical states.
- Event Correlation & Handling: Managing dependencies, escalations, and event handlers.
- Notification Engine: Sending alerts to appropriate contacts.
- Web Interface (CGI): Providing a visual dashboard for users to view status, history, and reports.
-
Monitored Hosts:
These are the remote machines or devices being monitored. They might have agents installed (like NRPE or NSClient++) to allow Nagios to execute plugins locally on them. -
Plugins:
Reside on the Nagios server (for local checks or checks like PING/HTTP) and potentially on monitored hosts (executed by agents).
Data Flow (Active Check Example):
- Nagios Scheduler determines it's time to check a service on a remote host.
- Nagios process executes the
check_nrpe
plugin (on the Nagios server). check_nrpe
connects to the NRPE daemon on the remote host.- The NRPE daemon executes a specific local plugin (e.g.,
check_disk
) on the remote host. - The local plugin returns its status and output to the NRPE daemon.
- The NRPE daemon sends this information back to
check_nrpe
on the Nagios server. check_nrpe
provides the result to the Nagios process.- Nagios updates the service status. If a state change warrants it, it triggers notifications or event handlers.
Benefits of Self-Hosting Nagios
While cloud-based monitoring solutions exist, self-hosting Nagios Core offers several advantages, especially for learning and customization:
- Full Control: You have complete authority over the configuration, data, and security of your monitoring system.
- Customization: Tailor Nagios precisely to your needs, integrate custom plugins, and modify its behavior extensively.
- No Vendor Lock-in: Avoid dependency on a specific vendor's roadmap or pricing changes.
- Cost-Effective: Nagios Core is free and open-source. You only incur costs for the hardware/VM it runs on.
- Deep Learning Experience: Setting up and managing Nagios from scratch provides invaluable system administration skills.
- Data Privacy: Monitoring data, which can be sensitive, remains within your infrastructure.
- Flexibility: Integrate with other internal systems or tools as needed.
Prerequisites for this Guide
To make the most of this guide, you should have:
- Basic Linux Command-Line Skills:
Familiarity with navigating directories, editing files, managing packages, and understanding permissions. Most workshops will assume a Debian/Ubuntu-based Linux distribution. - Basic Networking Concepts:
Understanding of IP addresses, TCP/IP, ports, DNS, and firewalls. - A Virtualization Environment (Recommended):
Software like VirtualBox, VMware Workstation/Player, or a cloud provider account (for creating VMs) will be highly beneficial for setting up a Nagios server and test client machines for workshops. - Patience and Eagerness to Learn:
Nagios is powerful but can have a steep learning curve initially. Persistence is key!
By the end of this guide, you will be well-equipped to deploy, configure, and manage a robust Nagios monitoring environment for your self-hosted services or small to medium-sized infrastructures. Let's begin this exciting journey into the world of Nagios!
1. Basic Nagios Setup and Configuration
This section covers the foundational steps to get a Nagios Core server up and running. We will start by installing Nagios Core and its essential plugins, then explore the structure of its configuration files, monitor our first host (the Nagios server itself), and finally set up basic email notifications. These steps are crucial for understanding how Nagios operates and for building more complex monitoring solutions later.
Installing Nagios Core
Installing Nagios Core involves several steps, from preparing the system with necessary dependencies to compiling Nagios and its plugins from source. While some distributions offer Nagios packages, compiling from source gives you the latest version and a better understanding of the components. We will primarily focus on a generic Linux environment, with specific workshop instructions for Debian/Ubuntu.
System Requirements:
- Operating System: A Linux distribution (e.g., Debian, Ubuntu, CentOS, RHEL).
- Web Server: Apache HTTP Server (or Nginx, but Apache is more traditionally used and simpler for initial setup with Nagios CGIs).
- PHP: Required for some web interface features, though the core CGIs are written in C.
- Compiler and Build Tools: GCC, make, and development libraries (like
build-essential
on Debian/Ubuntu). - GD Graphics Library: For generating status maps and other graphical elements (optional but recommended).
- Sufficient Resources: At least 1 CPU core, 1GB RAM, and a few GBs of disk space for a small setup. Requirements grow with the number of hosts and services monitored.
Steps Overview:
- Install Prerequisites: Ensure your system has Apache, PHP, a C compiler, and essential libraries.
- Create Nagios User and Group: For security, Nagios processes should run under a dedicated unprivileged user.
- Download Nagios Core and Nagios Plugins: Get the latest stable tarballs from the official Nagios websites.
- Compile and Install Nagios Core: Configure, compile, and install the main Nagios application.
- Compile and Install Nagios Plugins: These are the scripts Nagios uses to perform checks.
- Configure Web Interface: Set up Apache to serve the Nagios web UI and secure it.
- Verify Configuration and Start Services: Check for errors and start Nagios and Apache.
Detailed Explanation of Steps:
1. Installing Prerequisites:
The specific packages depend on your Linux distribution.
-
For Debian/Ubuntu based systems:
This command installs:sudo apt update sudo apt install -y autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php libgd-dev
autoconf
,gcc
,libc6
,make
: Standard build tools.wget
,unzip
: Utilities for downloading and extracting files.apache2
: The Apache web server.php
,libapache2-mod-php
: PHP and the Apache module for PHP.libgd-dev
: Development files for the GD graphics library.
-
For RHEL/CentOS based systems (example, package names might vary slightly):
2. Create Nagios User and Group:
Nagios needs a user and group to run under. Additionally, a group for allowing external commands via the web interface is often created.
sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios
sudo usermod -a -G nagcmd www-data # Or apache, depending on your web server user
useradd nagios
: Creates a user namednagios
.groupadd nagcmd
: Creates a group namednagcmd
.usermod -a -G nagcmd nagios
: Adds thenagios
user to thenagcmd
group.usermod -a -G nagcmd www-data
: Adds the webserver user (e.g.,www-data
on Debian/Ubuntu,apache
on CentOS) to thenagcmd
group. This allows the web server to submit commands to Nagios.
3. Download Nagios Core and Nagios Plugins: Always check the official Nagios website (nagios.org) for the latest stable versions.
# Example versions, replace with latest
NAGIOS_CORE_VERSION="4.4.14" # Check for the latest stable version
NAGIOS_PLUGINS_VERSION="2.4.8" # Check for the latest stable version
cd /tmp
wget https://github.com/NagiosEnterprises/nagioscore/releases/download/nagios-${NAGIOS_CORE_VERSION}/nagios-${NAGIOS_CORE_VERSION}.tar.gz
wget https://nagios-plugins.org/download/nagios-plugins-${NAGIOS_PLUGINS_VERSION}.tar.gz
tar -zxvf nagios-${NAGIOS_CORE_VERSION}.tar.gz
tar -zxvf nagios-plugins-${NAGIOS_PLUGINS_VERSION}.tar.gz
4. Compile and Install Nagios Core: Navigate into the extracted Nagios Core directory.
cd /tmp/nagioscore-nagios-${NAGIOS_CORE_VERSION}/
sudo ./configure --with-nagios-group=nagios --with-command-group=nagcmd --with-httpd-conf=/etc/apache2/sites-enabled/
./configure
: This script checks your system for dependencies and prepares the build environment.--with-nagios-group=nagios
: Specifies the Nagios group.--with-command-group=nagcmd
: Specifies the group for external commands.--with-httpd-conf=/etc/apache2/sites-enabled/
: (For Debian/Ubuntu Apache) Specifies where to install the Apache configuration snippet for Nagios. For RHEL/CentOS, this might be/etc/httpd/conf.d/
. Adapt as needed.
If ./configure
completes without errors, proceed with compilation and installation:
sudo make all
sudo make install
sudo make install-init # Installs init script (e.g., /etc/init.d/nagios)
sudo make install-daemoninit # Installs systemd unit file if systemd is detected
sudo make install-config # Installs SAMPLE configuration files
sudo make install-commandmode # Installs and configures permissions for the external command file
sudo make install-webconf # Installs Apache config file for Nagios web UI
make all
: Compiles the Nagios binaries and CGIs.make install
: Installs the compiled files, typically into/usr/local/nagios/
.make install-init
/make install-daemoninit
: Installs the service script to manage the Nagios daemon (start, stop, restart). The latter is for systems usingsystemd
.make install-config
: Installs sample configuration files in/usr/local/nagios/etc/
. Important: These are samples; you'll customize them. If you're upgrading, you might skip this or back up existing configs.make install-commandmode
: Sets up the directory and permissions for Nagios to process external commands.make install-webconf
: Installs an Apache configuration file (e.g.,nagios.conf
) into the directory specified by--with-httpd-conf
or a default location.
5. Compile and Install Nagios Plugins: Nagios Core needs plugins to actually perform checks.
cd /tmp/nagios-plugins-${NAGIOS_PLUGINS_VERSION}/
sudo ./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl
sudo make
sudo make install
./configure
: Prepares plugins for compilation.--with-nagios-user=nagios
and--with-nagios-group=nagios
: Sets user/group ownership for some plugins.--with-openssl
: Enables SSL/TLS support for plugins that require it (e.g.,check_http
for HTTPS).
make
: Compiles the plugins.make install
: Installs plugins, typically into/usr/local/nagios/libexec/
.
6. Configure Web Interface:
-
Enable Apache Modules: For Apache, CGI and rewrite modules are often needed.
For RHEL/CentOS, ensuremod_cgi
is loaded.mod_rewrite
is also good practice. -
Create Web Admin User: Nagios web interface access is typically protected by Basic Authentication.
This command creates a new password file (htpasswd.users
) and adds a usernagiosadmin
. You'll be prompted to enter a password for this user. For subsequent users, omit the-c
flag. The path/usr/local/nagios/etc/htpasswd.users
is a common location, but it's defined in the Apache configuration for Nagios (e.g., in/etc/apache2/sites-enabled/nagios.conf
). Ensure consistency. -
Review Apache Configuration for Nagios: The
make install-webconf
step should have created a file like/etc/apache2/sites-enabled/nagios.conf
(Debian/Ubuntu) or/etc/httpd/conf.d/nagios.conf
(RHEL/CentOS). Open this file and review it. Key things to check:ScriptAlias /nagios/cgi-bin/ /usr/local/nagios/sbin/
Alias /nagios /usr/local/nagios/share/
<Directory>
directives for/usr/local/nagios/sbin/
and/usr/local/nagios/share/
setting access controls.AuthUserFile
should point to/usr/local/nagios/etc/htpasswd.users
.AuthName "Nagios Access"
AuthType Basic
require valid-user
7. Verify Configuration and Start Services:
-
Verify Nagios Configuration: Before starting Nagios, it's crucial to verify its configuration.
This command will parse all your Nagios configuration files and report any errors. If there are errors, you must fix them before proceeding. "Total Warnings: 0" and "Total Errors: 0" is the goal. -
Start Nagios Service: If using systemd (common on modern Linux):
If using init scripts:
8. Accessing the Nagios Web Interface:
Open your web browser and navigate to http://YOUR_SERVER_IP/nagios/
. You should be prompted for the username (nagiosadmin
) and password you created earlier.
If successful, you'll see the Nagios Core dashboard. Initially, it might show a few items related to localhost
if the sample configurations were used.
This completes the basic installation of Nagios Core and its plugins. The next step is to understand the configuration files that drive its behavior.
Workshop Installing Nagios Core on a Debian/Ubuntu System
Objective:
Perform a clean installation of Nagios Core and Nagios Plugins from source on a fresh Debian or Ubuntu virtual machine.
Prerequisites:
- A virtual machine (e.g., VirtualBox, VMware) running a minimal server installation of Debian (e.g., Debian 11/12) or Ubuntu Server (e.g., Ubuntu 20.04/22.04 LTS).
- SSH access to the VM or direct console access.
- Internet connectivity from within the VM.
- Root or sudo privileges on the VM.
Steps:
-
Update System and Install Prerequisites: Log into your VM.
sudo apt update sudo apt upgrade -y sudo apt install -y build-essential autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php libgd-dev
build-essential
is a meta-package that installsgcc
,make
, and other crucial build tools on Debian/Ubuntu.
-
Create Nagios User and Group:
-
Download Nagios Core and Plugins: Go to the official Nagios Core releases page on GitHub (NagiosEnterprises/nagioscore) and the Nagios Plugins download page (nagios-plugins.org) to find the latest stable version numbers. Let's assume
4.4.14
for Core and2.4.8
for Plugins for this workshop. -
Compile and Install Nagios Core:
Self-reflection: Thecd /tmp/nagioscore-nagios-4.4.14/ sudo ./configure --with-nagios-group=nagios --with-command-group=nagcmd --with-httpd-conf=/etc/apache2/sites-enabled/ # This configure command is tailored for Debian/Ubuntu's Apache setup. # If configure completes without error: sudo make all sudo make install # Check if your system uses systemd (most modern systems do) # If `systemctl` is available, your system likely uses systemd if [ -d /run/systemd/system ]; then sudo make install-daemoninit # For systemd else sudo make install-init # For older init systems fi sudo make install-config sudo make install-commandmode # `make install-webconf` might have been run by `./configure` if `--with-httpd-conf` was successful. # If not, or to be sure: sudo make install-webconf
configure
script will attempt to install the Apache web config. If it can't (e.g., permissions, path issues),make install-webconf
is the fallback. -
Compile and Install Nagios Plugins:
-
Configure Web Interface: Enable necessary Apache modules:
Create thenagiosadmin
web user (you will be prompted for a password): Verify the Apache configuration for Nagios. The file should be at/etc/apache2/sites-enabled/nagios.conf
. Ensure it has lines likeAuthUserFile /usr/local/nagios/etc/htpasswd.users
andRequire valid-user
. -
Verify Nagios Configuration and Start Services: Check the sample Nagios configuration for errors:
You should see "Total Warnings: 0" and "Total Errors: 0". If not, troubleshoot based on the error messages (common initial issues involve file permissions or paths).Enable and start the Nagios service (assuming systemd):
Ensure Apache is also enabled and running: -
Access the Nagios Web Interface: Find your VM's IP address (e.g., using
ip addr show
orhostname -I
). Open a web browser on your host machine and navigate tohttp://<VM_IP_ADDRESS>/nagios/
. Log in with usernamenagiosadmin
and the password you set.You should now see the Nagios Core interface. It will likely be monitoring
localhost
with a few default checks defined in the sample configuration files.
Troubleshooting Tips for the Workshop:
- Permission Denied (Web Interface): If you see "Forbidden" errors, check Apache error logs (
/var/log/apache2/error.log
). This often relates to:- File permissions on
/usr/local/nagios/share
or/usr/local/nagios/sbin
. - Incorrect Apache configuration (
nagios.conf
). EnsureRequire all granted
or appropriateRequire
directives are set for your Apache version (Apache 2.4 uses different syntax than 2.2). The defaultnagios.conf
frommake install-webconf
usually handles this.
- File permissions on
- "File not found" for CGIs: Ensure
mod_cgi
is enabled andScriptAlias
is correct. - Nagios service fails to start: Check
sudo systemctl status nagios
andsudo journalctl -xeu nagios
for detailed error messages. Often related to configuration errors identified by the-v
check. - Plugin errors (e.g., "Return code of 127 is out of bounds"): This often means the plugin was not found or is not executable. Check paths in
commands.cfg
and permissions in/usr/local/nagios/libexec/
.
This workshop provides a solid foundation. You now have a working Nagios server!
Understanding Nagios Configuration Files
Nagios's power and flexibility stem from its text-based configuration files. Understanding their structure and purpose is paramount to effectively managing a Nagios installation. All primary configuration files are typically located in /usr/local/nagios/etc/
(or a similar path if Nagios was installed differently).
Main Configuration File (nagios.cfg
):
This is the heart of Nagios's configuration. It's usually located at /usr/local/nagios/etc/nagios.cfg
. This file tells Nagios:
- Paths to other configuration files: Using
cfg_file=
directives for object definitions andcfg_dir=
for directories containing object definitions. - Location of object cache file:
object_cache_file=/usr/local/nagios/var/objects.cache
- Location of status data file:
status_file=/usr/local/nagios/var/status.dat
- Log file location:
log_file=/usr/local/nagios/var/nagios.log
- Global settings: Such as check execution options, logging options, performance tuning parameters (e.g.,
interval_length
,max_concurrent_checks
). - User and group Nagios should run as:
nagios_user=nagios
,nagios_group=nagios
. - Event broker modules: For integrating with addons like PNP4Nagios or Mod_Gearman.
Example snippets from nagios.cfg
:
# LOG FILE
log_file=/usr/local/nagios/var/nagios.log
# OBJECT CONFIGURATION FILE(S)
# You can specify individual object config files as shown below:
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg
# You can also tell Nagios to process all config files ending with '.cfg'
# in a particular directory by using the cfg_dir directive as shown below:
cfg_dir=/usr/local/nagios/etc/servers
cfg_dir=/usr/local/nagios/etc/printers
cfg_dir=/usr/local/nagios/etc/switches
# NAGIOS USER AND GROUP
nagios_user=nagios
nagios_group=nagios
# CHECK RESULT PATH
# This is where Nagios checks for passive check results.
check_result_path=/usr/local/nagios/var/spool/checkresults
cfg_dir
directives to organize your object configuration files into logical subdirectories (e.g., /usr/local/nagios/etc/objects/hosts/
, /usr/local/nagios/etc/objects/services/
, or by device type like /usr/local/nagios/etc/servers/
).
Resource Files (resource.cfg
):
Usually located at /usr/local/nagios/etc/resource.cfg
(or private/resource.cfg
). This file is used to store user-defined macros. Macros are like variables that can be used throughout your Nagios configuration.
The most common use is to store sensitive information like passwords or community strings for SNMP, or commonly used paths.
Example:
# Sets $USER1$ to be the path to the plugins directory
$USER1$=/usr/local/nagios/libexec
# Sets $USEREMAIL$ to a specific email address
# $USEREMAIL$=youradmin@example.com
$HOSTADDRESS$
, $SERVICESTATE$
). User-defined macros typically start with $USERn$
(e.g., $USER1$
, $USER2$
, etc.) and are referenced with the dollar signs.
Object Configuration Files:
These files define the actual elements Nagios monitors and interacts with. They are typically stored in /usr/local/nagios/etc/objects/
or subdirectories specified by cfg_dir
in nagios.cfg
.
The common object types are:
- Hosts (
hosts.cfg
or similar): Define the physical/virtual machines and network devices. - Services (
services.cfg
or similar): Define the specific checks for hosts. - Contacts (
contacts.cfg
): Define individuals who receive notifications.define contact { contact_name nagiosadmin alias Nagios Administrator service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r ; Send notifications on warning, unknown, critical, recovery host_notification_options d,u,r ; Send notifications on down, unreachable, recovery service_notification_commands notify-service-by-email host_notification_commands notify-host-by-email email nagios@localhost ; This should be a real email address }
- Contact Groups (
contactgroups.cfg
or similar): Group contacts together. - Commands (
commands.cfg
): Define how Nagios executes plugins.Here,define command { command_name check_ping command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 }
$USER1$
is fromresource.cfg
,$HOSTADDRESS$
is a Nagios macro for the host's IP, and$ARG1$
,$ARG2$
are arguments passed from the service definition. - Timeperiods (
timeperiods.cfg
): Define specific time ranges for checks and notifications.define timeperiod { timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 # ... and so on for all days saturday 00:00-24:00 } define timeperiod{ timeperiod_name workhours alias Normal Work Hours monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 }
- Templates (often in
templates.cfg
or spread across object files):
Allow you to define common properties for hosts and services, promoting inheritance and reducing redundancy. You define a template with generic settings, and then specific host/service definitionsuse
that template, inheriting its properties and overriding them if needed.Thedefine host { name linux-server ; The name of this host template notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts check_command check-host-alive ; Default command to check if a host is "alive" max_check_attempts 5 notification_interval 60 notification_period 24x7 notification_options d,u,r contact_groups admins register 0 ; DONT REGISTER THIS DEFINITION - ITS A TEMPLATE }
register 0
line is crucial for templates; it tells Nagios this is not an actual object to monitor but a template to be used by other objects.
CGI Configuration File (cgi.cfg
):
Located at /usr/local/nagios/etc/cgi.cfg
, this file controls aspects of the Nagios web interface (the CGIs).
Key settings:
- Main configuration file location:
main_config_file=/usr/local/nagios/etc/nagios.cfg
- Physical HTML path:
physical_html_path=/usr/local/nagios/share
- URL HTML path:
url_html_path=/nagios
- Authentication and Authorization: Defines which users can view certain information or perform certain actions (e.g., submit commands).
It's essential to restrict
# AUTHENTICATION USAGE # This option controls whether or not the CGIs will use the # authentication and authorization functionality. # 0 = Don't use authentication functionality # 1 = Use authentication functionality use_authentication=1 # DEFAULT USERNAME # This is the default username that the CGIs will use if # an authenticated user cannot be found. #default_user_name=guest # SYSTEM/PROCESS INFORMATION ACCESS # These are comma-delimited lists of authorized users who can # view system/process information in the CGIs. authorized_for_system_information=nagiosadmin authorized_for_configuration_information=nagiosadmin # COMMAND ACCESS # These are comma-delimited lists of authorized users who can # issue commands via the command CGI. authorized_for_all_host_commands=nagiosadmin authorized_for_all_service_commands=nagiosadmin
authorized_for_*
directives to trusted users.
Directory Structure Summary:
/usr/local/nagios/bin/
: Nagios executable (nagios
)./usr/local/nagios/sbin/
: CGI executables (e.g.,status.cgi
,extinfo.cgi
)./usr/local/nagios/libexec/
: Nagios plugins (e.g.,check_ping
,check_http
)./usr/local/nagios/etc/
: Main configuration files (nagios.cfg
,cgi.cfg
,resource.cfg
) and object definitions (often in anobjects/
subdirectory)./usr/local/nagios/share/
: HTML, CSS, JavaScript, and images for the web interface./usr/local/nagios/var/
: Variable data, such as logs (nagios.log
), status data (status.dat
), object cache (objects.cache
), retention data (retention.dat
), and spool directories (e.g., for check results).
Best Practices for Organizing Configuration:
- Use
cfg_dir
extensively: Create directories likeobjects/hosts
,objects/services
,objects/templates
,objects/contactgroups
, etc. Or, organize by device type/location:etc/servers/
,etc/network/
,etc/applications/
. - One object per file (for larger setups): For instance, each host definition in its own file within
etc/hosts/hostname.cfg
. This makes management with configuration management tools (Ansible, Puppet, Chef) easier. - Leverage templates: Heavily use host and service templates to minimize redundancy and simplify bulk changes.
- Consistent naming conventions: Use clear and consistent names for hosts, services, templates, groups, etc.
- Version control: Store your Nagios configuration directory (
/usr/local/nagios/etc/
) in a version control system like Git. This allows you to track changes, revert to previous versions, and collaborate. - Regularly validate configuration: Always run
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
before restarting/reloading Nagios after making changes.
Understanding these files and their relationships is the key to mastering Nagios. The sample configuration files provided by make install-config
are an excellent starting point to explore.
Workshop Exploring and Modifying Basic Configuration Files
Objective:
To familiarize yourself with the key Nagios configuration files, make a simple modification, and verify the changes.
Prerequisites:
- A working Nagios Core installation (from the previous workshop).
- SSH or console access to the Nagios server.
- A text editor (e.g.,
nano
,vim
).
Steps:
-
Locate Core Configuration Files: Navigate to the Nagios configuration directory:
You should seenagios.cfg
,cgi.cfg
,resource.cfg
, and a directory namedobjects
. -
Examine
nagios.cfg
: Opennagios.cfg
with your text editor:- Look for
log_file
to see where Nagios logs its activities. - Find the
cfg_file
andcfg_dir
directives. Note how object configuration files are included. The sample configuration usually includes severalcfg_file
entries pointing to files within theobjects/
directory (e.g.,objects/commands.cfg
,objects/contacts.cfg
,objects/localhost.cfg
). - Observe settings like
nagios_user
andnagios_group
. - Do not make any changes yet. Exit the editor.
- Look for
-
Examine
resource.cfg
: Openresource.cfg
:- You'll likely see a line like
$USER1$=/usr/local/nagios/libexec
. This macro is widely used in command definitions to specify the path to plugins. - You might also see commented-out examples for other
$USERn$
macros. - Do not make any changes yet. Exit the editor.
- You'll likely see a line like
-
Explore the
You should see files likeobjects/
Directory:commands.cfg
,contacts.cfg
,timeperiods.cfg
,templates.cfg
, andlocalhost.cfg
. These files define the various Nagios objects. -
Modify Contact Information in
Find thecontacts.cfg
: The default contact is oftennagiosadmin
with a placeholder email. Let's change this. Opencontacts.cfg
:define contact
block fornagiosadmin
. It will look something like this:define contact{ contact_name nagiosadmin ; Short name of user use generic-contact ; Inherit default values from generic-contact template (defined above) alias Nagios Admin ; Full name of user email nagios@localhost ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ****** }
- Change the
email
directive fromnagios@localhost
to your actual email address (e.g.,yourname@example.com
). This is important for receiving notifications later. - You can also update the
alias
if you wish. - Save the file and exit the editor (Ctrl+X, then Y, then Enter in
nano
).
- Change the
-
Examine
This file typically contains definitions for monitoring the Nagios server itself (localhost.cfg
(Example Host/Service Definitions): Openlocalhost.cfg
:localhost
).- Look for
define host
block. Note itshost_name
(usuallylocalhost
),alias
, andaddress
(usually127.0.0.1
). - Observe several
define service
blocks. These define checks like PING, SSH, HTTP, disk space, current users, etc., forlocalhost
. Notice how each service is associated withhost_name localhost
. - Pay attention to the
check_command
directive in service definitions. This links to a command defined incommands.cfg
. - Do not make any changes yet. Exit the editor.
- Look for
-
Verify Configuration Changes: Any time you modify Nagios configuration files, you must verify them before reloading or restarting Nagios.
If you only changed the email address, this command should complete successfully with: If there are errors, the output will indicate the file and line number causing the problem. You'll need to correct the error and re-verify. -
Reload Nagios to Apply Changes: If verification is successful, reload Nagios. Reloading is preferred over restarting for minor configuration changes as it typically doesn't interrupt ongoing checks.
-
Check Nagios Log (Optional): You can tail the Nagios log file to see if the reload was successful and observe its general activity.
Look for lines indicating Nagios is re-reading configuration data. Press Ctrl+C to stop tailing.
Outcome:
You have now successfully:
- Navigated and inspected the main Nagios configuration files.
- Understood the purpose of
nagios.cfg
,resource.cfg
, and object definition files. - Modified a contact's email address.
- Verified the configuration using the
-v
flag. - Reloaded Nagios to apply the change.
This workshop builds confidence in working with Nagios configuration. The email address change will be used when we set up notifications.
Monitoring Your First Host (localhost)
By default, the sample Nagios configuration (make install-config
) often includes settings to monitor the Nagios server itself, referred to as localhost
. This is an excellent starting point to understand how host and service definitions work and to see Nagios in action. If your installation didn't include this, or if you want to understand how it's done from scratch, this section will guide you.
Core Concepts Involved:
- Host Definition: Defines the machine Nagios will monitor. For
localhost
, the address is127.0.0.1
. - Service Definitions: Define what specific aspects of the host will be monitored (e.g., PING, CPU load, disk space).
- Check Commands: Pre-defined commands in
commands.cfg
that Nagios uses to execute plugins with appropriate arguments. - Plugins: The actual scripts in
/usr/local/nagios/libexec/
that perform the checks.
Steps to Monitor localhost
(if not already configured):
-
Define the Host Object for
localhost
: Create or edit a configuration file (e.g.,/usr/local/nagios/etc/objects/localhost.cfg
).define host{ use linux-server ; Inherit default values from a template named 'linux-server' ; This template should be defined in templates.cfg or similar host_name localhost alias My Nagios Server (localhost) address 127.0.0.1 contact_groups admins ; Who to notify if this host has problems }
use linux-server
: This assumes you have a host template namedlinux-server
defined (typically intemplates.cfg
). Templates provide default values for many directives (e.g.,check_period
,notification_options
,max_check_attempts
). If you don't have one, you'd need to specify all required parameters directly or create a simple one.host_name
: A unique name for this host within Nagios.localhost
is conventional.alias
: A descriptive name shown in the web interface.address
: The IP address Nagios will use to check this host. Forlocalhost
, it's127.0.0.1
.contact_groups admins
: Specifies that members of theadmins
contact group should be notified. This group should be defined in yourcontactgroups.cfg
.
-
Define Basic Service Checks for
localhost
: In the same file (localhost.cfg
) or a separate services file, add service definitions.-
PING Check (Host Liveness): Although hosts have an implicit PING check via their
check_command
(oftencheck-host-alive
which usescheck_ping
), you can also define it as an explicit service for more detailed metrics and alerting.define service{ use local-service ; Inherit default values from a template named 'local-service' ; This template is often found in templates.cfg host_name localhost service_description PING check_command check_ping!100.0,20%!500.0,60% ; check_ping with Warning at 100ms/20% loss, Critical at 500ms/60% loss }
use local-service
: Assumes a service template namedlocal-service
exists.host_name localhost
: Associates this service with thelocalhost
host.service_description
: A descriptive name for this service (e.g., "PING", "HTTP Server").check_command check_ping!100.0,20%!500.0,60%
:check_ping
: This refers to a command definition incommands.cfg
.!
: Separator for command arguments.100.0,20%
: Argument 1 ($ARG1$
) forcheck_ping
- Warning threshold (100ms round-trip-average, 20% packet loss).500.0,60%
: Argument 2 ($ARG2$
) forcheck_ping
- Critical threshold (500ms RTA, 60% packet loss).
-
SSH Server Check:
define service{ use local-service host_name localhost service_description SSH Server check_command check_ssh }
check_command check_ssh
: Assumes a commandcheck_ssh
is defined, which uses thecheck_ssh
plugin to see if an SSH server is listening on port 22.
-
HTTP Server Check (for Nagios Web UI itself):
define service{ use local-service host_name localhost service_description HTTP Web Server check_command check_http }
check_command check_http
: Assumes a commandcheck_http
is defined, which uses thecheck_http
plugin. By default, it checks port 80 on the host's address.
-
Disk Space Check (Root Partition):
define service{ use local-service host_name localhost service_description Root Partition Disk Space check_command check_local_disk!20%!10%!/ ; Warn if <20% free, Critical if <10% free, for path '/' }
check_command check_local_disk!20%!10%!/
:check_local_disk
: A command typically using thecheck_disk
plugin.!20%
: Warning threshold ($ARG1$
).!10%
: Critical threshold ($ARG2$
).!/
: Path to check ($ARG3$
).
-
CPU Load Check:
define service{ use local-service host_name localhost service_description CPU Load check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0 ; Warn at 5,4,3 (1,5,15 min avg), Crit at 10,6,4 }
check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
:check_local_load
: A command typically usingcheck_load
plugin.!5.0,4.0,3.0
: Warning thresholds for 1-min, 5-min, 15-min load averages.!10.0,6.0,4.0
: Critical thresholds.
-
-
Ensure Check Commands are Defined: The
check_command
directives in service definitions refer to commands defined incommands.cfg
(or a similar file). These command definitions tell Nagios how to execute the actual plugin scripts. Example command definitions (these are usually present in the defaultcommands.cfg
):# 'check_local_disk' command definition define command{ command_name check_local_disk command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ } # 'check_local_load' command definition define command{ command_name check_local_load command_line $USER1$/check_load -w $ARG1$ -c $ARG2$ } # 'check_ping' command definition (often used by 'check-host-alive' too) define command{ command_name check_ping command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 } # 'check_ssh' command definition define command{ command_name check_ssh command_line $USER1$/check_ssh $ARG1$ $HOSTADDRESS$ } # Note: $ARG1$ for check_ssh can be used for options like -p <port> # 'check_http' command definition define command{ command_name check_http command_line $USER1$/check_http -I $HOSTADDRESS$ -p $ARG1$ $ARG2$ } # Note: For check_http, $ARG1$ is often port, $ARG2$ can be other options like -u /uri/ # If no ARGs are passed from service definition, it uses plugin defaults.
$USER1$
: This macro (fromresource.cfg
) points to/usr/local/nagios/libexec/
.$HOSTADDRESS$
: A built-in Nagios macro that gets replaced with theaddress
from the host definition.$ARGn$
: Placeholders for arguments passed from the service definition (after the!
incheck_command
).
-
Ensure Necessary Templates are Defined: The
use linux-server
anduse local-service
directives require these templates to be defined, usually in/usr/local/nagios/etc/objects/templates.cfg
. The sample configuration provides these. A minimallinux-server
host template:A minimaldefine host{ name linux-server ; Name of this template use generic-host ; Inherit other defaults check_period 24x7 check_interval 5 retry_interval 1 max_check_attempts 10 check_command check-host-alive notification_period 24x7 notification_interval 30 notification_options d,u,r contact_groups admins register 0 ; This is a template }
local-service
service template:These templates themselves oftendefine service{ name local-service ; Name of this template use generic-service ; Inherit other defaults max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 notification_period 24x7 notification_options w,u,c,r ; Notify on warning, unknown, critical, recovery contact_groups admins register 0 ; This is a template }
use
even more generic templates likegeneric-host
andgeneric-service
, which define the absolute base defaults. -
Add
Add a line like: Save and exit.localhost.cfg
tonagios.cfg
: If you created a new file (e.g.,localhost.cfg
), ensure it's included bynagios.cfg
: -
Verify and Reload Nagios:
-
View in Web Interface: Go to your Nagios web interface (
http://YOUR_SERVER_IP/nagios/
).- Click on "Hosts" in the left navigation pane. You should see
localhost
. - Click on "Services". You should see the services you defined (PING, SSH, HTTP, Disk, Load) associated with
localhost
. - Initially, services might be in a "Pending" state. After a few minutes, they should update to
OK
(green),WARNING
(yellow), orCRITICAL
(red) based on the check results.
- Click on "Hosts" in the left navigation pane. You should see
By following these steps, you actively instruct Nagios to monitor various aspects of its own host system. This provides immediate feedback and a practical understanding of the relationship between hosts, services, commands, and plugins.
Workshop Monitoring Localhost Services
Objective:
Ensure localhost
is being monitored with at least PING, Disk Space (root partition), and Current Users checks. If these are already present from the sample config, review their definitions. If not, add them.
Prerequisites:
- A working Nagios Core installation.
- The
nagiosadmin
contact email configured to your actual email address (from the previous workshop). - Text editor and sudo privileges.
Steps:
-
Inspect Existing
Look for service definitions for:localhost.cfg
: Navigate to/usr/local/nagios/etc/objects/
and openlocalhost.cfg
(or the file defining your localhost checks).- PING
- Root Partition (Disk Space)
- Current Users
A typical sample configuration will have these. For example:
# ... other definitions ... define service{ use local-service ; Name of service template to use host_name localhost service_description PING check_command check_ping!100.0,20%!500.0,60% } define service{ use local-service ; Name of service template to use host_name localhost service_description Root Partition check_command check_local_disk!20%!10%!/ } define service{ use local-service ; Name of service template to use host_name localhost service_description Current Users check_command check_users!20!50 } # ... other definitions ...
-
Understand the
check_command
for "Current Users": Thecheck_users!20!50
command for "Current Users" means:check_users
: This is the command name defined incommands.cfg
.!20
: This is$ARG1$
, the warning threshold. If 20 or more users are logged in, it's a WARNING.!50
: This is$ARG2$
, the critical threshold. If 50 or more users are logged in, it's a CRITICAL.
Let's verify the
Search forcheck_users
command definition incommands.cfg
. Opencommands.cfg
:check_users
. You should find something like: This confirms that$USER1$/check_users
(i.e.,/usr/local/nagios/libexec/check_users
) is called with the warning (-w
) and critical (-c
) arguments passed from the service definition. Exitnano
. -
Add a New Service Check (if one is missing or for practice): Swap Usage Let's add a check for Swap Usage. First, we need to see if a command like
The sample configuration often includes:check_local_swap
orcheck_swap
exists incommands.cfg
.If this command exists, we can use it. If not, you would add this definition to# 'check_local_swap' command definition define command{ command_name check_local_swap command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$ }
commands.cfg
. Assuming it exists, add the following service definition tolocalhost.cfg
: Add this block at the end of the file (or amongst other service definitions forlocalhost
):Let's refine thedefine service{ use local-service ; Name of service template to use host_name localhost service_description Swap Usage check_command check_local_swap!20%!10% ; Warn if swap free < 20%, Critical if < 10% ; Note: check_swap often takes thresholds as % free or MB free. ; The '!' might need to be adjusted based on plugin version. ; Default thresholds for check_swap are often % of *used* swap. ; Let's be explicit with `check_swap -w 20% -c 10%` (meaning 20% used warning, 10% used critical if plugin is standard) ; Or, if it's % free: `check_swap -w 80 -c 90` (Warn if used > 80%, Crit if used > 90%) ; For this workshop, we'll assume the command expects warning and critical for *used* thresholds. ; Let's re-evaluate: `check_swap`'s -w and -c are % *used*. ; So, Warn if 20% *used*, Crit if 10% *used* is not logical. ; Let's aim for: Warn if >80% used, Crit if >90% used. ; The plugin default is often -w 25% -c 50% (meaning 25% of swap size remaining is warning, 50% remaining is critical) which is a bit confusing. ; Let's use check_swap with -W (warning free %) and -C (critical free %) for clarity as per some plugin versions. ; However, standard Nagios Plugins `check_swap` typically uses -w for warning % USED and -c for critical % USED. ; So, let's set reasonable values for % USED: ; Warn if swap usage is > 50%, Critical if swap usage is > 80% }
check_command
for swap. Thecheck_swap
plugin arguments can be a bit confusing. Typically,-w <value>%
means "warning if used swap is<value>
percent of total swap". So, a more logicalcheck_command
for swap usage, warning at 50% used and critical at 80% used:Add this definition todefine service{ use local-service host_name localhost service_description Swap Usage check_command check_local_swap!50%!80% ; Warn if swap used > 50%, Critical if swap used > 80% }
localhost.cfg
. Save and exit. -
Verify and Reload Nagios:
If there are errors (e.g.,check_local_swap
command not defined), you'll need to add the command definition tocommands.cfg
as shown in step 3, then re-verify. If successful: -
Check in Web Interface: Go to your Nagios web UI. Under "Services", you should now see the "Swap Usage" service for
localhost
. It will initially be "Pending" and then transition to a status (likelyOK
if your system isn't heavily using swap). You can click on the service name to see its status details, including performance data if the plugin provides it (e.g., "SWAP OK - 100% free (2047 MB out of 2047 MB)").
Outcome:
You have reviewed existing localhost
service checks and successfully added a new service check for Swap Usage. This reinforces the process of:
- Identifying a monitoring need (Swap Usage).
- Ensuring a suitable
check_command
exists (or defining one). - Defining the
service
object, linking it to a host and the command. - Verifying and reloading Nagios.
- Confirming the new service in the web interface.
This structured approach is fundamental to expanding Nagios monitoring.
Basic Alerting and Notifications
Monitoring systems are most effective when they can alert administrators to problems. Nagios has a robust notification system that can inform contacts when hosts or services change state (e.g., go from OK
to CRITICAL
, or UP
to DOWN
). This section covers the basics of setting up email notifications.
Components Involved:
- Contacts: Definitions of individuals who should receive alerts, including their email addresses and notification preferences.
- Contact Groups: Collections of contacts. It's best practice to assign contact groups to hosts/services rather than individual contacts for easier management.
- Notification Commands: Commands Nagios uses to send out notifications (e.g.,
notify-host-by-email
,notify-service-by-email
). These typically use a local mail transfer agent (MTA) likesendmail
,postfix
, or a simple tool likemailx
. - Timeperiods: Define when notifications can be sent.
- Host and Service Definitions: These must specify which contact groups should be notified for them.
- Nagios Notification Logic: Nagios decides when to send notifications based on state changes (typically hard states), notification options (e.g.,
w,u,c,r
for services), and timeperiods.
Steps for Basic Email Notifications:
-
Ensure a Mail Transfer Agent (MTA) is Installed and Configured:
Nagios itself doesn't send emails directly. It relies on a system command (like/usr/bin/mail
or/usr/sbin/sendmail
) to do so. For this command to work, your Nagios server needs an MTA.- Common MTAs:
Postfix, Sendmail, Exim. - Simple Solution for Local/Relay:
ssmtp
ormsmtp
can be configured to relay emails through an external SMTP server (like Gmail or an institutional mail server). For basic testing,mailutils
(which providesmail
) might be sufficient if your server has a local MTA configured to send outbound mail (e.g.postfix
installed with "Local only" or "Internet Site" config).
For this basic setup, let's assume
During Postfix installation on Debian/Ubuntu, you'll be asked for the mail configuration type.mailutils
(which providesmailx
ormail
) is sufficient if Postfix or Sendmail is already minimally configured on your server. If not, a minimal Postfix installation is often straightforward:- "Internet Site": If your server has a public FQDN and can send mail directly.
- "Local only": If you only want mail delivered locally or if you'll configure a relay later.
- For more complex relaying (e.g., via Gmail), Postfix needs further configuration (e.g.,
/etc/postfix/main.cf
forrelayhost
, SASL authentication). This is beyond Nagios's basic setup but crucial for reliable external email delivery.
- Common MTAs:
-
Define Contact(s):
This was partially done in a previous workshop. Ensure your contact definition in/usr/local/nagios/etc/objects/contacts.cfg
is complete.define contact{ contact_name nagiosadmin use generic-contact ; Inherits default options alias Nagios Administrator email your_actual_email@example.com ; **CRUCIAL: Use a real, working email** host_notifications_enabled 1 service_notifications_enabled 1 host_notification_period 24x7 service_notification_period 24x7 host_notification_options d,u,r ; Down, Unreachable, Recovery service_notification_options w,u,c,r,f,s ; Warning, Unknown, Critical, Recovery, Flapping (start/stop), Scheduled Downtime (start/stop) host_notification_commands notify-host-by-email service_notification_commands notify-service-by-email }
email
: Must be correct.host_notification_options
andservice_notification_options
: Define for which states notifications are sent.d,u,r
andw,u,c,r
are common starting points.host_notification_commands
andservice_notification_commands
: Specify the commands Nagios will use to send notifications. These are defined incommands.cfg
.
-
Define Contact Group(s):
In/usr/local/nagios/etc/objects/contacts.cfg
or a dedicatedcontactgroups.cfg
:Ensuredefine contactgroup{ contactgroup_name admins alias System Administrators members nagiosadmin ; Comma-separated list of contact_names }
nagiosadmin
(or your contact'scontact_name
) is listed inmembers
. -
Verify Notification Commands:
Check/usr/local/nagios/etc/objects/commands.cfg
fornotify-host-by-email
andnotify-service-by-email
. Default definitions often look like this:define command{ command_name notify-host-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$ } define command{ command_name notify-service-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ }
- These commands use
printf
to format the email body and pipe it to/usr/bin/mail
. $CONTACTEMAIL$
is a Nagios macro that gets replaced with the contact's email address.- Many other Nagios macros (like
$HOSTNAME$
,$SERVICESTATE$
, etc.) provide context. - Important: Ensure the path
/usr/bin/mail
is correct for your system. It might be/usr/sbin/sendmail
or another path if you use a different MTA setup. You can find the path usingwhich mail
orwhich sendmail
. Adjustcommand_line
if needed.
- These commands use
-
Assign Contact Groups to Hosts and Services:
Ensure your host and service definitions (e.g., inlocalhost.cfg
) include thecontact_groups
directive. Example for a host inlocalhost.cfg
:Example for a service indefine host{ use linux-server host_name localhost alias My Nagios Server address 127.0.0.1 contact_groups admins ; This line ensures 'admins' group gets notified }
localhost.cfg
:If you use templates (define service{ use local-service host_name localhost service_description Root Partition check_command check_local_disk!20%!10%!/ contact_groups admins ; This line ensures 'admins' group gets notified }
linux-server
,local-service
), it's common to definecontact_groups admins
in the template itself. This way, all hosts/services using that template automatically inherit the contact group assignment. -
Enable Notifications Globally:
This is usually the default.
In/usr/local/nagios/etc/nagios.cfg
, ensure notifications are globally enabled: -
Verify and Reload Nagios:
-
Test Notifications:
The easiest way to test is to force a service into a problem state or send a custom notification.-
Via Web Interface (Custom Notification):
- Go to the Nagios web UI.
- Click on "Services" and choose one of the services for
localhost
(e.g., "Swap Usage"). - In the "Service Commands" section on the right, click "Send custom service notification".
- Enter a comment (e.g., "Testing custom notification") and click "Commit".
- Check your email. You should receive a "CUSTOM" notification. This primarily tests if the notification command and mail system work.
-
By Forcing a State Change (More Realistic Test):
Change the
This is a bit trickier forlocalhost
services that are normallyOK
. One way: Temporarily change a service check's thresholds to make it fail. Example: Modify the "Swap Usage" check inlocalhost.cfg
:check_command
for "Swap Usage" to something guaranteed to fail (or be in a warning/critical state):Save, verify, and reload Nagios: Wait for Nagios to re-check the service (usually within 5 minutes, or its# Original: check_local_swap!50%!80% # Test: Warn if swap used > 0.1%, Crit if swap used > 0.2% (almost always true for CRITICAL if any swap is configured) check_command check_local_swap!0.1!0.2
check_interval
).- The service should go into a SOFT problem state first.
- Nagios will re-check it (based on
retry_interval
andmax_check_attempts
in the service template). - Once it reaches a HARD problem state, a notification should be sent.
- Check your email for a
PROBLEM
orCRITICAL
alert. - Remember to change the thresholds back to sensible values and reload Nagios again!
-
Check Nagios Log:
If emails are not arriving, check/usr/local/nagios/var/nagios.log
for entries related to notifications. It will show attempts to send notifications and any immediate errors from the notification command. Also, check your system's mail log (e.g.,/var/log/mail.log
or/var/log/maillog
) for errors from the MTA.
-
Troubleshooting Notification Issues:
- No emails:
- Is the contact's email address correct?
- Is the contact part of the assigned contact group?
- Is the contact group assigned to the host/service?
- Are notifications enabled for the contact, host/service, and globally (
enable_notifications=1
)? - Is the notification period for the contact and host/service allowing notifications at the current time?
- Did the host/service reach a HARD state? Notifications are typically not sent for SOFT states.
- Is the MTA (Postfix, Sendmail, etc.) correctly configured and able to send external emails? Test sending an email from the command line: If this doesn't arrive, the issue is with your server's mail setup, not Nagios itself.
- Check
/usr/local/nagios/var/nagios.log
and/var/log/mail.log
.
- Notification delays: Check
notification_interval
in host/service definitions or templates. This is how long Nagios waits before re-notifying for an ongoing problem.
Setting up notifications correctly is vital. This basic email setup forms the groundwork for more advanced alerting strategies.
Workshop Setting Up Email Notifications for Localhost Alerts
Objective: Configure and test email notifications for alerts generated by localhost
services.
Prerequisites:
- A working Nagios Core installation.
- The
nagiosadmin
contact incontacts.cfg
should have your real email address. mailutils
package installed (sudo apt install mailutils
).- A basic MTA (like Postfix) installed and minimally configured to send mail from the server (even if only to relay hosts or if your server can send directly). If you installed Postfix, choose "Internet Site" or "Local only" initially. For "Internet Site", ensure your server has a resolvable FQDN. For "Local only", it might only deliver to local user mailboxes unless further configured.
- Crucial Test: Can your server send an email from the command line to your target email address?
If this email does not arrive, you must fix your server's mail system (Postfix, Sendmail, etc.) before proceeding with Nagios notifications. This might involve configuring
echo "This is a test email from my Nagios server." | mail -s "Mail Test from $(hostname)" your_real_email@example.com
relayhost
in Postfix, checking firewall rules, or ensuring your server's IP is not blacklisted.
- Crucial Test: Can your server send an email from the command line to your target email address?
Steps:
-
Verify Contact and Contact Group Configuration:
- Open
/usr/local/nagios/etc/objects/contacts.cfg
. - Confirm the
nagiosadmin
contact definition:define contact{ contact_name nagiosadmin use generic-contact alias Nagios Administrator email your_real_email@example.com ; <-- ENSURE THIS IS YOUR EMAIL host_notification_options d,u,r service_notification_options w,u,c,r,f host_notification_commands notify-host-by-email service_notification_commands notify-service-by-email host_notification_period 24x7 service_notification_period 24x7 }
- Confirm the
admins
contact group definition and thatnagiosadmin
is a member: - Save any changes.
- Open
-
Verify Notification Commands:
- Open
/usr/local/nagios/etc/objects/commands.cfg
. - Check the
notify-host-by-email
andnotify-service-by-email
commands. Ensure the path to the mail executable (e.g.,/usr/bin/mail
) is correct for your system. You can find the path withwhich mail
.
- Open
-
Assign Contact Group to
localhost
and its Services:- Open
/usr/local/nagios/etc/objects/localhost.cfg
. - Ensure the
localhost
host definition hascontact_groups admins
. - Ensure relevant services (e.g., PING, Root Partition, Swap Usage) also have
contact_groups admins
. Often, this is inherited from a template likelocal-service
. Iflocal-service
template (intemplates.cfg
) already specifiescontact_groups admins
, you don't need to repeat it in every service definition that useslocal-service
. Verify thelocal-service
template intemplates.cfg
: Look fordefine service{ name local-service ... }
and ensure it containscontact_groups admins
. If not, add it.
- Open
-
Enable Notifications Globally (if not already):
- Check
/usr/local/nagios/etc/nagios.cfg
forenable_notifications=1
.
- Check
-
Validate Configuration and Reload Nagios:
-
Test Notification by Forcing a Service to a Critical State: We'll use the "Root Partition" check for this test as it's easy to manipulate its thresholds.
- Identify current free space: In the Nagios web UI, look at the "Root Partition" service for
localhost
. It will show something like "DISK OK - free space: / 15 GB (70%)". Note the percentage free. Let's say it's 70% free. - Modify thresholds to trigger an alert:
Edit
localhost.cfg
: Find the "Root Partition" service. It might look like:Change thedefine service{ use local-service host_name localhost service_description Root Partition check_command check_local_disk!20%!10%!/ ; Warn at 20% free, Crit at 10% free }
check_command
to trigger a CRITICAL state based on your current free space. If you have 70% free, setting critical to 75% free will trigger it (i.e., critical if less than 75% free space). Save the file. - Validate and Reload Nagios:
- Monitor in Web UI and Check Email:
- In the Nagios UI, watch the "Root Partition" service. It will go to "Pending".
- After its next scheduled check, it should change to a SOFT CRITICAL state. Nagios will show "(Service check is currently in a soft critical state)".
- Nagios will re-check it (e.g., every minute if
retry_interval
is 1 inlocal-service
template). Aftermax_check_attempts
(e.g., 4), if it's still critical, it will enter a HARD CRITICAL state. - At this point, a notification should be sent. Check your email. You should receive an email titled something like "PROBLEM Service Alert: localhost/Root Partition is CRITICAL".
- The email body will contain details from the notification command.
- Identify current free space: In the Nagios web UI, look at the "Root Partition" service for
-
Revert Changes and Test Recovery Notification:
- Once you've received the CRITICAL alert, change the "Root Partition" service check command back to its original, sensible values in
localhost.cfg
: Save the file. - Validate and Reload Nagios:
- Monitor in Web UI and Check Email:
- The service will eventually be re-checked.
- It should return to an
OK
state (SOFT OK first, then HARD OK). - A RECOVERY notification should be sent. Check your email for a message like "RECOVERY Service Alert: localhost/Root Partition is OK".
- Once you've received the CRITICAL alert, change the "Root Partition" service check command back to its original, sensible values in
Outcome:
If you received both the CRITICAL and RECOVERY emails, your basic email notification system is working! You have successfully:
- Confirmed all necessary configuration components for notifications.
- Triggered a real alert by changing service thresholds.
- Observed the SOFT to HARD state transition.
- Received a problem notification email.
- Reverted the change and received a recovery notification email.
Troubleshooting Reminder:
If emails don't arrive, the first place to check (after confirming Nagios config and logs) is your server's mail system logs (e.g., /var/log/mail.log
or /var/log/maillog
) and re-test sending mail from the command line. Common issues include relay access denied by your mail server, spam filters catching the mails, or incorrect mail
command path in Nagios.
This completes the basic setup and familiarization with Nagios. You are now ready to move on to more intermediate topics.
2. Intermediate Nagios Monitoring Techniques
Having mastered the basics of Nagios installation, configuration, and local monitoring, we now move to intermediate techniques. This section will focus on extending Nagios's reach to monitor remote systems (both Linux and Windows), teach you how to write your own custom plugins for specialized checks, and delve into more advanced object configuration options like host groups, service groups, and templates to manage your monitoring environment more efficiently.
Monitoring Remote Linux Hosts with NRPE
One of the most common tasks for a Nagios server is to monitor services and resources on remote Linux/Unix machines. While some checks like PING or HTTP can be done directly by the Nagios server, many checks (CPU load, disk space, specific processes, memory usage) require an agent running on the remote host. The Nagios Remote Plugin Executor (NRPE) is a popular solution for this.
What is NRPE?
NRPE consists of two main components:
- The NRPE daemon (
nrpe
): This daemon runs on the remote Linux host you want to monitor. It listens for connections from the Nagios server, executes pre-defined Nagios plugins locally on the remote host, and returns the results to the Nagios server. - The
check_nrpe
plugin: This plugin resides on the Nagios server. Nagios usescheck_nrpe
to connect to the NRPE daemon on the remote host, specify which command (plugin) to run, and receive the output.
NRPE Architecture:
- Nagios server schedules a service check that uses the
check_nrpe
plugin. check_nrpe
(on Nagios server) connects to the NRPE daemon on the remote host (typically on TCP port 5666).check_nrpe
tells the NRPE daemon which pre-defined command to execute. These commands are configured in the NRPE daemon's configuration file (nrpe.cfg
) on the remote host.- The NRPE daemon executes the specified local plugin (e.g.,
check_load
,check_disk
) on the remote host. - The local plugin returns its exit status and output string to the NRPE daemon.
- The NRPE daemon sends this result back to the
check_nrpe
plugin on the Nagios server. check_nrpe
passes the result to the Nagios process for evaluation.
Security Considerations:
- By default, NRPE communication is unencrypted plain text. NRPE can be compiled with SSL/TLS support for encryption.
- The NRPE daemon configuration (
nrpe.cfg
) on the remote host specifies which hosts are allowed to connect (allowed_hosts
). This should be restricted to your Nagios server's IP address. - NRPE can be configured to not allow command arguments from the
check_nrpe
plugin (dont_blame_nrpe=0
is generally discouraged for security). Instead, commands with arguments are fully defined on the remote host'snrpe.cfg
. This prevents the Nagios server from instructing the NRPE daemon to run arbitrary commands with arbitrary arguments.
Steps to Monitor a Remote Linux Host using NRPE:
On the Remote Linux Host (the one to be monitored):
-
Install Nagios Plugins: The NRPE daemon needs Nagios plugins to execute locally. Even if it's not a full Nagios server, it needs
nagios-plugins
.The plugins will typically be installed in# On Debian/Ubuntu: sudo apt update sudo apt install -y nagios-plugins # On RHEL/CentOS (requires EPEL repository): # sudo yum install epel-release # sudo yum install nagios-plugins-all # Or specific plugins like nagios-plugins-load, nagios-plugins-disk etc.
/usr/lib/nagios/plugins/
(Debian/Ubuntu) or/usr/lib64/nagios/plugins/
(RHEL/CentOS). Note this path, as it's needed fornrpe.cfg
. -
Install NRPE Daemon:
-
Configure NRPE Daemon (
Key settings to check/modify:nrpe.cfg
): The configuration file is usually/etc/nagios/nrpe.cfg
(Debian/Ubuntu) or/etc/nagios/nrpe.cfg
(RHEL/CentOS). Edit this file:server_port=5666
: Default NRPE port.allowed_hosts=127.0.0.1,::1,YOUR_NAGIOS_SERVER_IP
: Crucial for security! ReplaceYOUR_NAGIOS_SERVER_IP
with the actual IP address of your Nagios Core server. Add IPv6 if needed.dont_blame_nrpe=0
: This is the default and more secure setting. It means NRPE will not accept arguments with commands sent bycheck_nrpe
. All command arguments must be defined innrpe.cfg
on the remote host. If you set this to1
, you can pass arguments fromcheck_nrpe
, but this is less secure.debug=0
: Set to1
for verbose logging during troubleshooting.- Command Definitions: This is where you define the commands that the Nagios server can request NRPE to run. The syntax is
command[command_name]=/path/to/plugin <arguments>
. The plugin path might vary. On Debian/Ubuntu it's often/usr/lib/nagios/plugins/
. Examples:# Example: Check for users logged in # $USER1$ is not defined here, so use full path. # Path to plugins might be /usr/lib/nagios/plugins or /usr/lib64/nagios/plugins # Adjust this path based on where nagios-plugins were installed (Step 1). # Use `dpkg -L nagios-plugins` or `rpm -ql nagios-plugins` to find plugin paths. # Assuming /usr/lib/nagios/plugins/ for Debian/Ubuntu examples: command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10 command[check_load]=/usr/lib/nagios/plugins/check_load -r -w .15,.10,.05 -c .30,.25,.20 command[check_hda1]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/hda1 # Example specific disk command[check_root_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z command[check_total_procs]=/usr/lib/nagios/plugins/check_procs -w 150 -c 200
- Define only the commands you need.
- The
command_name
(e.g.,check_users
,check_load
) is what you will specify from the Nagios server viacheck_nrpe
.
-
Start/Restart NRPE Daemon and Enable it:
-
Firewall Configuration (on Remote Host): If a firewall (like
ufw
orfirewalld
) is active on the remote host, allow incoming connections on TCP port 5666 from the Nagios server's IP.- Using
ufw
(Debian/Ubuntu): - Using
firewalld
(RHEL/CentOS):
- Using
On the Nagios Core Server:
-
Install
check_nrpe
Plugin: This plugin might have been installed during the initial "Nagios Plugins" compilation. If not, or if you need a specific version: Thecheck_nrpe
plugin source code is often bundled with the NRPE source code download, not always with Nagios Plugins package directly. Ifcheck_nrpe
is missing from/usr/local/nagios/libexec/
:Testcd /tmp # Download NRPE source (same version as daemon ideally, or a compatible one) # Check https://github.com/NagiosEnterprises/nrpe/releases for latest version NRPE_VERSION="4.1.0" # Example wget https://github.com/NagiosEnterprises/nrpe/releases/download/nrpe-${NRPE_VERSION}/nrpe-${NRPE_VERSION}.tar.gz tar -zxvf nrpe-${NRPE_VERSION}.tar.gz cd nrpe-${NRPE_VERSION}/ # Configure to build plugin (and optionally agent if you want to build it from here too) # You may need openssl-devel or libssl-dev: sudo apt install libssl-dev OR sudo yum install openssl-devel sudo ./configure --enable-ssl # Or without --enable-ssl if remote NRPE daemon doesn't use SSL sudo make check_nrpe sudo cp src/check_nrpe /usr/local/nagios/libexec/ sudo chown nagios:nagios /usr/local/nagios/libexec/check_nrpe sudo chmod 750 /usr/local/nagios/libexec/check_nrpe # Or 755
check_nrpe
from Nagios server's command line: This should return the NRPE version running on the remote host (e.g., "NRPE v4.1.0"). If it fails (timeout, connection refused):- Verify NRPE daemon is running on remote host (
systemctl status nagios-nrpe-server
). - Check
allowed_hosts
innrpe.cfg
on remote host. - Check firewall on remote host.
- Check network connectivity between Nagios server and remote host on port 5666 (e.g.,
telnet REMOTE_HOST_IP 5666
ornc -zv REMOTE_HOST_IP 5666
).
Test executing a defined command: If you defined
This should output something like: "USERS OK - 1 users currently logged in".command[check_users]=...
on the remote host: - Verify NRPE daemon is running on remote host (
-
Define
check_nrpe
Command in Nagios (if not already present): In/usr/local/nagios/etc/objects/commands.cfg
on the Nagios server:define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ }
$USER1$/check_nrpe
: Path to the plugin.-H $HOSTADDRESS$
: Specifies the remote host's IP address (taken from host definition).-t 30
: Timeout of 30 seconds for the check.-c $ARG1$
: The command name (defined in remotenrpe.cfg
) to execute.$ARG1$
will be replaced by the argument passed from the service definition.
If you compiled NRPE and
check_nrpe
with SSL support, you might need to add SSL-related arguments if your NRPE daemon requires them (e.g., certificate paths if client certs are used). For basic shared secret or anonymous SSL, often no extra args are needed if both sides are compiled with SSL. -
Define Host Object for the Remote Linux Host: Create a new config file in, for example,
/usr/local/nagios/etc/servers/remote-linux-host.cfg
(ensure this directory is included by acfg_dir
directive innagios.cfg
).Replacedefine host{ use linux-server ; Inherit from your generic Linux server template host_name remote-linux-server-01 alias My Remote Linux Server address REMOTE_HOST_IP ; IP address of the remote host contact_groups admins }
REMOTE_HOST_IP
with the actual IP. -
Define Service Checks using
check_nrpe
for the Remote Host: In the same file (remote-linux-host.cfg
):The string afterdefine service{ use generic-service ; Inherit from your generic service template host_name remote-linux-server-01 service_description Remote Users check_command check_nrpe!check_users ; 'check_users' is the command name defined in nrpe.cfg on remote host } define service{ use generic-service host_name remote-linux-server-01 service_description Remote Load check_command check_nrpe!check_load } define service{ use generic-service host_name remote-linux-server-01 service_description Remote Root Disk check_command check_nrpe!check_root_disk }
!
incheck_nrpe!command_name
is passed as$ARG1$
to thecheck_nrpe
command definition, which then becomes the command name sent to the remote NRPE daemon. -
Verify Configuration and Reload Nagios:
-
Check Web Interface: The new remote host and its services should appear in the Nagios web UI. They will initially be "Pending" and then update with statuses from the NRPE checks.
Troubleshooting NRPE:
- "Connection refused" or "Socket timeout" from
check_nrpe
:- Is NRPE daemon running on the remote host? (
ps aux | grep nrpe
,systemctl status nagios-nrpe-server
). - Is the Nagios server's IP in
allowed_hosts
innrpe.cfg
on remote? - Is port 5666 open in the firewall on the remote host for the Nagios server's IP?
- Network connectivity issues (routers, general network problems).
- (If using xinetd for NRPE, ensure xinetd is configured and running).
- Is NRPE daemon running on the remote host? (
- "CHECK_NRPE: Error - Could not complete SSL handshake.":
- Mismatch in SSL/TLS compilation/configuration. Ensure both
check_nrpe
(on Nagios server) andnrpe
daemon (on remote host) are compiled with or without SSL support consistently. If compiled with SSL, they usually negotiate. Some older versions had issues. - If the NRPE daemon was compiled with specific ciphers and
check_nrpe
doesn't support them.
- Mismatch in SSL/TLS compilation/configuration. Ensure both
- "CHECK_NRPE: Received 0 bytes from daemon." or "Command ... not defined":
- The command name sent by
check_nrpe
(e.g.,check_users
) is not defined in thecommand[...]
directives innrpe.cfg
on the remote host, or there's a typo. Command names are case-sensitive. - Plugin execution error on the remote host. Check NRPE daemon logs on the remote host (syslog or a dedicated NRPE log if configured).
- The command name sent by
- Plugin execution errors (e.g., "(No output returned from plugin)" or plugin-specific errors):
- The plugin path in
nrpe.cfg
on remote host is incorrect. - Plugin does not have execute permissions on remote host.
- Plugin itself is failing. Try running the exact command line from
nrpe.cfg
manually on the remote host as thenagios
user (or whatever user NRPE runs as) to debug. E.g.,sudo -u nagios /usr/lib/nagios/plugins/check_users -w 5 -c 10
.
- The plugin path in
NRPE is a powerful way to extend Nagios's monitoring capabilities to your entire Linux infrastructure.
Workshop Monitoring a Remote Linux Host via NRPE
Objective:
Set up monitoring for a remote Linux host (VM2) from your Nagios server (VM1). You will monitor CPU load, root disk space, and total running processes on VM2.
Prerequisites:
- Your Nagios Core server (VM1) from previous workshops.
- A second Linux virtual machine (VM2, e.g., Debian/Ubuntu server). This will be the "remote host".
- Network connectivity between VM1 and VM2. Ensure they can ping each other by IP.
- Know the IP addresses of both VM1 (Nagios server) and VM2 (remote host).
- Sudo/root access on both VMs.
Let's assume:
- VM1 (Nagios Server) IP:
192.168.1.100
(Replace with your actual IP) - VM2 (Remote Linux Host) IP:
192.168.1.101
(Replace with your actual IP)
Part 1: Configure the Remote Linux Host (VM2)
-
Log in to VM2.
-
Install Nagios Plugins and NRPE Server:
This installs the necessary check plugins (likecheck_disk
,check_load
,check_procs
) and the NRPE daemon. -
Configure NRPE Daemon on VM2: Edit
/etc/nagios/nrpe.cfg
:- Find the
allowed_hosts
line. Modify it to include your Nagios Server's IP (VM1): (Replace192.168.1.100
with VM1's actual IP address). - Ensure
dont_blame_nrpe=0
(this is default and more secure). - Add or verify command definitions. The default
nrpe.cfg
on Debian/Ubuntu often comes with some pre-defined commands. Ensure these (or similar) are present and uncommented. The path to plugins is usually/usr/lib/nagios/plugins/
.Self-correction: The# These should exist or be added: command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10 command[check_load]=/usr/lib/nagios/plugins/check_load -r -w .15,.10,.05 -c .30,.25,.20 command[check_disk_root]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / command[check_procs_total]=/usr/lib/nagios/plugins/check_procs -w 250 -c 400
nrpe.cfg
might usecheck_hda1
orcheck_sda1
as examples for disks. We want a specific command for the root partition (/
). Socommand[check_disk_root]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
is good. Also for total processes:command[check_procs_total]=/usr/lib/nagios/plugins/check_procs -w 250 -c 400
(adjust thresholds as needed for your VM2). - Save and exit
nrpe.cfg
.
- Find the
-
Restart and Enable NRPE Service on VM2:
-
Configure Firewall on VM2 (if applicable): If
ufw
is active on VM2:(Replacesudo ufw allow from 192.168.1.100 to any port 5666 proto tcp comment 'Allow NRPE from Nagios Server' sudo ufw reload sudo ufw status # Verify the rule is active
192.168.1.100
with VM1's IP).
Part 2: Configure the Nagios Server (VM1)
-
Log in to VM1.
-
Ensure
If it's not there, you'll need to compile it from the NRPE source package as described in the main NRPE section. For this workshop, we'll assume it was installed as part of thecheck_nrpe
Plugin is Installed: It's usually installed withnagios-plugins
or compiled when you installed Nagios Core and plugins. Verify its existence:nagios-plugins
package that was compiled earlier or that you have followed the steps to compilecheck_nrpe
manually from the NRPE source tarball. If it was installed viaapt install nagios-nrpe-plugin
(less common for source installs), it might be in/usr/lib/nagios/plugins/check_nrpe
. If so, adjust paths incommands.cfg
accordingly. But for source installs, it should be in/usr/local/nagios/libexec/
. -
Test
(Replacecheck_nrpe
from VM1 to VM2:192.168.1.101
with VM2's IP). Expected output:NRPE vX.Y.Z
(the version of NRPE running on VM2). If this fails, troubleshoot (firewall on VM2,allowed_hosts
on VM2, NRPE service status on VM2).Now test a specific command defined in VM2's
nrpe.cfg
:Each should return an OK status with some output. If you get "Command not defined", double-check the command names in VM2's/usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_users /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_load /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_disk_root /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_procs_total
nrpe.cfg
and ensure they match what you use with-c
. -
Define
Add the following definition if it doesn't already exist: Save and exit.check_nrpe
Command in Nagios (VM1): Open/usr/local/nagios/etc/objects/commands.cfg
: -
Create Configuration File for VM2 on Nagios Server (VM1): Create a new directory for remote server configs if you don't have one:
Tell Nagios to read configs from this directory. Edit/usr/local/nagios/etc/nagios.cfg
: Add this line (if not already present): Save and exit.Now, create the config file for VM2:
Add the following content:Replacedefine host{ use linux-server ; Name of host template to use host_name vm2-remote-linux alias Remote Linux VM2 address 192.168.1.101 ; <<< IP Address of VM2 contact_groups admins } define service{ use generic-service ; Name of service template to use host_name vm2-remote-linux service_description CPU Load via NRPE check_command check_nrpe!check_load } define service{ use generic-service host_name vm2-remote-linux service_description Root Disk Space via NRPE check_command check_nrpe!check_disk_root } define service{ use generic-service host_name vm2-remote-linux service_description Total Processes via NRPE check_command check_nrpe!check_procs_total }
192.168.1.101
with VM2's actual IP. Save and exit. -
Verify Nagios Configuration and Reload (VM1):
If there are errors, read them carefully. They often point to typos or missing definitions. -
Check Nagios Web Interface (VM1): Open your Nagios web UI.
- Go to "Hosts". You should see
vm2-remote-linux
. - Go to "Services". You should see the three new services associated with
vm2-remote-linux
(CPU Load, Root Disk Space, Total Processes). - They will be in "Pending" state initially. After a few minutes, they should update with their actual status from VM2.
- Go to "Hosts". You should see
Outcome:
You have successfully configured Nagios to monitor key metrics (CPU load, disk space, total processes) on a remote Linux host (VM2) using NRPE. This involved:
- Setting up the NRPE daemon and necessary plugins on the remote host (VM2).
- Configuring firewall rules and allowed hosts for security.
- Ensuring the
check_nrpe
plugin is available on the Nagios server (VM1). - Testing connectivity and command execution with
check_nrpe
from the command line. - Defining the new host and its NRPE-based services in the Nagios server configuration.
- Verifying the setup in the Nagios web interface.
This workshop provides a practical template for adding more remote Linux hosts and more NRPE-based checks to your Nagios monitoring.
Monitoring Remote Windows Hosts with NSClient++
Monitoring Windows hosts requires a different agent than Linux's NRPE. The most popular and versatile agent for Windows is NSClient++. It can communicate with Nagios using various protocols, including NRPE, making the Nagios-side configuration very similar to monitoring Linux hosts via NRPE.
What is NSClient++?
NSClient++ is an agent designed for Windows systems (though it has Linux ports too) that allows a Nagios server to query performance metrics, service states, process information, and more.
Key features:
- Multiple protocols: It can listen for connections using Nagios's native
check_nt
protocol (older, less secure), NRPE (recommended for consistency if you already use it for Linux), and others. - Extensible: Supports external scripts and PowerShell.
- Built-in checks: Provides many common Windows checks out-of-the-box (CPU, memory, disk, services, processes, event logs).
- Secure: Supports SSL/TLS for NRPE communication and certificate-based authentication.
Architecture (using NRPE protocol with NSClient++):
- Nagios server schedules a service check using the
check_nrpe
plugin. check_nrpe
(on Nagios server) connects to NSClient++ on the Windows host (NSClient++ listening as an NRPE daemon, typically on TCP port 5666).check_nrpe
tells NSClient++ which pre-defined command (alias) to execute. These command aliases are configured in NSClient++'s configuration file (nsclient.ini
orcustom.ini
).- NSClient++ executes the corresponding internal check module or an external script.
- NSClient++ returns the result (exit status and output string) to
check_nrpe
. check_nrpe
passes the result to the Nagios process.
Steps to Monitor a Remote Windows Host using NSClient++ (with NRPE):
On the Remote Windows Host:
-
Download NSClient++: Go to the official NSClient++ website (nsclient.org) and download the latest stable version (usually an MSI installer for 64-bit Windows).
-
Install NSClient++: Run the MSI installer.
- Choose "Generic" or "Typical" setup type.
- Important Configuration during install:
- Allowed hosts: Enter the IP address of your Nagios Core server. This is crucial for security.
- Enable common check plugins: Ensure
NRPEServer
(orNRPEServer module (check_nrpe)
) is enabled/ticked. You might also enableCheckSystem
(for CPU, memory),CheckDisk
,CheckService
, etc. - Password (for
check_nt
): If you were to usecheck_nt
, you'd set a password here. For NRPE, it's not directly used in the same way, but it's good to set something if prompted. - You might be asked if you want to allow arguments to be passed to NRPE commands. For better security, it's often recommended to define commands fully within NSClient++ and not allow arguments from Nagios (similar to
dont_blame_nrpe=0
in Linux NRPE). However, NSClient++ often defaults to allowing arguments for convenience.
- Complete the installation. NSClient++ will be installed as a Windows service and should start automatically.
-
Configure NSClient++ (
nsclient.ini
orcustom.ini
): The configuration file is typically located atC:\Program Files\NSClient++\nsclient.ini
. For modern versions, it's recommended to put custom settings inC:\Program Files\NSClient++\custom.ini
to avoid overwrites during upgrades. Open thensclient.ini
(or create/editcustom.ini
) with a text editor (run as Administrator).-
Enable Modules: Ensure necessary modules are enabled.
-
Configure NRPEServer settings:
If[/settings/NRPE/server] ; Allow a Llowed hosts allowed hosts = YOUR_NAGIOS_SERVER_IP ; Reconfirm this from install step, or add Nagios server IP here ; Allow arguments from Nagios (less secure, but often convenient for NSClient++) allow arguments = true ; Allow nasty characters (meta characters) in arguments (can be a security risk if not careful) allow nasty characters = true ; Or false for more security, requiring more careful command definitions ; Port to listen on port = 5666 ; Enable SSL/TLS (recommended) - ensure check_nrpe on Nagios server also supports it ; use ssl = true ; or false if not using SSL yet. ; insecure = true ; If using SSL but not full certificate validation (simpler setup)
use ssl = true
is set,check_nrpe
on the Nagios server must also be compiled with SSL support and might need the-S
or appropriate SSL flags if the NSClient++ SSL setup is strict. For simplicity, you might start withuse ssl = false
. -
Define Command Aliases (External Scripts/Aliases): NSClient++ has many built-in checks that don't need explicit aliasing if
allow arguments = true
for NRPE, as you can call them directly (e.g.,check_cpu
,check_memory
). However, for complex commands or to restrict what Nagios can call, define aliases. These are typically defined under[/settings/external scripts/alias]
or[/settings/external scripts/scripts]
(for actual scripts). Many common checks are directly invokable if the corresponding module is loaded. For instance, ifCheckSystem
is loaded,check_nrpe
can often callcheck_cpu
,check_memory
directly. Example pre-defined aliases (often innsclient.ini
already, or can be added):These[/settings/external scripts/alias] alias_cpu = checkCPU warn=80 crit=90 time=5m time=1m time=30s alias_mem = checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical alias_disk_c = CheckDriveSize MinWarn=20% MinCrit=10% Drive=C: FilterType=FIXED alias_service_spooler = checkServiceState CheckAll Spooler ; Checks if Spooler service is running alias_uptime = checkUpTime MinWarn=1d MinCrit=1h ; Warn if uptime < 1 day, Crit if < 1 hour (example)
alias_
names are what you'd call from Nagios viacheck_nrpe
(e.g.,check_nrpe -c alias_cpu
). Ifallow arguments = true
in[/settings/NRPE/server]
, you can often call the internal commands directly like:check_nrpe -c check_cpu -a warn=80 crit=90
This is very flexible but gives more control to the Nagios side.
-
-
Restart NSClient++ Service: Open the Windows Services console (
services.msc
), find "NSClient++" (or "NSCP"), and restart it to apply configuration changes. -
Configure Windows Firewall: Allow incoming connections on TCP port 5666 from your Nagios server's IP address.
- Open Windows Defender Firewall with Advanced Security.
- Go to "Inbound Rules".
- Click "New Rule..."
- Type of rule: "Port".
- Protocol and Ports: "TCP", Specific local ports: "5666".
- Action: "Allow the connection".
- Profile: Choose appropriate profiles (Domain, Private, Public - typically Domain and Private).
- Name: "NSClient++ NRPE (from Nagios)".
- (Optional but recommended) Scope: Restrict "Remote IP addresses" to your Nagios server's IP.
On the Nagios Core Server:
-
Ensure
check_nrpe
Plugin is Ready: Samecheck_nrpe
plugin used for Linux hosts can be used for Windows hosts running NSClient++ in NRPE mode. Verify it exists and works (as in the Linux NRPE section)./usr/local/nagios/libexec/check_nrpe
-
Test
Should return something like:check_nrpe
from Nagios server to Windows host:I seem to be doing fine...
or the NSClient++ version. If you get "CHECK_NRPE: Error - Could not complete SSL handshake," ensureuse ssl = false
innsclient.ini
ifcheck_nrpe
is not using SSL, or ensure both are configured for compatible SSL.Test a specific command (assuming
allow arguments = true
in NSClient++ andCheckSystem
module is loaded):Or, if using aliases defined in# Test CPU check: warn if 5min avg > 80%, crit if > 90% /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_cpu -a warn=load>80 crit=load>90 time=5m # Test Memory check: warn if physical memory usage > 80%, crit if > 90% /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_memory -a type=physical warn=used>80% crit=used>90% # Test C: drive space: warn if free < 20GB, crit if free < 10GB /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_drivesize -a drive=C: warn=free<20G crit=free<10G
nsclient.ini
likealias_cpu
: -
Define Host Object for the Windows Host: Create
/usr/local/nagios/etc/servers/windows-host-01.cfg
(or similar):It's good practice to create adefine host{ use windows-server ; You might create a 'windows-server' host template ; Or use 'generic-host' or 'linux-server' if similar enough host_name my-windows-server alias My First Windows Server address WINDOWS_HOST_IP ; IP of the Windows host contact_groups admins }
windows-server
host template intemplates.cfg
if you monitor many Windows machines. -
Define Service Checks using
check_nrpe
: In the same file (windows-host-01.cfg
):Important Note on Arguments: When passing arguments to# Example using direct command calls (requires 'allow arguments = true' in NSClient++) define service{ use generic-service host_name my-windows-server service_description Windows CPU Load check_command check_nrpe!check_cpu!-a warn=load>80 crit=load>90 time=5m time=1m time=30s } define service{ use generic-service host_name my-windows-server service_description Windows Memory Usage check_command check_nrpe!check_memory!-a type=physical warn=used>80% crit=used>90% } define service{ use generic-service host_name my-windows-server service_description Windows C Drive Space check_command check_nrpe!check_drivesize!-a drive=C: warn=free<20G crit=free<10G ShowAll=long } define service{ use generic-service host_name my-windows-server service_description Windows Uptime check_command check_nrpe!check_uptime # For check_uptime, NSClient++ has default warn/crit values. # To specify, e.g., warn if uptime < 7d, crit < 1d: # check_command check_nrpe!check_uptime!-a warn=uptime<7d crit=uptime<1d }
check_nrpe
that will then be forwarded to NSClient++, the!
character is used by Nagios to separate the main command from its arguments. If NSClient++ commands also use!
or other special characters that Nagios might misinterpret, careful quoting or alternative argument passing might be needed. In the examplecheck_nrpe!check_cpu!-a warn=load>80 ...
, the command sent to NRPE ischeck_cpu
and the arguments string starts with-a warn=load>80 ...
. NSClient++ parses this argument string.If you used aliases like
This is cleaner and more secure as the full check logic is on the client.alias_cpu
innsclient.ini
: -
Verify Configuration and Reload Nagios:
-
Check Web Interface: The new Windows host and its services should appear.
Troubleshooting NSClient++:
- Connection issues from Nagios server:
- NSClient++ service running on Windows?
- Nagios server IP in
allowed hosts
innsclient.ini
([/settings/NRPE/server]
section)? - Windows Firewall allowing port 5666 from Nagios server?
- SSL/TLS mismatch? Try
use ssl = false
innsclient.ini
for initial testing ifcheck_nrpe
is not using SSL.
- "UNKNOWN: No handler for command" or similar from
check_nrpe
:- The command (e.g.,
check_cpu
) or alias (e.g.,alias_cpu
) is not recognized by NSClient++. - Ensure the required module (e.g.,
CheckSystem
) is enabled in[/modules]
innsclient.ini
. - If using an alias, ensure it's correctly defined in
[/settings/external scripts/alias]
. - If
allow arguments = false
in NSClient++, you must use aliases that fully define the command.
- The command (e.g.,
- NSClient++ logs: Check
C:\Program Files\NSClient++\nsclient.log
for errors on the Windows host. You might need to increase log level innsclient.ini
([/settings/log]
section, e.g.,level = debug
). Remember to restart NSClient++ service after changing its .ini file. - NSClient++ test command: On the Windows host, you can test commands locally:
Open Command Prompt as Administrator, navigate to
C:\Program Files\NSClient++\
, then runnscp test
. This gives you an interactive NSClient++ console. You can then type commands likecheck_cpu warn=load>80 crit=load>90 time=5m
to see their output. Oralias_cpu
if you defined such an alias.
Monitoring Windows hosts with NSClient++ and NRPE provides a consistent approach with monitoring Linux hosts, simplifying Nagios configuration.
Workshop Monitoring a Remote Windows Host using NSClient++ and check_nrpe
Objective:
Install and configure NSClient++ on a remote Windows host (VM3) and monitor its CPU usage, memory usage, and C: drive space from your Nagios server (VM1) using the NRPE protocol.
Prerequisites:
- Your Nagios Core server (VM1) from previous workshops (IP:
192.168.1.100
- adjust as needed). - A Windows virtual machine (VM3, e.g., Windows Server 2019/2022 or Windows 10/11). This will be the "remote Windows host." (IP:
192.168.1.102
- adjust as needed). - Network connectivity between VM1 and VM3. Ensure VM1 can ping VM3 by IP.
- Administrator access on VM3.
- Sudo/root access on VM1.
check_nrpe
plugin working on VM1.
Part 1: Configure the Remote Windows Host (VM3)
-
Log in to VM3 as an Administrator.
-
Download NSClient++: Open a web browser on VM3 and go to
https://nsclient.org/download/
. Download the latest stable 64-bit MSI installer (e.g.,NSCP-0.5.x.xx-x64.msi
). -
Install NSClient++ on VM3:
- Run the downloaded MSI installer.
- Click "Next" on the welcome screen. Accept the license agreement and click "Next."
- Setup Type: Choose "Typical." Click "Next."
- Configuration:
- Allowed hosts address: Enter the IP address of your Nagios server (VM1), e.g.,
192.168.1.100
. - NSClient++ Monitoring Tools: Keep defaults or ensure "Enable NRPE server" (or similar wording for
NRPEServer
) is checked. - You can leave the password fields blank as we are focusing on NRPE.
- Click "Next."
- Allowed hosts address: Enter the IP address of your Nagios server (VM1), e.g.,
- Click "Install." If prompted by User Account Control, click "Yes."
- Once installation is complete, click "Finish." The NSClient++ service should start automatically.
-
Configure NSClient++ (nsclient.ini):
- Open File Explorer and navigate to
C:\Program Files\NSClient++\
. - Open
nsclient.ini
with a text editor like Notepad, run as Administrator (right-click Notepad, "Run as administrator," then open the file). - Enable Modules (verify):
Under the
[/modules]
section, ensure these are present and set toenabled
(or uncommented): - Configure NRPE Server Settings:
Under the
[/settings/NRPE/server]
section (create it if it doesn't exist or is commented out):Make sure; Undocumented key verify mode = none ; For simpler SSL, if used. Start without SSL for ease. insecure = true ; Alias for verify mode = none and allow-self-signed = true ; Allow arguments from NRPE client allow arguments = true ; Allow "nasty" meta characters ( szükséges lehet speciális karakterek miatt ) allow nasty characters = true ; Be cautious with this in production ; Allowed hosts allowed hosts = 192.168.1.100 ; << Your Nagios Server IP (VM1) ; Port to use for NRPE. port = 5666 ; SSL/TLS options - For initial workshop, keep SSL disabled for simplicity use ssl = false
allowed hosts
correctly lists VM1's IP. For this workshop, we setuse ssl = false
to simplify the initial setup. In a production environment, you should enable SSL.* - Command Aliases (Optional but Good Practice):
While
allow arguments = true
lets us call commands directly, let's define a few aliases under[/settings/external scripts/alias]
for clarity and future security hardening. If this section doesn't exist, create it.[/settings/external scripts/alias] alias_cpu_long = checkCPU warn=load>80 crit=load>90 time=5m time=1m time=30s ShowAll=long alias_mem_phys = checkMem MaxWarn=80% MaxCrit=90% type=physical ShowAll=long alias_disk_c_space = CheckDriveSize MinWarn=20% MinCrit=10% Drive=C: ShowAll=long FilterType=FIXED alias_win_uptime = checkUpTime MinWarn=24h MinCrit=2h ShowAll=long
- Save the
nsclient.ini
file.
- Open File Explorer and navigate to
-
Restart NSClient++ Service on VM3:
- Open "Services" (type
services.msc
in the Run dialog or Start menu search). - Find "NSClient++ Monitoring Agent" (or similar, might be "NSCP").
- Right-click it and select "Restart."
- Open "Services" (type
-
Configure Windows Firewall on VM3:
- Search for "Windows Defender Firewall with Advanced Security" and open it.
- Click on "Inbound Rules" in the left pane.
- In the right pane, click "New Rule..."
- Rule Type: Select "Port," click "Next."
- Protocol and Ports: Select "TCP." Select "Specific local ports:" and enter
5666
. Click "Next." - Action: Select "Allow the connection." Click "Next."
- Profile: Keep "Domain," "Private," and "Public" checked (or adjust based on your network profile). Click "Next."
- Name: Enter a descriptive name, e.g.,
Nagios NRPE (NSClient++)
. - Scope (Optional but Recommended): In the rule properties (after creation or during), go to the "Scope" tab. Under "Remote IP address," choose "These IP addresses," click "Add," and enter the IP address of your Nagios server (VM1, e.g.,
192.168.1.100
). Click "OK." - Click "Finish."
Part 2: Configure the Nagios Server (VM1)
-
Log in to VM1.
-
Test
Expected output:check_nrpe
from VM1 to VM3: Replace192.168.1.102
with VM3's actual IP address.I seem to be doing fine...
or an NSClient++ version string. If it fails (timeout, connection refused):- Verify NSClient++ service is running on VM3.
- Check
allowed hosts
innsclient.ini
on VM3. - Check Windows Firewall rule on VM3.
- Ensure
use ssl = false
is set innsclient.ini
if yourcheck_nrpe
is not compiled with SSL or you're not using SSL options with it.
Now test the aliases you defined (or direct commands if you prefer):
Each should return an OK status with some output./usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_cpu_long /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_mem_phys /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_disk_c_space /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_win_uptime
-
Define
check_nrpe
Command in Nagios (VM1) (if not already done): This should already exist from the Linux NRPE workshop. Verify in/usr/local/nagios/etc/objects/commands.cfg
:Self-correction: Addingdefine command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ $ARG2$ }
$ARG2$
allows passing further arguments from Nagios service definitions if needed, which is common when directly calling NSClient++ internal checks (e.g.check_nrpe!check_cpu!-a warn=80 crit=90
). If you only use aliases defined on the client,$ARG1$
is sufficient. Using$ARG2$
provides more flexibility. -
Create Configuration File for VM3 on Nagios Server (VM1): In the
Add the following content:/usr/local/nagios/etc/servers/
directory (created in previous workshop):Replacedefine host{ use generic-host ; Or create a 'windows-server' template host_name vm3-remote-windows alias Remote Windows VM3 address 192.168.1.102 ; <<< IP Address of VM3 (Windows host) contact_groups admins ; For Windows, check-host-alive (ping) is usually fine as the default check_command ; If you create a windows-server template, you can set specific icons, etc. icon_image win40.gif ; Example icon (if images are in Nagios share) statusmap_image win40.gd2 } define service{ use generic-service host_name vm3-remote-windows service_description Windows CPU Usage check_command check_nrpe!alias_cpu_long } define service{ use generic-service host_name vm3-remote-windows service_description Windows Memory Usage check_command check_nrpe!alias_mem_phys } define service{ use generic-service host_name vm3-remote-windows service_description Windows C Drive Space check_command check_nrpe!alias_disk_c_space } define service{ use generic-service host_name vm3-remote-windows service_description Windows Uptime check_command check_nrpe!alias_win_uptime }
192.168.1.102
with VM3's actual IP. Save and exit. -
Verify Nagios Configuration and Reload (VM1):
-
Check Nagios Web Interface (VM1): Open your Nagios web UI.
- Go to "Hosts". You should see
vm3-remote-windows
. - Go to "Services". You should see the four new services associated with
vm3-remote-windows
. - They will be in "Pending" state initially. After a few minutes, they should update with their actual status from VM3. You should see CPU, Memory, Disk, and Uptime information.
- Go to "Hosts". You should see
Outcome:
You have successfully configured Nagios to monitor key metrics on a remote Windows host (VM3) using NSClient++ with the NRPE protocol. This involved:
- Installing and configuring NSClient++ on the Windows host (VM3).
- Defining command aliases in
nsclient.ini
for the checks. - Configuring the Windows Firewall.
- Testing connectivity and command execution with
check_nrpe
from the Nagios server (VM1). - Defining the new Windows host and its NRPE-based services in the Nagios server configuration.
- Verifying the setup in the Nagios web interface.
This workshop demonstrates the versatility of NRPE and how NSClient++ enables comprehensive Windows monitoring within a Nagios environment. For production, remember to enable and configure SSL for NRPE communication between Nagios and NSClient++.
Writing Custom Nagios Plugins
While Nagios comes with a vast library of official and community-contributed plugins, there will inevitably be situations where you need to monitor something unique to your environment for which no existing plugin is suitable. This is where writing custom Nagios plugins becomes essential. Nagios plugins are simple executables or scripts that adhere to a specific contract regarding exit codes and output format.
Plugin Development Guidelines:
- Executable:
The plugin must be an executable file (script or compiled program). Common choices for scripts are Bash, Perl, Python, Ruby, or PowerShell (if executed via an agent like NSClient++). - Exit Codes:
The plugin must terminate with one of the following exit codes to indicate the status of the check:0
: OK - The service is functioning correctly.1
: WARNING - The service is in a warning state (e.g., approaching a threshold).2
: CRITICAL - The service is in a critical state (e.g., threshold exceeded, service down).3
: UNKNOWN - The status of the service could not be determined (e.g., plugin error, invalid arguments, resource unavailable). Any other exit code will typically be treated as UNKNOWN by Nagios, or may result in an error like "(Return code of X is out of bounds)".
-
Output Format (STDOUT):
The plugin must print at least one line of human-readable text to standard output (STDOUT). This is the primary status information displayed in the Nagios UI.- Single-Line Output:
SERVICESTATUS: Plugin message | optional_performance_data
Example:DISK OK - / (sda1) is 78% full. | /=5079MB;15280;17190;0;19100
- Multi-Line Output (less common for main line, but possible for extended info):
The first line follows the single-line format. Subsequent lines can provide additional details.
SERVICESTATUS: Primary plugin message | optional_performance_data
Additional line 1
Additional line 2
Nagios primarily cares about the first line for the main status text and performance data.
- Single-Line Output:
-
Performance Data (Perfdata):
Optionally, plugins can return performance data, which Nagios can process and store for graphing (e.g., with PNP4Nagios). Perfdata is appended to the first line of output after a pipe symbol (|
). Format:'label'=value[UOM];[warn];[crit];[min];[max]
label
: A string label for the datasource (e.g.,load1
,disk_usage_c
). Should be short and avoid spaces or special characters (except underscore).value
: The actual value of the metric (integer or float).UOM
(Unit of Measure): Optional. E.g.,s
(seconds),%
,B
(bytes),MB
,GB
,TB
,c
(count).warn
: Optional warning threshold for this metric.crit
: Optional critical threshold for this metric.min
: Optional minimum value for graphing.max
: Optional maximum value for graphing. Multiple perfdata metrics can be returned, separated by spaces:... | metric1=value1;w1;c1 metric2=value2;w2;c2 ...
Example:CPU_LOAD OK - 1 min: 0.05, 5 min: 0.10, 15 min: 0.15 | load1=0.05;10;15;0 load5=0.10;8;12;0 load15=0.15;5;10;0
-
Error Output (STDERR):
Plugins should ideally not print anything to standard error (STDERR) during normal operation. STDERR output is often captured by Nagios but not displayed as the primary status. It might be logged or shown in extended info. For critical plugin failures, exiting with status 3 (UNKNOWN) and providing an error message on STDOUT is preferred.
Choosing a Scripting Language:
- Bash:
Excellent for simple checks, file system operations, or wrapping existing command-line tools. Widely available on Linux. - Perl:
Historically very popular for Nagios plugins due to its strong text processing and regular expression capabilities. Many existing plugins are written in Perl. TheNagios::Plugin
Perl module simplifies development. - Python:
Increasingly popular due to its readability, extensive libraries, and ease of use. Thenagiosplugin
Python library can be helpful. - PowerShell:
The go-to for Windows-specific checks if you are writing scripts to be executed locally on a Windows machine (e.g., via NSClient++'s external script capabilities).
Basic Plugin Structure (Conceptual):
- Parse command-line arguments (thresholds, host/port, etc.).
- Perform the check (e.g., read a file, query an API, run a command).
- Analyze the result against thresholds.
- Determine the status (OK, WARNING, CRITICAL, UNKNOWN).
- Construct the output string (including performance data if applicable).
- Print the output string to STDOUT.
- Exit with the appropriate status code.
Example Bash Script Plugin:
Let's create a simple Bash script plugin that checks if a specific file exists and optionally if its size is within certain limits.
check_custom_file.sh
#!/bin/bash
# Exit codes
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
# Default values
FILE_PATH=""
WARN_SIZE_KB="" # Warn if size is GREATER than this in KB
CRIT_SIZE_KB="" # Critical if size is GREATER than this in KB
CHECK_EXISTS_ONLY=false
# --- Helper function for usage ---
print_usage() {
echo "Usage: $0 -f <file_path> [-w <warn_size_kb> -c <crit_size_kb>] [-e]"
echo " -f <file_path>: Path to the file to check."
echo " -w <warn_size_kb>: Warning threshold for file size in KB (optional)."
echo " -c <crit_size_kb>: Critical threshold for file size in KB (optional)."
echo " -e: Check for existence only, ignore size checks (optional)."
exit $STATE_UNKNOWN
}
# --- Parse command line arguments ---
while getopts "f:w:c:e" opt; do
case ${opt} in
f) FILE_PATH="${OPTARG}" ;;
w) WARN_SIZE_KB="${OPTARG}" ;;
c) CRIT_SIZE_KB="${OPTARG}" ;;
e) CHECK_EXISTS_ONLY=true ;;
*) print_usage ;;
esac
done
# --- Validate arguments ---
if [ -z "${FILE_PATH}" ]; then
echo "UNKNOWN: File path (-f) is mandatory."
exit $STATE_UNKNOWN
fi
if ! ${CHECK_EXISTS_ONLY}; then
if [ -n "${WARN_SIZE_KB}" ] && ! [[ "${WARN_SIZE_KB}" =~ ^[0-9]+$ ]]; then
echo "UNKNOWN: Warning size (-w) must be a positive integer."
exit $STATE_UNKNOWN
fi
if [ -n "${CRIT_SIZE_KB}" ] && ! [[ "${CRIT_SIZE_KB}" =~ ^[0-9]+$ ]]; then
echo "UNKNOWN: Critical size (-c) must be a positive integer."
exit $STATE_UNKNOWN
fi
if [ -n "${WARN_SIZE_KB}" ] && [ -n "${CRIT_SIZE_KB}" ] && [ "${WARN_SIZE_KB}" -ge "${CRIT_SIZE_KB}" ]; then
echo "UNKNOWN: Warning size (-w) must be less than critical size (-c)."
exit $STATE_UNKNOWN
fi
fi
# --- Perform the check ---
if [ ! -f "${FILE_PATH}" ]; then
echo "CRITICAL: File '${FILE_PATH}' does not exist or is not a regular file."
exit $STATE_CRITICAL
fi
if ${CHECK_EXISTS_ONLY}; then
echo "OK: File '${FILE_PATH}' exists."
exit $STATE_OK
fi
# Check size (if thresholds are provided)
FILE_SIZE_BYTES=$(stat -c%s "${FILE_PATH}")
FILE_SIZE_KB=$((FILE_SIZE_BYTES / 1024))
PERFDATA="size=${FILE_SIZE_KB}KB"
if [ -n "${WARN_SIZE_KB}" ]; then PERFDATA="${PERFDATA};${WARN_SIZE_KB}"; fi
if [ -n "${CRIT_SIZE_KB}" ]; then PERFDATA="${PERFDATA};${CRIT_SIZE_KB}"; fi
PERFDATA="${PERFDATA};0" # Min value for perfdata
STATUS_MSG_PREFIX="File '${FILE_PATH}' size is ${FILE_SIZE_KB}KB"
# Check critical threshold first
if [ -n "${CRIT_SIZE_KB}" ] && [ "${FILE_SIZE_KB}" -gt "${CRIT_SIZE_KB}" ]; then
echo "CRITICAL: ${STATUS_MSG_PREFIX} (Threshold > ${CRIT_SIZE_KB}KB) | ${PERFDATA}"
exit $STATE_CRITICAL
fi
# Check warning threshold
if [ -n "${WARN_SIZE_KB}" ] && [ "${FILE_SIZE_KB}" -gt "${WARN_SIZE_KB}" ]; then
echo "WARNING: ${STATUS_MSG_PREFIX} (Threshold > ${WARN_SIZE_KB}KB) | ${PERFDATA}"
exit $STATE_WARNING
fi
# If we reach here, it's OK
echo "OK: ${STATUS_MSG_PREFIX} | ${PERFDATA}"
exit $STATE_OK
Integrating the Custom Plugin with Nagios:
-
Place the Plugin: Copy your plugin script (e.g.,
check_custom_file.sh
) to the Nagios plugins directory on the Nagios server (or on the remote host if it's to be run by NRPE/NSClient++). -
Test the Plugin from Command Line: Always test thoroughly from the command line before defining it in Nagios.
# As nagios user for realistic permissions test sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /var/log/syslog -w 10000 -c 20000 # Expected: OK: File '/var/log/syslog' size is XXXKB | size=XXXKB;10000;20000;0 sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /tmp/nonexistentfile -e # Expected: CRITICAL: File '/tmp/nonexistentfile' does not exist... # Create a large test file sudo fallocate -l 25M /tmp/largefile.test sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /tmp/largefile.test -w 10000 -c 20000 # 10MB, 20MB # Expected: CRITICAL: File '/tmp/largefile.test' size is 25600KB (Threshold > 20000KB) | size=25600KB;10000;20000;0 sudo rm /tmp/largefile.test
-
Define a Nagios Command: Add a command definition in
/usr/local/nagios/etc/objects/commands.cfg
:define command{ command_name check_custom_file command_line $USER1$/check_custom_file.sh -f $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$ ; ARG1=file, ARG2=warn_size, ARG3=crit_size ; ARG4 can be used for extra args like -e }
$ARG1$
: Will be the file path.$ARG2$
: Will be the warning size threshold.$ARG3$
: Will be the critical size threshold.$ARG4$
: Could be used to pass-e
if checking existence only, or other optional flags.
-
Define a Nagios Service: Add a service definition for a host (e.g.,
localhost
inlocalhost.cfg
):For the second service, notice thedefine service{ use local-service host_name localhost service_description Syslog Size Check check_command check_custom_file!/var/log/syslog!102400!204800 ; Check /var/log/syslog, Warn > 100MB, Crit > 200MB } define service{ use local-service host_name localhost service_description Important Config File Exists check_command check_custom_file!/etc/my_app/important.conf!!-e ; Check /etc/my_app/important.conf exists. ; Note the empty $ARG2$ and $ARG3$ (warn/crit sizes) ; $ARG4$ is -e for existence check }
!!
which means$ARG2$
and$ARG3$
are empty. The-e
is passed as$ARG4$
. You might need to adjust the command definition if it doesn't gracefully handle empty optional arguments passed this way, or create a separate command for the existence check. A more robustcommand_line
incommands.cfg
could be:Then the service definitions would use# Command for file size check define command{ command_name check_custom_file_size command_line $USER1$/check_custom_file.sh -f $ARG1$ -w $ARG2$ -c $ARG3$ } # Command for file existence check define command{ command_name check_custom_file_exists command_line $USER1$/check_custom_file.sh -f $ARG1$ -e }
check_custom_file_size
orcheck_custom_file_exists
accordingly. This is often cleaner. -
Verify and Reload Nagios:
The new service(s) should appear in the Nagios UI.
Writing custom plugins is a powerful skill that allows you to tailor Nagios to almost any monitoring requirement. Remember to test thoroughly and handle errors gracefully within your plugin.
Workshop Creating a Simple Bash Script Plugin
Objective:
Write a Bash script plugin to check the number of active SSH sessions on the Nagios server (localhost). The plugin should issue a WARNING if the count exceeds a threshold and CRITICAL if it exceeds another.
Plugin Logic:
- The plugin will use the
who
command andgrep
forpts/
(common for SSH sessions) or specific IP patterns if known. A simpler approach for this workshop is to count lines fromwho
that indicate an active terminal session, which often correlates with SSH sessions on a server. - It will accept warning (
-w
) and critical (-c
) thresholds as arguments. - It will output the number of sessions and performance data.
Steps:
-
Create the Plugin Script on the Nagios Server (VM1): Navigate to a temporary directory or your preferred script development location. Create a file named
Paste the following Bash script content:check_ssh_sessions.sh
:Save the file (Ctrl+X, Y, Enter in nano).#!/bin/bash # Nagios Exit Codes STATE_OK=0 STATE_WARNING=1 STATE_CRITICAL=2 STATE_UNKNOWN=3 # Default thresholds (can be overridden by arguments) WARN_THRESHOLD="" CRIT_THRESHOLD="" # --- Helper function for usage --- print_usage() { echo "Usage: $0 -w <warn_sessions> -c <crit_sessions>" echo " -w <warn_sessions>: Warning threshold for number of active SSH sessions." echo " -c <crit_sessions>: Critical threshold for number of active SSH sessions." exit $STATE_UNKNOWN } # --- Parse Arguments --- while getopts "w:c:" opt; do case ${opt} in w) WARN_THRESHOLD="${OPTARG}" ;; c) CRIT_THRESHOLD="${OPTARG}" ;; *) print_usage ;; esac done # --- Validate Arguments --- if [ -z "${WARN_THRESHOLD}" ] || ! [[ "${WARN_THRESHOLD}" =~ ^[0-9]+$ ]]; then echo "UNKNOWN: Warning threshold (-w) must be a positive integer." exit $STATE_UNKNOWN fi if [ -z "${CRIT_THRESHOLD}" ] || ! [[ "${CRIT_THRESHOLD}" =~ ^[0-9]+$ ]]; then echo "UNKNOWN: Critical threshold (-c) must be a positive integer." exit $STATE_UNKNOWN fi if [ "${WARN_THRESHOLD}" -ge "${CRIT_THRESHOLD}" ]; then echo "UNKNOWN: Warning threshold (-w) must be less than critical threshold (-c)." exit $STATE_UNKNOWN fi # --- Perform the Check --- # Count lines from 'who' that seem to be remote sessions (e.g., have an IP or are on pts) # This is a simplistic approach; a more robust check might filter more specifically. # For this workshop, we'll count all lines from 'who' as a proxy for logged-in users. # If you have many local console logins, this count will include them. # A more specific grep for SSH might be `who | grep -c '(.*)'` or `ss -tnp state established '( dport = :ssh )' | awk 'NR>1 {print $5}' | cut -d: -f1 | sort -u | wc -l` # For simplicity, let's count lines from `who` which are generally interactive sessions. CURRENT_SESSIONS=$(who | wc -l) # --- Determine Status and Output --- OUTPUT_MSG="Active sessions: ${CURRENT_SESSIONS}" PERFDATA="sessions=${CURRENT_SESSIONS};${WARN_THRESHOLD};${CRIT_THRESHOLD};0" # Min value 0 if [ "${CURRENT_SESSIONS}" -ge "${CRIT_THRESHOLD}" ]; then echo "CRITICAL: ${OUTPUT_MSG} | ${PERFDATA}" exit $STATE_CRITICAL elif [ "${CURRENT_SESSIONS}" -ge "${WARN_THRESHOLD}" ]; then echo "WARNING: ${OUTPUT_MSG} | ${PERFDATA}" exit $STATE_WARNING else echo "OK: ${OUTPUT_MSG} | ${PERFDATA}" exit $STATE_OK fi
-
Make the Plugin Executable and Move it:
-
Test the Plugin from the Command Line: Log in via SSH to your Nagios server a few times from different terminals to create some sessions. Then run these tests:
# Test OK state (assuming you have < 3 sessions) /usr/local/nagios/libexec/check_ssh_sessions.sh -w 3 -c 5 # Expected: OK: Active sessions: X | sessions=X;3;5;0 (where X is your session count) # Test WARNING state (adjust -w so current sessions > warning but < critical) # If you have 2 sessions, test with: /usr/local/nagios/libexec/check_ssh_sessions.sh -w 1 -c 5 # Expected: WARNING: Active sessions: 2 | sessions=2;1;5;0 # Test CRITICAL state (adjust -c so current sessions > critical) # If you have 2 sessions, test with: /usr/local/nagios/libexec/check_ssh_sessions.sh -w 1 -c 2 # Expected: CRITICAL: Active sessions: 2 | sessions=2;1;2;0 # Test argument validation /usr/local/nagios/libexec/check_ssh_sessions.sh -w 5 -c 3 # Warn >= Crit # Expected: UNKNOWN: Warning threshold (-w) must be less than critical threshold (-c). /usr/local/nagios/libexec/check_ssh_sessions.sh -w foo -c bar # Expected: UNKNOWN: Warning threshold (-w) must be a positive integer.
-
Define a Nagios Command for the Plugin: Open
Add the following command definition: Save and exit./usr/local/nagios/etc/objects/commands.cfg
on your Nagios server (VM1): -
Define a Nagios Service to Use the Plugin: We'll add this service to monitor
Add the following service definition:localhost
(the Nagios server itself). Open/usr/local/nagios/etc/objects/localhost.cfg
:Adjust the thresholdsdefine service{ use local-service ; Name of service template to use host_name localhost service_description Active SSH Sessions check_command check_active_ssh_sessions!3!5 ; Warn if >= 3 sessions, Crit if >= 5 sessions }
3
and5
as appropriate for your server. Save and exit. -
Verify Nagios Configuration and Reload:
-
Check in Nagios Web Interface:
- Open your Nagios web UI.
- Go to "Services." Look for the "Active SSH Sessions" service associated with
localhost
. - It will initially be "Pending." After Nagios runs the check, it should display the status (OK, WARNING, or CRITICAL) based on the current number of sessions and your defined thresholds.
- Click on the service name to see the status output and performance data.
Outcome:
You have successfully created a custom Bash script plugin, integrated it into Nagios, and are now monitoring the number of active SSH sessions on your Nagios server. This workshop covered:
- Writing a Bash script that adheres to Nagios plugin guidelines (exit codes, output).
- Implementing argument parsing and validation.
- Performing a system check (
who | wc -l
). - Formatting output with status messages and performance data.
- Defining the corresponding Nagios command and service.
- Testing the plugin thoroughly.
This practical experience provides a solid foundation for developing more complex custom plugins tailored to your specific monitoring needs.
Advanced Object Configuration
Nagios's object configuration provides powerful features to manage complex environments efficiently. Using host groups, service groups, templates, timeperiods, escalations, and dependencies allows you to create a more organized, maintainable, and intelligent monitoring setup.
Host Groups and Service Groups:
-
Host Groups: Collections of hosts. They simplify management and viewing. For example, you can create groups like
linux-servers
,windows-servers
,web-servers
,database-servers
,network-switches
.- Viewing: The Nagios UI allows filtering by host group.
- Configuration: You can assign contact groups or other settings at the host group level, which can be inherited by member hosts (though direct assignment on hosts/services is more common for contacts).
- Dependencies & Escalations: Can be defined based on host groups.
Definition (
hostgroups.cfg
or similar):A host can be a member of multiple host groups by listing itsdefine hostgroup{ hostgroup_name web-servers alias All Web Servers members webserver01,webserver02,apache-prod ; Comma-separated list of host_name's }
host_name
in severalmembers
directives or by specifyinghostgroups
in its host definition: -
Service Groups: Collections of services. Similar to host groups, they aid in organization and viewing. A service group can contain services from different hosts.
- Viewing: The UI allows filtering by service group.
- Business Process Monitoring: Can be used to group services that constitute a critical business process (e.g., all services related to an e-commerce application).
Definition (
Alternatively, assign a service to service groups in its definition:servicegroups.cfg
or similar):
Templates for Hosts and Services (Inheritance):
Templates are one of the most powerful features for reducing redundancy and simplifying configuration. You define a set of common properties in a template, and then host or service definitions can use
that template to inherit those properties.
- How Inheritance Works:
- Objects inherit all properties from the template(s) they
use
. - Properties defined directly in the object override those inherited from the template.
- Multiple templates can be used (comma-separated list for
use
directive). Properties from later templates in the list override earlier ones. - Templates can also inherit from other templates, creating a hierarchy.
- Objects inherit all properties from the template(s) they
register 0
: Templates themselves are not actual hosts or services to be monitored. Theregister 0
directive in a template definition tells Nagios not to register it as a live object.
Example (templates.cfg
):
# Generic host template
define host{
name generic-host ; Name of this template
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_period 24x7
check_period 24x7
max_check_attempts 5
check_interval 5 ; Check every 5 minutes
retry_interval 1 ; Retry every 1 minute on failure
contact_groups admins ; Default contact group
register 0 ; THIS IS A TEMPLATE, DO NOT REGISTER
}
# Linux server template inheriting from generic-host
define host{
name linux-server
use generic-host ; Inherit from generic-host
check_command check-host-alive-ping ; Specific check command for Linux
icon_image linux40.png ; (Assuming you have icons)
statusmap_image linux40.gd2
register 0
}
# Generic service template
define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 10 ; Check every 10 minutes
retry_check_interval 2 ; Retry every 2 minutes on failure
contact_groups admins
notification_options w,u,c,r ; Notify on warning, unknown, critical, recovery
notification_interval 60 ; Re-notify every 60 minutes for ongoing problem
notification_period 24x7
register 0
}
# In a host definition file (e.g., mywebserver.cfg)
define host{
use linux-server ; Inherits all settings from linux-server & generic-host
host_name my-web-01
alias Production Web Server 01
address 192.168.1.50
contact_groups web-admins,db-admins ; Overrides 'admins' from template
}
# In a service definition file
define service{
use generic-service
host_name my-web-01
service_description HTTP Check
check_command check_http
normal_check_interval 5 ; Override default of 10 mins from generic-service
}
Timeperiods:
Timeperiods define when Nagios can perform checks and send notifications.
- Usage: Assigned to hosts, services (for
check_period
andnotification_period
), and contacts (forhost_notification_period
andservice_notification_period
). 24x7
is a common default. Others likeworkhours
,nonworkhours
,none
(to disable) are useful.
Definition (timeperiods.cfg
):
define timeperiod{
timeperiod_name us-workhours
alias Normal US Work Hours (Mon-Fri, 9am-5pm EST)
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
# Omitting saturday and sunday means they are not part of this timeperiod
}
define timeperiod{
timeperiod_name none
alias No Time Is A Good Time
# No day directives = never
}
Escalations (Host and Service):
Escalations define modified notification rules if a problem persists. For example, notify a manager if a critical service is still down after 1 hour.
- Trigger: Based on the number of notifications sent or the time a host/service has been in a problem state.
- Action: Can notify different contact groups, use different notification intervals, or limit the escalation period.
Definition (escalations.cfg
or similar):
define serviceescalation{
host_name my-critical-server
service_description Main Application Service
first_notification 3 ; Escalate after the 3rd notification for this service
last_notification 0 ; 0 means escalate for all subsequent notifications
contact_groups oncall-level2,managers ; Notify these groups
notification_interval 30 ; Notify these escalated contacts every 30 mins
escalation_period 24x7 ; Escalate during this timeperiod
escalation_options w,u,c ; Escalate for warning, unknown, critical states
}
define hostescalation{
hostgroup_name database-servers
first_notification 2
last_notification 5
contact_groups db-managers
notification_interval 60
escalation_options d,u ; Escalate for down, unreachable states
}
Dependencies (Host and Service):
Dependencies define relationships between hosts or services to prevent a flood of notifications during widespread outages and enable smarter root cause analysis.
- Purpose: If a "parent" host/service is down/critical, Nagios can suppress notifications for "child" hosts/services that depend on it. Checks for dependent items might also be suppressed.
- Example: If a core switch is down, all servers connected to it will become unreachable. Defining the servers as dependent on the switch prevents notifications for each server, only alerting for the switch.
- Execution Dependency: The check for the dependent item will not run if the dependency is not met.
- Notification Dependency: Notifications for the dependent item will be suppressed if the dependency is not met.
- Dependency Period: Defines when the dependency is active.
Definition (dependencies.cfg
or similar):
define hostdependency{
host_name my-web-server ; Dependent host
dependent_host_name my-core-switch ; Host this one depends on
notification_failure_criteria d,u ; If switch is d=DOWN or u=UNREACHABLE...
; ...suppress notifications for my-web-server
# execution_failure_criteria can also be d,u,o,p,n (o=UP, p=PENDING, n=NONE)
}
define servicedependency{
host_name my-app-server
service_description Application UI ; Dependent service
dependent_host_name my-db-server
dependent_service_description Database Service ; Service this one depends on
notification_failure_criteria w,u,c ; If DB service is W, U, or C...
; ...suppress notifications for App UI service
inherits_parent 1 ; If DB server is down, also consider App UI dependent
}
inherits_parent=1
means if the host of the dependent_service_description
is DOWN or UNREACHABLE, this service dependency will also be considered failed.
Benefits of Advanced Object Configuration:
- Reduced Redundancy: Templates make configurations DRY (Don't Repeat Yourself).
- Easier Management: Changes to templates propagate to all inheriting objects. Groups simplify bulk operations and views.
- Smarter Alerting: Escalations ensure critical issues get appropriate attention. Dependencies reduce notification noise and help pinpoint root causes.
- Flexibility: Timeperiods allow fine-grained control over check and notification timing.
Mastering these advanced object configurations is key to scaling your Nagios deployment and making it an indispensable tool rather than a source of alert fatigue.
Workshop Implementing Host Groups and Service Templates
Objective:
Organize existing monitored hosts into logical host groups and create/use a more specific service template for a common type of check (e.g., disk space checks).
Prerequisites:
- A working Nagios Core installation with at least
localhost
and one remote host (Linux or Windows) monitored. - For this workshop, let's assume you have:
localhost
(your Nagios server).vm2-remote-linux
(a remote Linux server).vm3-remote-windows
(a remote Windows server).- All hosts currently use generic templates like
linux-server
orgeneric-host
, and services usegeneric-service
orlocal-service
.
Part 1: Implementing Host Groups
-
Define Host Groups: Create or edit a file for host group definitions, e.g.,
Add the following definitions:/usr/local/nagios/etc/objects/hostgroups.cfg
. If this file doesn't exist, create it and ensure it's included innagios.cfg
(e.g.,cfg_file=/usr/local/nagios/etc/objects/hostgroups.cfg
).Alternatively, instead of listingdefine hostgroup{ hostgroup_name linux-servers alias All Linux Servers members localhost, vm2-remote-linux ; Add other Linux host_names if you have them } define hostgroup{ hostgroup_name windows-servers alias All Windows Servers members vm3-remote-windows ; Add other Windows host_names if you have them } define hostgroup{ hostgroup_name all-servers alias All Monitored Servers members localhost, vm2-remote-linux, vm3-remote-windows ; Explicitly list all, or use hostgroup recursion later }
members
here, you can add thehostgroups
directive to each host definition. For this workshop,members
in the hostgroup definition is fine. -
Verify and Reload Nagios:
-
Check in Nagios Web Interface:
- Go to the "Host Groups" link in the navigation pane. You should see your newly defined groups:
linux-servers
,windows-servers
, andall-servers
. - Click on each group name to see its members.
- Under "Host Group Grid," you'll see a matrix view.
- Go to the "Host Groups" link in the navigation pane. You should see your newly defined groups:
Part 2: Implementing a Specific Service Template for Disk Checks
Let's say you want all your disk space checks to have a slightly different retry interval or re-notification interval than generic-service
.
-
Define the Disk Service Template: Open your templates file, e.g.,
Add a new service template definition. This template will inherit from/usr/local/nagios/etc/objects/templates.cfg
.generic-service
and then override specific values.define service{ name disk-service-template use generic-service ; Inherit from our main generic service normal_check_interval 15 ; Check disk space every 15 minutes retry_check_interval 3 ; Retry every 3 minutes on failure notification_interval 120 ; Re-notify every 2 hours for ongoing disk issues register 0 ; This is a template # You could also add specific contact_groups here if disk alerts go to a storage team # contact_groups admins,storage-team }
-
Apply the New Template to Disk Services: Now, find your existing disk space service definitions and change them to
use
this new template.-
For
Find the "Root Partition" service (or similar disk check):localhost
(e.g., inlocalhost.cfg
):If you have other disk checks ondefine service{ # use local-service ; Old template use disk-service-template ; New template host_name localhost service_description Root Partition check_command check_local_disk!20%!10%!/ }
localhost
(like/home
or/var
), update them too. -
For
Find the "Root Disk Space via NRPE" service:vm2-remote-linux
(e.g., inservers/vm2-linux.cfg
): -
For
Find the "Windows C Drive Space" service:vm3-remote-windows
(e.g., inservers/vm3-windows.cfg
):
-
-
Verify and Reload Nagios:
-
Observe Changes (Subtle):
- In the Nagios UI, go to the "Services" view.
- Click on one of the disk services you modified (e.g., "Root Partition" for
localhost
). - In the detailed view, look at "Check Interval," "Retry Interval," and "Notification Interval." They should now reflect the values from
disk-service-template
(e.g., Check Interval 15 min, Retry Interval 3 min). This confirms the template inheritance is working. - The actual check behavior (thresholds for warning/critical) remains defined by the
check_command
in the service definition itself.
Outcome:
You have successfully:
- Organized your hosts into logical
hostgroup
s, making them easier to view and manage. - Created a specialized
service template
(disk-service-template
) that inherits from a more generic template and customizes certain parameters. - Applied this new template to relevant disk space services, demonstrating how templates can enforce consistent settings (like check frequency or notification behavior) for similar types of checks across multiple hosts.
This workshop illustrates how using host groups and refining service templates leads to a more structured, maintainable, and scalable Nagios configuration. As your monitored environment grows, these practices become increasingly vital.
3. Advanced Nagios Management and Optimization
With a solid understanding of basic and intermediate Nagios concepts, we now explore advanced topics. This section delves into passive checks using NSCA for monitoring asynchronous events, implementing event handlers for automated problem remediation, strategies for performance tuning and scaling your Nagios instance, essential security best practices, and a brief look at extending Nagios with popular addons. These advanced techniques will help you build a more robust, efficient, and secure monitoring solution.
Passive Checks and NSCA
So far, we've primarily focused on active checks, where Nagios initiates checks for hosts and services at regular intervals. However, there are scenarios where this model isn't ideal:
- Asynchronous Events: Monitoring events that don't occur regularly, such as the completion of a nightly backup job, a security alert from an intrusion detection system, or a user-triggered action.
- Services Behind Restrictive Firewalls: When the Nagios server cannot directly reach a service to check it.
- Resource Intensive Checks: For checks that are too resource-intensive to run frequently from the Nagios server.
- Distributed Monitoring: Aggregating results from other monitoring systems or remote agents.
For these situations, Nagios supports passive checks.
Active vs. Passive Checks:
-
Active Checks:
- Initiated by the Nagios server.
- Scheduled at regular intervals (defined by
check_interval
andretry_interval
). - Nagios executes a plugin to determine status.
- Example: Nagios pings a server every 5 minutes.
-
Passive Checks:
- Initiated by an external application or script on the monitored host (or another system).
- The external application performs the check and submits the result (status, output message) to Nagios.
- Nagios does not schedule these checks; it simply processes the results when they arrive.
- Example: A backup script on a remote server sends a "Backup OK" or "Backup FAILED" message to Nagios upon completion.
Nagios Service Check Acceptor (NSCA):
NSCA is a common addon used to facilitate passive checks. It consists of two parts:
- NSCA Daemon: Runs on the Nagios server. It listens on a specific TCP port (default 5667) for incoming passive check results.
send_nsca
Client: A utility run on the remote host (or any system that needs to submit a passive check result). It formats the check result and sends it to the NSCA daemon on the Nagios server.
NSCA Architecture:
- An external application/script on a (remote) host determines the status of a service (e.g., backup job completes).
- This application/script uses the
send_nsca
client utility to construct a message containing the target host name, service description (as defined in Nagios), status code, and plugin output. send_nsca
sends this message to the NSCA daemon running on the Nagios server.- The NSCA daemon receives the message, performs basic validation (and decryption if configured), and writes the check result to Nagios's external command file (
nagios.cmd
). - The Nagios daemon periodically processes the external command file, reads the passive check result, and updates the status of the corresponding service.
Security:
- NSCA communication can be encrypted using various ciphers. Both the daemon and client must be configured with the same encryption method and password/key.
- The NSCA daemon configuration can restrict which hosts are allowed to send data.
- Firewall rules on the Nagios server should allow incoming connections on the NSCA port (e.g., TCP 5667) only from trusted IP addresses/networks.
Configuring Nagios for Passive Checks:
-
Service Definition: Services that receive passive check results need to be defined in Nagios, but configured to accept them.
define service{ host_name some-remote-server service_description Nightly Backup Status active_checks_enabled 0 ; Disable active checks for this service passive_checks_enabled 1 ; Enable passive checks check_period 24x7 ; Still need a check_period for freshness max_check_attempts 1 ; Usually 1, as result is submitted directly is_volatile 0 ; Or 1 if every result is important regardless of previous contact_groups admins # No check_command is needed if only passive checks are used. # However, freshness checking is highly recommended. check_freshness 1 ; Enable freshness checking freshness_threshold 90000 ; E.g., 3600*25 = 90000 seconds (25 hours) ; If no result received in 25 hours, service becomes stale. check_command check-dummy!3!"No backup results received" ; Or a custom 'stale' check ; This command runs ONLY if freshness threshold is exceeded. stalking_options o,w,u,c ; Log all state changes if desired register 1 }
active_checks_enabled 0
: Crucial. Disables Nagios from actively checking this service.passive_checks_enabled 1
: Crucial. Allows Nagios to accept results for this service.- Freshness Checking:
check_freshness 1
: Enables freshness checking.freshness_threshold <seconds>
: If Nagios doesn't receive a passive result for this service within this many seconds, it considers the service "stale."- When a service becomes stale, Nagios can optionally run an active
check_command
(likecheck-dummy
or a custom alert) to force the service into a WARNING, CRITICAL, or UNKNOWN state and trigger notifications. This alerts you that the passive check mechanism itself might be failing. check-dummy
is a simple plugin that returns a predefined state and message. For example:check-dummy 2 "Service is stale"
would force it to CRITICAL.
-
Nagios Main Configuration (
nagios.cfg
): Ensure passive check result processing is enabled (it usually is by default).
Installing and Configuring NSCA (Nagios Server and Client):
On the Nagios Server:
-
Download and Install NSCA:
cd /tmp # Check Nagios Exchange or GitHub for NSCA source. # Example version, find latest stable from a trusted source. # NagiosEnterprises/nsca on GitHub is a common source. NSCA_VERSION="2.9.2" # Example, verify latest version wget https://github.com/NagiosEnterprises/nsca/releases/download/v${NSCA_VERSION}/nsca-${NSCA_VERSION}.tar.gz tar -zxvf nsca-${NSCA_VERSION}.tar.gz cd nsca-${NSCA_VERSION}/ sudo ./configure --with-nsca-user=nagios --with-nsca-group=nagios # Or your nagios user/group sudo make all # This will build both the nsca daemon and send_nsca client. # Install only the daemon on the server: sudo cp src/nsca /usr/local/nagios/bin/ sudo cp sample-config/nsca.cfg /usr/local/nagios/etc/ sudo chown nagios:nagios /usr/local/nagios/bin/nsca /usr/local/nagios/etc/nsca.cfg sudo chmod 750 /usr/local/nagios/bin/nsca
-
Configure NSCA Daemon (
Key settings:nsca.cfg
): Edit/usr/local/nagios/etc/nsca.cfg
:nsca_user=nagios
(if you ran configure with it, otherwise set it here)nsca_group=nagios
server_port=5667
command_file=/usr/local/nagios/var/rw/nagios.cmd
(Nagios external command file path)password=your_secret_nsca_password
(Choose a strong password if using simple password encryption)decryption_method=1
(for XOR encryption with password. 0=None, 1=XOR, 2=DES, etc. Other methods require libmcrypt). XOR is simple but not very strong. Consider stronger methods for production.
-
Firewall: Allow TCP port 5667 on the Nagios server from IPs that will send NSCA data.
-
Run NSCA Daemon: You can run it directly or set it up as a systemd service. Directly (for testing):
To run it via systemd (recommended): Create/etc/systemd/system/nsca.service
: Then:
On the Client Host (that will send passive results):
-
Install
send_nsca
client: Ifmake all
was run during NSCA compilation on the Nagios server,src/send_nsca
was built. Copy this binary to the client machine (e.g., into/usr/local/bin/
or/usr/sbin/
). Also copy thesample-config/send_nsca.cfg
to/usr/local/nagios/etc/
(or/etc/nagios/
) on the client.# On Nagios server where you compiled NSCA: # scp /tmp/nsca-${NSCA_VERSION}/src/send_nsca user@remote_client_ip:/tmp/ # scp /tmp/nsca-${NSCA_VERSION}/sample-config/send_nsca.cfg user@remote_client_ip:/tmp/ # On the remote client: sudo cp /tmp/send_nsca /usr/local/bin/ sudo mkdir -p /usr/local/nagios/etc # Or /etc/nagios/ sudo cp /tmp/send_nsca.cfg /usr/local/nagios/etc/send_nsca.cfg sudo chown root:root /usr/local/bin/send_nsca /usr/local/nagios/etc/send_nsca.cfg # Or appropriate user sudo chmod +x /usr/local/bin/send_nsca
-
Configure
Set:send_nsca.cfg
on Client: Edit/usr/local/nagios/etc/send_nsca.cfg
on the client.password=your_secret_nsca_password
(Must matchnsca.cfg
on the server)encryption_method=1
(Must matchnsca.cfg
on the server)
-
Using
send_nsca
: Thesend_nsca
client reads data from standard input or a file. The format is:<host_name>\t<svc_description>\t<return_code>\t<plugin_output>\n
(Fields are tab-separated, ending with a newline).<host_name>
: Thehost_name
as defined in Nagios.<svc_description>
: Theservice_description
as defined in Nagios.<return_code>
: 0 for OK, 1 for WARNING, 2 for CRITICAL, 3 for UNKNOWN.<plugin_output>
: Text message from the plugin.
Example usage in a script:
#!/bin/bash NAGIOS_SERVER_IP="your_nagios_server_ip" HOST_NAME="some-remote-server" SERVICE_DESC="Nightly Backup Status" NSCA_CONFIG="/usr/local/nagios/etc/send_nsca.cfg" # Path to send_nsca.cfg # Simulate backup echo "Running backup..." sleep 10 # Simulate backup work BACKUP_SUCCESS=true # or false if $BACKUP_SUCCESS; then RETURN_CODE=0 PLUGIN_OUTPUT="Backup completed successfully at $(date)" else RETURN_CODE=2 PLUGIN_OUTPUT="Backup FAILED at $(date) - Check logs for details." fi # Send to Nagios via NSCA printf "%s\t%s\t%s\t%s\n" "${HOST_NAME}" "${SERVICE_DESC}" "${RETURN_CODE}" "${PLUGIN_OUTPUT}" | \ /usr/local/bin/send_nsca -H ${NAGIOS_SERVER_IP} -p 5667 -d "\t" -c ${NSCA_CONFIG} echo "Result sent to Nagios."
-H ${NAGIOS_SERVER_IP}
: Specifies the Nagios server running NSCA daemon.-p 5667
: NSCA port.-d "\t"
: Specifies tab as the delimiter.-c ${NSCA_CONFIG}
: Path tosend_nsca.cfg
.
Passive checks with NSCA offer great flexibility for integrating various event sources and custom monitoring logic into Nagios. Remember that freshness checking is vital to ensure your passive check submission mechanisms are themselves working.
Workshop Implementing Passive Checks with NSCA
Objective:
Configure a passive service check for a simulated nightly cron job on a remote Linux host (VM2). The cron job script will use send_nsca
to report its success or failure to the Nagios server (VM1).
Prerequisites:
- Nagios Server (VM1) and Remote Linux Host (VM2) set up. VM1 IP:
192.168.1.100
, VM2 IP:192.168.1.101
(adjust as needed). - NRPE setup is not strictly needed for this NSCA workshop but VM2 should be defined as a host in Nagios.
- Root/sudo access on both VMs.
- Build tools (
gcc
,make
) on VM1 for compiling NSCA.
Part 1: Setup NSCA on Nagios Server (VM1)
-
Install Build Dependencies (if not already present):
-
Download and Compile NSCA on VM1:
cd /tmp NSCA_VERSION="2.9.2" # Or check for latest from NagiosEnterprises/nsca on GitHub wget https://github.com/NagiosEnterprises/nsca/releases/download/v${NSCA_VERSION}/nsca-${NSCA_VERSION}.tar.gz tar -zxvf nsca-${NSCA_VERSION}.tar.gz cd nsca-${NSCA_VERSION}/ sudo ./configure --with-nagios-user=nagios --with-nagios-group=nagios sudo make all
-
Install NSCA Daemon and Configuration on VM1:
-
Configure
Make the following changes:nsca.cfg
on VM1:- Set
password=MyNscaSecretPassword123
(Choose your own password). - Set
decryption_method=1
(XOR encryption). For stronger, use others if libmcrypt-dev was installed and you configured nsca with it. - Ensure
command_file=/usr/local/nagios/var/rw/nagios.cmd
is correct. - Save and exit.
- Set
-
Create systemd Service File for NSCA on VM1:
Paste the following:Self-correction: The[Unit] Description=Nagios Service Check Acceptor After=network.target [Service] Type=forking User=nagios Group=nagios ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg --daemon PIDFile=/var/run/nsca.pid ; nsca creates this if run with --daemon, adjust path if nsca default is different ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target
--daemon
flag for NSCA makes it fork. SoType=forking
and aPIDFile
is appropriate. NSCA might not create a PID file by default unless specified with-p
or if the init script handles it. For simplicity,Type=simple
and removing--daemon
(so it runs in foreground managed by systemd) is often easier if NSCA doesn't natively handle PID files well without an init script. Let's tryType=simple
andExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg -d
(using-d
as an alias for--daemon
which usually implies foreground for some tools or a specific daemon mode for others. NSCA's-d
often means 'detach/daemonize'). IfType=simple
is used, thenExecStart
should not use--daemon
if it forks.ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg
without--daemon
or-d
might be best if it runs in foreground. Let's assume NSCA's--daemon
or-d
flag correctly daemonizes and manages its PID. If it doesn't,Type=simple
and running it in foreground is cleaner for systemd. The NSCA provided init script usesstart-stop-daemon
. A simpler systemd unit that expects NSCA to stay in foreground:(Assuming[Unit] Description=Nagios Service Check Acceptor After=network.target [Service] Type=simple User=nagios Group=nagios ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg --no-fork Restart=on-failure [Install] WantedBy=multi-user.target
--no-fork
or similar exists, or just omitting--daemon
if it runs in foreground by default). The version of NSCA from NagiosEnterprises uses--daemon
to background itself. SoType=forking
and aPIDFile
it creates might be correct. Let's stick to the one from the main content for now.Enable and start NSCA service:
-
Configure Firewall on VM1: If
ufw
is active:(Replacesudo ufw allow from 192.168.1.101 to any port 5667 proto tcp comment 'Allow NSCA from VM2' sudo ufw reload
192.168.1.101
with VM2's IP). -
Define Passive Service in Nagios on VM1: Ensure VM2 (
Add this service definition:vm2-remote-linux
) is defined as a host. Then, edit its config file, e.g.,/usr/local/nagios/etc/servers/vm2-linux.cfg
:define service{ host_name vm2-remote-linux service_description Simulated Cron Job Status active_checks_enabled 0 ; Crucial: Disable active checks passive_checks_enabled 1 ; Crucial: Enable passive checks check_period 24x7 max_check_attempts 1 is_volatile 0 contact_groups admins check_freshness 1 ; Enable freshness checking freshness_threshold 86400 ; 24 hours (in seconds). If no result, it's stale. check_command check_dummy!2!"CRON job results overdue" ; Command if stale ; Ensure 'check_dummy' command is defined in commands.cfg ; check_dummy!0!message (OK), !1!message (WARN), !2!message (CRIT) stalking_options o,w,u,c notes This service is updated passively by a cron job on vm2-remote-linux. register 1 }
- Verify
check_dummy
command: Ensure/usr/local/nagios/etc/objects/commands.cfg
has: Thecheck_dummy
plugin should be in/usr/local/nagios/libexec/
.
Validate and reload Nagios:
- Verify
Part 2: Setup send_nsca
on Remote Linux Host (VM2)
-
Copy
send_nsca
binary andsend_nsca.cfg
sample configuration to VM2: From VM1 (where you compiled NSCA, likely in/tmp/nsca-${NSCA_VERSION}/
if you followed the steps): First, ensure you know the username and IP address for VM2. Let's assumeyour_user@192.168.1.101
.(Replace# On VM1, navigate to the NSCA source directory where `make all` was run cd /tmp/nsca-${NSCA_VERSION}/ # Securely copy the send_nsca binary scp src/send_nsca your_user@192.168.1.101:/tmp/ # Securely copy the sample send_nsca.cfg configuration file scp sample-config/send_nsca.cfg your_user@192.168.1.101:/tmp/
your_user
with your actual username on VM2 and192.168.1.101
with VM2's actual IP address). You will be prompted for the password foryour_user
on VM2. -
Install
send_nsca
and its configuration file on VM2: Log in to VM2 via SSH. The files you copied should be in the/tmp/
directory.# On VM2: # Move the send_nsca binary to a standard location for executables sudo mv /tmp/send_nsca /usr/local/bin/ # Create the directory for Nagios configuration files if it doesn't exist sudo mkdir -p /usr/local/nagios/etc/ # Move the send_nsca.cfg configuration file to this directory sudo mv /tmp/send_nsca.cfg /usr/local/nagios/etc/ # Set appropriate ownership and permissions # send_nsca binary should be owned by root and executable sudo chown root:root /usr/local/bin/send_nsca sudo chmod 755 /usr/local/bin/send_nsca # rwxr-xr-x # send_nsca.cfg can be owned by root, readable by relevant users/groups sudo chown root:root /usr/local/nagios/etc/send_nsca.cfg sudo chmod 644 /usr/local/nagios/etc/send_nsca.cfg # rw-r--r--
-
Configure
Look for the following lines and modify them:send_nsca.cfg
on VM2: This file tellssend_nsca
how to encrypt data and what password to use. It must match the settings innsca.cfg
on the Nagios server (VM1).password=MyNscaSecretPassword123
- Important: This password must be exactly the same as the
password
you set in/usr/local/nagios/etc/nsca.cfg
on VM1.
- Important: This password must be exactly the same as the
encryption_method=1
- This specifies the encryption algorithm.
1
usually stands for XOR. This must also match thedecryption_method
innsca.cfg
on VM1. If you used a different method on VM1, set it here accordingly.
- This specifies the encryption algorithm.
Save the file and exit (Ctrl+X, then Y, then Enter in
nano
). -
Create the Simulated Cron Job Script on VM2: This script will simulate a task (like a backup) and then use
Paste the following content into the script:send_nsca
to report its status to Nagios. Create a new script file, for example, in/opt/simulate_cron_job.sh
:Make the script executable:#!/bin/bash # Configuration Variables NAGIOS_SERVER_IP="192.168.1.100" # <<<< IP Address of your Nagios Server (VM1) HOST_NAME="vm2-remote-linux" # <<<< This MUST match the 'host_name' defined in Nagios for VM2 SERVICE_DESC="Simulated Cron Job Status" # <<<< This MUST match the 'service_description' for the passive service in Nagios NSCA_CONFIG_FILE="/usr/local/nagios/etc/send_nsca.cfg" # Path to the send_nsca client config file # Simulate job execution and determine success or failure # For this workshop, we'll randomly make it succeed or fail. echo "Simulating cron job execution..." # sleep 5 # Optional: simulate work if (( RANDOM % 2 )); then # Job succeeded RETURN_CODE=0 # Nagios OK state PLUGIN_OUTPUT="Simulated cron job completed successfully at $(date)." echo "Job SUCCEEDED. Sending OK status to Nagios." else # Job failed RETURN_CODE=2 # Nagios CRITICAL state PLUGIN_OUTPUT="Simulated cron job FAILED at $(date). Please check application logs on VM2." echo "Job FAILED. Sending CRITICAL status to Nagios." fi # Prepare the data string for send_nsca # Format: <host_name>\t<service_description>\t<return_code>\t<plugin_output>\n DATA_STRING=$(printf "%s\t%s\t%s\t%s\n" "${HOST_NAME}" "${SERVICE_DESC}" "${RETURN_CODE}" "${PLUGIN_OUTPUT}") # Send the data to Nagios server using send_nsca # The printf output is piped to send_nsca's standard input. echo "${DATA_STRING}" | /usr/local/bin/send_nsca -H ${NAGIOS_SERVER_IP} -p 5667 -d "\t" -c ${NSCA_CONFIG_FILE} if [ $? -eq 0 ]; then echo "NSCA data sent successfully to ${NAGIOS_SERVER_IP}." else echo "Error sending NSCA data. Check send_nsca execution and connectivity." fi
-
Test the Script Manually on VM2: Execute the script a few times to see it send different statuses:
Each time, it will print whether it simulated a success or failure and indicate that it attempted to send data. -
Check Nagios Web Interface (VM1):
- Navigate to your Nagios UI on VM1.
- Go to the "Services" view. Find the service named "Simulated Cron Job Status" associated with the host
vm2-remote-linux
. - Its status should update to either OK (green) or CRITICAL (red) based on what your script sent. This update might take a moment as Nagios processes its external command file.
- The "Status Information" column will display the
PLUGIN_OUTPUT
message from your script. - The "Last Check" time will reflect when Nagios processed the passive result submitted by NSCA. It will not be the regular active check interval since active checks are disabled for this service.
- Troubleshooting: If the service status doesn't update or remains "Pending (No data received from host yet)":
- On VM1 (Nagios Server):
- Check the NSCA daemon status:
sudo systemctl status nsca
. - Examine Nagios logs:
sudo tail -f /usr/local/nagios/var/nagios.log
. Look for lines related to processing passive checks or NSCA errors. - Examine system logs for NSCA messages:
sudo journalctl -u nsca
orsudo grep nsca /var/log/syslog
. - Verify the firewall rule for port 5667 is active and correct:
sudo ufw status
.
- Check the NSCA daemon status:
- On VM2 (Client Host):
- When you run
/opt/simulate_cron_job.sh
, does it report any errors fromsend_nsca
itself? - Double-check the
NAGIOS_SERVER_IP
,HOST_NAME
, andSERVICE_DESC
variables in the script. They must exactly match Nagios configuration (case-sensitive). - Verify the
password
andencryption_method
in/usr/local/nagios/etc/send_nsca.cfg
match VM1'snsca.cfg
. - Can VM2 reach VM1 on TCP port 5667?
telnet 192.168.1.100 5667
(from VM2, use VM1's IP). If it connects, press Ctrl+] then typequit
. If "Connection refused" or timeout, there's a network/firewall issue.
- When you run
- On VM1 (Nagios Server):
-
(Optional) Set up as a real cron job on VM2: To have this script run automatically, you can add it to the system's cron table. Open the cron table for editing (usually as root):
If prompted, choose an editor (e.g.,nano
). Add a line to schedule the script. For example, to run it every 5 minutes (for testing purposes): This runs the script every 5 minutes and appends its standard output and standard error to/var/log/simulated_cron_job.log
. For a real nightly job, you'd use a schedule like0 2 * * *
(2 AM every day). Save and exit the crontab. The cron daemon will automatically pick up the new schedule.
Outcome:
You have now successfully:
- Set up the NSCA daemon on your Nagios server (VM1) to listen for passive check results.
- Defined a passive service in Nagios on VM1, configured with freshness checking to alert if results stop arriving.
- Installed and configured the
send_nsca
client utility on the remote Linux host (VM2). - Created a script on VM2 that simulates a cron job and uses
send_nsca
to report its success or failure status to Nagios. - Observed these passively submitted check results appearing and updating in the Nagios web interface.
This workshop concretely demonstrates the power and utility of passive checks for integrating results from external systems or asynchronous events into your Nagios monitoring environment. The freshness checking component is vital as it monitors the passive check mechanism itself.
Event Handlers for Automated Remediation
Event handlers are scripts or commands that Nagios can execute when a host or service changes state. This powerful feature allows for automated problem remediation attempts, potentially resolving issues before manual intervention is required, or gathering diagnostic information when a problem occurs.
How Event Handlers Work:
- A host or service enters a problem state (e.g., a service becomes CRITICAL) or recovers (e.g., goes from CRITICAL to OK).
- If an event handler is defined for that host/service and state change, Nagios executes the specified command.
- The event handler script runs, performing actions like restarting a service, clearing a temporary directory, logging extra diagnostics, or even triggering actions on other systems.
- The event handler script should ideally be short-lived and not block Nagios for too long.
Key Concepts:
- State Changes: Event handlers can be triggered on various state changes:
- When a host/service goes into a SOFT problem state.
- When a host/service goes into a HARD problem state (most common).
- When a host/service recovers from a problem state (HARD OK/UP).
- Event Handler Command: A Nagios command definition that specifies the script/executable to run and any arguments.
- Macros: Event handler commands can use Nagios macros to pass context about the host/service state to the script (e.g.,
$HOSTNAME$
,$SERVICESTATE$
,$SERVICEOUTPUT$
,$HOSTSTATETYPE$
,$SERVICESTATETYPE$
). - Global vs. Specific: Event handlers can be enabled globally in
nagios.cfg
(enable_event_handlers=1
) and then defined per host/service or in templates.
Defining an Event Handler:
-
Write the Event Handler Script: This script will perform the desired action. It can be written in Bash, Python, Perl, etc. Example: A simple Bash script to attempt restarting a service.
attempt_restart_service.sh
#!/bin/bash # Arguments passed by Nagios (defined in command) HOSTNAME=$1 SERVICEDESC=$2 SERVICESTATE=$3 # e.g., CRITICAL, WARNING, UNKNOWN, OK SERVICESTATETYPE=$4 # e.g., SOFT, HARD SERVICEATTEMPT=$5 # e.g., 1/3, 2/3, 3/3 (current_attempt/max_attempts) LOGFILE="/usr/local/nagios/var/event_handler.log" echo "$(date): Event Handler triggered for ${HOSTNAME}/${SERVICEDESC}" >> ${LOGFILE} echo "State: ${SERVICESTATE}, Type: ${SERVICESTATETYPE}, Attempt: ${SERVICEATTEMPT}" >> ${LOGFILE} # Only attempt restart on a HARD CRITICAL state for a specific service # And only on the first hard state notification (SERVICEATTEMPT will be like "1/MAX_CHECKS") # Or, if max_check_attempts for service is 3, it's 3/3. # It's better to trigger event handlers on the first HARD state change. # We can check $SERVICEATTEMPT$ against $MAXSERVICEATTEMPTS$ or just trigger on first HARD. # For this example, let's assume we only want to act on HARD CRITICAL states. if [ "${SERVICESTATETYPE}" == "HARD" ] && [ "${SERVICESTATE}" == "CRITICAL" ]; then echo "Attempting to restart ${SERVICEDESC} on ${HOSTNAME}..." >> ${LOGFILE} # How to restart depends on the service and host # If the service is on a remote Linux host, you might use SSH: # Make sure SSH key-based authentication is set up for the 'nagios' user # to the remote host, and that the nagios user has sudo rights for that specific service restart. # EXAMPLE: # if [ "${SERVICEDESC}" == "HTTP Web Server" ] && [ "${HOSTNAME}" == "my-web-server-01" ]; then # ssh nagios@${HOSTNAME} "sudo systemctl restart apache2" >> ${LOGFILE} 2>&1 # echo "Apache restart command sent." >> ${LOGFILE} # fi # For a local service (on the Nagios server itself): if [ "${SERVICEDESC}" == "Local Apache Service" ] && [ "${HOSTNAME}" == "localhost" ]; then # Ensure nagios user has sudo rights for this command # e.g., in /etc/sudoers: nagios ALL=(ALL) NOPASSWD: /bin/systemctl restart apache2 sudo systemctl restart apache2 >> ${LOGFILE} 2>&1 echo "Local Apache restart command executed." >> ${LOGFILE} fi echo "--------------------------------------" >> ${LOGFILE} else echo "No action taken (State: ${SERVICESTATE}, Type: ${SERVICESTATETYPE})." >> ${LOGFILE} echo "--------------------------------------" >> ${LOGFILE} fi exit 0 # Event handlers should typically exit 0
- Place this script in
/usr/local/nagios/libexec/
. - Make it executable:
sudo chmod +x /usr/local/nagios/libexec/attempt_restart_service.sh
. - Ensure the
nagios
user has permissions to execute it and any commands within it (e.g.,sudo
rights if restarting system services). This is a major security consideration.
- Place this script in
-
Define the Event Handler Command in Nagios: Add to
/usr/local/nagios/etc/objects/commands.cfg
: -
Enable Event Handlers Globally: In
(This is usually the default)./usr/local/nagios/etc/nagios.cfg
: -
Assign the Event Handler to a Service: In a service definition (e.g., for a local Apache service on
localhost
):define service{ use local-service host_name localhost service_description Local Apache Service ; Match this in your script check_command check_http event_handler_enabled 1 ; Enable event handler for THIS service event_handler service-restarter ; Name of the command to run contact_groups admins }
event_handler_enabled 1
: Enables it for this specific service.event_handler service-restarter
: Specifies the command.
-
Verify and Reload Nagios:
Considerations and Best Practices:
- Security: Granting the
nagios
usersudo
rights is a significant security risk. Restrict these rights as much as possible (e.g., only for specific commands needed by event handlers). UseNOPASSWD
with caution. SSH key-based auth for remote commands must also be secured. - Idempotency: Event handler scripts should ideally be idempotent (running them multiple times has the same effect as running them once). This prevents issues if Nagios triggers them repeatedly.
- Avoid Loops: Be careful not to create event handler loops (e.g., an event handler causes a state change that triggers another event handler, etc.).
- Keep Scripts Fast: Event handlers run synchronously in some Nagios versions/configurations, potentially blocking other checks. Keep them quick or design them to background longer tasks.
- Testing: Test event handlers thoroughly in a non-production environment.
- Logging: Log actions performed by event handlers for auditing and troubleshooting.
- Use for Diagnostics: Event handlers aren't just for remediation. They can gather diagnostic data (e.g., run
top
,ps
,netstat
, save logs) when a problem occurs, attaching this info to the notification or storing it centrally. - State Type: Decide whether to trigger on SOFT or HARD states. Triggering on SOFT states can be aggressive. HARD states are usually preferred for remediation actions.
- Host Event Handlers: Similar concepts apply to host event handlers (e.g., try to reboot a server if it's DOWN, but this is risky).
Event handlers can significantly enhance Nagios's capabilities, transforming it from a passive monitoring system into one that can actively attempt to resolve issues. However, they must be implemented with care and strong attention to security.
Workshop Implementing an Event Handler to Log Extra Info
Objective:
Create an event handler that, when a service on localhost
(e.g., "Swap Usage") enters a HARD CRITICAL state, logs detailed system information (like free -m
, vmstat
, df -h
) to a specific file for later diagnosis. This is a non-remediating, diagnostic event handler.
Prerequisites:
- Working Nagios server (VM1).
- A service on
localhost
that you can easily force into a CRITICAL state (e.g., "Swap Usage" or the custom "Active SSH Sessions" check). We'll use "Swap Usage".
Steps:
-
Create the Event Handler Script: On VM1, create
Paste the following content:/usr/local/nagios/libexec/log_system_diags.sh
:Make the script executable and set ownership:#!/bin/bash # Nagios Macros passed as arguments HOSTNAME=$1 SERVICEDESC=$2 SERVICESTATE=$3 SERVICESTATETYPE=$4 SERVICEOUTPUT=$5 # The plugin output for the service LOG_DIR="/usr/local/nagios/var/diag_logs" DIAG_FILE="${LOG_DIR}/${HOSTNAME}_${SERVICEDESC// /_}_$(date +%Y%m%d_%H%M%S).diag" # Create log directory if it doesn't exist mkdir -p ${LOG_DIR} chown nagios:nagios ${LOG_DIR} # Ensure nagios user can write # Log basic event info echo "Event Handler: log_system_diags.sh triggered at $(date)" > ${DIAG_FILE} echo "Host: ${HOSTNAME}" >> ${DIAG_FILE} echo "Service: ${SERVICEDESC}" >> ${DIAG_FILE} echo "State: ${SERVICESTATE} (${SERVICESTATETYPE})" >> ${DIAG_FILE} echo "Plugin Output: ${SERVICEOUTPUT}" >> ${DIAG_FILE} echo "-----------------------------------------" >> ${DIAG_FILE} # Only gather diagnostics on HARD CRITICAL state if [ "${SERVICESTATETYPE}" == "HARD" ] && [ "${SERVICESTATE}" == "CRITICAL" ]; then echo "Gathering system diagnostics..." >> ${DIAG_FILE} echo "" >> ${DIAG_FILE} echo "=== df -h ===" >> ${DIAG_FILE} df -h >> ${DIAG_FILE} 2>&1 echo "" >> ${DIAG_FILE} echo "=== free -m ===" >> ${DIAG_FILE} free -m >> ${DIAG_FILE} 2>&1 echo "" >> ${DIAG_FILE} echo "=== vmstat 1 3 ===" >> ${DIAG_FILE} # 3 samples, 1 second apart vmstat 1 3 >> ${DIAG_FILE} 2>&1 echo "" >> ${DIAG_FILE} echo "=== top -b -n 1 ===" >> ${DIAG_FILE} # Batch mode, 1 iteration top -b -n 1 >> ${DIAG_FILE} 2>&1 echo "" >> ${DIAG_FILE} echo "Diagnostics gathering complete." >> ${DIAG_FILE} # Also log to main Nagios event handler log for quick check echo "$(date): Diagnostics for ${HOSTNAME}/${SERVICEDESC} saved to ${DIAG_FILE}" >> /usr/local/nagios/var/event_handler.log else echo "No diagnostics gathered. State was ${SERVICESTATE} (${SERVICESTATETYPE})." >> ${DIAG_FILE} echo "$(date): Event handler for ${HOSTNAME}/${SERVICEDESC} triggered, no action for state ${SERVICESTATE} (${SERVICESTATETYPE})." >> /usr/local/nagios/var/event_handler.log fi echo "-----------------------------------------" >> /usr/local/nagios/var/event_handler.log exit 0
sudo chmod +x /usr/local/nagios/libexec/log_system_diags.sh sudo chown nagios:nagios /usr/local/nagios/libexec/log_system_diags.sh # Create the main event handler log file and set permissions sudo touch /usr/local/nagios/var/event_handler.log sudo chown nagios:nagios /usr/local/nagios/var/event_handler.log
-
Define the Event Handler Command in Nagios: Edit
Add:/usr/local/nagios/etc/objects/commands.cfg
:Note: We are quotingdefine command{ command_name log-diagnostics command_line $USER1$/log_system_diags.sh $HOSTNAME$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATETYPE$ "$SERVICEOUTPUT$" }
$SERVICEDESC$
and$SERVICEOUTPUT$
because they can contain spaces. -
Enable Event Handlers Globally (if not already): Check
/usr/local/nagios/etc/nagios.cfg
forenable_event_handlers=1
. -
Assign the Event Handler to the "Swap Usage" Service on
Find the "Swap Usage" service definition and modify it: Save and exit.localhost
: Edit/usr/local/nagios/etc/objects/localhost.cfg
: -
Verify and Reload Nagios:
-
Test the Event Handler: We need to force the "Swap Usage" service into a HARD CRITICAL state.
- Modify
check_command
for "Swap Usage" to guarantee CRITICAL: Inlocalhost.cfg
, temporarily change thecheck_command
for "Swap Usage" to something that will definitely be critical. If your system has any swap at all and it's mostly free, setting a very low "percent used" critical threshold will trigger it. Example: Change fromcheck_local_swap!50%!80%
tocheck_local_swap!1%!2%
. This means critical if more than 2% of swap is used.Save, validate (sudo nano /usr/local/nagios/etc/objects/localhost.cfg # Find Swap Usage service, modify check_command to: # check_command check_local_swap!1!2
sudo /usr/local/nagios/bin/nagios -v ...
), and reload Nagios (sudo systemctl reload nagios
). - Wait for State Change: Monitor the "Swap Usage" service in the Nagios UI. It will go through:
- Pending
- SOFT CRITICAL (after first check)
- Retry checks...
- HARD CRITICAL (after
max_check_attempts
for the service, e.g., 3 or 4)
- Check for Diagnostic File: Once the service is in a HARD CRITICAL state, the event handler should have run.
Check the main event handler log:
You should see a line indicating diagnostics were saved.
Then, check the diagnostic log directory:
You should see a new file named like
localhost_Swap_Usage_YYYYMMDD_HHMMSS.diag
. View its content: It should contain the output ofdf -h
,free -m
,vmstat
, andtop
.
- Modify
-
Revert Changes:
- Important: Change the
check_command
for "Swap Usage" inlocalhost.cfg
back to its original, sensible thresholds (e.g.,check_local_swap!50%!80%
). - Save, validate, and reload Nagios.
- The service should eventually recover to OK. The event handler might log that it was triggered for an OK state but took no diagnostic action (as per our script's logic).
- Important: Change the
Outcome:
You have successfully implemented a diagnostic event handler that:
- Triggers on a specific service entering a HARD CRITICAL state.
- Executes a custom script to gather system information.
- Saves this information to a uniquely named file for later analysis.
This workshop demonstrates a safe and useful application of event handlers – gathering data without attempting risky automated fixes. This approach can be invaluable for troubleshooting intermittent or complex issues.
Performance Tuning and Scaling Nagios
As the number of monitored hosts and services grows, Nagios performance can become a concern. Slow check execution, a lagging web interface, and delayed notifications are common symptoms. Effective tuning and scaling strategies are crucial for maintaining a responsive and reliable monitoring system.
Key Areas for Performance Optimization:
-
Hardware Resources:
- CPU: Nagios and its checks can be CPU-intensive. More cores are generally better than raw clock speed for parallel check execution.
- RAM: Sufficient RAM is needed to hold Nagios's state information, run plugins, and for the OS/web server. Monitor memory usage; swap usage is a bad sign.
- Disk I/O: Nagios writes status data, logs, and performance data frequently. Fast disks (SSDs) significantly improve performance, especially for I/O-bound operations like perfdata processing.
- Place
/usr/local/nagios/var/
(especiallyspool/
andperfdata/
if using addons) on a fast filesystem or separate fast disk.
- Place
-
Nagios Configuration (
nagios.cfg
):interval_length
: Default is 60 seconds. This is the fundamental time unit for scheduling. Reducing it (e.g., to 10 or 30) allows for more granular scheduling but increases CPU load as Nagios wakes up more often. For very large installs, increasing it slightly (e.g. to 120) might be considered if extreme granularity isn't needed, but this is rare. The default 60 is usually fine.- Check Scheduling and Execution:
max_concurrent_checks
: (Obsolete in Nagios 4.x, which uses a more dynamic check scheduler). In older versions, this limited how many service checks could run simultaneously.service_check_timeout
/host_check_timeout
: Global timeouts for checks. Ensure they are reasonable. Plugins that hang can tie up Nagios worker processes.max_service_check_spread
/max_host_check_spread
: Spreads out initial checks when Nagios starts to avoid a "thundering herd."
- Optimizing Check Execution:
- Use compiled plugins where possible: Compiled C plugins are generally faster than script-based ones (Perl, Python, Bash).
- Efficient plugins: Ensure custom plugins are written efficiently. Avoid unnecessary overhead.
- Reduce plugin timeouts: Set appropriate timeouts within plugin calls (e.g.,
-t
option for many network plugins) so they don't hang indefinitely.
- Object Configuration:
- Templates: Use them extensively. Nagios processes templates efficiently.
- Avoid overly complex dependencies: While useful, deeply nested or circular dependencies can add processing overhead.
use_large_installation_tweaks
: (Default is 1/ON in Nagios 4.x). Enables several internal optimizations for larger environments. Ensure it's on.enable_environment_macros
: (Default 0/OFF). Enabling this makes more environment variables available to plugins but can add slight overhead. Only enable if strictly needed by a plugin.
-
Optimize Check Intervals:
- Not everything needs to be checked every minute or even every 5 minutes.
- Prioritize: Critical services get frequent checks (e.g., 1-5 mins). Less critical or stable services can have longer intervals (e.g., 10-30 mins, or even hourly for some things).
- Use different
check_interval
andretry_interval
settings in service templates for different classes of service.
-
Passive Checks (NRPE, NSCA):
- For checks on remote hosts, offload execution to the remote host (NRPE, NSClient++). This distributes CPU load.
- Use NSCA for asynchronous events to avoid polling.
-
Web Interface Performance:
- CGI Optimization: The Nagios CGIs can be slow on large installations.
- Ensure your web server (Apache) is well-configured (e.g.,
KeepAlive On
, appropriate MaxRequestWorkers). - Consider alternatives like Nagios V-Shell or Thruk for a faster web interface, or modern UIs like NagVis if you only need visualization.
- Ensure your web server (Apache) is well-configured (e.g.,
cgi.cfg
settings:escape_html_tags=0
(Default is 1/ON): Turning this off can speed up CGIs but introduces a security risk (XSS) if plugin output is not sanitized. Use with extreme caution.- Limit default items displayed in status pages (e.g.,
default_page_limit
).
- CGI Optimization: The Nagios CGIs can be slow on large installations.
-
Perfdata Processing:
- If you're graphing performance data (e.g., with PNP4Nagios, Grafana via InfluxDB), the processing of perfdata files can be I/O intensive.
- Broker Modules: Use broker modules like
NPCD
(for PNP4Nagios) orNDOUtils
(to write to a database) for more efficient, asynchronous perfdata handling. process_performance_data=1
innagios.cfg
is needed.- Ensure perfdata processing scripts/daemons are efficient and don't overload the Nagios server. Consider moving perfdata processing/storage to a separate server if Nagios server is struggling.
-
Distributed Monitoring (Advanced Scaling): For very large environments (thousands of hosts, tens/hundreds of thousands of services), a single Nagios instance may not be sufficient.
- Mod_Gearman: A popular addon that distributes check execution to multiple "Gearman workers" (which can be on different servers). The Nagios server acts as a scheduler, offloading the actual check execution. This dramatically improves scalability.
- DNX (Distributed Nagios eXecutor): Another framework for distributing checks.
- Federated Nagios Servers: Multiple independent Nagios servers monitoring different parts of the infrastructure, with their status aggregated by a central "master" Nagios server (often using NSCA for passive updates or a tool like Thruk to view multiple backends).
-
Nagios Core 4.x Worker Architecture: Nagios Core 4 introduced a worker process model for check execution, significantly improving performance over Nagios 3.x.
- Nagios main process handles scheduling, event handling, etc.
- Separate worker processes are forked to execute checks.
nagios.cfg
has settings likeservice_check_workers
andhost_check_workers
to control the number of worker processes. These are often auto-tuned. Manual adjustment requires careful monitoring.
-
Monitoring Nagios Itself:
- Monitor the Nagios process, CPU/memory/disk usage of the Nagios server.
- Monitor the size of the Nagios external command file (
nagios.cmd
) and check result spool directories. If they grow too large, it indicates Nagios is falling behind. - Monitor event latency (time between a problem occurring and a notification being sent).
-
Regular Maintenance:
- Archive or rotate Nagios logs (
nagios.log
,retention.dat
, perfdata logs). - Periodically review and optimize configurations. Remove unused checks or objects.
- Keep Nagios Core and plugins updated to benefit from performance improvements and bug fixes.
- Archive or rotate Nagios logs (
Tools for Diagnosing Performance:
top
/htop
: Monitor CPU and memory usage.iostat
: Monitor disk I/O.vmstat
: Monitor system activity, memory, swap, I/O.- Nagios logs (
nagios.log
with debug enabled if necessary, but be careful as debug logging itself adds overhead). nagiostats
: A utility that comes with Nagios, provides statistics about check execution latencies, queue lengths, etc. (Run this periodically to get a snapshot).
Scaling Nagios is an ongoing process of monitoring, analyzing, and tuning. Start with simple optimizations and move to more complex solutions like distributed monitoring only when necessary.
Workshop Analyzing Nagios Performance with nagiostats
Objective:
Use the nagiostats
utility to get a snapshot of your Nagios instance's performance metrics and understand what they mean. This is a diagnostic workshop, not a tuning one, but it provides the data needed for tuning.
Prerequisites:
- A working Nagios Core installation that has been running for some time with several hosts and services being actively checked. The more activity, the more interesting
nagiostats
output will be. - Access to the Nagios server's command line.
Steps:
-
Locate
If it's not found there, your installation might be different, but this is the standard location for source installs.nagiostats
: Thenagiostats
utility is typically installed in the same directory as the mainnagios
executable. -
Run
You should get output similar to this (values will vary greatly):nagiostats
: Executenagiostats
pointing it to your main Nagios configuration file.Nagios Stats 4.x.x Copyright (c) 2009-2020 Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: XXXX-XX-XX License: GPL CURRENT STATUS DATA --------------------------------------------------------------------- Status File: /usr/local/nagios/var/status.dat Status File Age: 2s Status File Version: 4.x.x PROGRAM STATUS DATA --------------------------------------------------------------------- Nagios Process ID: 12345 Running Time: 2d 3h 15m 30s Nagios User: nagios Nagios Group: nagios CHECK PROCESSING DATA --------------------------------------------------------------------- Services Checked: 1500 Hosts Checked: 300 Service Check Interval: 300s Host Check Interval: 300s Service Inter-Check Delay: 1.00s Host Inter-Check Delay: 0.50s Services Actively Checked: 25 Hosts Actively Checked: 5 EVENT QUEUE DATA --------------------------------------------------------------------- Queued Events: 0 HIGH Latency Events: 0 TOTAL Latency Events: 10 AVG Latency Events: 0.05s MAX Latency Events: 0.20s SERVICE CHECK DATA --------------------------------------------------------------------- Total Services: 50 Services Ok: 48 Services Warning: 1 Services Unknown: 0 Services Critical: 1 Services Pending: 0 Services Obsessing: 50 Services Scheduled: 50 Services Checked: 1500 Checks Last 1/5/15/60 Min: 10 / 50 / 150 / 600 Latency Last 1/5/15/60 Min: 0.01s / 0.02s / 0.02s / 0.03s Service Max Latency: 0.15s Avg Service Check Latency: 0.03s Total Service State Change: 5 Avg Service State Change: 1.0% HOST CHECK DATA --------------------------------------------------------------------- Total Hosts: 10 Hosts Up: 9 Hosts Down: 1 Hosts Unreachable: 0 Hosts Pending: 0 Hosts Obsessing: 10 Hosts Scheduled: 10 Hosts Checked: 300 Checks Last 1/5/15/60 Min: 2 / 10 / 30 / 120 Latency Last 1/5/15/60 Min: 0.00s / 0.01s / 0.01s / 0.01s Host Max Latency: 0.05s Avg Host Check Latency: 0.01s Total Host State Change: 2 Avg Host State Change: 0.5% EXTERNAL COMMAND DATA --------------------------------------------------------------------- External Commands Checked: 25 ... (more stats)
-
Analyze the Output - Key Sections and Metrics:
-
CHECK PROCESSING DATA
:Services Actively Checked
/Hosts Actively Checked
: How many checks are currently running or in the immediate queue. High numbers consistently could indicate a bottleneck.Service Inter-Check Delay
/Host Inter-Check Delay
: The average delay method used by Nagios to spread out checks.
-
EVENT QUEUE DATA
: (More relevant if you have many scheduled events or use an event broker)Queued Events
: Number of events (like checks, notifications) waiting to be processed. If this is consistently high, Nagios is falling behind.HIGH Latency Events
: Events that took too long to process.AVG Latency Events
/MAX Latency Events
: Average and maximum time events spent in the queue. High latency means delays in checks and notifications.
-
SERVICE CHECK DATA
/HOST CHECK DATA
:Checks Last 1/5/15/60 Min
: Number of checks performed in these time windows. Gives an idea of check velocity.Latency Last 1/5/15/60 Min
: Average execution latency of checks in these windows. This is a critical metric. High latency means checks are taking too long to complete. This could be due to slow plugins, network issues, or an overloaded Nagios server.Service Max Latency
/Host Max Latency
: The longest any single check took. Helps identify outlier slow checks.Avg Service Check Latency
/Avg Host Check Latency
: Overall average execution time for checks. Aim to keep these low (e.g., under 1-2 seconds for most environments, much lower for highly optimized ones).
-
Buffer Usage (Might be in
EXTERNAL COMMAND DATA
or a separate section depending on Nagios version and broker usage):- Metrics like
buffer_slots_used
,buffer_slots_free
,total_buffer_slots
. - If buffers (e.g., for external commands, check results) are consistently full, it's a sign of overload.
- Metrics like
-
-
Interpreting the Metrics for Potential Issues:
-
High Check Latencies (e.g.,
Avg Service Check Latency
> few seconds):- Investigate slow plugins: Use plugin timeouts, optimize custom scripts.
- Network issues to remote hosts.
- Nagios server CPU/Disk I/O bound.
- Too many checks scheduled too frequently.
-
High Event Queue Latency/Many Queued Events:
- Nagios core processing is a bottleneck.
- Consider if event broker modules are slowing things down.
- Server resources (CPU mainly).
-
High Number of
Actively Checked
Services/Hosts:- May indicate checks are taking longer than their scheduled interval, leading to a backlog.
- Check latencies are likely also high.
-
nagiostats
shows "N/A" for some values: This can happen if Nagios has just restarted or if certain features (like an event broker) are not heavily used or configured.
-
-
Run
nagiostats
Periodically: To understand trends, runnagiostats
at different times, especially during peak load, and compare the output. You could even script this to collect data over time.
Outcome:
By running nagiostats
and examining its output, you've gained insight into:
- The volume of checks your Nagios instance is performing.
- The execution latency of these checks, which is a primary indicator of performance.
- The load on Nagios's internal event queue.
This information is the first step in diagnosing performance problems. If nagiostats
reveals high latencies or queue buildups, you would then proceed to investigate the causes using techniques discussed in the "Performance Tuning and Scaling Nagios" theory section (e.g., checking server resources, optimizing plugins, adjusting check intervals, or considering distributed monitoring). This workshop equips you to gather the necessary baseline data.
Security Best Practices for Nagios
Securing your Nagios installation is paramount, as it has deep visibility into your infrastructure and can potentially execute commands. A compromised Nagios server could be a launchpad for wider attacks.
Key Security Areas:
-
Secure the Nagios Server OS:
- Minimal Installation: Install only necessary packages.
- Regular Updates: Keep the OS and all packages patched.
- Firewall: Use a host-based firewall (e.g.,
ufw
,firewalld
) to restrict access to necessary ports only (SSH, HTTP/HTTPS for web UI, NRPE/NSCA if applicable). - Strong Passwords & SSH Key Authentication: For server access.
- Intrusion Detection/Prevention Systems (IDS/IPS): Consider deploying them.
- Disable Unused Services.
-
Secure the Web Interface:
- HTTPS: Always use HTTPS (SSL/TLS) to encrypt web traffic to the Nagios UI. Configure Apache/Nginx with a valid SSL certificate (e.g., from Let's Encrypt).
- Strong Authentication:
- Use strong passwords for the
htpasswd
users accessing the web UI. - Change the default
nagiosadmin
username. - Store
htpasswd
file securely with restricted permissions.
- Use strong passwords for the
- Restrict Access:
- In Apache/Nginx config, limit access to the
/nagios
URL to specific IP addresses or internal networks if possible. Require ip <your_admin_network>
- In Apache/Nginx config, limit access to the
- CGI Security (
cgi.cfg
):use_authentication=1
(Ensure authentication is enabled).- Restrict
authorized_for_*
directives: Grant command execution rights (authorized_for_all_host_commands
,authorized_for_all_service_commands
, etc.) only to highly trusted administrator accounts. Avoid giving these rights to read-only users. escape_html_tags=1
: Keep this enabled (default) to prevent XSS vulnerabilities from plugin output, unless you fully trust and sanitize all plugin outputs.
-
Secure Nagios Core Configuration and Processes:
- Run as Unprivileged User: Nagios should run as a dedicated unprivileged user (e.g.,
nagios
). Themake install
process usually sets this up. - File Permissions:
/usr/local/nagios/etc/
(config files): Readable by Nagios user, writable only by root/admin. Sensitive info like passwords inresource.cfg
should be highly restricted./usr/local/nagios/libexec/
(plugins): Executable by Nagios user. Writable only by root/admin./usr/local/nagios/var/rw/nagios.cmd
(external command file): Writable by Nagios user and the web server user (if external commands from UI are allowed). Permissions are critical here (dp S
bit set bymake install-commandmode
).
- Secure External Commands: Be extremely cautious if allowing external commands via the web UI or other means. This is a powerful feature that can be abused.
- Disable
enable_environment_macros=0
innagios.cfg
unless absolutely necessary for a plugin, as it can be a vector for injecting commands if plugins are not written carefully.
- Run as Unprivileged User: Nagios should run as a dedicated unprivileged user (e.g.,
-
Secure Check Agents and Protocols:
- NRPE:
- Use SSL/TLS encryption for NRPE communication (compile NRPE with SSL).
- In
nrpe.cfg
on clients:allowed_hosts
should strictly list only your Nagios server IP(s). dont_blame_nrpe=0
(default): Do not allow command arguments from Nagios server. Define full commands with arguments in client'snrpe.cfg
. If you must allow arguments (dont_blame_nrpe=1
), be extremely careful about what commands are exposed and validate inputs in your plugins.- Firewall NRPE port (5666) on clients to only allow Nagios server(s).
- NSClient++ (for Windows):
- Use SSL/TLS for communication (e.g., when using NRPE listener).
- In
nsclient.ini
:allowed hosts
should list Nagios server IP(s). - If allowing arguments from Nagios, be cautious. Define secure aliases.
- Use strong passwords if using older protocols like
check_nt
. - Firewall NSClient++ port on clients.
- NSCA:
- Use encryption (e.g., DES, 3DES, or AES if compiled with libmcrypt; XOR is weak). Use strong passwords.
- In
nsca.cfg
on server: Defineallowed_hosts
if your NSCA version supports it, or firewall the NSCA port (5667) to only allow trusted submitters.
- SNMP:
- Use SNMPv3 (which provides encryption and authentication) instead of SNMPv1/v2c (which use plain-text community strings).
- If using SNMPv1/v2c, use strong, non-default community strings.
- Restrict SNMP access on devices to only the Nagios server's IP using ACLs.
- NRPE:
-
Plugin Security:
- Source Plugins Carefully: Use official Nagios plugins or well-vetted community plugins. Be cautious with plugins from untrusted sources.
- Audit Custom Plugins: If writing custom plugins, audit them for security vulnerabilities (e.g., command injection, insecure handling of arguments). Sanitize all external input.
- Principle of Least Privilege: Plugins should run with the minimum privileges necessary.
-
Backup Nagios Configuration: Regularly back up
/usr/local/nagios/etc/
and any custom plugins. Store backups securely. Consider version control (Git) for/usr/local/nagios/etc/
. -
Monitoring and Auditing:
- Monitor Nagios server logs and system logs for suspicious activity.
- Audit Nagios configurations regularly for security misconfigurations.
- Nagios audit log (
nagios.log
with appropriate verbosity) can show configuration changes, commands executed, notifications sent, etc.
By implementing these security best practices, you can significantly reduce the risk of your Nagios monitoring system being compromised. Security is an ongoing process, not a one-time setup.
Workshop Securing Nagios Web UI with HTTPS (Self-Signed Cert)
Objective:
Configure the Apache web server for your Nagios installation to use HTTPS with a self-signed SSL certificate. This encrypts the web traffic between your browser and the Nagios UI.
Note:
Self-signed certificates will cause browser warnings. For production, obtain a certificate from a trusted Certificate Authority (CA) or use Let's Encrypt. This workshop focuses on the mechanism.
Prerequisites:
- Working Nagios Core installation with Apache web server.
openssl
command-line tool installed (usually default on Linux).- Apache's SSL module (
mod_ssl
) enabled.
Steps:
-
Enable Apache SSL Module (if not already enabled):
(On RHEL/CentOS, this might besudo yum install mod_ssl
and then ensure it's loaded). -
Create a Directory for SSL Certificates:
-
Generate a Self-Signed SSL Certificate and Private Key:
Useopenssl
to generate a key and a certificate.sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ -keyout /etc/apache2/ssl/nagios.key \ -out /etc/apache2/ssl/nagios.crt
req -x509
: Request for an X.509 certificate.-nodes
: No DES encryption for the private key (so Apache can read it without a passphrase at startup). For higher security, you can encrypt the key, but Apache will then need the passphrase at each start.-days 365
: Certificate validity (1 year).-newkey rsa:2048
: Generate a new 2048-bit RSA private key.-keyout /etc/apache2/ssl/nagios.key
: Path to save the private key.-out /etc/apache2/ssl/nagios.crt
: Path to save the certificate.
You will be prompted for information for the certificate (Country Name, State, Locality, Organization, Common Name, etc.).
- Common Name (CN): This is important. Enter the FQDN or IP address of your Nagios server (how you access it in the browser, e.g.,
nagios.yourdomain.com
or192.168.1.100
). If they don't match, browsers will give more specific warnings.
-
Set Permissions for Key and Certificate:
The private key must be protected. -
Configure Apache to Use SSL for Nagios:
You need to modify your Apache configuration for Nagios. This is often in/etc/apache2/sites-enabled/nagios.conf
(Debian/Ubuntu) or/etc/httpd/conf.d/nagios.conf
(RHEL/CentOS). We will create a new virtual host configuration for HTTPS or modify the existing one. A common approach is to redirect HTTP to HTTPS.Edit your existing Nagios Apache config file or the default SSL config file. For Debian/Ubuntu, Apache usually has a
default-ssl.conf
insites-available
. We can create a dedicated SSL vhost for Nagios. Let's assume you modify/etc/apache2/sites-enabled/nagios.conf
. Backup the current file first:Modify it to look something like this (this example sets up an HTTPS virtual host on port 443):sudo cp /etc/apache2/sites-enabled/nagios.conf /etc/apache2/sites-enabled/nagios.conf.backup sudo nano /etc/apache2/sites-enabled/nagios.conf
# Original HTTP VirtualHost for redirect (optional, but good practice) <VirtualHost *:80> ServerName your_nagios_server_fqdn_or_ip # e.g., nagios.example.com or server's IP Redirect permanent /nagios https://your_nagios_server_fqdn_or_ip/nagios # If you want to redirect everything on port 80 to HTTPS: # Redirect permanent / https://your_nagios_server_fqdn_or_ip/ </VirtualHost> # HTTPS VirtualHost for Nagios <VirtualHost *:443> ServerName your_nagios_server_fqdn_or_ip # Must match CN in cert or be covered by it SSLEngine on SSLCertificateFile /etc/apache2/ssl/nagios.crt SSLCertificateKeyFile /etc/apache2/ssl/nagios.key # Nagios Specific Configuration (should be similar to your existing HTTP config) ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin" <Directory "/usr/local/nagios/sbin"> SSLRequireSSL Options ExecCGI AllowOverride None <IfVersion >= 2.3> <RequireAll> Require all granted # For Apache 2.4 Basic Authentication AuthType Basic AuthName "Nagios Access" AuthUserFile /usr/local/nagios/etc/htpasswd.users Require valid-user </RequireAll> </IfVersion> <IfVersion < 2.3> Order allow,deny Allow from all # For Apache 2.2 Basic Authentication AuthType Basic AuthName "Nagios Access" AuthUserFile /usr/local/nagios/etc/htpasswd.users Require valid-user </IfVersion> </Directory> Alias /nagios "/usr/local/nagios/share" <Directory "/usr/local/nagios/share"> SSLRequireSSL Options None AllowOverride None <IfVersion >= 2.3> <RequireAll> Require all granted # For Apache 2.4 Basic Authentication AuthType Basic AuthName "Nagios Access" AuthUserFile /usr/local/nagios/etc/htpasswd.users Require valid-user </RequireAll> </IfVersion> <IfVersion < 2.3> Order allow,deny Allow from all # For Apache 2.2 Basic Authentication AuthType Basic AuthName "Nagios Access" AuthUserFile /usr/local/nagios/etc/htpasswd.users Require valid-user </IfVersion> </Directory> </VirtualHost>
- Replace
your_nagios_server_fqdn_or_ip
with the actual FQDN or IP address. SSLRequireSSL
inside<Directory>
blocks ensures these directories are only accessed over SSL.- This example keeps the Nagios alias and CGI script configurations, wrapping them in an SSL-enabled VirtualHost.
- If you have
Listen 80
andListen 443
inports.conf
(Debian/Ubuntu) orhttpd.conf
(RHEL/CentOS), these VirtualHosts should work.
- Replace
-
Test Apache Configuration and Restart Apache:
-
Test Accessing Nagios UI via HTTPS:
Open your web browser and navigate tohttps://your_nagios_server_fqdn_or_ip/nagios/
.- Browser Warning: You will likely see a security warning because the certificate is self-signed (not trusted by a public CA). This is expected for this workshop.
- You'll need to accept the risk and proceed (e.g., "Advanced" -> "Proceed to ... (unsafe)").
- Once you proceed, you should see the Nagios login prompt.
- Log in. The connection should now be encrypted (look for
https://
and a padlock icon, though it might have a warning overlay due to the self-signed cert). - Test if HTTP access to
/nagios
redirects to HTTPS, if you configured the redirect.
- Browser Warning: You will likely see a security warning because the certificate is self-signed (not trusted by a public CA). This is expected for this workshop.
Outcome:
You have successfully:
- Generated a self-signed SSL certificate and private key.
- Configured Apache to serve the Nagios web interface over HTTPS on port 443.
- (Optionally) Configured a redirect from HTTP to HTTPS for the Nagios URL.
While this uses a self-signed certificate (not for production public sites), it demonstrates the core steps for enabling SSL/TLS, which is crucial for securing sensitive Nagios web traffic. For production, replace the self-signed certificate with one from a trusted CA (e.g., Let's Encrypt is free and widely used).
Extending Nagios with Addons (Brief Overview)
While Nagios Core provides a powerful monitoring engine, its functionality can be significantly extended and enhanced using various addons. These addons can provide features like advanced graphing, alternative web interfaces, distributed monitoring capabilities, and more. Here's a brief overview of some popular ones:
-
Graphing and Visualization:
- PNP4Nagios: One of the most popular addons for graphing performance data collected by Nagios. It uses RRDtool (Round Robin Database Tool) to store and render graphs. It integrates well with Nagios and can display graphs directly within the Nagios UI (with some CGI modifications) or via its own web interface.
- How it works: Nagios writes perfdata to files. NPCD (Nagios Perfdata C Daemon), a bulk mode processor for PNP4Nagios, picks up these files and feeds data into RRDtool databases.
- NagVis: A powerful visualization addon that allows you to create custom maps and diagrams (e.g., network topology, datacenter layout, application flow) with Nagios status information overlaid. Status icons change color based on host/service states.
- Grafana with InfluxDB/Prometheus: A very popular modern approach. Nagios can send perfdata to a time-series database like InfluxDB (using a perfdata script or Telegraf) or be scraped by Prometheus (using an exporter). Grafana then connects to these databases to create rich, interactive dashboards. This offers more flexibility and power than older RRDtool-based solutions but requires setting up a separate TIG/Prometheus stack.
- PNP4Nagios: One of the most popular addons for graphing performance data collected by Nagios. It uses RRDtool (Round Robin Database Tool) to store and render graphs. It integrates well with Nagios and can display graphs directly within the Nagios UI (with some CGI modifications) or via its own web interface.
-
Alternative Web Interfaces:
- Thruk: A modern, feature-rich web interface for Nagios (and other monitoring backends like Icinga, Naemon). It offers faster performance than the classic CGIs, a more customizable UI, advanced filtering, reporting, and multi-backend support (can connect to several Nagios instances).
- Nagios V-Shell: A PHP-based frontend for Nagios that aims to be faster and more user-friendly than the standard CGIs.
-
Distributed Monitoring and Scaling:
- Mod_Gearman: As mentioned in scaling, this addon distributes Nagios check execution across multiple worker nodes using the Gearman job queue system. This significantly enhances the capacity of a Nagios setup.
- Nagios Core acts as the scheduler and submits check jobs to Gearman.
- Gearman workers (can be on separate servers) pick up jobs, execute plugins, and return results.
- DNX (Distributed Nagios eXecutor): An alternative framework for distributing Nagios checks.
- Mod_Gearman: As mentioned in scaling, this addon distributes Nagios check execution across multiple worker nodes using the Gearman job queue system. This significantly enhances the capacity of a Nagios setup.
-
Configuration Management:
- While not strictly Nagios addons, tools like Ansible, Puppet, Chef, or SaltStack are often used to manage Nagios configuration files, especially in larger environments. They allow for templated, automated, and version-controlled deployment of host, service, and other object definitions.
- NConf, NagiosQL: Web-based configuration tools that allow you to manage Nagios object definitions through a GUI, storing them in a database and then generating the Nagios flat config files. These can simplify configuration for users less comfortable with direct file editing but add another layer of complexity.
-
Database Integration:
- NDOUtils (Nagios Data Output Utilities): A broker module that exports Nagios status and historical data to a MySQL or PostgreSQL database. This data can then be used by other addons (like Nagios V-Shell, some reporting tools) or for custom querying and reporting. It's a core component for many advanced Nagios setups.
Choosing Addons:
- Identify your needs: What specific functionality are you missing? (e.g., better graphing, faster UI, scalability).
- Complexity: Some addons are simple to install, while others (like Mod_Gearman or a full Grafana stack) are more involved.
- Community and Support: Look for well-maintained addons with active communities.
- Compatibility: Ensure the addon is compatible with your Nagios Core version.
Installing and configuring these addons typically involves downloading them, following their specific installation instructions (which might include compiling code, installing dependencies, configuring Nagios broker modules, and setting up web server configurations), and then integrating them with your Nagios Core setup. Each addon has its own learning curve.
Starting with PNP4Nagios for graphing is often a good first step for extending Nagios, as visual data trends are very valuable.
Conclusion for Advanced Nagios
This section has taken you through several advanced aspects of Nagios management and optimization. You've learned about the flexibility of passive checks with NSCA for monitoring asynchronous events and firewalled services. We explored the power of event handlers for automated diagnostics and potential remediation, emphasizing the need for caution and security. Performance tuning strategies, from hardware considerations to configuration tweaks and distributed monitoring concepts, were discussed to help you scale your Nagios instance effectively. Critical security best practices were highlighted to protect your monitoring infrastructure. Finally, a brief overview of popular addons showed how Nagios Core's capabilities can be significantly extended.
By mastering these advanced techniques, you can transform Nagios from a basic monitoring tool into a highly sophisticated, efficient, and integral part of your IT operations, capable of handling complex environments and proactively contributing to system stability and reliability. Continuous learning and adaptation are key, as the landscape of IT infrastructure and monitoring tools continues to evolve.