Skip to content
Author Nejat Hakan
eMail nejat.hakan@outlook.de
PayPal Me https://paypal.me/nejathakan


Network/Server Monitoring Nagios

Introduction to Nagios

Welcome to this comprehensive guide on Network and Server Monitoring with Nagios. In the world of IT infrastructure management, proactive monitoring is not just a best practice; it's a necessity. Downtime can lead to significant financial losses, damage to reputation, and decreased productivity. Nagios stands as one of the most established, powerful, and flexible open-source monitoring solutions available, empowering administrators to detect and resolve IT infrastructure problems before they affect critical business processes.

This guide is designed for university students and aspiring system administrators who wish to delve deep into Nagios, understand its architecture, learn how to install and configure it from scratch (self-hosting), and master its various features to monitor diverse environments effectively. We will cover everything from the fundamental concepts to advanced techniques, ensuring you gain practical, hands-on experience through detailed workshops.

What is Nagios?

Nagios, specifically Nagios Core, is an open-source application that provides monitoring and alerting services for servers, switches, applications, and services. It was originally created by Ethan Galstad and is now developed and maintained by a vibrant community. Nagios doesn't perform any monitoring itself; instead, it relies on plugins to do the actual work. It acts as a scheduler, a state manager, an alerter, and a central dashboard for the information gathered by these plugins.

The primary goals of Nagios are:

  1. Monitoring: To continuously check the status of hosts (servers, network devices) and the services running on them.
  2. Alerting: To notify administrators when problems arise, allowing for rapid response.
  3. Reporting: To provide historical data and reports on availability, performance, and incidents.
  4. Visibility: To offer a centralized view of the entire IT infrastructure's health.

Why is Monitoring Important?

Effective monitoring offers numerous benefits:

  • Proactive Problem Detection: Identify issues before they escalate into major outages.
  • Reduced Downtime: Faster problem resolution leads to increased availability of services.
  • Capacity Planning: Track resource utilization (CPU, memory, disk, network bandwidth) to predict future needs.
  • SLA Management: Verify that Service Level Agreements (SLAs) are being met.
  • Security: Monitor for unauthorized changes or suspicious activities.
  • Troubleshooting: Provide valuable data to diagnose complex problems quickly.
  • Peace of Mind: Knowing that your systems are being watched over, even when you're not actively looking.

Core Concepts of Nagios

Understanding these fundamental concepts is crucial before diving into the practical aspects:

  • Hosts: These are physical or virtual devices on your network that you want to monitor. Examples include servers, workstations, routers, switches, printers, etc. Each host has an address (IP or FQDN) and can be in states like UP, DOWN, or UNREACHABLE.
  • Services: These are specific functionalities or resources associated with a host. Examples include CPU load, disk usage, memory usage, a running web server (HTTP), an SSH daemon, a specific process, or network connectivity (PING). Services have states like OK, WARNING, CRITICAL, or UNKNOWN.
  • Plugins: These are external, executable scripts or programs that perform the actual checks. Nagios Core calls these plugins to determine the status of a host or service. Plugins return an exit code (indicating status) and output text (providing details). There are thousands of plugins available for almost any conceivable check.
  • Commands: Nagios uses command definitions to specify how to execute plugins. These definitions include the plugin's path and any arguments it requires.
  • Checks:
    • Active Checks: These are initiated by the Nagios server. Nagios schedules and executes plugins at regular intervals to check the status of hosts and services.
    • Passive Checks: These are initiated by external applications or processes on the monitored hosts. The results are submitted to Nagios for processing. This is useful for monitoring asynchronous events or services behind restrictive firewalls.
  • States:
    • Host States: UP (reachable and responding), DOWN (unreachable or not responding), UNREACHABLE (an intermediate host, like a router, is down, preventing Nagios from reaching the target host).
    • Service States: OK (functioning correctly), WARNING (potential issue or approaching a threshold), CRITICAL (serious issue, service likely unavailable), UNKNOWN (unable to determine status, often due to plugin errors or misconfiguration).
    • State Types:
      • Soft State: A temporary, unconfirmed state. When a host or service first changes state, it enters a soft state. Nagios will re-check it multiple times (configurable) before confirming.
      • Hard State: A confirmed, persistent state. After a configurable number of re-checks in a soft state, if the state remains the same, it transitions to a hard state. Notifications and event handlers are typically triggered only on hard state changes.
  • Notifications: When a host or service enters a hard problem state (or recovers), Nagios can send notifications to designated contacts (e.g., administrators) via various methods like email, SMS, or custom scripts.
  • Contacts and Contact Groups: Contacts are individuals who receive notifications. Contact groups are collections of contacts, simplifying notification management.
  • Timeperiods: These define when Nagios is allowed to perform checks or send notifications (e.g., "24x7", "workhours", "nonworkhours").
  • Event Handlers: Optional scripts that can be executed when a host or service changes state, allowing for automated remediation attempts (e.g., restarting a failed service).
  • NRPE (Nagios Remote Plugin Executor): A common addon used to execute Nagios plugins on remote Linux/Unix hosts. The Nagios server uses the check_nrpe plugin to connect to an NRPE daemon running on the remote host, which then executes local plugins.
  • NSClient++: A versatile agent often used for monitoring Windows machines. It can act as an NRPE daemon, an NSCA client, and has its own built-in checks.
  • NSCA (Nagios Service Check Acceptor): A daemon that runs on the Nagios server to accept passive check results submitted by external applications using the send_nsca client.

Nagios Architecture Overview

A typical Nagios setup involves:

  1. The Nagios Server:
    This is the central machine where Nagios Core is installed. It is responsible for:

    • Scheduling Checks: Deciding when and how often to check hosts and services.
    • Executing Checks: Running plugins (either locally or via agents like NRPE for remote checks).
    • Processing Check Results: Determining the status of hosts/services based on plugin output.
    • State Management: Tracking current and historical states.
    • Event Correlation & Handling: Managing dependencies, escalations, and event handlers.
    • Notification Engine: Sending alerts to appropriate contacts.
    • Web Interface (CGI): Providing a visual dashboard for users to view status, history, and reports.
  2. Monitored Hosts:
    These are the remote machines or devices being monitored. They might have agents installed (like NRPE or NSClient++) to allow Nagios to execute plugins locally on them.

  3. Plugins:
    Reside on the Nagios server (for local checks or checks like PING/HTTP) and potentially on monitored hosts (executed by agents).

Data Flow (Active Check Example):

  1. Nagios Scheduler determines it's time to check a service on a remote host.
  2. Nagios process executes the check_nrpe plugin (on the Nagios server).
  3. check_nrpe connects to the NRPE daemon on the remote host.
  4. The NRPE daemon executes a specific local plugin (e.g., check_disk) on the remote host.
  5. The local plugin returns its status and output to the NRPE daemon.
  6. The NRPE daemon sends this information back to check_nrpe on the Nagios server.
  7. check_nrpe provides the result to the Nagios process.
  8. Nagios updates the service status. If a state change warrants it, it triggers notifications or event handlers.

Benefits of Self-Hosting Nagios

While cloud-based monitoring solutions exist, self-hosting Nagios Core offers several advantages, especially for learning and customization:

  • Full Control: You have complete authority over the configuration, data, and security of your monitoring system.
  • Customization: Tailor Nagios precisely to your needs, integrate custom plugins, and modify its behavior extensively.
  • No Vendor Lock-in: Avoid dependency on a specific vendor's roadmap or pricing changes.
  • Cost-Effective: Nagios Core is free and open-source. You only incur costs for the hardware/VM it runs on.
  • Deep Learning Experience: Setting up and managing Nagios from scratch provides invaluable system administration skills.
  • Data Privacy: Monitoring data, which can be sensitive, remains within your infrastructure.
  • Flexibility: Integrate with other internal systems or tools as needed.

Prerequisites for this Guide

To make the most of this guide, you should have:

  • Basic Linux Command-Line Skills:
    Familiarity with navigating directories, editing files, managing packages, and understanding permissions. Most workshops will assume a Debian/Ubuntu-based Linux distribution.
  • Basic Networking Concepts:
    Understanding of IP addresses, TCP/IP, ports, DNS, and firewalls.
  • A Virtualization Environment (Recommended):
    Software like VirtualBox, VMware Workstation/Player, or a cloud provider account (for creating VMs) will be highly beneficial for setting up a Nagios server and test client machines for workshops.
  • Patience and Eagerness to Learn:
    Nagios is powerful but can have a steep learning curve initially. Persistence is key!

By the end of this guide, you will be well-equipped to deploy, configure, and manage a robust Nagios monitoring environment for your self-hosted services or small to medium-sized infrastructures. Let's begin this exciting journey into the world of Nagios!

1. Basic Nagios Setup and Configuration

This section covers the foundational steps to get a Nagios Core server up and running. We will start by installing Nagios Core and its essential plugins, then explore the structure of its configuration files, monitor our first host (the Nagios server itself), and finally set up basic email notifications. These steps are crucial for understanding how Nagios operates and for building more complex monitoring solutions later.

Installing Nagios Core

Installing Nagios Core involves several steps, from preparing the system with necessary dependencies to compiling Nagios and its plugins from source. While some distributions offer Nagios packages, compiling from source gives you the latest version and a better understanding of the components. We will primarily focus on a generic Linux environment, with specific workshop instructions for Debian/Ubuntu.

System Requirements:

  • Operating System: A Linux distribution (e.g., Debian, Ubuntu, CentOS, RHEL).
  • Web Server: Apache HTTP Server (or Nginx, but Apache is more traditionally used and simpler for initial setup with Nagios CGIs).
  • PHP: Required for some web interface features, though the core CGIs are written in C.
  • Compiler and Build Tools: GCC, make, and development libraries (like build-essential on Debian/Ubuntu).
  • GD Graphics Library: For generating status maps and other graphical elements (optional but recommended).
  • Sufficient Resources: At least 1 CPU core, 1GB RAM, and a few GBs of disk space for a small setup. Requirements grow with the number of hosts and services monitored.

Steps Overview:

  1. Install Prerequisites: Ensure your system has Apache, PHP, a C compiler, and essential libraries.
  2. Create Nagios User and Group: For security, Nagios processes should run under a dedicated unprivileged user.
  3. Download Nagios Core and Nagios Plugins: Get the latest stable tarballs from the official Nagios websites.
  4. Compile and Install Nagios Core: Configure, compile, and install the main Nagios application.
  5. Compile and Install Nagios Plugins: These are the scripts Nagios uses to perform checks.
  6. Configure Web Interface: Set up Apache to serve the Nagios web UI and secure it.
  7. Verify Configuration and Start Services: Check for errors and start Nagios and Apache.

Detailed Explanation of Steps:

1. Installing Prerequisites:
The specific packages depend on your Linux distribution.

  • For Debian/Ubuntu based systems:

    sudo apt update
    sudo apt install -y autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php libgd-dev
    
    This command installs:

    • autoconf, gcc, libc6, make: Standard build tools.
    • wget, unzip: Utilities for downloading and extracting files.
    • apache2: The Apache web server.
    • php, libapache2-mod-php: PHP and the Apache module for PHP.
    • libgd-dev: Development files for the GD graphics library.
  • For RHEL/CentOS based systems (example, package names might vary slightly):

    sudo yum install -y gcc glibc glibc-common make gettext automake autoconf wget openssl-devel net-snmp net-snmp-utils epel-release
    sudo yum install -y httpd php gd gd-devel
    

2. Create Nagios User and Group:
Nagios needs a user and group to run under. Additionally, a group for allowing external commands via the web interface is often created.

sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios
sudo usermod -a -G nagcmd www-data  # Or apache, depending on your web server user

  • useradd nagios: Creates a user named nagios.
  • groupadd nagcmd: Creates a group named nagcmd.
  • usermod -a -G nagcmd nagios: Adds the nagios user to the nagcmd group.
  • usermod -a -G nagcmd www-data: Adds the webserver user (e.g., www-data on Debian/Ubuntu, apache on CentOS) to the nagcmd group. This allows the web server to submit commands to Nagios.

3. Download Nagios Core and Nagios Plugins: Always check the official Nagios website (nagios.org) for the latest stable versions.

# Example versions, replace with latest
NAGIOS_CORE_VERSION="4.4.14" # Check for the latest stable version
NAGIOS_PLUGINS_VERSION="2.4.8" # Check for the latest stable version

cd /tmp
wget https://github.com/NagiosEnterprises/nagioscore/releases/download/nagios-${NAGIOS_CORE_VERSION}/nagios-${NAGIOS_CORE_VERSION}.tar.gz
wget https://nagios-plugins.org/download/nagios-plugins-${NAGIOS_PLUGINS_VERSION}.tar.gz

tar -zxvf nagios-${NAGIOS_CORE_VERSION}.tar.gz
tar -zxvf nagios-plugins-${NAGIOS_PLUGINS_VERSION}.tar.gz

4. Compile and Install Nagios Core: Navigate into the extracted Nagios Core directory.

cd /tmp/nagioscore-nagios-${NAGIOS_CORE_VERSION}/
sudo ./configure --with-nagios-group=nagios --with-command-group=nagcmd --with-httpd-conf=/etc/apache2/sites-enabled/

  • ./configure: This script checks your system for dependencies and prepares the build environment.
    • --with-nagios-group=nagios: Specifies the Nagios group.
    • --with-command-group=nagcmd: Specifies the group for external commands.
    • --with-httpd-conf=/etc/apache2/sites-enabled/: (For Debian/Ubuntu Apache) Specifies where to install the Apache configuration snippet for Nagios. For RHEL/CentOS, this might be /etc/httpd/conf.d/. Adapt as needed.

If ./configure completes without errors, proceed with compilation and installation:

sudo make all
sudo make install
sudo make install-init     # Installs init script (e.g., /etc/init.d/nagios)
sudo make install-daemoninit # Installs systemd unit file if systemd is detected
sudo make install-config   # Installs SAMPLE configuration files
sudo make install-commandmode # Installs and configures permissions for the external command file
sudo make install-webconf  # Installs Apache config file for Nagios web UI

  • make all: Compiles the Nagios binaries and CGIs.
  • make install: Installs the compiled files, typically into /usr/local/nagios/.
  • make install-init / make install-daemoninit: Installs the service script to manage the Nagios daemon (start, stop, restart). The latter is for systems using systemd.
  • make install-config: Installs sample configuration files in /usr/local/nagios/etc/. Important: These are samples; you'll customize them. If you're upgrading, you might skip this or back up existing configs.
  • make install-commandmode: Sets up the directory and permissions for Nagios to process external commands.
  • make install-webconf: Installs an Apache configuration file (e.g., nagios.conf) into the directory specified by --with-httpd-conf or a default location.

5. Compile and Install Nagios Plugins: Nagios Core needs plugins to actually perform checks.

cd /tmp/nagios-plugins-${NAGIOS_PLUGINS_VERSION}/
sudo ./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl
sudo make
sudo make install

  • ./configure: Prepares plugins for compilation.
    • --with-nagios-user=nagios and --with-nagios-group=nagios: Sets user/group ownership for some plugins.
    • --with-openssl: Enables SSL/TLS support for plugins that require it (e.g., check_http for HTTPS).
  • make: Compiles the plugins.
  • make install: Installs plugins, typically into /usr/local/nagios/libexec/.

6. Configure Web Interface:

  • Enable Apache Modules: For Apache, CGI and rewrite modules are often needed.

    sudo a2enmod cgi rewrite  # For Debian/Ubuntu
    sudo systemctl restart apache2
    
    For RHEL/CentOS, ensure mod_cgi is loaded. mod_rewrite is also good practice.

  • Create Web Admin User: Nagios web interface access is typically protected by Basic Authentication.

    sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
    
    This command creates a new password file (htpasswd.users) and adds a user nagiosadmin. You'll be prompted to enter a password for this user. For subsequent users, omit the -c flag. The path /usr/local/nagios/etc/htpasswd.users is a common location, but it's defined in the Apache configuration for Nagios (e.g., in /etc/apache2/sites-enabled/nagios.conf). Ensure consistency.

  • Review Apache Configuration for Nagios: The make install-webconf step should have created a file like /etc/apache2/sites-enabled/nagios.conf (Debian/Ubuntu) or /etc/httpd/conf.d/nagios.conf (RHEL/CentOS). Open this file and review it. Key things to check:

    • ScriptAlias /nagios/cgi-bin/ /usr/local/nagios/sbin/
    • Alias /nagios /usr/local/nagios/share/
    • <Directory> directives for /usr/local/nagios/sbin/ and /usr/local/nagios/share/ setting access controls.
    • AuthUserFile should point to /usr/local/nagios/etc/htpasswd.users.
    • AuthName "Nagios Access"
    • AuthType Basic
    • require valid-user

7. Verify Configuration and Start Services:

  • Verify Nagios Configuration: Before starting Nagios, it's crucial to verify its configuration.

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    
    This command will parse all your Nagios configuration files and report any errors. If there are errors, you must fix them before proceeding. "Total Warnings: 0" and "Total Errors: 0" is the goal.

  • Start Nagios Service: If using systemd (common on modern Linux):

    sudo systemctl enable nagios  # Enable Nagios to start on boot
    sudo systemctl start nagios
    sudo systemctl status nagios
    
    If using init scripts:
    sudo /etc/init.d/nagios start
    # To enable on boot (distribution dependent, e.g., update-rc.d or chkconfig)
    sudo systemctl enable apache2 # or httpd
    sudo systemctl restart apache2 # or httpd
    

8. Accessing the Nagios Web Interface:
Open your web browser and navigate to http://YOUR_SERVER_IP/nagios/. You should be prompted for the username (nagiosadmin) and password you created earlier. If successful, you'll see the Nagios Core dashboard. Initially, it might show a few items related to localhost if the sample configurations were used.

This completes the basic installation of Nagios Core and its plugins. The next step is to understand the configuration files that drive its behavior.

Workshop Installing Nagios Core on a Debian/Ubuntu System

Objective:
Perform a clean installation of Nagios Core and Nagios Plugins from source on a fresh Debian or Ubuntu virtual machine.

Prerequisites:

  • A virtual machine (e.g., VirtualBox, VMware) running a minimal server installation of Debian (e.g., Debian 11/12) or Ubuntu Server (e.g., Ubuntu 20.04/22.04 LTS).
  • SSH access to the VM or direct console access.
  • Internet connectivity from within the VM.
  • Root or sudo privileges on the VM.

Steps:

  1. Update System and Install Prerequisites: Log into your VM.

    sudo apt update
    sudo apt upgrade -y
    sudo apt install -y build-essential autoconf gcc libc6 make wget unzip apache2 php libapache2-mod-php libgd-dev
    

    • build-essential is a meta-package that installs gcc, make, and other crucial build tools on Debian/Ubuntu.
  2. Create Nagios User and Group:

    sudo useradd nagios
    sudo groupadd nagcmd
    sudo usermod -a -G nagcmd nagios
    sudo usermod -a -G nagcmd www-data # www-data is the Apache user on Debian/Ubuntu
    

  3. Download Nagios Core and Plugins: Go to the official Nagios Core releases page on GitHub (NagiosEnterprises/nagioscore) and the Nagios Plugins download page (nagios-plugins.org) to find the latest stable version numbers. Let's assume 4.4.14 for Core and 2.4.8 for Plugins for this workshop.

    cd /tmp
    wget https://github.com/NagiosEnterprises/nagioscore/releases/download/nagios-4.4.14/nagios-4.4.14.tar.gz
    wget https://nagios-plugins.org/download/nagios-plugins-2.4.8.tar.gz
    
    tar -zxvf nagios-4.4.14.tar.gz
    tar -zxvf nagios-plugins-2.4.8.tar.gz
    

  4. Compile and Install Nagios Core:

    cd /tmp/nagioscore-nagios-4.4.14/
    sudo ./configure --with-nagios-group=nagios --with-command-group=nagcmd --with-httpd-conf=/etc/apache2/sites-enabled/
    # This configure command is tailored for Debian/Ubuntu's Apache setup.
    
    # If configure completes without error:
    sudo make all
    sudo make install
    # Check if your system uses systemd (most modern systems do)
    # If `systemctl` is available, your system likely uses systemd
    if [ -d /run/systemd/system ]; then
        sudo make install-daemoninit # For systemd
    else
        sudo make install-init     # For older init systems
    fi
    sudo make install-config
    sudo make install-commandmode
    # `make install-webconf` might have been run by `./configure` if `--with-httpd-conf` was successful.
    # If not, or to be sure:
    sudo make install-webconf
    
    Self-reflection: The configure script will attempt to install the Apache web config. If it can't (e.g., permissions, path issues), make install-webconf is the fallback.

  5. Compile and Install Nagios Plugins:

    cd /tmp/nagios-plugins-2.4.8/
    sudo ./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl
    sudo make
    sudo make install
    

  6. Configure Web Interface: Enable necessary Apache modules:

    sudo a2enmod cgi rewrite
    sudo systemctl restart apache2
    
    Create the nagiosadmin web user (you will be prompted for a password):
    sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
    
    Verify the Apache configuration for Nagios. The file should be at /etc/apache2/sites-enabled/nagios.conf.
    cat /etc/apache2/sites-enabled/nagios.conf
    
    Ensure it has lines like AuthUserFile /usr/local/nagios/etc/htpasswd.users and Require valid-user.

  7. Verify Nagios Configuration and Start Services: Check the sample Nagios configuration for errors:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    
    You should see "Total Warnings: 0" and "Total Errors: 0". If not, troubleshoot based on the error messages (common initial issues involve file permissions or paths).

    Enable and start the Nagios service (assuming systemd):

    sudo systemctl enable nagios
    sudo systemctl start nagios
    sudo systemctl status nagios # Check it's active and running
    
    Ensure Apache is also enabled and running:
    sudo systemctl enable apache2
    sudo systemctl restart apache2 # Restart to ensure all new configs are loaded
    sudo systemctl status apache2
    

  8. Access the Nagios Web Interface: Find your VM's IP address (e.g., using ip addr show or hostname -I). Open a web browser on your host machine and navigate to http://<VM_IP_ADDRESS>/nagios/. Log in with username nagiosadmin and the password you set.

    You should now see the Nagios Core interface. It will likely be monitoring localhost with a few default checks defined in the sample configuration files.

Troubleshooting Tips for the Workshop:

  • Permission Denied (Web Interface): If you see "Forbidden" errors, check Apache error logs (/var/log/apache2/error.log). This often relates to:
    • File permissions on /usr/local/nagios/share or /usr/local/nagios/sbin.
    • Incorrect Apache configuration (nagios.conf). Ensure Require all granted or appropriate Require directives are set for your Apache version (Apache 2.4 uses different syntax than 2.2). The default nagios.conf from make install-webconf usually handles this.
  • "File not found" for CGIs: Ensure mod_cgi is enabled and ScriptAlias is correct.
  • Nagios service fails to start: Check sudo systemctl status nagios and sudo journalctl -xeu nagios for detailed error messages. Often related to configuration errors identified by the -v check.
  • Plugin errors (e.g., "Return code of 127 is out of bounds"): This often means the plugin was not found or is not executable. Check paths in commands.cfg and permissions in /usr/local/nagios/libexec/.

This workshop provides a solid foundation. You now have a working Nagios server!

Understanding Nagios Configuration Files

Nagios's power and flexibility stem from its text-based configuration files. Understanding their structure and purpose is paramount to effectively managing a Nagios installation. All primary configuration files are typically located in /usr/local/nagios/etc/ (or a similar path if Nagios was installed differently).

Main Configuration File (nagios.cfg):
This is the heart of Nagios's configuration. It's usually located at /usr/local/nagios/etc/nagios.cfg. This file tells Nagios:

  • Paths to other configuration files: Using cfg_file= directives for object definitions and cfg_dir= for directories containing object definitions.
  • Location of object cache file: object_cache_file=/usr/local/nagios/var/objects.cache
  • Location of status data file: status_file=/usr/local/nagios/var/status.dat
  • Log file location: log_file=/usr/local/nagios/var/nagios.log
  • Global settings: Such as check execution options, logging options, performance tuning parameters (e.g., interval_length, max_concurrent_checks).
  • User and group Nagios should run as: nagios_user=nagios, nagios_group=nagios.
  • Event broker modules: For integrating with addons like PNP4Nagios or Mod_Gearman.

Example snippets from nagios.cfg:

# LOG FILE
log_file=/usr/local/nagios/var/nagios.log

# OBJECT CONFIGURATION FILE(S)
# You can specify individual object config files as shown below:
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg

# You can also tell Nagios to process all config files ending with '.cfg'
# in a particular directory by using the cfg_dir directive as shown below:
cfg_dir=/usr/local/nagios/etc/servers
cfg_dir=/usr/local/nagios/etc/printers
cfg_dir=/usr/local/nagios/etc/switches

# NAGIOS USER AND GROUP
nagios_user=nagios
nagios_group=nagios

# CHECK RESULT PATH
# This is where Nagios checks for passive check results.
check_result_path=/usr/local/nagios/var/spool/checkresults
It's highly recommended to use cfg_dir directives to organize your object configuration files into logical subdirectories (e.g., /usr/local/nagios/etc/objects/hosts/, /usr/local/nagios/etc/objects/services/, or by device type like /usr/local/nagios/etc/servers/).

Resource Files (resource.cfg):
Usually located at /usr/local/nagios/etc/resource.cfg (or private/resource.cfg). This file is used to store user-defined macros. Macros are like variables that can be used throughout your Nagios configuration. The most common use is to store sensitive information like passwords or community strings for SNMP, or commonly used paths. Example:

# Sets $USER1$ to be the path to the plugins directory
$USER1$=/usr/local/nagios/libexec

# Sets $USEREMAIL$ to a specific email address
# $USEREMAIL$=youradmin@example.com
Nagios predefines some macros (e.g., $HOSTADDRESS$, $SERVICESTATE$). User-defined macros typically start with $USERn$ (e.g., $USER1$, $USER2$, etc.) and are referenced with the dollar signs.

Object Configuration Files:
These files define the actual elements Nagios monitors and interacts with. They are typically stored in /usr/local/nagios/etc/objects/ or subdirectories specified by cfg_dir in nagios.cfg. The common object types are:

  • Hosts (hosts.cfg or similar): Define the physical/virtual machines and network devices.
    define host {
        use             linux-server  ; Name of host template to use
        host_name       myserver1
        alias           My First Linux Server
        address         192.168.1.10
        contact_groups  admins
    }
    
  • Services (services.cfg or similar): Define the specific checks for hosts.
    define service {
        use                     generic-service  ; Name of service template to use
        host_name               myserver1
        service_description     PING
        check_command           check_ping!100.0,20%!500.0,60%
        contact_groups          admins
    }
    
  • Contacts (contacts.cfg): Define individuals who receive notifications.
    define contact {
        contact_name            nagiosadmin
        alias                   Nagios Administrator
        service_notification_period 24x7
        host_notification_period  24x7
        service_notification_options w,u,c,r  ; Send notifications on warning, unknown, critical, recovery
        host_notification_options d,u,r      ; Send notifications on down, unreachable, recovery
        service_notification_commands notify-service-by-email
        host_notification_commands  notify-host-by-email
        email                   nagios@localhost ; This should be a real email address
    }
    
  • Contact Groups (contactgroups.cfg or similar): Group contacts together.
    define contactgroup {
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin, anotheradmin
    }
    
  • Commands (commands.cfg): Define how Nagios executes plugins.
    define command {
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
    }
    
    Here, $USER1$ is from resource.cfg, $HOSTADDRESS$ is a Nagios macro for the host's IP, and $ARG1$, $ARG2$ are arguments passed from the service definition.
  • Timeperiods (timeperiods.cfg): Define specific time ranges for checks and notifications.
    define timeperiod {
        timeperiod_name 24x7
        alias           24 Hours A Day, 7 Days A Week
        sunday          00:00-24:00
        monday          00:00-24:00
        # ... and so on for all days
        saturday        00:00-24:00
    }
    
    define timeperiod{
        timeperiod_name workhours
        alias           Normal Work Hours
        monday          09:00-17:00
        tuesday         09:00-17:00
        wednesday       09:00-17:00
        thursday        09:00-17:00
        friday          09:00-17:00
    }
    
  • Templates (often in templates.cfg or spread across object files):
    Allow you to define common properties for hosts and services, promoting inheritance and reducing redundancy. You define a template with generic settings, and then specific host/service definitions use that template, inheriting its properties and overriding them if needed.
    define host {
        name                    linux-server    ; The name of this host template
        notifications_enabled   1               ; Host notifications are enabled
        event_handler_enabled   1               ; Host event handler is enabled
        flap_detection_enabled  1               ; Flap detection is enabled
        process_perf_data       1               ; Process performance data
        retain_status_information 1             ; Retain status information across program restarts
        retain_nonstatus_information 1          ; Retain non-status information across program restarts
        check_command           check-host-alive ; Default command to check if a host is "alive"
        max_check_attempts      5
        notification_interval   60
        notification_period     24x7
        notification_options    d,u,r
        contact_groups          admins
        register                0               ; DONT REGISTER THIS DEFINITION - ITS A TEMPLATE
    }
    
    The register 0 line is crucial for templates; it tells Nagios this is not an actual object to monitor but a template to be used by other objects.

CGI Configuration File (cgi.cfg):
Located at /usr/local/nagios/etc/cgi.cfg, this file controls aspects of the Nagios web interface (the CGIs). Key settings:

  • Main configuration file location: main_config_file=/usr/local/nagios/etc/nagios.cfg
  • Physical HTML path: physical_html_path=/usr/local/nagios/share
  • URL HTML path: url_html_path=/nagios
  • Authentication and Authorization: Defines which users can view certain information or perform certain actions (e.g., submit commands).
    # AUTHENTICATION USAGE
    # This option controls whether or not the CGIs will use the
    # authentication and authorization functionality.
    # 0 = Don't use authentication functionality
    # 1 = Use authentication functionality
    use_authentication=1
    
    # DEFAULT USERNAME
    # This is the default username that the CGIs will use if
    # an authenticated user cannot be found.
    #default_user_name=guest
    
    # SYSTEM/PROCESS INFORMATION ACCESS
    # These are comma-delimited lists of authorized users who can
    # view system/process information in the CGIs.
    authorized_for_system_information=nagiosadmin
    authorized_for_configuration_information=nagiosadmin
    
    # COMMAND ACCESS
    # These are comma-delimited lists of authorized users who can
    # issue commands via the command CGI.
    authorized_for_all_host_commands=nagiosadmin
    authorized_for_all_service_commands=nagiosadmin
    
    It's essential to restrict authorized_for_* directives to trusted users.

Directory Structure Summary:

  • /usr/local/nagios/bin/: Nagios executable (nagios).
  • /usr/local/nagios/sbin/: CGI executables (e.g., status.cgi, extinfo.cgi).
  • /usr/local/nagios/libexec/: Nagios plugins (e.g., check_ping, check_http).
  • /usr/local/nagios/etc/: Main configuration files (nagios.cfg, cgi.cfg, resource.cfg) and object definitions (often in an objects/ subdirectory).
  • /usr/local/nagios/share/: HTML, CSS, JavaScript, and images for the web interface.
  • /usr/local/nagios/var/: Variable data, such as logs (nagios.log), status data (status.dat), object cache (objects.cache), retention data (retention.dat), and spool directories (e.g., for check results).

Best Practices for Organizing Configuration:

  • Use cfg_dir extensively: Create directories like objects/hosts, objects/services, objects/templates, objects/contactgroups, etc. Or, organize by device type/location: etc/servers/, etc/network/, etc/applications/.
  • One object per file (for larger setups): For instance, each host definition in its own file within etc/hosts/hostname.cfg. This makes management with configuration management tools (Ansible, Puppet, Chef) easier.
  • Leverage templates: Heavily use host and service templates to minimize redundancy and simplify bulk changes.
  • Consistent naming conventions: Use clear and consistent names for hosts, services, templates, groups, etc.
  • Version control: Store your Nagios configuration directory (/usr/local/nagios/etc/) in a version control system like Git. This allows you to track changes, revert to previous versions, and collaborate.
  • Regularly validate configuration: Always run /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg before restarting/reloading Nagios after making changes.

Understanding these files and their relationships is the key to mastering Nagios. The sample configuration files provided by make install-config are an excellent starting point to explore.

Workshop Exploring and Modifying Basic Configuration Files

Objective:
To familiarize yourself with the key Nagios configuration files, make a simple modification, and verify the changes.

Prerequisites:

  • A working Nagios Core installation (from the previous workshop).
  • SSH or console access to the Nagios server.
  • A text editor (e.g., nano, vim).

Steps:

  1. Locate Core Configuration Files: Navigate to the Nagios configuration directory:

    cd /usr/local/nagios/etc/
    ls -l
    
    You should see nagios.cfg, cgi.cfg, resource.cfg, and a directory named objects.

  2. Examine nagios.cfg: Open nagios.cfg with your text editor:

    sudo nano nagios.cfg
    

    • Look for log_file to see where Nagios logs its activities.
    • Find the cfg_file and cfg_dir directives. Note how object configuration files are included. The sample configuration usually includes several cfg_file entries pointing to files within the objects/ directory (e.g., objects/commands.cfg, objects/contacts.cfg, objects/localhost.cfg).
    • Observe settings like nagios_user and nagios_group.
    • Do not make any changes yet. Exit the editor.
  3. Examine resource.cfg: Open resource.cfg:

    sudo nano resource.cfg
    

    • You'll likely see a line like $USER1$=/usr/local/nagios/libexec. This macro is widely used in command definitions to specify the path to plugins.
    • You might also see commented-out examples for other $USERn$ macros.
    • Do not make any changes yet. Exit the editor.
  4. Explore the objects/ Directory:

    cd objects/
    ls -l
    
    You should see files like commands.cfg, contacts.cfg, timeperiods.cfg, templates.cfg, and localhost.cfg. These files define the various Nagios objects.

  5. Modify Contact Information in contacts.cfg: The default contact is often nagiosadmin with a placeholder email. Let's change this. Open contacts.cfg:

    sudo nano contacts.cfg
    
    Find the define contact block for nagiosadmin. It will look something like this:
    define contact{
        contact_name    nagiosadmin             ; Short name of user
        use             generic-contact         ; Inherit default values from generic-contact template (defined above)
        alias           Nagios Admin            ; Full name of user
        email           nagios@localhost        ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
    }
    

    • Change the email directive from nagios@localhost to your actual email address (e.g., yourname@example.com). This is important for receiving notifications later.
    • You can also update the alias if you wish.
    • Save the file and exit the editor (Ctrl+X, then Y, then Enter in nano).
  6. Examine localhost.cfg (Example Host/Service Definitions): Open localhost.cfg:

    sudo nano localhost.cfg
    
    This file typically contains definitions for monitoring the Nagios server itself (localhost).

    • Look for define host block. Note its host_name (usually localhost), alias, and address (usually 127.0.0.1).
    • Observe several define service blocks. These define checks like PING, SSH, HTTP, disk space, current users, etc., for localhost. Notice how each service is associated with host_name localhost.
    • Pay attention to the check_command directive in service definitions. This links to a command defined in commands.cfg.
    • Do not make any changes yet. Exit the editor.
  7. Verify Configuration Changes: Any time you modify Nagios configuration files, you must verify them before reloading or restarting Nagios.

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    
    If you only changed the email address, this command should complete successfully with:
    Total Warnings: 0
    Total Errors:   0
    Things look okay - No serious problems were detected during the pre-flight check
    
    If there are errors, the output will indicate the file and line number causing the problem. You'll need to correct the error and re-verify.

  8. Reload Nagios to Apply Changes: If verification is successful, reload Nagios. Reloading is preferred over restarting for minor configuration changes as it typically doesn't interrupt ongoing checks.

    sudo systemctl reload nagios
    # Or if not using systemd: sudo /etc/init.d/nagios reload
    

  9. Check Nagios Log (Optional): You can tail the Nagios log file to see if the reload was successful and observe its general activity.

    sudo tail -f /usr/local/nagios/var/nagios.log
    
    Look for lines indicating Nagios is re-reading configuration data. Press Ctrl+C to stop tailing.

Outcome:
You have now successfully:

  • Navigated and inspected the main Nagios configuration files.
  • Understood the purpose of nagios.cfg, resource.cfg, and object definition files.
  • Modified a contact's email address.
  • Verified the configuration using the -v flag.
  • Reloaded Nagios to apply the change.

This workshop builds confidence in working with Nagios configuration. The email address change will be used when we set up notifications.

Monitoring Your First Host (localhost)

By default, the sample Nagios configuration (make install-config) often includes settings to monitor the Nagios server itself, referred to as localhost. This is an excellent starting point to understand how host and service definitions work and to see Nagios in action. If your installation didn't include this, or if you want to understand how it's done from scratch, this section will guide you.

Core Concepts Involved:

  • Host Definition: Defines the machine Nagios will monitor. For localhost, the address is 127.0.0.1.
  • Service Definitions: Define what specific aspects of the host will be monitored (e.g., PING, CPU load, disk space).
  • Check Commands: Pre-defined commands in commands.cfg that Nagios uses to execute plugins with appropriate arguments.
  • Plugins: The actual scripts in /usr/local/nagios/libexec/ that perform the checks.

Steps to Monitor localhost (if not already configured):

  1. Define the Host Object for localhost: Create or edit a configuration file (e.g., /usr/local/nagios/etc/objects/localhost.cfg).

    define host{
        use                     linux-server    ; Inherit default values from a template named 'linux-server'
                                                ; This template should be defined in templates.cfg or similar
        host_name               localhost
        alias                   My Nagios Server (localhost)
        address                 127.0.0.1
        contact_groups          admins          ; Who to notify if this host has problems
    }
    

    • use linux-server: This assumes you have a host template named linux-server defined (typically in templates.cfg). Templates provide default values for many directives (e.g., check_period, notification_options, max_check_attempts). If you don't have one, you'd need to specify all required parameters directly or create a simple one.
    • host_name: A unique name for this host within Nagios. localhost is conventional.
    • alias: A descriptive name shown in the web interface.
    • address: The IP address Nagios will use to check this host. For localhost, it's 127.0.0.1.
    • contact_groups admins: Specifies that members of the admins contact group should be notified. This group should be defined in your contactgroups.cfg.
  2. Define Basic Service Checks for localhost: In the same file (localhost.cfg) or a separate services file, add service definitions.

    • PING Check (Host Liveness): Although hosts have an implicit PING check via their check_command (often check-host-alive which uses check_ping), you can also define it as an explicit service for more detailed metrics and alerting.

      define service{
          use                     local-service  ; Inherit default values from a template named 'local-service'
                                                 ; This template is often found in templates.cfg
          host_name               localhost
          service_description     PING
          check_command           check_ping!100.0,20%!500.0,60%
                                                 ; check_ping with Warning at 100ms/20% loss, Critical at 500ms/60% loss
      }
      

      • use local-service: Assumes a service template named local-service exists.
      • host_name localhost: Associates this service with the localhost host.
      • service_description: A descriptive name for this service (e.g., "PING", "HTTP Server").
      • check_command check_ping!100.0,20%!500.0,60%:
        • check_ping: This refers to a command definition in commands.cfg.
        • !: Separator for command arguments.
        • 100.0,20%: Argument 1 ($ARG1$) for check_ping - Warning threshold (100ms round-trip-average, 20% packet loss).
        • 500.0,60%: Argument 2 ($ARG2$) for check_ping - Critical threshold (500ms RTA, 60% packet loss).
    • SSH Server Check:

      define service{
          use                     local-service
          host_name               localhost
          service_description     SSH Server
          check_command           check_ssh
      }
      

      • check_command check_ssh: Assumes a command check_ssh is defined, which uses the check_ssh plugin to see if an SSH server is listening on port 22.
    • HTTP Server Check (for Nagios Web UI itself):

      define service{
          use                     local-service
          host_name               localhost
          service_description     HTTP Web Server
          check_command           check_http
      }
      

      • check_command check_http: Assumes a command check_http is defined, which uses the check_http plugin. By default, it checks port 80 on the host's address.
    • Disk Space Check (Root Partition):

      define service{
          use                     local-service
          host_name               localhost
          service_description     Root Partition Disk Space
          check_command           check_local_disk!20%!10%!/
                                                  ; Warn if <20% free, Critical if <10% free, for path '/'
      }
      

      • check_command check_local_disk!20%!10%!/:
        • check_local_disk: A command typically using the check_disk plugin.
        • !20%: Warning threshold ($ARG1$).
        • !10%: Critical threshold ($ARG2$).
        • !/: Path to check ($ARG3$).
    • CPU Load Check:

      define service{
          use                     local-service
          host_name               localhost
          service_description     CPU Load
          check_command           check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
                                                  ; Warn at 5,4,3 (1,5,15 min avg), Crit at 10,6,4
      }
      

      • check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0:
        • check_local_load: A command typically using check_load plugin.
        • !5.0,4.0,3.0: Warning thresholds for 1-min, 5-min, 15-min load averages.
        • !10.0,6.0,4.0: Critical thresholds.
  3. Ensure Check Commands are Defined: The check_command directives in service definitions refer to commands defined in commands.cfg (or a similar file). These command definitions tell Nagios how to execute the actual plugin scripts. Example command definitions (these are usually present in the default commands.cfg):

    # 'check_local_disk' command definition
    define command{
        command_name    check_local_disk
        command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
    }
    
    # 'check_local_load' command definition
    define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$
    }
    
    # 'check_ping' command definition (often used by 'check-host-alive' too)
    define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
    }
    
    # 'check_ssh' command definition
    define command{
        command_name    check_ssh
        command_line    $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
    }
    # Note: $ARG1$ for check_ssh can be used for options like -p <port>
    
    # 'check_http' command definition
    define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ -p $ARG1$ $ARG2$
    }
    # Note: For check_http, $ARG1$ is often port, $ARG2$ can be other options like -u /uri/
    # If no ARGs are passed from service definition, it uses plugin defaults.
    
    • $USER1$: This macro (from resource.cfg) points to /usr/local/nagios/libexec/.
    • $HOSTADDRESS$: A built-in Nagios macro that gets replaced with the address from the host definition.
    • $ARGn$: Placeholders for arguments passed from the service definition (after the ! in check_command).
  4. Ensure Necessary Templates are Defined: The use linux-server and use local-service directives require these templates to be defined, usually in /usr/local/nagios/etc/objects/templates.cfg. The sample configuration provides these. A minimal linux-server host template:

    define host{
        name                            linux-server    ; Name of this template
        use                             generic-host    ; Inherit other defaults
        check_period                    24x7
        check_interval                  5
        retry_interval                  1
        max_check_attempts              10
        check_command                   check-host-alive
        notification_period             24x7
        notification_interval           30
        notification_options            d,u,r
        contact_groups                  admins
        register                        0               ; This is a template
    }
    
    A minimal local-service service template:
    define service{
        name                            local-service   ; Name of this template
        use                             generic-service ; Inherit other defaults
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        notification_period             24x7
        notification_options            w,u,c,r         ; Notify on warning, unknown, critical, recovery
        contact_groups                  admins
        register                        0               ; This is a template
    }
    
    These templates themselves often use even more generic templates like generic-host and generic-service, which define the absolute base defaults.

  5. Add localhost.cfg to nagios.cfg: If you created a new file (e.g., localhost.cfg), ensure it's included by nagios.cfg:

    sudo nano /usr/local/nagios/etc/nagios.cfg
    
    Add a line like:
    cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
    
    Save and exit.

  6. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If no errors:
    sudo systemctl reload nagios
    

  7. View in Web Interface: Go to your Nagios web interface (http://YOUR_SERVER_IP/nagios/).

    • Click on "Hosts" in the left navigation pane. You should see localhost.
    • Click on "Services". You should see the services you defined (PING, SSH, HTTP, Disk, Load) associated with localhost.
    • Initially, services might be in a "Pending" state. After a few minutes, they should update to OK (green), WARNING (yellow), or CRITICAL (red) based on the check results.

By following these steps, you actively instruct Nagios to monitor various aspects of its own host system. This provides immediate feedback and a practical understanding of the relationship between hosts, services, commands, and plugins.

Workshop Monitoring Localhost Services

Objective:
Ensure localhost is being monitored with at least PING, Disk Space (root partition), and Current Users checks. If these are already present from the sample config, review their definitions. If not, add them.

Prerequisites:

  • A working Nagios Core installation.
  • The nagiosadmin contact email configured to your actual email address (from the previous workshop).
  • Text editor and sudo privileges.

Steps:

  1. Inspect Existing localhost.cfg: Navigate to /usr/local/nagios/etc/objects/ and open localhost.cfg (or the file defining your localhost checks).

    cd /usr/local/nagios/etc/objects/
    sudo nano localhost.cfg
    
    Look for service definitions for:

    • PING
    • Root Partition (Disk Space)
    • Current Users

    A typical sample configuration will have these. For example:

    # ... other definitions ...
    
    define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
    }
    
    define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             Root Partition
        check_command                   check_local_disk!20%!10%!/
    }
    
    define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             Current Users
        check_command                   check_users!20!50
    }
    # ... other definitions ...
    

  2. Understand the check_command for "Current Users": The check_users!20!50 command for "Current Users" means:

    • check_users: This is the command name defined in commands.cfg.
    • !20: This is $ARG1$, the warning threshold. If 20 or more users are logged in, it's a WARNING.
    • !50: This is $ARG2$, the critical threshold. If 50 or more users are logged in, it's a CRITICAL.

    Let's verify the check_users command definition in commands.cfg. Open commands.cfg:

    sudo nano commands.cfg
    
    Search for check_users. You should find something like:
    define command{
        command_name    check_users
        command_line    $USER1$/check_users -w $ARG1$ -c $ARG2$
    }
    
    This confirms that $USER1$/check_users (i.e., /usr/local/nagios/libexec/check_users) is called with the warning (-w) and critical (-c) arguments passed from the service definition. Exit nano.

  3. Add a New Service Check (if one is missing or for practice): Swap Usage Let's add a check for Swap Usage. First, we need to see if a command like check_local_swap or check_swap exists in commands.cfg.

    sudo grep -i swap commands.cfg
    
    The sample configuration often includes:
    # 'check_local_swap' command definition
    define command{
        command_name    check_local_swap
        command_line    $USER1$/check_swap -w $ARG1$ -c $ARG2$
    }
    
    If this command exists, we can use it. If not, you would add this definition to commands.cfg. Assuming it exists, add the following service definition to localhost.cfg:
    sudo nano localhost.cfg
    
    Add this block at the end of the file (or amongst other service definitions for localhost):
    define service{
        use                     local-service         ; Name of service template to use
        host_name               localhost
        service_description     Swap Usage
        check_command           check_local_swap!20%!10%
                                                  ; Warn if swap free < 20%, Critical if < 10%
                                                  ; Note: check_swap often takes thresholds as % free or MB free.
                                                  ; The '!' might need to be adjusted based on plugin version.
                                                  ; Default thresholds for check_swap are often % of *used* swap.
                                                  ; Let's be explicit with `check_swap -w 20% -c 10%` (meaning 20% used warning, 10% used critical if plugin is standard)
                                                  ; Or, if it's % free: `check_swap -w 80 -c 90` (Warn if used > 80%, Crit if used > 90%)
                                                  ; For this workshop, we'll assume the command expects warning and critical for *used* thresholds.
                                                  ; Let's re-evaluate: `check_swap`'s -w and -c are % *used*.
                                                  ; So, Warn if 20% *used*, Crit if 10% *used* is not logical.
                                                  ; Let's aim for: Warn if >80% used, Crit if >90% used.
                                                  ; The plugin default is often -w 25% -c 50% (meaning 25% of swap size remaining is warning, 50% remaining is critical) which is a bit confusing.
                                                  ; Let's use check_swap with -W (warning free %) and -C (critical free %) for clarity as per some plugin versions.
                                                  ; However, standard Nagios Plugins `check_swap` typically uses -w for warning % USED and -c for critical % USED.
                                                  ; So, let's set reasonable values for % USED:
                                                  ; Warn if swap usage is > 50%, Critical if swap usage is > 80%
    }
    
    Let's refine the check_command for swap. The check_swap plugin arguments can be a bit confusing. Typically, -w <value>% means "warning if used swap is <value> percent of total swap". So, a more logical check_command for swap usage, warning at 50% used and critical at 80% used:
    define service{
        use                     local-service
        host_name               localhost
        service_description     Swap Usage
        check_command           check_local_swap!50%!80%
                                                  ; Warn if swap used > 50%, Critical if swap used > 80%
    }
    
    Add this definition to localhost.cfg. Save and exit.

  4. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    
    If there are errors (e.g., check_local_swap command not defined), you'll need to add the command definition to commands.cfg as shown in step 3, then re-verify. If successful:
    sudo systemctl reload nagios
    

  5. Check in Web Interface: Go to your Nagios web UI. Under "Services", you should now see the "Swap Usage" service for localhost. It will initially be "Pending" and then transition to a status (likely OK if your system isn't heavily using swap). You can click on the service name to see its status details, including performance data if the plugin provides it (e.g., "SWAP OK - 100% free (2047 MB out of 2047 MB)").

Outcome:
You have reviewed existing localhost service checks and successfully added a new service check for Swap Usage. This reinforces the process of:

  1. Identifying a monitoring need (Swap Usage).
  2. Ensuring a suitable check_command exists (or defining one).
  3. Defining the service object, linking it to a host and the command.
  4. Verifying and reloading Nagios.
  5. Confirming the new service in the web interface.

This structured approach is fundamental to expanding Nagios monitoring.

Basic Alerting and Notifications

Monitoring systems are most effective when they can alert administrators to problems. Nagios has a robust notification system that can inform contacts when hosts or services change state (e.g., go from OK to CRITICAL, or UP to DOWN). This section covers the basics of setting up email notifications.

Components Involved:

  • Contacts: Definitions of individuals who should receive alerts, including their email addresses and notification preferences.
  • Contact Groups: Collections of contacts. It's best practice to assign contact groups to hosts/services rather than individual contacts for easier management.
  • Notification Commands: Commands Nagios uses to send out notifications (e.g., notify-host-by-email, notify-service-by-email). These typically use a local mail transfer agent (MTA) like sendmail, postfix, or a simple tool like mailx.
  • Timeperiods: Define when notifications can be sent.
  • Host and Service Definitions: These must specify which contact groups should be notified for them.
  • Nagios Notification Logic: Nagios decides when to send notifications based on state changes (typically hard states), notification options (e.g., w,u,c,r for services), and timeperiods.

Steps for Basic Email Notifications:

  1. Ensure a Mail Transfer Agent (MTA) is Installed and Configured:
    Nagios itself doesn't send emails directly. It relies on a system command (like /usr/bin/mail or /usr/sbin/sendmail) to do so. For this command to work, your Nagios server needs an MTA.

    • Common MTAs:
      Postfix, Sendmail, Exim.
    • Simple Solution for Local/Relay:
      ssmtp or msmtp can be configured to relay emails through an external SMTP server (like Gmail or an institutional mail server). For basic testing, mailutils (which provides mail) might be sufficient if your server has a local MTA configured to send outbound mail (e.g. postfix installed with "Local only" or "Internet Site" config).

    For this basic setup, let's assume mailutils (which provides mailx or mail) is sufficient if Postfix or Sendmail is already minimally configured on your server. If not, a minimal Postfix installation is often straightforward:

    sudo apt install postfix mailutils
    
    During Postfix installation on Debian/Ubuntu, you'll be asked for the mail configuration type.

    • "Internet Site": If your server has a public FQDN and can send mail directly.
    • "Local only": If you only want mail delivered locally or if you'll configure a relay later.
    • For more complex relaying (e.g., via Gmail), Postfix needs further configuration (e.g., /etc/postfix/main.cf for relayhost, SASL authentication). This is beyond Nagios's basic setup but crucial for reliable external email delivery.
  2. Define Contact(s):
    This was partially done in a previous workshop. Ensure your contact definition in /usr/local/nagios/etc/objects/contacts.cfg is complete.

    define contact{
        contact_name            nagiosadmin
        use                     generic-contact     ; Inherits default options
        alias                   Nagios Administrator
        email                   your_actual_email@example.com ; **CRUCIAL: Use a real, working email**
        host_notifications_enabled    1
        service_notifications_enabled   1
        host_notification_period      24x7
        service_notification_period   24x7
        host_notification_options     d,u,r           ; Down, Unreachable, Recovery
        service_notification_options  w,u,c,r,f,s     ; Warning, Unknown, Critical, Recovery, Flapping (start/stop), Scheduled Downtime (start/stop)
        host_notification_commands    notify-host-by-email
        service_notification_commands notify-service-by-email
    }
    

    • email: Must be correct.
    • host_notification_options and service_notification_options: Define for which states notifications are sent. d,u,r and w,u,c,r are common starting points.
    • host_notification_commands and service_notification_commands: Specify the commands Nagios will use to send notifications. These are defined in commands.cfg.
  3. Define Contact Group(s):
    In /usr/local/nagios/etc/objects/contacts.cfg or a dedicated contactgroups.cfg:

    define contactgroup{
        contactgroup_name       admins
        alias                   System Administrators
        members                 nagiosadmin       ; Comma-separated list of contact_names
    }
    
    Ensure nagiosadmin (or your contact's contact_name) is listed in members.

  4. Verify Notification Commands:
    Check /usr/local/nagios/etc/objects/commands.cfg for notify-host-by-email and notify-service-by-email. Default definitions often look like this:

    define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }
    
    define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
    }
    

    • These commands use printf to format the email body and pipe it to /usr/bin/mail.
    • $CONTACTEMAIL$ is a Nagios macro that gets replaced with the contact's email address.
    • Many other Nagios macros (like $HOSTNAME$, $SERVICESTATE$, etc.) provide context.
    • Important: Ensure the path /usr/bin/mail is correct for your system. It might be /usr/sbin/sendmail or another path if you use a different MTA setup. You can find the path using which mail or which sendmail. Adjust command_line if needed.
  5. Assign Contact Groups to Hosts and Services:
    Ensure your host and service definitions (e.g., in localhost.cfg) include the contact_groups directive. Example for a host in localhost.cfg:

    define host{
        use             linux-server
        host_name       localhost
        alias           My Nagios Server
        address         127.0.0.1
        contact_groups  admins  ; This line ensures 'admins' group gets notified
    }
    
    Example for a service in localhost.cfg:
    define service{
        use                     local-service
        host_name               localhost
        service_description     Root Partition
        check_command           check_local_disk!20%!10%!/
        contact_groups          admins  ; This line ensures 'admins' group gets notified
    }
    
    If you use templates (linux-server, local-service), it's common to define contact_groups admins in the template itself. This way, all hosts/services using that template automatically inherit the contact group assignment.

  6. Enable Notifications Globally:
    In /usr/local/nagios/etc/nagios.cfg, ensure notifications are globally enabled:

    enable_notifications=1
    
    This is usually the default.

  7. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If no errors:
    sudo systemctl reload nagios
    

  8. Test Notifications:
    The easiest way to test is to force a service into a problem state or send a custom notification.

    • Via Web Interface (Custom Notification):

      1. Go to the Nagios web UI.
      2. Click on "Services" and choose one of the services for localhost (e.g., "Swap Usage").
      3. In the "Service Commands" section on the right, click "Send custom service notification".
      4. Enter a comment (e.g., "Testing custom notification") and click "Commit".
      5. Check your email. You should receive a "CUSTOM" notification. This primarily tests if the notification command and mail system work.
    • By Forcing a State Change (More Realistic Test):
      This is a bit trickier for localhost services that are normally OK. One way: Temporarily change a service check's thresholds to make it fail. Example: Modify the "Swap Usage" check in localhost.cfg:

      sudo nano /usr/local/nagios/etc/objects/localhost.cfg
      
      Change the check_command for "Swap Usage" to something guaranteed to fail (or be in a warning/critical state):
      # Original: check_local_swap!50%!80%
      # Test: Warn if swap used > 0.1%, Crit if swap used > 0.2% (almost always true for CRITICAL if any swap is configured)
      check_command           check_local_swap!0.1!0.2
      
      Save, verify, and reload Nagios:
      sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
      sudo systemctl reload nagios
      
      Wait for Nagios to re-check the service (usually within 5 minutes, or its check_interval).

      • The service should go into a SOFT problem state first.
      • Nagios will re-check it (based on retry_interval and max_check_attempts in the service template).
      • Once it reaches a HARD problem state, a notification should be sent.
      • Check your email for a PROBLEM or CRITICAL alert.
      • Remember to change the thresholds back to sensible values and reload Nagios again!
    • Check Nagios Log:
      If emails are not arriving, check /usr/local/nagios/var/nagios.log for entries related to notifications. It will show attempts to send notifications and any immediate errors from the notification command. Also, check your system's mail log (e.g., /var/log/mail.log or /var/log/maillog) for errors from the MTA.

Troubleshooting Notification Issues:

  • No emails:
    • Is the contact's email address correct?
    • Is the contact part of the assigned contact group?
    • Is the contact group assigned to the host/service?
    • Are notifications enabled for the contact, host/service, and globally (enable_notifications=1)?
    • Is the notification period for the contact and host/service allowing notifications at the current time?
    • Did the host/service reach a HARD state? Notifications are typically not sent for SOFT states.
    • Is the MTA (Postfix, Sendmail, etc.) correctly configured and able to send external emails? Test sending an email from the command line:
      echo "Test email body" | mail -s "Test Email from Nagios Server" your_actual_email@example.com
      
      If this doesn't arrive, the issue is with your server's mail setup, not Nagios itself.
    • Check /usr/local/nagios/var/nagios.log and /var/log/mail.log.
  • Notification delays: Check notification_interval in host/service definitions or templates. This is how long Nagios waits before re-notifying for an ongoing problem.

Setting up notifications correctly is vital. This basic email setup forms the groundwork for more advanced alerting strategies.

Workshop Setting Up Email Notifications for Localhost Alerts

Objective: Configure and test email notifications for alerts generated by localhost services.

Prerequisites:

  • A working Nagios Core installation.
  • The nagiosadmin contact in contacts.cfg should have your real email address.
  • mailutils package installed (sudo apt install mailutils).
  • A basic MTA (like Postfix) installed and minimally configured to send mail from the server (even if only to relay hosts or if your server can send directly). If you installed Postfix, choose "Internet Site" or "Local only" initially. For "Internet Site", ensure your server has a resolvable FQDN. For "Local only", it might only deliver to local user mailboxes unless further configured.
    • Crucial Test: Can your server send an email from the command line to your target email address?
      echo "This is a test email from my Nagios server." | mail -s "Mail Test from $(hostname)" your_real_email@example.com
      
      If this email does not arrive, you must fix your server's mail system (Postfix, Sendmail, etc.) before proceeding with Nagios notifications. This might involve configuring relayhost in Postfix, checking firewall rules, or ensuring your server's IP is not blacklisted.

Steps:

  1. Verify Contact and Contact Group Configuration:

    • Open /usr/local/nagios/etc/objects/contacts.cfg.
    • Confirm the nagiosadmin contact definition:
      define contact{
          contact_name            nagiosadmin
          use                     generic-contact
          alias                   Nagios Administrator
          email                   your_real_email@example.com ; <-- ENSURE THIS IS YOUR EMAIL
          host_notification_options     d,u,r
          service_notification_options  w,u,c,r,f
          host_notification_commands    notify-host-by-email
          service_notification_commands notify-service-by-email
          host_notification_period      24x7
          service_notification_period   24x7
      }
      
    • Confirm the admins contact group definition and that nagiosadmin is a member:
      define contactgroup{
          contactgroup_name       admins
          alias                   Nagios Administrators
          members                 nagiosadmin
      }
      
    • Save any changes.
  2. Verify Notification Commands:

    • Open /usr/local/nagios/etc/objects/commands.cfg.
    • Check the notify-host-by-email and notify-service-by-email commands. Ensure the path to the mail executable (e.g., /usr/bin/mail) is correct for your system.
      # Example:
      # command_line    /usr/bin/printf "..." | /usr/bin/mail -s "..." $CONTACTEMAIL$
      
      You can find the path with which mail.
  3. Assign Contact Group to localhost and its Services:

    • Open /usr/local/nagios/etc/objects/localhost.cfg.
    • Ensure the localhost host definition has contact_groups admins.
      define host{
          # ... other settings ...
          contact_groups          admins
      }
      
    • Ensure relevant services (e.g., PING, Root Partition, Swap Usage) also have contact_groups admins. Often, this is inherited from a template like local-service. If local-service template (in templates.cfg) already specifies contact_groups admins, you don't need to repeat it in every service definition that uses local-service. Verify the local-service template in templates.cfg:
      sudo nano /usr/local/nagios/etc/objects/templates.cfg
      
      Look for define service{ name local-service ... } and ensure it contains contact_groups admins. If not, add it.
  4. Enable Notifications Globally (if not already):

    • Check /usr/local/nagios/etc/nagios.cfg for enable_notifications=1.
  5. Validate Configuration and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If OK:
    sudo systemctl reload nagios
    

  6. Test Notification by Forcing a Service to a Critical State: We'll use the "Root Partition" check for this test as it's easy to manipulate its thresholds.

    • Identify current free space: In the Nagios web UI, look at the "Root Partition" service for localhost. It will show something like "DISK OK - free space: / 15 GB (70%)". Note the percentage free. Let's say it's 70% free.
    • Modify thresholds to trigger an alert: Edit localhost.cfg:
      sudo nano /usr/local/nagios/etc/objects/localhost.cfg
      
      Find the "Root Partition" service. It might look like:
      define service{
          use                             local-service
          host_name                       localhost
          service_description             Root Partition
          check_command                   check_local_disk!20%!10%!/  ; Warn at 20% free, Crit at 10% free
      }
      
      Change the check_command to trigger a CRITICAL state based on your current free space. If you have 70% free, setting critical to 75% free will trigger it (i.e., critical if less than 75% free space).
      # For example, if you have 70% free space:
      # Set Warning if less than 80% free, Critical if less than 75% free
      check_command                   check_local_disk!80%!75%!/
      
      Save the file.
    • Validate and Reload Nagios:
      sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
      sudo systemctl reload nagios
      
    • Monitor in Web UI and Check Email:
      1. In the Nagios UI, watch the "Root Partition" service. It will go to "Pending".
      2. After its next scheduled check, it should change to a SOFT CRITICAL state. Nagios will show "(Service check is currently in a soft critical state)".
      3. Nagios will re-check it (e.g., every minute if retry_interval is 1 in local-service template). After max_check_attempts (e.g., 4), if it's still critical, it will enter a HARD CRITICAL state.
      4. At this point, a notification should be sent. Check your email. You should receive an email titled something like "PROBLEM Service Alert: localhost/Root Partition is CRITICAL".
      5. The email body will contain details from the notification command.
  7. Revert Changes and Test Recovery Notification:

    • Once you've received the CRITICAL alert, change the "Root Partition" service check command back to its original, sensible values in localhost.cfg:
      # Revert to: Warn at 20% free, Crit at 10% free (or your original values)
      check_command                   check_local_disk!20%!10%!/
      
      Save the file.
    • Validate and Reload Nagios:
      sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
      sudo systemctl reload nagios
      
    • Monitor in Web UI and Check Email:
      1. The service will eventually be re-checked.
      2. It should return to an OK state (SOFT OK first, then HARD OK).
      3. A RECOVERY notification should be sent. Check your email for a message like "RECOVERY Service Alert: localhost/Root Partition is OK".

Outcome:
If you received both the CRITICAL and RECOVERY emails, your basic email notification system is working! You have successfully:

  • Confirmed all necessary configuration components for notifications.
  • Triggered a real alert by changing service thresholds.
  • Observed the SOFT to HARD state transition.
  • Received a problem notification email.
  • Reverted the change and received a recovery notification email.

Troubleshooting Reminder:
If emails don't arrive, the first place to check (after confirming Nagios config and logs) is your server's mail system logs (e.g., /var/log/mail.log or /var/log/maillog) and re-test sending mail from the command line. Common issues include relay access denied by your mail server, spam filters catching the mails, or incorrect mail command path in Nagios.

This completes the basic setup and familiarization with Nagios. You are now ready to move on to more intermediate topics.

2. Intermediate Nagios Monitoring Techniques

Having mastered the basics of Nagios installation, configuration, and local monitoring, we now move to intermediate techniques. This section will focus on extending Nagios's reach to monitor remote systems (both Linux and Windows), teach you how to write your own custom plugins for specialized checks, and delve into more advanced object configuration options like host groups, service groups, and templates to manage your monitoring environment more efficiently.

Monitoring Remote Linux Hosts with NRPE

One of the most common tasks for a Nagios server is to monitor services and resources on remote Linux/Unix machines. While some checks like PING or HTTP can be done directly by the Nagios server, many checks (CPU load, disk space, specific processes, memory usage) require an agent running on the remote host. The Nagios Remote Plugin Executor (NRPE) is a popular solution for this.

What is NRPE?
NRPE consists of two main components:

  1. The NRPE daemon (nrpe): This daemon runs on the remote Linux host you want to monitor. It listens for connections from the Nagios server, executes pre-defined Nagios plugins locally on the remote host, and returns the results to the Nagios server.
  2. The check_nrpe plugin: This plugin resides on the Nagios server. Nagios uses check_nrpe to connect to the NRPE daemon on the remote host, specify which command (plugin) to run, and receive the output.

NRPE Architecture:

  1. Nagios server schedules a service check that uses the check_nrpe plugin.
  2. check_nrpe (on Nagios server) connects to the NRPE daemon on the remote host (typically on TCP port 5666).
  3. check_nrpe tells the NRPE daemon which pre-defined command to execute. These commands are configured in the NRPE daemon's configuration file (nrpe.cfg) on the remote host.
  4. The NRPE daemon executes the specified local plugin (e.g., check_load, check_disk) on the remote host.
  5. The local plugin returns its exit status and output string to the NRPE daemon.
  6. The NRPE daemon sends this result back to the check_nrpe plugin on the Nagios server.
  7. check_nrpe passes the result to the Nagios process for evaluation.

Security Considerations:

  • By default, NRPE communication is unencrypted plain text. NRPE can be compiled with SSL/TLS support for encryption.
  • The NRPE daemon configuration (nrpe.cfg) on the remote host specifies which hosts are allowed to connect (allowed_hosts). This should be restricted to your Nagios server's IP address.
  • NRPE can be configured to not allow command arguments from the check_nrpe plugin (dont_blame_nrpe=0 is generally discouraged for security). Instead, commands with arguments are fully defined on the remote host's nrpe.cfg. This prevents the Nagios server from instructing the NRPE daemon to run arbitrary commands with arbitrary arguments.

Steps to Monitor a Remote Linux Host using NRPE:

On the Remote Linux Host (the one to be monitored):

  1. Install Nagios Plugins: The NRPE daemon needs Nagios plugins to execute locally. Even if it's not a full Nagios server, it needs nagios-plugins.

    # On Debian/Ubuntu:
    sudo apt update
    sudo apt install -y nagios-plugins
    # On RHEL/CentOS (requires EPEL repository):
    # sudo yum install epel-release
    # sudo yum install nagios-plugins-all # Or specific plugins like nagios-plugins-load, nagios-plugins-disk etc.
    
    The plugins will typically be installed in /usr/lib/nagios/plugins/ (Debian/Ubuntu) or /usr/lib64/nagios/plugins/ (RHEL/CentOS). Note this path, as it's needed for nrpe.cfg.

  2. Install NRPE Daemon:

    # On Debian/Ubuntu:
    sudo apt install -y nagios-nrpe-server
    # On RHEL/CentOS (requires EPEL repository):
    # sudo yum install nrpe
    

  3. Configure NRPE Daemon (nrpe.cfg): The configuration file is usually /etc/nagios/nrpe.cfg (Debian/Ubuntu) or /etc/nagios/nrpe.cfg (RHEL/CentOS). Edit this file:

    sudo nano /etc/nagios/nrpe.cfg # Adjust path if needed
    
    Key settings to check/modify:

    • server_port=5666: Default NRPE port.
    • allowed_hosts=127.0.0.1,::1,YOUR_NAGIOS_SERVER_IP: Crucial for security! Replace YOUR_NAGIOS_SERVER_IP with the actual IP address of your Nagios Core server. Add IPv6 if needed.
    • dont_blame_nrpe=0: This is the default and more secure setting. It means NRPE will not accept arguments with commands sent by check_nrpe. All command arguments must be defined in nrpe.cfg on the remote host. If you set this to 1, you can pass arguments from check_nrpe, but this is less secure.
    • debug=0: Set to 1 for verbose logging during troubleshooting.
    • Command Definitions: This is where you define the commands that the Nagios server can request NRPE to run. The syntax is command[command_name]=/path/to/plugin <arguments>. The plugin path might vary. On Debian/Ubuntu it's often /usr/lib/nagios/plugins/. Examples:
      # Example: Check for users logged in
      # $USER1$ is not defined here, so use full path.
      # Path to plugins might be /usr/lib/nagios/plugins or /usr/lib64/nagios/plugins
      # Adjust this path based on where nagios-plugins were installed (Step 1).
      # Use `dpkg -L nagios-plugins` or `rpm -ql nagios-plugins` to find plugin paths.
      # Assuming /usr/lib/nagios/plugins/ for Debian/Ubuntu examples:
      
      command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
      command[check_load]=/usr/lib/nagios/plugins/check_load -r -w .15,.10,.05 -c .30,.25,.20
      command[check_hda1]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/hda1 # Example specific disk
      command[check_root_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
      command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
      command[check_total_procs]=/usr/lib/nagios/plugins/check_procs -w 150 -c 200
      
      • Define only the commands you need.
      • The command_name (e.g., check_users, check_load) is what you will specify from the Nagios server via check_nrpe.
  4. Start/Restart NRPE Daemon and Enable it:

    # On Debian/Ubuntu (uses systemd typically):
    sudo systemctl enable nagios-nrpe-server
    sudo systemctl start nagios-nrpe-server
    sudo systemctl status nagios-nrpe-server
    # On RHEL/CentOS (may use systemd or init):
    # sudo systemctl enable nrpe
    # sudo systemctl start nrpe
    # sudo systemctl status nrpe
    

  5. Firewall Configuration (on Remote Host): If a firewall (like ufw or firewalld) is active on the remote host, allow incoming connections on TCP port 5666 from the Nagios server's IP.

    • Using ufw (Debian/Ubuntu):
      sudo ufw allow from YOUR_NAGIOS_SERVER_IP to any port 5666 proto tcp
      sudo ufw reload
      
    • Using firewalld (RHEL/CentOS):
      sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="YOUR_NAGIOS_SERVER_IP" port port="5666" protocol="tcp" accept'
      sudo firewall-cmd --reload
      

On the Nagios Core Server:

  1. Install check_nrpe Plugin: This plugin might have been installed during the initial "Nagios Plugins" compilation. If not, or if you need a specific version: The check_nrpe plugin source code is often bundled with the NRPE source code download, not always with Nagios Plugins package directly. If check_nrpe is missing from /usr/local/nagios/libexec/:

    cd /tmp
    # Download NRPE source (same version as daemon ideally, or a compatible one)
    # Check https://github.com/NagiosEnterprises/nrpe/releases for latest version
    NRPE_VERSION="4.1.0" # Example
    wget https://github.com/NagiosEnterprises/nrpe/releases/download/nrpe-${NRPE_VERSION}/nrpe-${NRPE_VERSION}.tar.gz
    tar -zxvf nrpe-${NRPE_VERSION}.tar.gz
    cd nrpe-${NRPE_VERSION}/
    # Configure to build plugin (and optionally agent if you want to build it from here too)
    # You may need openssl-devel or libssl-dev: sudo apt install libssl-dev OR sudo yum install openssl-devel
    sudo ./configure --enable-ssl # Or without --enable-ssl if remote NRPE daemon doesn't use SSL
    sudo make check_nrpe
    sudo cp src/check_nrpe /usr/local/nagios/libexec/
    sudo chown nagios:nagios /usr/local/nagios/libexec/check_nrpe
    sudo chmod 750 /usr/local/nagios/libexec/check_nrpe # Or 755
    
    Test check_nrpe from Nagios server's command line:
    /usr/local/nagios/libexec/check_nrpe -H REMOTE_HOST_IP
    
    This should return the NRPE version running on the remote host (e.g., "NRPE v4.1.0"). If it fails (timeout, connection refused):

    • Verify NRPE daemon is running on remote host (systemctl status nagios-nrpe-server).
    • Check allowed_hosts in nrpe.cfg on remote host.
    • Check firewall on remote host.
    • Check network connectivity between Nagios server and remote host on port 5666 (e.g., telnet REMOTE_HOST_IP 5666 or nc -zv REMOTE_HOST_IP 5666).

    Test executing a defined command: If you defined command[check_users]=... on the remote host:

    /usr/local/nagios/libexec/check_nrpe -H REMOTE_HOST_IP -c check_users
    
    This should output something like: "USERS OK - 1 users currently logged in".

  2. Define check_nrpe Command in Nagios (if not already present): In /usr/local/nagios/etc/objects/commands.cfg on the Nagios server:

    define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$
    }
    

    • $USER1$/check_nrpe: Path to the plugin.
    • -H $HOSTADDRESS$: Specifies the remote host's IP address (taken from host definition).
    • -t 30: Timeout of 30 seconds for the check.
    • -c $ARG1$: The command name (defined in remote nrpe.cfg) to execute. $ARG1$ will be replaced by the argument passed from the service definition.

    If you compiled NRPE and check_nrpe with SSL support, you might need to add SSL-related arguments if your NRPE daemon requires them (e.g., certificate paths if client certs are used). For basic shared secret or anonymous SSL, often no extra args are needed if both sides are compiled with SSL.

  3. Define Host Object for the Remote Linux Host: Create a new config file in, for example, /usr/local/nagios/etc/servers/remote-linux-host.cfg (ensure this directory is included by a cfg_dir directive in nagios.cfg).

    define host{
        use                     linux-server    ; Inherit from your generic Linux server template
        host_name               remote-linux-server-01
        alias                   My Remote Linux Server
        address                 REMOTE_HOST_IP  ; IP address of the remote host
        contact_groups          admins
    }
    
    Replace REMOTE_HOST_IP with the actual IP.

  4. Define Service Checks using check_nrpe for the Remote Host: In the same file (remote-linux-host.cfg):

    define service{
        use                     generic-service ; Inherit from your generic service template
        host_name               remote-linux-server-01
        service_description     Remote Users
        check_command           check_nrpe!check_users
                                        ; 'check_users' is the command name defined in nrpe.cfg on remote host
    }
    
    define service{
        use                     generic-service
        host_name               remote-linux-server-01
        service_description     Remote Load
        check_command           check_nrpe!check_load
    }
    
    define service{
        use                     generic-service
        host_name               remote-linux-server-01
        service_description     Remote Root Disk
        check_command           check_nrpe!check_root_disk
    }
    
    The string after ! in check_nrpe!command_name is passed as $ARG1$ to the check_nrpe command definition, which then becomes the command name sent to the remote NRPE daemon.

  5. Verify Configuration and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If no errors:
    sudo systemctl reload nagios
    

  6. Check Web Interface: The new remote host and its services should appear in the Nagios web UI. They will initially be "Pending" and then update with statuses from the NRPE checks.

Troubleshooting NRPE:

  • "Connection refused" or "Socket timeout" from check_nrpe:
    • Is NRPE daemon running on the remote host? (ps aux | grep nrpe, systemctl status nagios-nrpe-server).
    • Is the Nagios server's IP in allowed_hosts in nrpe.cfg on remote?
    • Is port 5666 open in the firewall on the remote host for the Nagios server's IP?
    • Network connectivity issues (routers, general network problems).
    • (If using xinetd for NRPE, ensure xinetd is configured and running).
  • "CHECK_NRPE: Error - Could not complete SSL handshake.":
    • Mismatch in SSL/TLS compilation/configuration. Ensure both check_nrpe (on Nagios server) and nrpe daemon (on remote host) are compiled with or without SSL support consistently. If compiled with SSL, they usually negotiate. Some older versions had issues.
    • If the NRPE daemon was compiled with specific ciphers and check_nrpe doesn't support them.
  • "CHECK_NRPE: Received 0 bytes from daemon." or "Command ... not defined":
    • The command name sent by check_nrpe (e.g., check_users) is not defined in the command[...] directives in nrpe.cfg on the remote host, or there's a typo. Command names are case-sensitive.
    • Plugin execution error on the remote host. Check NRPE daemon logs on the remote host (syslog or a dedicated NRPE log if configured).
  • Plugin execution errors (e.g., "(No output returned from plugin)" or plugin-specific errors):
    • The plugin path in nrpe.cfg on remote host is incorrect.
    • Plugin does not have execute permissions on remote host.
    • Plugin itself is failing. Try running the exact command line from nrpe.cfg manually on the remote host as the nagios user (or whatever user NRPE runs as) to debug. E.g., sudo -u nagios /usr/lib/nagios/plugins/check_users -w 5 -c 10.

NRPE is a powerful way to extend Nagios's monitoring capabilities to your entire Linux infrastructure.

Workshop Monitoring a Remote Linux Host via NRPE

Objective:
Set up monitoring for a remote Linux host (VM2) from your Nagios server (VM1). You will monitor CPU load, root disk space, and total running processes on VM2.

Prerequisites:

  • Your Nagios Core server (VM1) from previous workshops.
  • A second Linux virtual machine (VM2, e.g., Debian/Ubuntu server). This will be the "remote host".
  • Network connectivity between VM1 and VM2. Ensure they can ping each other by IP.
  • Know the IP addresses of both VM1 (Nagios server) and VM2 (remote host).
  • Sudo/root access on both VMs.

Let's assume:

  • VM1 (Nagios Server) IP: 192.168.1.100 (Replace with your actual IP)
  • VM2 (Remote Linux Host) IP: 192.168.1.101 (Replace with your actual IP)

Part 1: Configure the Remote Linux Host (VM2)

  1. Log in to VM2.

  2. Install Nagios Plugins and NRPE Server:

    sudo apt update
    sudo apt install -y nagios-plugins nagios-nrpe-server
    
    This installs the necessary check plugins (like check_disk, check_load, check_procs) and the NRPE daemon.

  3. Configure NRPE Daemon on VM2: Edit /etc/nagios/nrpe.cfg:

    sudo nano /etc/nagios/nrpe.cfg
    

    • Find the allowed_hosts line. Modify it to include your Nagios Server's IP (VM1):
      allowed_hosts=127.0.0.1,::1,192.168.1.100
      
      (Replace 192.168.1.100 with VM1's actual IP address).
    • Ensure dont_blame_nrpe=0 (this is default and more secure).
    • Add or verify command definitions. The default nrpe.cfg on Debian/Ubuntu often comes with some pre-defined commands. Ensure these (or similar) are present and uncommented. The path to plugins is usually /usr/lib/nagios/plugins/.
      # These should exist or be added:
      command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
      command[check_load]=/usr/lib/nagios/plugins/check_load -r -w .15,.10,.05 -c .30,.25,.20
      command[check_disk_root]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
      command[check_procs_total]=/usr/lib/nagios/plugins/check_procs -w 250 -c 400
      
      Self-correction: The nrpe.cfg might use check_hda1 or check_sda1 as examples for disks. We want a specific command for the root partition (/). So command[check_disk_root]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p / is good. Also for total processes: command[check_procs_total]=/usr/lib/nagios/plugins/check_procs -w 250 -c 400 (adjust thresholds as needed for your VM2).
    • Save and exit nrpe.cfg.
  4. Restart and Enable NRPE Service on VM2:

    sudo systemctl restart nagios-nrpe-server
    sudo systemctl enable nagios-nrpe-server
    sudo systemctl status nagios-nrpe-server # Verify it's active (running)
    

  5. Configure Firewall on VM2 (if applicable): If ufw is active on VM2:

    sudo ufw allow from 192.168.1.100 to any port 5666 proto tcp comment 'Allow NRPE from Nagios Server'
    sudo ufw reload
    sudo ufw status # Verify the rule is active
    
    (Replace 192.168.1.100 with VM1's IP).

Part 2: Configure the Nagios Server (VM1)

  1. Log in to VM1.

  2. Ensure check_nrpe Plugin is Installed: It's usually installed with nagios-plugins or compiled when you installed Nagios Core and plugins. Verify its existence:

    ls -l /usr/local/nagios/libexec/check_nrpe
    
    If it's not there, you'll need to compile it from the NRPE source package as described in the main NRPE section. For this workshop, we'll assume it was installed as part of the nagios-plugins package that was compiled earlier or that you have followed the steps to compile check_nrpe manually from the NRPE source tarball. If it was installed via apt install nagios-nrpe-plugin (less common for source installs), it might be in /usr/lib/nagios/plugins/check_nrpe. If so, adjust paths in commands.cfg accordingly. But for source installs, it should be in /usr/local/nagios/libexec/.

  3. Test check_nrpe from VM1 to VM2:

    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101
    
    (Replace 192.168.1.101 with VM2's IP). Expected output: NRPE vX.Y.Z (the version of NRPE running on VM2). If this fails, troubleshoot (firewall on VM2, allowed_hosts on VM2, NRPE service status on VM2).

    Now test a specific command defined in VM2's nrpe.cfg:

    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_users
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_load
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_disk_root
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.101 -c check_procs_total
    
    Each should return an OK status with some output. If you get "Command not defined", double-check the command names in VM2's nrpe.cfg and ensure they match what you use with -c.

  4. Define check_nrpe Command in Nagios (VM1): Open /usr/local/nagios/etc/objects/commands.cfg:

    sudo nano /usr/local/nagios/etc/objects/commands.cfg
    
    Add the following definition if it doesn't already exist:
    define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$
    }
    
    Save and exit.

  5. Create Configuration File for VM2 on Nagios Server (VM1): Create a new directory for remote server configs if you don't have one:

    sudo mkdir -p /usr/local/nagios/etc/servers
    
    Tell Nagios to read configs from this directory. Edit /usr/local/nagios/etc/nagios.cfg:
    sudo nano /usr/local/nagios/etc/nagios.cfg
    
    Add this line (if not already present):
    cfg_dir=/usr/local/nagios/etc/servers
    
    Save and exit.

    Now, create the config file for VM2:

    sudo nano /usr/local/nagios/etc/servers/vm2-linux.cfg
    
    Add the following content:
    define host{
        use             linux-server            ; Name of host template to use
        host_name       vm2-remote-linux
        alias           Remote Linux VM2
        address         192.168.1.101           ; <<< IP Address of VM2
        contact_groups  admins
    }
    
    define service{
        use                     generic-service         ; Name of service template to use
        host_name               vm2-remote-linux
        service_description     CPU Load via NRPE
        check_command           check_nrpe!check_load
    }
    
    define service{
        use                     generic-service
        host_name               vm2-remote-linux
        service_description     Root Disk Space via NRPE
        check_command           check_nrpe!check_disk_root
    }
    
    define service{
        use                     generic-service
        host_name               vm2-remote-linux
        service_description     Total Processes via NRPE
        check_command           check_nrpe!check_procs_total
    }
    
    Replace 192.168.1.101 with VM2's actual IP. Save and exit.

  6. Verify Nagios Configuration and Reload (VM1):

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If "Total Warnings: 0" and "Total Errors: 0":
    sudo systemctl reload nagios
    
    If there are errors, read them carefully. They often point to typos or missing definitions.

  7. Check Nagios Web Interface (VM1): Open your Nagios web UI.

    • Go to "Hosts". You should see vm2-remote-linux.
    • Go to "Services". You should see the three new services associated with vm2-remote-linux (CPU Load, Root Disk Space, Total Processes).
    • They will be in "Pending" state initially. After a few minutes, they should update with their actual status from VM2.

Outcome:
You have successfully configured Nagios to monitor key metrics (CPU load, disk space, total processes) on a remote Linux host (VM2) using NRPE. This involved:

  • Setting up the NRPE daemon and necessary plugins on the remote host (VM2).
  • Configuring firewall rules and allowed hosts for security.
  • Ensuring the check_nrpe plugin is available on the Nagios server (VM1).
  • Testing connectivity and command execution with check_nrpe from the command line.
  • Defining the new host and its NRPE-based services in the Nagios server configuration.
  • Verifying the setup in the Nagios web interface.

This workshop provides a practical template for adding more remote Linux hosts and more NRPE-based checks to your Nagios monitoring.

Monitoring Remote Windows Hosts with NSClient++

Monitoring Windows hosts requires a different agent than Linux's NRPE. The most popular and versatile agent for Windows is NSClient++. It can communicate with Nagios using various protocols, including NRPE, making the Nagios-side configuration very similar to monitoring Linux hosts via NRPE.

What is NSClient++?
NSClient++ is an agent designed for Windows systems (though it has Linux ports too) that allows a Nagios server to query performance metrics, service states, process information, and more. Key features:

  • Multiple protocols: It can listen for connections using Nagios's native check_nt protocol (older, less secure), NRPE (recommended for consistency if you already use it for Linux), and others.
  • Extensible: Supports external scripts and PowerShell.
  • Built-in checks: Provides many common Windows checks out-of-the-box (CPU, memory, disk, services, processes, event logs).
  • Secure: Supports SSL/TLS for NRPE communication and certificate-based authentication.

Architecture (using NRPE protocol with NSClient++):

  1. Nagios server schedules a service check using the check_nrpe plugin.
  2. check_nrpe (on Nagios server) connects to NSClient++ on the Windows host (NSClient++ listening as an NRPE daemon, typically on TCP port 5666).
  3. check_nrpe tells NSClient++ which pre-defined command (alias) to execute. These command aliases are configured in NSClient++'s configuration file (nsclient.ini or custom.ini).
  4. NSClient++ executes the corresponding internal check module or an external script.
  5. NSClient++ returns the result (exit status and output string) to check_nrpe.
  6. check_nrpe passes the result to the Nagios process.

Steps to Monitor a Remote Windows Host using NSClient++ (with NRPE):

On the Remote Windows Host:

  1. Download NSClient++: Go to the official NSClient++ website (nsclient.org) and download the latest stable version (usually an MSI installer for 64-bit Windows).

  2. Install NSClient++: Run the MSI installer.

    • Choose "Generic" or "Typical" setup type.
    • Important Configuration during install:
      • Allowed hosts: Enter the IP address of your Nagios Core server. This is crucial for security.
      • Enable common check plugins: Ensure NRPEServer (or NRPEServer module (check_nrpe)) is enabled/ticked. You might also enable CheckSystem (for CPU, memory), CheckDisk, CheckService, etc.
      • Password (for check_nt): If you were to use check_nt, you'd set a password here. For NRPE, it's not directly used in the same way, but it's good to set something if prompted.
      • You might be asked if you want to allow arguments to be passed to NRPE commands. For better security, it's often recommended to define commands fully within NSClient++ and not allow arguments from Nagios (similar to dont_blame_nrpe=0 in Linux NRPE). However, NSClient++ often defaults to allowing arguments for convenience.
    • Complete the installation. NSClient++ will be installed as a Windows service and should start automatically.
  3. Configure NSClient++ (nsclient.ini or custom.ini): The configuration file is typically located at C:\Program Files\NSClient++\nsclient.ini. For modern versions, it's recommended to put custom settings in C:\Program Files\NSClient++\custom.ini to avoid overwrites during upgrades. Open the nsclient.ini (or create/edit custom.ini) with a text editor (run as Administrator).

    • Enable Modules: Ensure necessary modules are enabled.

      [/modules]
      ; Common modules
      CheckSystem = enabled
      CheckDisk = enabled
      CheckEventLog = enabled  ; If you want to check event logs
      NRPEServer = enabled     ; Essential for NRPE
      ; HELPSystem = enabled   ; For listing commands, useful for debug
      

    • Configure NRPEServer settings:

      [/settings/NRPE/server]
      ; Allow a Llowed hosts
      allowed hosts = YOUR_NAGIOS_SERVER_IP ; Reconfirm this from install step, or add Nagios server IP here
      ; Allow arguments from Nagios (less secure, but often convenient for NSClient++)
      allow arguments = true
      ; Allow nasty characters (meta characters) in arguments (can be a security risk if not careful)
      allow nasty characters = true ; Or false for more security, requiring more careful command definitions
      ; Port to listen on
      port = 5666
      ; Enable SSL/TLS (recommended) - ensure check_nrpe on Nagios server also supports it
      ; use ssl = true ; or false if not using SSL yet.
      ; insecure = true ; If using SSL but not full certificate validation (simpler setup)
      
      If use ssl = true is set, check_nrpe on the Nagios server must also be compiled with SSL support and might need the -S or appropriate SSL flags if the NSClient++ SSL setup is strict. For simplicity, you might start with use ssl = false.

    • Define Command Aliases (External Scripts/Aliases): NSClient++ has many built-in checks that don't need explicit aliasing if allow arguments = true for NRPE, as you can call them directly (e.g., check_cpu, check_memory). However, for complex commands or to restrict what Nagios can call, define aliases. These are typically defined under [/settings/external scripts/alias] or [/settings/external scripts/scripts] (for actual scripts). Many common checks are directly invokable if the corresponding module is loaded. For instance, if CheckSystem is loaded, check_nrpe can often call check_cpu, check_memory directly. Example pre-defined aliases (often in nsclient.ini already, or can be added):

      [/settings/external scripts/alias]
      alias_cpu = checkCPU warn=80 crit=90 time=5m time=1m time=30s
      alias_mem = checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical
      alias_disk_c = CheckDriveSize MinWarn=20% MinCrit=10% Drive=C: FilterType=FIXED
      alias_service_spooler = checkServiceState CheckAll Spooler ; Checks if Spooler service is running
      alias_uptime = checkUpTime MinWarn=1d MinCrit=1h ; Warn if uptime < 1 day, Crit if < 1 hour (example)
      
      These alias_ names are what you'd call from Nagios via check_nrpe (e.g., check_nrpe -c alias_cpu). If allow arguments = true in [/settings/NRPE/server], you can often call the internal commands directly like: check_nrpe -c check_cpu -a warn=80 crit=90 This is very flexible but gives more control to the Nagios side.

  4. Restart NSClient++ Service: Open the Windows Services console (services.msc), find "NSClient++" (or "NSCP"), and restart it to apply configuration changes.

  5. Configure Windows Firewall: Allow incoming connections on TCP port 5666 from your Nagios server's IP address.

    • Open Windows Defender Firewall with Advanced Security.
    • Go to "Inbound Rules".
    • Click "New Rule..."
    • Type of rule: "Port".
    • Protocol and Ports: "TCP", Specific local ports: "5666".
    • Action: "Allow the connection".
    • Profile: Choose appropriate profiles (Domain, Private, Public - typically Domain and Private).
    • Name: "NSClient++ NRPE (from Nagios)".
    • (Optional but recommended) Scope: Restrict "Remote IP addresses" to your Nagios server's IP.

On the Nagios Core Server:

  1. Ensure check_nrpe Plugin is Ready: Same check_nrpe plugin used for Linux hosts can be used for Windows hosts running NSClient++ in NRPE mode. Verify it exists and works (as in the Linux NRPE section). /usr/local/nagios/libexec/check_nrpe

  2. Test check_nrpe from Nagios server to Windows host:

    /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP
    
    Should return something like: I seem to be doing fine... or the NSClient++ version. If you get "CHECK_NRPE: Error - Could not complete SSL handshake," ensure use ssl = false in nsclient.ini if check_nrpe is not using SSL, or ensure both are configured for compatible SSL.

    Test a specific command (assuming allow arguments = true in NSClient++ and CheckSystem module is loaded):

    # Test CPU check: warn if 5min avg > 80%, crit if > 90%
    /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_cpu -a warn=load>80 crit=load>90 time=5m
    # Test Memory check: warn if physical memory usage > 80%, crit if > 90%
    /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_memory -a type=physical warn=used>80% crit=used>90%
    # Test C: drive space: warn if free < 20GB, crit if free < 10GB
    /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c check_drivesize -a drive=C: warn=free<20G crit=free<10G
    
    Or, if using aliases defined in nsclient.ini like alias_cpu:
    /usr/local/nagios/libexec/check_nrpe -H WINDOWS_HOST_IP -c alias_cpu
    

  3. Define Host Object for the Windows Host: Create /usr/local/nagios/etc/servers/windows-host-01.cfg (or similar):

    define host{
        use             windows-server  ; You might create a 'windows-server' host template
                                        ; Or use 'generic-host' or 'linux-server' if similar enough
        host_name       my-windows-server
        alias           My First Windows Server
        address         WINDOWS_HOST_IP ; IP of the Windows host
        contact_groups  admins
    }
    
    It's good practice to create a windows-server host template in templates.cfg if you monitor many Windows machines.

  4. Define Service Checks using check_nrpe: In the same file (windows-host-01.cfg):

    # Example using direct command calls (requires 'allow arguments = true' in NSClient++)
    define service{
        use                     generic-service
        host_name               my-windows-server
        service_description     Windows CPU Load
        check_command           check_nrpe!check_cpu!-a warn=load>80 crit=load>90 time=5m time=1m time=30s
    }
    
    define service{
        use                     generic-service
        host_name               my-windows-server
        service_description     Windows Memory Usage
        check_command           check_nrpe!check_memory!-a type=physical warn=used>80% crit=used>90%
    }
    
    define service{
        use                     generic-service
        host_name               my-windows-server
        service_description     Windows C Drive Space
        check_command           check_nrpe!check_drivesize!-a drive=C: warn=free<20G crit=free<10G ShowAll=long
    }
    
    define service{
        use                     generic-service
        host_name               my-windows-server
        service_description     Windows Uptime
        check_command           check_nrpe!check_uptime
        # For check_uptime, NSClient++ has default warn/crit values.
        # To specify, e.g., warn if uptime < 7d, crit < 1d:
        # check_command           check_nrpe!check_uptime!-a warn=uptime<7d crit=uptime<1d
    }
    
    Important Note on Arguments: When passing arguments to check_nrpe that will then be forwarded to NSClient++, the ! character is used by Nagios to separate the main command from its arguments. If NSClient++ commands also use ! or other special characters that Nagios might misinterpret, careful quoting or alternative argument passing might be needed. In the example check_nrpe!check_cpu!-a warn=load>80 ..., the command sent to NRPE is check_cpu and the arguments string starts with -a warn=load>80 .... NSClient++ parses this argument string.

    If you used aliases like alias_cpu in nsclient.ini:

    define service{
        use                     generic-service
        host_name               my-windows-server
        service_description     Windows CPU Load (alias)
        check_command           check_nrpe!alias_cpu
    }
    
    This is cleaner and more secure as the full check logic is on the client.

  5. Verify Configuration and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

  6. Check Web Interface: The new Windows host and its services should appear.

Troubleshooting NSClient++:

  • Connection issues from Nagios server:
    • NSClient++ service running on Windows?
    • Nagios server IP in allowed hosts in nsclient.ini ([/settings/NRPE/server] section)?
    • Windows Firewall allowing port 5666 from Nagios server?
    • SSL/TLS mismatch? Try use ssl = false in nsclient.ini for initial testing if check_nrpe is not using SSL.
  • "UNKNOWN: No handler for command" or similar from check_nrpe:
    • The command (e.g., check_cpu) or alias (e.g., alias_cpu) is not recognized by NSClient++.
    • Ensure the required module (e.g., CheckSystem) is enabled in [/modules] in nsclient.ini.
    • If using an alias, ensure it's correctly defined in [/settings/external scripts/alias].
    • If allow arguments = false in NSClient++, you must use aliases that fully define the command.
  • NSClient++ logs: Check C:\Program Files\NSClient++\nsclient.log for errors on the Windows host. You might need to increase log level in nsclient.ini ([/settings/log] section, e.g., level = debug). Remember to restart NSClient++ service after changing its .ini file.
  • NSClient++ test command: On the Windows host, you can test commands locally: Open Command Prompt as Administrator, navigate to C:\Program Files\NSClient++\, then run nscp test. This gives you an interactive NSClient++ console. You can then type commands like check_cpu warn=load>80 crit=load>90 time=5m to see their output. Or alias_cpu if you defined such an alias.

Monitoring Windows hosts with NSClient++ and NRPE provides a consistent approach with monitoring Linux hosts, simplifying Nagios configuration.

Workshop Monitoring a Remote Windows Host using NSClient++ and check_nrpe

Objective:
Install and configure NSClient++ on a remote Windows host (VM3) and monitor its CPU usage, memory usage, and C: drive space from your Nagios server (VM1) using the NRPE protocol.

Prerequisites:

  • Your Nagios Core server (VM1) from previous workshops (IP: 192.168.1.100 - adjust as needed).
  • A Windows virtual machine (VM3, e.g., Windows Server 2019/2022 or Windows 10/11). This will be the "remote Windows host." (IP: 192.168.1.102 - adjust as needed).
  • Network connectivity between VM1 and VM3. Ensure VM1 can ping VM3 by IP.
  • Administrator access on VM3.
  • Sudo/root access on VM1.
  • check_nrpe plugin working on VM1.

Part 1: Configure the Remote Windows Host (VM3)

  1. Log in to VM3 as an Administrator.

  2. Download NSClient++: Open a web browser on VM3 and go to https://nsclient.org/download/. Download the latest stable 64-bit MSI installer (e.g., NSCP-0.5.x.xx-x64.msi).

  3. Install NSClient++ on VM3:

    • Run the downloaded MSI installer.
    • Click "Next" on the welcome screen. Accept the license agreement and click "Next."
    • Setup Type: Choose "Typical." Click "Next."
    • Configuration:
      • Allowed hosts address: Enter the IP address of your Nagios server (VM1), e.g., 192.168.1.100.
      • NSClient++ Monitoring Tools: Keep defaults or ensure "Enable NRPE server" (or similar wording for NRPEServer) is checked.
      • You can leave the password fields blank as we are focusing on NRPE.
      • Click "Next."
    • Click "Install." If prompted by User Account Control, click "Yes."
    • Once installation is complete, click "Finish." The NSClient++ service should start automatically.
  4. Configure NSClient++ (nsclient.ini):

    • Open File Explorer and navigate to C:\Program Files\NSClient++\.
    • Open nsclient.ini with a text editor like Notepad, run as Administrator (right-click Notepad, "Run as administrator," then open the file).
    • Enable Modules (verify): Under the [/modules] section, ensure these are present and set to enabled (or uncommented):
      CheckSystem = enabled
      CheckDisk = enabled
      NRPEServer = enabled
      ; HELPSystem = enabled ; Useful for debugging, can be left commented out for production
      
    • Configure NRPE Server Settings: Under the [/settings/NRPE/server] section (create it if it doesn't exist or is commented out):
      ; Undocumented key
      verify mode = none  ; For simpler SSL, if used. Start without SSL for ease.
      insecure = true     ; Alias for verify mode = none and allow-self-signed = true
      
      ; Allow arguments from NRPE client
      allow arguments = true
      
      ; Allow "nasty" meta characters ( szükséges lehet speciális karakterek miatt )
      allow nasty characters = true ; Be cautious with this in production
      
      ; Allowed hosts
      allowed hosts = 192.168.1.100 ; << Your Nagios Server IP (VM1)
      
      ; Port to use for NRPE.
      port = 5666
      
      ; SSL/TLS options - For initial workshop, keep SSL disabled for simplicity
      use ssl = false
      
      Make sure allowed hosts correctly lists VM1's IP. For this workshop, we set use ssl = false to simplify the initial setup. In a production environment, you should enable SSL.*
    • Command Aliases (Optional but Good Practice): While allow arguments = true lets us call commands directly, let's define a few aliases under [/settings/external scripts/alias] for clarity and future security hardening. If this section doesn't exist, create it.
      [/settings/external scripts/alias]
      alias_cpu_long = checkCPU warn=load>80 crit=load>90 time=5m time=1m time=30s ShowAll=long
      alias_mem_phys = checkMem MaxWarn=80% MaxCrit=90% type=physical ShowAll=long
      alias_disk_c_space = CheckDriveSize MinWarn=20% MinCrit=10% Drive=C: ShowAll=long FilterType=FIXED
      alias_win_uptime = checkUpTime MinWarn=24h MinCrit=2h ShowAll=long
      
    • Save the nsclient.ini file.
  5. Restart NSClient++ Service on VM3:

    • Open "Services" (type services.msc in the Run dialog or Start menu search).
    • Find "NSClient++ Monitoring Agent" (or similar, might be "NSCP").
    • Right-click it and select "Restart."
  6. Configure Windows Firewall on VM3:

    • Search for "Windows Defender Firewall with Advanced Security" and open it.
    • Click on "Inbound Rules" in the left pane.
    • In the right pane, click "New Rule..."
    • Rule Type: Select "Port," click "Next."
    • Protocol and Ports: Select "TCP." Select "Specific local ports:" and enter 5666. Click "Next."
    • Action: Select "Allow the connection." Click "Next."
    • Profile: Keep "Domain," "Private," and "Public" checked (or adjust based on your network profile). Click "Next."
    • Name: Enter a descriptive name, e.g., Nagios NRPE (NSClient++).
    • Scope (Optional but Recommended): In the rule properties (after creation or during), go to the "Scope" tab. Under "Remote IP address," choose "These IP addresses," click "Add," and enter the IP address of your Nagios server (VM1, e.g., 192.168.1.100). Click "OK."
    • Click "Finish."

Part 2: Configure the Nagios Server (VM1)

  1. Log in to VM1.

  2. Test check_nrpe from VM1 to VM3: Replace 192.168.1.102 with VM3's actual IP address.

    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30
    
    Expected output: I seem to be doing fine... or an NSClient++ version string. If it fails (timeout, connection refused):

    • Verify NSClient++ service is running on VM3.
    • Check allowed hosts in nsclient.ini on VM3.
    • Check Windows Firewall rule on VM3.
    • Ensure use ssl = false is set in nsclient.ini if your check_nrpe is not compiled with SSL or you're not using SSL options with it.

    Now test the aliases you defined (or direct commands if you prefer):

    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_cpu_long
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_mem_phys
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_disk_c_space
    /usr/local/nagios/libexec/check_nrpe -H 192.168.1.102 -t 30 -c alias_win_uptime
    
    Each should return an OK status with some output.

  3. Define check_nrpe Command in Nagios (VM1) (if not already done): This should already exist from the Linux NRPE workshop. Verify in /usr/local/nagios/etc/objects/commands.cfg:

    define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ $ARG2$
    }
    
    Self-correction: Adding $ARG2$ allows passing further arguments from Nagios service definitions if needed, which is common when directly calling NSClient++ internal checks (e.g. check_nrpe!check_cpu!-a warn=80 crit=90). If you only use aliases defined on the client, $ARG1$ is sufficient. Using $ARG2$ provides more flexibility.

  4. Create Configuration File for VM3 on Nagios Server (VM1): In the /usr/local/nagios/etc/servers/ directory (created in previous workshop):

    sudo nano /usr/local/nagios/etc/servers/vm3-windows.cfg
    
    Add the following content:
    define host{
        use             generic-host    ; Or create a 'windows-server' template
        host_name       vm3-remote-windows
        alias           Remote Windows VM3
        address         192.168.1.102   ; <<< IP Address of VM3 (Windows host)
        contact_groups  admins
        ; For Windows, check-host-alive (ping) is usually fine as the default check_command
        ; If you create a windows-server template, you can set specific icons, etc.
        icon_image      win40.gif       ; Example icon (if images are in Nagios share)
        statusmap_image win40.gd2
    }
    
    define service{
        use                     generic-service
        host_name               vm3-remote-windows
        service_description     Windows CPU Usage
        check_command           check_nrpe!alias_cpu_long
    }
    
    define service{
        use                     generic-service
        host_name               vm3-remote-windows
        service_description     Windows Memory Usage
        check_command           check_nrpe!alias_mem_phys
    }
    
    define service{
        use                     generic-service
        host_name               vm3-remote-windows
        service_description     Windows C Drive Space
        check_command           check_nrpe!alias_disk_c_space
    }
    
    define service{
        use                     generic-service
        host_name               vm3-remote-windows
        service_description     Windows Uptime
        check_command           check_nrpe!alias_win_uptime
    }
    
    Replace 192.168.1.102 with VM3's actual IP. Save and exit.

  5. Verify Nagios Configuration and Reload (VM1):

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If "Total Warnings: 0" and "Total Errors: 0":
    sudo systemctl reload nagios
    

  6. Check Nagios Web Interface (VM1): Open your Nagios web UI.

    • Go to "Hosts". You should see vm3-remote-windows.
    • Go to "Services". You should see the four new services associated with vm3-remote-windows.
    • They will be in "Pending" state initially. After a few minutes, they should update with their actual status from VM3. You should see CPU, Memory, Disk, and Uptime information.

Outcome:
You have successfully configured Nagios to monitor key metrics on a remote Windows host (VM3) using NSClient++ with the NRPE protocol. This involved:

  • Installing and configuring NSClient++ on the Windows host (VM3).
  • Defining command aliases in nsclient.ini for the checks.
  • Configuring the Windows Firewall.
  • Testing connectivity and command execution with check_nrpe from the Nagios server (VM1).
  • Defining the new Windows host and its NRPE-based services in the Nagios server configuration.
  • Verifying the setup in the Nagios web interface.

This workshop demonstrates the versatility of NRPE and how NSClient++ enables comprehensive Windows monitoring within a Nagios environment. For production, remember to enable and configure SSL for NRPE communication between Nagios and NSClient++.

Writing Custom Nagios Plugins

While Nagios comes with a vast library of official and community-contributed plugins, there will inevitably be situations where you need to monitor something unique to your environment for which no existing plugin is suitable. This is where writing custom Nagios plugins becomes essential. Nagios plugins are simple executables or scripts that adhere to a specific contract regarding exit codes and output format.

Plugin Development Guidelines:

  1. Executable:
    The plugin must be an executable file (script or compiled program). Common choices for scripts are Bash, Perl, Python, Ruby, or PowerShell (if executed via an agent like NSClient++).
  2. Exit Codes:
    The plugin must terminate with one of the following exit codes to indicate the status of the check:
    • 0: OK - The service is functioning correctly.
    • 1: WARNING - The service is in a warning state (e.g., approaching a threshold).
    • 2: CRITICAL - The service is in a critical state (e.g., threshold exceeded, service down).
    • 3: UNKNOWN - The status of the service could not be determined (e.g., plugin error, invalid arguments, resource unavailable). Any other exit code will typically be treated as UNKNOWN by Nagios, or may result in an error like "(Return code of X is out of bounds)".
  3. Output Format (STDOUT):
    The plugin must print at least one line of human-readable text to standard output (STDOUT). This is the primary status information displayed in the Nagios UI.

    • Single-Line Output: SERVICESTATUS: Plugin message | optional_performance_data Example: DISK OK - / (sda1) is 78% full. | /=5079MB;15280;17190;0;19100
    • Multi-Line Output (less common for main line, but possible for extended info): The first line follows the single-line format. Subsequent lines can provide additional details. SERVICESTATUS: Primary plugin message | optional_performance_data Additional line 1 Additional line 2 Nagios primarily cares about the first line for the main status text and performance data.
  4. Performance Data (Perfdata):
    Optionally, plugins can return performance data, which Nagios can process and store for graphing (e.g., with PNP4Nagios). Perfdata is appended to the first line of output after a pipe symbol (|). Format: 'label'=value[UOM];[warn];[crit];[min];[max]

    • label: A string label for the datasource (e.g., load1, disk_usage_c). Should be short and avoid spaces or special characters (except underscore).
    • value: The actual value of the metric (integer or float).
    • UOM (Unit of Measure): Optional. E.g., s (seconds), %, B (bytes), MB, GB, TB, c (count).
    • warn: Optional warning threshold for this metric.
    • crit: Optional critical threshold for this metric.
    • min: Optional minimum value for graphing.
    • max: Optional maximum value for graphing. Multiple perfdata metrics can be returned, separated by spaces: ... | metric1=value1;w1;c1 metric2=value2;w2;c2 ... Example: CPU_LOAD OK - 1 min: 0.05, 5 min: 0.10, 15 min: 0.15 | load1=0.05;10;15;0 load5=0.10;8;12;0 load15=0.15;5;10;0
  5. Error Output (STDERR):
    Plugins should ideally not print anything to standard error (STDERR) during normal operation. STDERR output is often captured by Nagios but not displayed as the primary status. It might be logged or shown in extended info. For critical plugin failures, exiting with status 3 (UNKNOWN) and providing an error message on STDOUT is preferred.

Choosing a Scripting Language:

  • Bash:
    Excellent for simple checks, file system operations, or wrapping existing command-line tools. Widely available on Linux.
  • Perl:
    Historically very popular for Nagios plugins due to its strong text processing and regular expression capabilities. Many existing plugins are written in Perl. The Nagios::Plugin Perl module simplifies development.
  • Python:
    Increasingly popular due to its readability, extensive libraries, and ease of use. The nagiosplugin Python library can be helpful.
  • PowerShell:
    The go-to for Windows-specific checks if you are writing scripts to be executed locally on a Windows machine (e.g., via NSClient++'s external script capabilities).

Basic Plugin Structure (Conceptual):

  1. Parse command-line arguments (thresholds, host/port, etc.).
  2. Perform the check (e.g., read a file, query an API, run a command).
  3. Analyze the result against thresholds.
  4. Determine the status (OK, WARNING, CRITICAL, UNKNOWN).
  5. Construct the output string (including performance data if applicable).
  6. Print the output string to STDOUT.
  7. Exit with the appropriate status code.

Example Bash Script Plugin:
Let's create a simple Bash script plugin that checks if a specific file exists and optionally if its size is within certain limits.

check_custom_file.sh

#!/bin/bash

# Exit codes
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

# Default values
FILE_PATH=""
WARN_SIZE_KB="" # Warn if size is GREATER than this in KB
CRIT_SIZE_KB="" # Critical if size is GREATER than this in KB
CHECK_EXISTS_ONLY=false

# --- Helper function for usage ---
print_usage() {
    echo "Usage: $0 -f <file_path> [-w <warn_size_kb> -c <crit_size_kb>] [-e]"
    echo "  -f <file_path>: Path to the file to check."
    echo "  -w <warn_size_kb>: Warning threshold for file size in KB (optional)."
    echo "  -c <crit_size_kb>: Critical threshold for file size in KB (optional)."
    echo "  -e: Check for existence only, ignore size checks (optional)."
    exit $STATE_UNKNOWN
}

# --- Parse command line arguments ---
while getopts "f:w:c:e" opt; do
    case ${opt} in
        f) FILE_PATH="${OPTARG}" ;;
        w) WARN_SIZE_KB="${OPTARG}" ;;
        c) CRIT_SIZE_KB="${OPTARG}" ;;
        e) CHECK_EXISTS_ONLY=true ;;
        *) print_usage ;;
    esac
done

# --- Validate arguments ---
if [ -z "${FILE_PATH}" ]; then
    echo "UNKNOWN: File path (-f) is mandatory."
    exit $STATE_UNKNOWN
fi

if ! ${CHECK_EXISTS_ONLY}; then
    if [ -n "${WARN_SIZE_KB}" ] && ! [[ "${WARN_SIZE_KB}" =~ ^[0-9]+$ ]]; then
        echo "UNKNOWN: Warning size (-w) must be a positive integer."
        exit $STATE_UNKNOWN
    fi
    if [ -n "${CRIT_SIZE_KB}" ] && ! [[ "${CRIT_SIZE_KB}" =~ ^[0-9]+$ ]]; then
        echo "UNKNOWN: Critical size (-c) must be a positive integer."
        exit $STATE_UNKNOWN
    fi
    if [ -n "${WARN_SIZE_KB}" ] && [ -n "${CRIT_SIZE_KB}" ] && [ "${WARN_SIZE_KB}" -ge "${CRIT_SIZE_KB}" ]; then
        echo "UNKNOWN: Warning size (-w) must be less than critical size (-c)."
        exit $STATE_UNKNOWN
    fi
fi

# --- Perform the check ---
if [ ! -f "${FILE_PATH}" ]; then
    echo "CRITICAL: File '${FILE_PATH}' does not exist or is not a regular file."
    exit $STATE_CRITICAL
fi

if ${CHECK_EXISTS_ONLY}; then
    echo "OK: File '${FILE_PATH}' exists."
    exit $STATE_OK
fi

# Check size (if thresholds are provided)
FILE_SIZE_BYTES=$(stat -c%s "${FILE_PATH}")
FILE_SIZE_KB=$((FILE_SIZE_BYTES / 1024))
PERFDATA="size=${FILE_SIZE_KB}KB"
if [ -n "${WARN_SIZE_KB}" ]; then PERFDATA="${PERFDATA};${WARN_SIZE_KB}"; fi
if [ -n "${CRIT_SIZE_KB}" ]; then PERFDATA="${PERFDATA};${CRIT_SIZE_KB}"; fi
PERFDATA="${PERFDATA};0" # Min value for perfdata

STATUS_MSG_PREFIX="File '${FILE_PATH}' size is ${FILE_SIZE_KB}KB"

# Check critical threshold first
if [ -n "${CRIT_SIZE_KB}" ] && [ "${FILE_SIZE_KB}" -gt "${CRIT_SIZE_KB}" ]; then
    echo "CRITICAL: ${STATUS_MSG_PREFIX} (Threshold > ${CRIT_SIZE_KB}KB) | ${PERFDATA}"
    exit $STATE_CRITICAL
fi

# Check warning threshold
if [ -n "${WARN_SIZE_KB}" ] && [ "${FILE_SIZE_KB}" -gt "${WARN_SIZE_KB}" ]; then
    echo "WARNING: ${STATUS_MSG_PREFIX} (Threshold > ${WARN_SIZE_KB}KB) | ${PERFDATA}"
    exit $STATE_WARNING
fi

# If we reach here, it's OK
echo "OK: ${STATUS_MSG_PREFIX} | ${PERFDATA}"
exit $STATE_OK

Integrating the Custom Plugin with Nagios:

  1. Place the Plugin: Copy your plugin script (e.g., check_custom_file.sh) to the Nagios plugins directory on the Nagios server (or on the remote host if it's to be run by NRPE/NSClient++).

    sudo cp check_custom_file.sh /usr/local/nagios/libexec/
    sudo chown nagios:nagios /usr/local/nagios/libexec/check_custom_file.sh
    sudo chmod +x /usr/local/nagios/libexec/check_custom_file.sh
    

  2. Test the Plugin from Command Line: Always test thoroughly from the command line before defining it in Nagios.

    # As nagios user for realistic permissions test
    sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /var/log/syslog -w 10000 -c 20000
    # Expected: OK: File '/var/log/syslog' size is XXXKB | size=XXXKB;10000;20000;0
    
    sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /tmp/nonexistentfile -e
    # Expected: CRITICAL: File '/tmp/nonexistentfile' does not exist...
    
    # Create a large test file
    sudo fallocate -l 25M /tmp/largefile.test
    sudo -u nagios /usr/local/nagios/libexec/check_custom_file.sh -f /tmp/largefile.test -w 10000 -c 20000 # 10MB, 20MB
    # Expected: CRITICAL: File '/tmp/largefile.test' size is 25600KB (Threshold > 20000KB) | size=25600KB;10000;20000;0
    sudo rm /tmp/largefile.test
    

  3. Define a Nagios Command: Add a command definition in /usr/local/nagios/etc/objects/commands.cfg:

    define command{
        command_name    check_custom_file
        command_line    $USER1$/check_custom_file.sh -f $ARG1$ -w $ARG2$ -c $ARG3$ $ARG4$
                                                        ; ARG1=file, ARG2=warn_size, ARG3=crit_size
                                                        ; ARG4 can be used for extra args like -e
    }
    

    • $ARG1$: Will be the file path.
    • $ARG2$: Will be the warning size threshold.
    • $ARG3$: Will be the critical size threshold.
    • $ARG4$: Could be used to pass -e if checking existence only, or other optional flags.
  4. Define a Nagios Service: Add a service definition for a host (e.g., localhost in localhost.cfg):

    define service{
        use                     local-service
        host_name               localhost
        service_description     Syslog Size Check
        check_command           check_custom_file!/var/log/syslog!102400!204800
                                        ; Check /var/log/syslog, Warn > 100MB, Crit > 200MB
    }
    
    define service{
        use                     local-service
        host_name               localhost
        service_description     Important Config File Exists
        check_command           check_custom_file!/etc/my_app/important.conf!!-e
                                        ; Check /etc/my_app/important.conf exists.
                                        ; Note the empty $ARG2$ and $ARG3$ (warn/crit sizes)
                                        ; $ARG4$ is -e for existence check
    }
    
    For the second service, notice the !! which means $ARG2$ and $ARG3$ are empty. The -e is passed as $ARG4$. You might need to adjust the command definition if it doesn't gracefully handle empty optional arguments passed this way, or create a separate command for the existence check. A more robust command_line in commands.cfg could be:
    # Command for file size check
    define command{
        command_name    check_custom_file_size
        command_line    $USER1$/check_custom_file.sh -f $ARG1$ -w $ARG2$ -c $ARG3$
    }
    # Command for file existence check
    define command{
        command_name    check_custom_file_exists
        command_line    $USER1$/check_custom_file.sh -f $ARG1$ -e
    }
    
    Then the service definitions would use check_custom_file_size or check_custom_file_exists accordingly. This is often cleaner.

  5. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    
    The new service(s) should appear in the Nagios UI.

Writing custom plugins is a powerful skill that allows you to tailor Nagios to almost any monitoring requirement. Remember to test thoroughly and handle errors gracefully within your plugin.

Workshop Creating a Simple Bash Script Plugin

Objective:
Write a Bash script plugin to check the number of active SSH sessions on the Nagios server (localhost). The plugin should issue a WARNING if the count exceeds a threshold and CRITICAL if it exceeds another.

Plugin Logic:

  • The plugin will use the who command and grep for pts/ (common for SSH sessions) or specific IP patterns if known. A simpler approach for this workshop is to count lines from who that indicate an active terminal session, which often correlates with SSH sessions on a server.
  • It will accept warning (-w) and critical (-c) thresholds as arguments.
  • It will output the number of sessions and performance data.

Steps:

  1. Create the Plugin Script on the Nagios Server (VM1): Navigate to a temporary directory or your preferred script development location. Create a file named check_ssh_sessions.sh:

    nano check_ssh_sessions.sh
    
    Paste the following Bash script content:
    #!/bin/bash
    
    # Nagios Exit Codes
    STATE_OK=0
    STATE_WARNING=1
    STATE_CRITICAL=2
    STATE_UNKNOWN=3
    
    # Default thresholds (can be overridden by arguments)
    WARN_THRESHOLD=""
    CRIT_THRESHOLD=""
    
    # --- Helper function for usage ---
    print_usage() {
        echo "Usage: $0 -w <warn_sessions> -c <crit_sessions>"
        echo "  -w <warn_sessions>: Warning threshold for number of active SSH sessions."
        echo "  -c <crit_sessions>: Critical threshold for number of active SSH sessions."
        exit $STATE_UNKNOWN
    }
    
    # --- Parse Arguments ---
    while getopts "w:c:" opt; do
        case ${opt} in
            w) WARN_THRESHOLD="${OPTARG}" ;;
            c) CRIT_THRESHOLD="${OPTARG}" ;;
            *) print_usage ;;
        esac
    done
    
    # --- Validate Arguments ---
    if [ -z "${WARN_THRESHOLD}" ] || ! [[ "${WARN_THRESHOLD}" =~ ^[0-9]+$ ]]; then
        echo "UNKNOWN: Warning threshold (-w) must be a positive integer."
        exit $STATE_UNKNOWN
    fi
    if [ -z "${CRIT_THRESHOLD}" ] || ! [[ "${CRIT_THRESHOLD}" =~ ^[0-9]+$ ]]; then
        echo "UNKNOWN: Critical threshold (-c) must be a positive integer."
        exit $STATE_UNKNOWN
    fi
    if [ "${WARN_THRESHOLD}" -ge "${CRIT_THRESHOLD}" ]; then
        echo "UNKNOWN: Warning threshold (-w) must be less than critical threshold (-c)."
        exit $STATE_UNKNOWN
    fi
    
    # --- Perform the Check ---
    # Count lines from 'who' that seem to be remote sessions (e.g., have an IP or are on pts)
    # This is a simplistic approach; a more robust check might filter more specifically.
    # For this workshop, we'll count all lines from 'who' as a proxy for logged-in users.
    # If you have many local console logins, this count will include them.
    # A more specific grep for SSH might be `who | grep -c '(.*)'` or `ss -tnp state established '( dport = :ssh )' | awk 'NR>1 {print $5}' | cut -d: -f1 | sort -u | wc -l`
    # For simplicity, let's count lines from `who` which are generally interactive sessions.
    CURRENT_SESSIONS=$(who | wc -l)
    
    # --- Determine Status and Output ---
    OUTPUT_MSG="Active sessions: ${CURRENT_SESSIONS}"
    PERFDATA="sessions=${CURRENT_SESSIONS};${WARN_THRESHOLD};${CRIT_THRESHOLD};0" # Min value 0
    
    if [ "${CURRENT_SESSIONS}" -ge "${CRIT_THRESHOLD}" ]; then
        echo "CRITICAL: ${OUTPUT_MSG} | ${PERFDATA}"
        exit $STATE_CRITICAL
    elif [ "${CURRENT_SESSIONS}" -ge "${WARN_THRESHOLD}" ]; then
        echo "WARNING: ${OUTPUT_MSG} | ${PERFDATA}"
        exit $STATE_WARNING
    else
        echo "OK: ${OUTPUT_MSG} | ${PERFDATA}"
        exit $STATE_OK
    fi
    
    Save the file (Ctrl+X, Y, Enter in nano).

  2. Make the Plugin Executable and Move it:

    chmod +x check_ssh_sessions.sh
    sudo cp check_ssh_sessions.sh /usr/local/nagios/libexec/
    sudo chown nagios:nagios /usr/local/nagios/libexec/check_ssh_sessions.sh
    

  3. Test the Plugin from the Command Line: Log in via SSH to your Nagios server a few times from different terminals to create some sessions. Then run these tests:

    # Test OK state (assuming you have < 3 sessions)
    /usr/local/nagios/libexec/check_ssh_sessions.sh -w 3 -c 5
    # Expected: OK: Active sessions: X | sessions=X;3;5;0 (where X is your session count)
    
    # Test WARNING state (adjust -w so current sessions > warning but < critical)
    # If you have 2 sessions, test with:
    /usr/local/nagios/libexec/check_ssh_sessions.sh -w 1 -c 5
    # Expected: WARNING: Active sessions: 2 | sessions=2;1;5;0
    
    # Test CRITICAL state (adjust -c so current sessions > critical)
    # If you have 2 sessions, test with:
    /usr/local/nagios/libexec/check_ssh_sessions.sh -w 1 -c 2
    # Expected: CRITICAL: Active sessions: 2 | sessions=2;1;2;0
    
    # Test argument validation
    /usr/local/nagios/libexec/check_ssh_sessions.sh -w 5 -c 3 # Warn >= Crit
    # Expected: UNKNOWN: Warning threshold (-w) must be less than critical threshold (-c).
    
    /usr/local/nagios/libexec/check_ssh_sessions.sh -w foo -c bar
    # Expected: UNKNOWN: Warning threshold (-w) must be a positive integer.
    

  4. Define a Nagios Command for the Plugin: Open /usr/local/nagios/etc/objects/commands.cfg on your Nagios server (VM1):

    sudo nano /usr/local/nagios/etc/objects/commands.cfg
    
    Add the following command definition:
    define command{
        command_name    check_active_ssh_sessions
        command_line    $USER1$/check_ssh_sessions.sh -w $ARG1$ -c $ARG2$
    }
    
    Save and exit.

  5. Define a Nagios Service to Use the Plugin: We'll add this service to monitor localhost (the Nagios server itself). Open /usr/local/nagios/etc/objects/localhost.cfg:

    sudo nano /usr/local/nagios/etc/objects/localhost.cfg
    
    Add the following service definition:
    define service{
        use                     local-service           ; Name of service template to use
        host_name               localhost
        service_description     Active SSH Sessions
        check_command           check_active_ssh_sessions!3!5
                                                        ; Warn if >= 3 sessions, Crit if >= 5 sessions
    }
    
    Adjust the thresholds 3 and 5 as appropriate for your server. Save and exit.

  6. Verify Nagios Configuration and Reload:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # If "Total Warnings: 0" and "Total Errors: 0":
    sudo systemctl reload nagios
    

  7. Check in Nagios Web Interface:

    • Open your Nagios web UI.
    • Go to "Services." Look for the "Active SSH Sessions" service associated with localhost.
    • It will initially be "Pending." After Nagios runs the check, it should display the status (OK, WARNING, or CRITICAL) based on the current number of sessions and your defined thresholds.
    • Click on the service name to see the status output and performance data.

Outcome:
You have successfully created a custom Bash script plugin, integrated it into Nagios, and are now monitoring the number of active SSH sessions on your Nagios server. This workshop covered:

  • Writing a Bash script that adheres to Nagios plugin guidelines (exit codes, output).
  • Implementing argument parsing and validation.
  • Performing a system check (who | wc -l).
  • Formatting output with status messages and performance data.
  • Defining the corresponding Nagios command and service.
  • Testing the plugin thoroughly.

This practical experience provides a solid foundation for developing more complex custom plugins tailored to your specific monitoring needs.

Advanced Object Configuration

Nagios's object configuration provides powerful features to manage complex environments efficiently. Using host groups, service groups, templates, timeperiods, escalations, and dependencies allows you to create a more organized, maintainable, and intelligent monitoring setup.

Host Groups and Service Groups:

  • Host Groups: Collections of hosts. They simplify management and viewing. For example, you can create groups like linux-servers, windows-servers, web-servers, database-servers, network-switches.

    • Viewing: The Nagios UI allows filtering by host group.
    • Configuration: You can assign contact groups or other settings at the host group level, which can be inherited by member hosts (though direct assignment on hosts/services is more common for contacts).
    • Dependencies & Escalations: Can be defined based on host groups.

    Definition (hostgroups.cfg or similar):

    define hostgroup{
        hostgroup_name  web-servers
        alias           All Web Servers
        members         webserver01,webserver02,apache-prod ; Comma-separated list of host_name's
    }
    
    A host can be a member of multiple host groups by listing its host_name in several members directives or by specifying hostgroups in its host definition:
    define host{
        host_name       webserver01
        # ... other settings ...
        hostgroups      web-servers,linux-servers ; This host is in two groups
    }
    

  • Service Groups: Collections of services. Similar to host groups, they aid in organization and viewing. A service group can contain services from different hosts.

    • Viewing: The UI allows filtering by service group.
    • Business Process Monitoring: Can be used to group services that constitute a critical business process (e.g., all services related to an e-commerce application).

    Definition (servicegroups.cfg or similar):

    define servicegroup{
        servicegroup_name       critical-db-services
        alias                   Critical Database Services
        members                 dbserver01,Oracle Listener,dbserver01,Oracle Tablespace Usage,dbserver02,MySQL Status
                                ; Format: host_name,service_description,host_name,service_description,...
    }
    
    Alternatively, assign a service to service groups in its definition:
    define service{
        host_name           webserver01
        service_description HTTP Port 80
        # ... other settings ...
        servicegroups       http-services,ecommerce-app-services
    }
    

Templates for Hosts and Services (Inheritance):
Templates are one of the most powerful features for reducing redundancy and simplifying configuration. You define a set of common properties in a template, and then host or service definitions can use that template to inherit those properties.

  • How Inheritance Works:
    • Objects inherit all properties from the template(s) they use.
    • Properties defined directly in the object override those inherited from the template.
    • Multiple templates can be used (comma-separated list for use directive). Properties from later templates in the list override earlier ones.
    • Templates can also inherit from other templates, creating a hierarchy.
  • register 0: Templates themselves are not actual hosts or services to be monitored. The register 0 directive in a template definition tells Nagios not to register it as a live object.

Example (templates.cfg):

# Generic host template
define host{
    name                            generic-host    ; Name of this template
    notifications_enabled           1
    event_handler_enabled           1
    flap_detection_enabled          1
    process_perf_data               1
    retain_status_information       1
    retain_nonstatus_information    1
    notification_period             24x7
    check_period                    24x7
    max_check_attempts              5
    check_interval                  5               ; Check every 5 minutes
    retry_interval                  1               ; Retry every 1 minute on failure
    contact_groups                  admins          ; Default contact group
    register                        0               ; THIS IS A TEMPLATE, DO NOT REGISTER
}

# Linux server template inheriting from generic-host
define host{
    name                            linux-server
    use                             generic-host    ; Inherit from generic-host
    check_command                   check-host-alive-ping ; Specific check command for Linux
    icon_image                      linux40.png     ; (Assuming you have icons)
    statusmap_image                 linux40.gd2
    register                        0
}

# Generic service template
define service{
    name                            generic-service
    active_checks_enabled           1
    passive_checks_enabled          1
    parallelize_check               1
    obsess_over_service             1
    check_freshness                 0
    notifications_enabled           1
    event_handler_enabled           1
    flap_detection_enabled          1
    process_perf_data               1
    retain_status_information       1
    retain_nonstatus_information    1
    is_volatile                     0
    check_period                    24x7
    max_check_attempts              3
    normal_check_interval           10              ; Check every 10 minutes
    retry_check_interval            2               ; Retry every 2 minutes on failure
    contact_groups                  admins
    notification_options            w,u,c,r         ; Notify on warning, unknown, critical, recovery
    notification_interval           60              ; Re-notify every 60 minutes for ongoing problem
    notification_period             24x7
    register                        0
}
Using Templates in Object Definitions:
# In a host definition file (e.g., mywebserver.cfg)
define host{
    use             linux-server    ; Inherits all settings from linux-server & generic-host
    host_name       my-web-01
    alias           Production Web Server 01
    address         192.168.1.50
    contact_groups  web-admins,db-admins ; Overrides 'admins' from template
}

# In a service definition file
define service{
    use                     generic-service
    host_name               my-web-01
    service_description     HTTP Check
    check_command           check_http
    normal_check_interval   5 ; Override default of 10 mins from generic-service
}

Timeperiods:
Timeperiods define when Nagios can perform checks and send notifications.

  • Usage: Assigned to hosts, services (for check_period and notification_period), and contacts (for host_notification_period and service_notification_period).
  • 24x7 is a common default. Others like workhours, nonworkhours, none (to disable) are useful.

Definition (timeperiods.cfg):

define timeperiod{
    timeperiod_name         us-workhours
    alias                   Normal US Work Hours (Mon-Fri, 9am-5pm EST)
    monday                  09:00-17:00
    tuesday                 09:00-17:00
    wednesday               09:00-17:00
    thursday                09:00-17:00
    friday                  09:00-17:00
    # Omitting saturday and sunday means they are not part of this timeperiod
}

define timeperiod{
    timeperiod_name         none
    alias                   No Time Is A Good Time
    # No day directives = never
}

Escalations (Host and Service):
Escalations define modified notification rules if a problem persists. For example, notify a manager if a critical service is still down after 1 hour.

  • Trigger: Based on the number of notifications sent or the time a host/service has been in a problem state.
  • Action: Can notify different contact groups, use different notification intervals, or limit the escalation period.

Definition (escalations.cfg or similar):

define serviceescalation{
    host_name               my-critical-server
    service_description     Main Application Service
    first_notification      3           ; Escalate after the 3rd notification for this service
    last_notification       0           ; 0 means escalate for all subsequent notifications
    contact_groups          oncall-level2,managers ; Notify these groups
    notification_interval   30          ; Notify these escalated contacts every 30 mins
    escalation_period       24x7        ; Escalate during this timeperiod
    escalation_options      w,u,c       ; Escalate for warning, unknown, critical states
}

define hostescalation{
    hostgroup_name          database-servers
    first_notification      2
    last_notification       5
    contact_groups          db-managers
    notification_interval   60
    escalation_options      d,u         ; Escalate for down, unreachable states
}

Dependencies (Host and Service):
Dependencies define relationships between hosts or services to prevent a flood of notifications during widespread outages and enable smarter root cause analysis.

  • Purpose: If a "parent" host/service is down/critical, Nagios can suppress notifications for "child" hosts/services that depend on it. Checks for dependent items might also be suppressed.
  • Example: If a core switch is down, all servers connected to it will become unreachable. Defining the servers as dependent on the switch prevents notifications for each server, only alerting for the switch.
  • Execution Dependency: The check for the dependent item will not run if the dependency is not met.
  • Notification Dependency: Notifications for the dependent item will be suppressed if the dependency is not met.
  • Dependency Period: Defines when the dependency is active.

Definition (dependencies.cfg or similar):

define hostdependency{
    host_name                       my-web-server       ; Dependent host
    dependent_host_name             my-core-switch      ; Host this one depends on
    notification_failure_criteria   d,u                 ; If switch is d=DOWN or u=UNREACHABLE...
                                                        ; ...suppress notifications for my-web-server
    # execution_failure_criteria can also be d,u,o,p,n (o=UP, p=PENDING, n=NONE)
}

define servicedependency{
    host_name                       my-app-server
    service_description             Application UI      ; Dependent service
    dependent_host_name             my-db-server
    dependent_service_description   Database Service    ; Service this one depends on
    notification_failure_criteria   w,u,c               ; If DB service is W, U, or C...
                                                        ; ...suppress notifications for App UI service
    inherits_parent                 1                   ; If DB server is down, also consider App UI dependent
}
inherits_parent=1 means if the host of the dependent_service_description is DOWN or UNREACHABLE, this service dependency will also be considered failed.

Benefits of Advanced Object Configuration:

  • Reduced Redundancy: Templates make configurations DRY (Don't Repeat Yourself).
  • Easier Management: Changes to templates propagate to all inheriting objects. Groups simplify bulk operations and views.
  • Smarter Alerting: Escalations ensure critical issues get appropriate attention. Dependencies reduce notification noise and help pinpoint root causes.
  • Flexibility: Timeperiods allow fine-grained control over check and notification timing.

Mastering these advanced object configurations is key to scaling your Nagios deployment and making it an indispensable tool rather than a source of alert fatigue.

Workshop Implementing Host Groups and Service Templates

Objective:
Organize existing monitored hosts into logical host groups and create/use a more specific service template for a common type of check (e.g., disk space checks).

Prerequisites:

  • A working Nagios Core installation with at least localhost and one remote host (Linux or Windows) monitored.
  • For this workshop, let's assume you have:
    • localhost (your Nagios server).
    • vm2-remote-linux (a remote Linux server).
    • vm3-remote-windows (a remote Windows server).
    • All hosts currently use generic templates like linux-server or generic-host, and services use generic-service or local-service.

Part 1: Implementing Host Groups

  1. Define Host Groups: Create or edit a file for host group definitions, e.g., /usr/local/nagios/etc/objects/hostgroups.cfg. If this file doesn't exist, create it and ensure it's included in nagios.cfg (e.g., cfg_file=/usr/local/nagios/etc/objects/hostgroups.cfg).

    sudo nano /usr/local/nagios/etc/objects/hostgroups.cfg
    
    Add the following definitions:
    define hostgroup{
        hostgroup_name  linux-servers
        alias           All Linux Servers
        members         localhost, vm2-remote-linux  ; Add other Linux host_names if you have them
    }
    
    define hostgroup{
        hostgroup_name  windows-servers
        alias           All Windows Servers
        members         vm3-remote-windows           ; Add other Windows host_names if you have them
    }
    
    define hostgroup{
        hostgroup_name  all-servers
        alias           All Monitored Servers
        members         localhost, vm2-remote-linux, vm3-remote-windows
                                                    ; Explicitly list all, or use hostgroup recursion later
    }
    
    Alternatively, instead of listing members here, you can add the hostgroups directive to each host definition. For this workshop, members in the hostgroup definition is fine.

  2. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

  3. Check in Nagios Web Interface:

    • Go to the "Host Groups" link in the navigation pane. You should see your newly defined groups: linux-servers, windows-servers, and all-servers.
    • Click on each group name to see its members.
    • Under "Host Group Grid," you'll see a matrix view.

Part 2: Implementing a Specific Service Template for Disk Checks

Let's say you want all your disk space checks to have a slightly different retry interval or re-notification interval than generic-service.

  1. Define the Disk Service Template: Open your templates file, e.g., /usr/local/nagios/etc/objects/templates.cfg.

    sudo nano /usr/local/nagios/etc/objects/templates.cfg
    
    Add a new service template definition. This template will inherit from generic-service and then override specific values.
    define service{
        name                            disk-service-template
        use                             generic-service     ; Inherit from our main generic service
        normal_check_interval           15                  ; Check disk space every 15 minutes
        retry_check_interval            3                   ; Retry every 3 minutes on failure
        notification_interval           120                 ; Re-notify every 2 hours for ongoing disk issues
        register                        0                   ; This is a template
        # You could also add specific contact_groups here if disk alerts go to a storage team
        # contact_groups                admins,storage-team
    }
    

  2. Apply the New Template to Disk Services: Now, find your existing disk space service definitions and change them to use this new template.

    • For localhost (e.g., in localhost.cfg):

      sudo nano /usr/local/nagios/etc/objects/localhost.cfg
      
      Find the "Root Partition" service (or similar disk check):
      define service{
          # use                             local-service ; Old template
          use                             disk-service-template ; New template
          host_name                       localhost
          service_description             Root Partition
          check_command                   check_local_disk!20%!10%!/
      }
      
      If you have other disk checks on localhost (like /home or /var), update them too.

    • For vm2-remote-linux (e.g., in servers/vm2-linux.cfg):

      sudo nano /usr/local/nagios/etc/servers/vm2-linux.cfg
      
      Find the "Root Disk Space via NRPE" service:
      define service{
          # use                             generic-service ; Old template
          use                             disk-service-template ; New template
          host_name                       vm2-remote-linux
          service_description             Root Disk Space via NRPE
          check_command           check_nrpe!check_disk_root
      }
      

    • For vm3-remote-windows (e.g., in servers/vm3-windows.cfg):

      sudo nano /usr/local/nagios/etc/servers/vm3-windows.cfg
      
      Find the "Windows C Drive Space" service:
      define service{
          # use                             generic-service ; Old template
          use                             disk-service-template ; New template
          host_name                       vm3-remote-windows
          service_description     Windows C Drive Space
          check_command           check_nrpe!alias_disk_c_space ; Or your direct check_drivesize command
      }
      

  3. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

  4. Observe Changes (Subtle):

    • In the Nagios UI, go to the "Services" view.
    • Click on one of the disk services you modified (e.g., "Root Partition" for localhost).
    • In the detailed view, look at "Check Interval," "Retry Interval," and "Notification Interval." They should now reflect the values from disk-service-template (e.g., Check Interval 15 min, Retry Interval 3 min). This confirms the template inheritance is working.
    • The actual check behavior (thresholds for warning/critical) remains defined by the check_command in the service definition itself.

Outcome:
You have successfully:

  • Organized your hosts into logical hostgroups, making them easier to view and manage.
  • Created a specialized service template (disk-service-template) that inherits from a more generic template and customizes certain parameters.
  • Applied this new template to relevant disk space services, demonstrating how templates can enforce consistent settings (like check frequency or notification behavior) for similar types of checks across multiple hosts.

This workshop illustrates how using host groups and refining service templates leads to a more structured, maintainable, and scalable Nagios configuration. As your monitored environment grows, these practices become increasingly vital.

3. Advanced Nagios Management and Optimization

With a solid understanding of basic and intermediate Nagios concepts, we now explore advanced topics. This section delves into passive checks using NSCA for monitoring asynchronous events, implementing event handlers for automated problem remediation, strategies for performance tuning and scaling your Nagios instance, essential security best practices, and a brief look at extending Nagios with popular addons. These advanced techniques will help you build a more robust, efficient, and secure monitoring solution.

Passive Checks and NSCA

So far, we've primarily focused on active checks, where Nagios initiates checks for hosts and services at regular intervals. However, there are scenarios where this model isn't ideal:

  • Asynchronous Events: Monitoring events that don't occur regularly, such as the completion of a nightly backup job, a security alert from an intrusion detection system, or a user-triggered action.
  • Services Behind Restrictive Firewalls: When the Nagios server cannot directly reach a service to check it.
  • Resource Intensive Checks: For checks that are too resource-intensive to run frequently from the Nagios server.
  • Distributed Monitoring: Aggregating results from other monitoring systems or remote agents.

For these situations, Nagios supports passive checks.

Active vs. Passive Checks:

  • Active Checks:

    • Initiated by the Nagios server.
    • Scheduled at regular intervals (defined by check_interval and retry_interval).
    • Nagios executes a plugin to determine status.
    • Example: Nagios pings a server every 5 minutes.
  • Passive Checks:

    • Initiated by an external application or script on the monitored host (or another system).
    • The external application performs the check and submits the result (status, output message) to Nagios.
    • Nagios does not schedule these checks; it simply processes the results when they arrive.
    • Example: A backup script on a remote server sends a "Backup OK" or "Backup FAILED" message to Nagios upon completion.

Nagios Service Check Acceptor (NSCA):
NSCA is a common addon used to facilitate passive checks. It consists of two parts:

  1. NSCA Daemon: Runs on the Nagios server. It listens on a specific TCP port (default 5667) for incoming passive check results.
  2. send_nsca Client: A utility run on the remote host (or any system that needs to submit a passive check result). It formats the check result and sends it to the NSCA daemon on the Nagios server.

NSCA Architecture:

  1. An external application/script on a (remote) host determines the status of a service (e.g., backup job completes).
  2. This application/script uses the send_nsca client utility to construct a message containing the target host name, service description (as defined in Nagios), status code, and plugin output.
  3. send_nsca sends this message to the NSCA daemon running on the Nagios server.
  4. The NSCA daemon receives the message, performs basic validation (and decryption if configured), and writes the check result to Nagios's external command file (nagios.cmd).
  5. The Nagios daemon periodically processes the external command file, reads the passive check result, and updates the status of the corresponding service.

Security:

  • NSCA communication can be encrypted using various ciphers. Both the daemon and client must be configured with the same encryption method and password/key.
  • The NSCA daemon configuration can restrict which hosts are allowed to send data.
  • Firewall rules on the Nagios server should allow incoming connections on the NSCA port (e.g., TCP 5667) only from trusted IP addresses/networks.

Configuring Nagios for Passive Checks:

  1. Service Definition: Services that receive passive check results need to be defined in Nagios, but configured to accept them.

    define service{
        host_name               some-remote-server
        service_description     Nightly Backup Status
        active_checks_enabled   0           ; Disable active checks for this service
        passive_checks_enabled  1           ; Enable passive checks
        check_period            24x7        ; Still need a check_period for freshness
        max_check_attempts      1           ; Usually 1, as result is submitted directly
        is_volatile             0           ; Or 1 if every result is important regardless of previous
        contact_groups          admins
        # No check_command is needed if only passive checks are used.
        # However, freshness checking is highly recommended.
        check_freshness         1           ; Enable freshness checking
        freshness_threshold     90000       ; E.g., 3600*25 = 90000 seconds (25 hours)
                                            ; If no result received in 25 hours, service becomes stale.
        check_command           check-dummy!3!"No backup results received" ; Or a custom 'stale' check
                                            ; This command runs ONLY if freshness threshold is exceeded.
        stalking_options        o,w,u,c     ; Log all state changes if desired
        register                1
    }
    

    • active_checks_enabled 0: Crucial. Disables Nagios from actively checking this service.
    • passive_checks_enabled 1: Crucial. Allows Nagios to accept results for this service.
    • Freshness Checking:
      • check_freshness 1: Enables freshness checking.
      • freshness_threshold <seconds>: If Nagios doesn't receive a passive result for this service within this many seconds, it considers the service "stale."
      • When a service becomes stale, Nagios can optionally run an active check_command (like check-dummy or a custom alert) to force the service into a WARNING, CRITICAL, or UNKNOWN state and trigger notifications. This alerts you that the passive check mechanism itself might be failing.
      • check-dummy is a simple plugin that returns a predefined state and message. For example: check-dummy 2 "Service is stale" would force it to CRITICAL.
  2. Nagios Main Configuration (nagios.cfg): Ensure passive check result processing is enabled (it usually is by default).

    accept_passive_service_checks=1
    accept_passive_host_checks=1 ; If you plan to use passive host checks
    check_service_freshness=1    ; Global switch for service freshness
    

Installing and Configuring NSCA (Nagios Server and Client):

On the Nagios Server:

  1. Download and Install NSCA:

    cd /tmp
    # Check Nagios Exchange or GitHub for NSCA source.
    # Example version, find latest stable from a trusted source.
    # NagiosEnterprises/nsca on GitHub is a common source.
    NSCA_VERSION="2.9.2" # Example, verify latest version
    wget https://github.com/NagiosEnterprises/nsca/releases/download/v${NSCA_VERSION}/nsca-${NSCA_VERSION}.tar.gz
    tar -zxvf nsca-${NSCA_VERSION}.tar.gz
    cd nsca-${NSCA_VERSION}/
    
    sudo ./configure --with-nsca-user=nagios --with-nsca-group=nagios # Or your nagios user/group
    sudo make all
    # This will build both the nsca daemon and send_nsca client.
    # Install only the daemon on the server:
    sudo cp src/nsca /usr/local/nagios/bin/
    sudo cp sample-config/nsca.cfg /usr/local/nagios/etc/
    sudo chown nagios:nagios /usr/local/nagios/bin/nsca /usr/local/nagios/etc/nsca.cfg
    sudo chmod 750 /usr/local/nagios/bin/nsca
    

  2. Configure NSCA Daemon (nsca.cfg): Edit /usr/local/nagios/etc/nsca.cfg:

    sudo nano /usr/local/nagios/etc/nsca.cfg
    
    Key settings:

    • nsca_user=nagios (if you ran configure with it, otherwise set it here)
    • nsca_group=nagios
    • server_port=5667
    • command_file=/usr/local/nagios/var/rw/nagios.cmd (Nagios external command file path)
    • password=your_secret_nsca_password (Choose a strong password if using simple password encryption)
    • decryption_method=1 (for XOR encryption with password. 0=None, 1=XOR, 2=DES, etc. Other methods require libmcrypt). XOR is simple but not very strong. Consider stronger methods for production.
  3. Firewall: Allow TCP port 5667 on the Nagios server from IPs that will send NSCA data.

  4. Run NSCA Daemon: You can run it directly or set it up as a systemd service. Directly (for testing):

    sudo /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg -d # -d for daemon mode
    
    To run it via systemd (recommended): Create /etc/systemd/system/nsca.service:
    [Unit]
    Description=Nagios Service Check Acceptor
    After=network.target
    
    [Service]
    Type=simple
    User=nagios
    Group=nagios
    ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg --daemon
    ExecReload=/bin/kill -HUP $MAINPID
    Restart=on-failure
    
    [Install]
    WantedBy=multi-user.target
    
    Then:
    sudo systemctl daemon-reload
    sudo systemctl enable nsca
    sudo systemctl start nsca
    sudo systemctl status nsca
    

On the Client Host (that will send passive results):

  1. Install send_nsca client: If make all was run during NSCA compilation on the Nagios server, src/send_nsca was built. Copy this binary to the client machine (e.g., into /usr/local/bin/ or /usr/sbin/). Also copy the sample-config/send_nsca.cfg to /usr/local/nagios/etc/ (or /etc/nagios/) on the client.

    # On Nagios server where you compiled NSCA:
    # scp /tmp/nsca-${NSCA_VERSION}/src/send_nsca user@remote_client_ip:/tmp/
    # scp /tmp/nsca-${NSCA_VERSION}/sample-config/send_nsca.cfg user@remote_client_ip:/tmp/
    
    # On the remote client:
    sudo cp /tmp/send_nsca /usr/local/bin/
    sudo mkdir -p /usr/local/nagios/etc # Or /etc/nagios/
    sudo cp /tmp/send_nsca.cfg /usr/local/nagios/etc/send_nsca.cfg
    sudo chown root:root /usr/local/bin/send_nsca /usr/local/nagios/etc/send_nsca.cfg # Or appropriate user
    sudo chmod +x /usr/local/bin/send_nsca
    

  2. Configure send_nsca.cfg on Client: Edit /usr/local/nagios/etc/send_nsca.cfg on the client.

    sudo nano /usr/local/nagios/etc/send_nsca.cfg
    
    Set:

    • password=your_secret_nsca_password (Must match nsca.cfg on the server)
    • encryption_method=1 (Must match nsca.cfg on the server)
  3. Using send_nsca: The send_nsca client reads data from standard input or a file. The format is: <host_name>\t<svc_description>\t<return_code>\t<plugin_output>\n (Fields are tab-separated, ending with a newline).

    • <host_name>: The host_name as defined in Nagios.
    • <svc_description>: The service_description as defined in Nagios.
    • <return_code>: 0 for OK, 1 for WARNING, 2 for CRITICAL, 3 for UNKNOWN.
    • <plugin_output>: Text message from the plugin.

    Example usage in a script:

    #!/bin/bash
    NAGIOS_SERVER_IP="your_nagios_server_ip"
    HOST_NAME="some-remote-server"
    SERVICE_DESC="Nightly Backup Status"
    NSCA_CONFIG="/usr/local/nagios/etc/send_nsca.cfg" # Path to send_nsca.cfg
    
    # Simulate backup
    echo "Running backup..."
    sleep 10 # Simulate backup work
    BACKUP_SUCCESS=true # or false
    
    if $BACKUP_SUCCESS; then
        RETURN_CODE=0
        PLUGIN_OUTPUT="Backup completed successfully at $(date)"
    else
        RETURN_CODE=2
        PLUGIN_OUTPUT="Backup FAILED at $(date) - Check logs for details."
    fi
    
    # Send to Nagios via NSCA
    printf "%s\t%s\t%s\t%s\n" "${HOST_NAME}" "${SERVICE_DESC}" "${RETURN_CODE}" "${PLUGIN_OUTPUT}" | \
    /usr/local/bin/send_nsca -H ${NAGIOS_SERVER_IP} -p 5667 -d "\t" -c ${NSCA_CONFIG}
    
    echo "Result sent to Nagios."
    

    • -H ${NAGIOS_SERVER_IP}: Specifies the Nagios server running NSCA daemon.
    • -p 5667: NSCA port.
    • -d "\t": Specifies tab as the delimiter.
    • -c ${NSCA_CONFIG}: Path to send_nsca.cfg.

Passive checks with NSCA offer great flexibility for integrating various event sources and custom monitoring logic into Nagios. Remember that freshness checking is vital to ensure your passive check submission mechanisms are themselves working.

Workshop Implementing Passive Checks with NSCA

Objective:
Configure a passive service check for a simulated nightly cron job on a remote Linux host (VM2). The cron job script will use send_nsca to report its success or failure to the Nagios server (VM1).

Prerequisites:

  • Nagios Server (VM1) and Remote Linux Host (VM2) set up. VM1 IP: 192.168.1.100, VM2 IP: 192.168.1.101 (adjust as needed).
  • NRPE setup is not strictly needed for this NSCA workshop but VM2 should be defined as a host in Nagios.
  • Root/sudo access on both VMs.
  • Build tools (gcc, make) on VM1 for compiling NSCA.

Part 1: Setup NSCA on Nagios Server (VM1)

  1. Install Build Dependencies (if not already present):

    sudo apt update
    sudo apt install -y build-essential libmcrypt-dev # libmcrypt-dev for more encryption options if desired
    

  2. Download and Compile NSCA on VM1:

    cd /tmp
    NSCA_VERSION="2.9.2" # Or check for latest from NagiosEnterprises/nsca on GitHub
    wget https://github.com/NagiosEnterprises/nsca/releases/download/v${NSCA_VERSION}/nsca-${NSCA_VERSION}.tar.gz
    tar -zxvf nsca-${NSCA_VERSION}.tar.gz
    cd nsca-${NSCA_VERSION}/
    
    sudo ./configure --with-nagios-user=nagios --with-nagios-group=nagios
    sudo make all
    

  3. Install NSCA Daemon and Configuration on VM1:

    sudo cp src/nsca /usr/local/nagios/bin/
    sudo cp sample-config/nsca.cfg /usr/local/nagios/etc/
    sudo chown nagios:nagios /usr/local/nagios/bin/nsca /usr/local/nagios/etc/nsca.cfg
    sudo chmod 750 /usr/local/nagios/bin/nsca
    

  4. Configure nsca.cfg on VM1:

    sudo nano /usr/local/nagios/etc/nsca.cfg
    
    Make the following changes:

    • Set password=MyNscaSecretPassword123 (Choose your own password).
    • Set decryption_method=1 (XOR encryption). For stronger, use others if libmcrypt-dev was installed and you configured nsca with it.
    • Ensure command_file=/usr/local/nagios/var/rw/nagios.cmd is correct.
    • Save and exit.
  5. Create systemd Service File for NSCA on VM1:

    sudo nano /etc/systemd/system/nsca.service
    
    Paste the following:
    [Unit]
    Description=Nagios Service Check Acceptor
    After=network.target
    
    [Service]
    Type=forking
    User=nagios
    Group=nagios
    ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg --daemon
    PIDFile=/var/run/nsca.pid ; nsca creates this if run with --daemon, adjust path if nsca default is different
    ExecReload=/bin/kill -HUP $MAINPID
    Restart=on-failure
    
    [Install]
    WantedBy=multi-user.target
    
    Self-correction: The --daemon flag for NSCA makes it fork. So Type=forking and a PIDFile is appropriate. NSCA might not create a PID file by default unless specified with -p or if the init script handles it. For simplicity, Type=simple and removing --daemon (so it runs in foreground managed by systemd) is often easier if NSCA doesn't natively handle PID files well without an init script. Let's try Type=simple and ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg -d (using -d as an alias for --daemon which usually implies foreground for some tools or a specific daemon mode for others. NSCA's -d often means 'detach/daemonize'). If Type=simple is used, then ExecStart should not use --daemon if it forks. ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg without --daemon or -d might be best if it runs in foreground. Let's assume NSCA's --daemon or -d flag correctly daemonizes and manages its PID. If it doesn't, Type=simple and running it in foreground is cleaner for systemd. The NSCA provided init script uses start-stop-daemon. A simpler systemd unit that expects NSCA to stay in foreground:
    [Unit]
    Description=Nagios Service Check Acceptor
    After=network.target
    
    [Service]
    Type=simple
    User=nagios
    Group=nagios
    ExecStart=/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg --no-fork
    Restart=on-failure
    
    [Install]
    WantedBy=multi-user.target
    
    (Assuming --no-fork or similar exists, or just omitting --daemon if it runs in foreground by default). The version of NSCA from NagiosEnterprises uses --daemon to background itself. So Type=forking and a PIDFile it creates might be correct. Let's stick to the one from the main content for now.

    Enable and start NSCA service:

    sudo systemctl daemon-reload
    sudo systemctl enable nsca
    sudo systemctl start nsca
    sudo systemctl status nsca # Verify it's active and running. Check logs if issues.
    

  6. Configure Firewall on VM1: If ufw is active:

    sudo ufw allow from 192.168.1.101 to any port 5667 proto tcp comment 'Allow NSCA from VM2'
    sudo ufw reload
    
    (Replace 192.168.1.101 with VM2's IP).

  7. Define Passive Service in Nagios on VM1: Ensure VM2 (vm2-remote-linux) is defined as a host. Then, edit its config file, e.g., /usr/local/nagios/etc/servers/vm2-linux.cfg:

    sudo nano /usr/local/nagios/etc/servers/vm2-linux.cfg
    
    Add this service definition:
    define service{
        host_name               vm2-remote-linux
        service_description     Simulated Cron Job Status
        active_checks_enabled   0       ; Crucial: Disable active checks
        passive_checks_enabled  1       ; Crucial: Enable passive checks
        check_period            24x7
        max_check_attempts      1
        is_volatile             0
        contact_groups          admins
        check_freshness         1       ; Enable freshness checking
        freshness_threshold     86400   ; 24 hours (in seconds). If no result, it's stale.
        check_command           check_dummy!2!"CRON job results overdue" ; Command if stale
                                        ; Ensure 'check_dummy' command is defined in commands.cfg
                                        ; check_dummy!0!message (OK), !1!message (WARN), !2!message (CRIT)
        stalking_options        o,w,u,c
        notes                   This service is updated passively by a cron job on vm2-remote-linux.
        register                1
    }
    

    • Verify check_dummy command: Ensure /usr/local/nagios/etc/objects/commands.cfg has:
      define command{
          command_name    check_dummy
          command_line    $USER1$/check_dummy $ARG1$ "$ARG2$"
      }
      
      The check_dummy plugin should be in /usr/local/nagios/libexec/.

    Validate and reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

Part 2: Setup send_nsca on Remote Linux Host (VM2)

  1. Copy send_nsca binary and send_nsca.cfg sample configuration to VM2: From VM1 (where you compiled NSCA, likely in /tmp/nsca-${NSCA_VERSION}/ if you followed the steps): First, ensure you know the username and IP address for VM2. Let's assume your_user@192.168.1.101.

    # On VM1, navigate to the NSCA source directory where `make all` was run
    cd /tmp/nsca-${NSCA_VERSION}/ 
    
    # Securely copy the send_nsca binary
    scp src/send_nsca your_user@192.168.1.101:/tmp/
    
    # Securely copy the sample send_nsca.cfg configuration file
    scp sample-config/send_nsca.cfg your_user@192.168.1.101:/tmp/
    
    (Replace your_user with your actual username on VM2 and 192.168.1.101 with VM2's actual IP address). You will be prompted for the password for your_user on VM2.

  2. Install send_nsca and its configuration file on VM2: Log in to VM2 via SSH. The files you copied should be in the /tmp/ directory.

    # On VM2:
    # Move the send_nsca binary to a standard location for executables
    sudo mv /tmp/send_nsca /usr/local/bin/
    
    # Create the directory for Nagios configuration files if it doesn't exist
    sudo mkdir -p /usr/local/nagios/etc/
    
    # Move the send_nsca.cfg configuration file to this directory
    sudo mv /tmp/send_nsca.cfg /usr/local/nagios/etc/
    
    # Set appropriate ownership and permissions
    # send_nsca binary should be owned by root and executable
    sudo chown root:root /usr/local/bin/send_nsca
    sudo chmod 755 /usr/local/bin/send_nsca # rwxr-xr-x
    
    # send_nsca.cfg can be owned by root, readable by relevant users/groups
    sudo chown root:root /usr/local/nagios/etc/send_nsca.cfg
    sudo chmod 644 /usr/local/nagios/etc/send_nsca.cfg # rw-r--r--
    

  3. Configure send_nsca.cfg on VM2: This file tells send_nsca how to encrypt data and what password to use. It must match the settings in nsca.cfg on the Nagios server (VM1).

    sudo nano /usr/local/nagios/etc/send_nsca.cfg
    
    Look for the following lines and modify them:

    • password=MyNscaSecretPassword123
      • Important: This password must be exactly the same as the password you set in /usr/local/nagios/etc/nsca.cfg on VM1.
    • encryption_method=1
      • This specifies the encryption algorithm. 1 usually stands for XOR. This must also match the decryption_method in nsca.cfg on VM1. If you used a different method on VM1, set it here accordingly.

    Save the file and exit (Ctrl+X, then Y, then Enter in nano).

  4. Create the Simulated Cron Job Script on VM2: This script will simulate a task (like a backup) and then use send_nsca to report its status to Nagios. Create a new script file, for example, in /opt/simulate_cron_job.sh:

    sudo nano /opt/simulate_cron_job.sh
    
    Paste the following content into the script:
    #!/bin/bash
    
    # Configuration Variables
    NAGIOS_SERVER_IP="192.168.1.100" # <<<< IP Address of your Nagios Server (VM1)
    HOST_NAME="vm2-remote-linux"    # <<<< This MUST match the 'host_name' defined in Nagios for VM2
    SERVICE_DESC="Simulated Cron Job Status" # <<<< This MUST match the 'service_description' for the passive service in Nagios
    NSCA_CONFIG_FILE="/usr/local/nagios/etc/send_nsca.cfg" # Path to the send_nsca client config file
    
    # Simulate job execution and determine success or failure
    # For this workshop, we'll randomly make it succeed or fail.
    echo "Simulating cron job execution..."
    # sleep 5 # Optional: simulate work
    
    if (( RANDOM % 2 )); then
        # Job succeeded
        RETURN_CODE=0 # Nagios OK state
        PLUGIN_OUTPUT="Simulated cron job completed successfully at $(date)."
        echo "Job SUCCEEDED. Sending OK status to Nagios."
    else
        # Job failed
        RETURN_CODE=2 # Nagios CRITICAL state
        PLUGIN_OUTPUT="Simulated cron job FAILED at $(date). Please check application logs on VM2."
        echo "Job FAILED. Sending CRITICAL status to Nagios."
    fi
    
    # Prepare the data string for send_nsca
    # Format: <host_name>\t<service_description>\t<return_code>\t<plugin_output>\n
    DATA_STRING=$(printf "%s\t%s\t%s\t%s\n" "${HOST_NAME}" "${SERVICE_DESC}" "${RETURN_CODE}" "${PLUGIN_OUTPUT}")
    
    # Send the data to Nagios server using send_nsca
    # The printf output is piped to send_nsca's standard input.
    echo "${DATA_STRING}" | /usr/local/bin/send_nsca -H ${NAGIOS_SERVER_IP} -p 5667 -d "\t" -c ${NSCA_CONFIG_FILE}
    
    if [ $? -eq 0 ]; then
        echo "NSCA data sent successfully to ${NAGIOS_SERVER_IP}."
    else
        echo "Error sending NSCA data. Check send_nsca execution and connectivity."
    fi
    
    Make the script executable:
    sudo chmod +x /opt/simulate_cron_job.sh
    

  5. Test the Script Manually on VM2: Execute the script a few times to see it send different statuses:

    sudo /opt/simulate_cron_job.sh
    
    Each time, it will print whether it simulated a success or failure and indicate that it attempted to send data.

  6. Check Nagios Web Interface (VM1):

    • Navigate to your Nagios UI on VM1.
    • Go to the "Services" view. Find the service named "Simulated Cron Job Status" associated with the host vm2-remote-linux.
    • Its status should update to either OK (green) or CRITICAL (red) based on what your script sent. This update might take a moment as Nagios processes its external command file.
    • The "Status Information" column will display the PLUGIN_OUTPUT message from your script.
    • The "Last Check" time will reflect when Nagios processed the passive result submitted by NSCA. It will not be the regular active check interval since active checks are disabled for this service.
    • Troubleshooting: If the service status doesn't update or remains "Pending (No data received from host yet)":
      • On VM1 (Nagios Server):
        • Check the NSCA daemon status: sudo systemctl status nsca.
        • Examine Nagios logs: sudo tail -f /usr/local/nagios/var/nagios.log. Look for lines related to processing passive checks or NSCA errors.
        • Examine system logs for NSCA messages: sudo journalctl -u nsca or sudo grep nsca /var/log/syslog.
        • Verify the firewall rule for port 5667 is active and correct: sudo ufw status.
      • On VM2 (Client Host):
        • When you run /opt/simulate_cron_job.sh, does it report any errors from send_nsca itself?
        • Double-check the NAGIOS_SERVER_IP, HOST_NAME, and SERVICE_DESC variables in the script. They must exactly match Nagios configuration (case-sensitive).
        • Verify the password and encryption_method in /usr/local/nagios/etc/send_nsca.cfg match VM1's nsca.cfg.
        • Can VM2 reach VM1 on TCP port 5667? telnet 192.168.1.100 5667 (from VM2, use VM1's IP). If it connects, press Ctrl+] then type quit. If "Connection refused" or timeout, there's a network/firewall issue.
  7. (Optional) Set up as a real cron job on VM2: To have this script run automatically, you can add it to the system's cron table. Open the cron table for editing (usually as root):

    sudo crontab -e
    
    If prompted, choose an editor (e.g., nano). Add a line to schedule the script. For example, to run it every 5 minutes (for testing purposes):
    */5 * * * * /opt/simulate_cron_job.sh >> /var/log/simulated_cron_job.log 2>&1
    
    This runs the script every 5 minutes and appends its standard output and standard error to /var/log/simulated_cron_job.log. For a real nightly job, you'd use a schedule like 0 2 * * * (2 AM every day). Save and exit the crontab. The cron daemon will automatically pick up the new schedule.

Outcome:
You have now successfully:

  • Set up the NSCA daemon on your Nagios server (VM1) to listen for passive check results.
  • Defined a passive service in Nagios on VM1, configured with freshness checking to alert if results stop arriving.
  • Installed and configured the send_nsca client utility on the remote Linux host (VM2).
  • Created a script on VM2 that simulates a cron job and uses send_nsca to report its success or failure status to Nagios.
  • Observed these passively submitted check results appearing and updating in the Nagios web interface.

This workshop concretely demonstrates the power and utility of passive checks for integrating results from external systems or asynchronous events into your Nagios monitoring environment. The freshness checking component is vital as it monitors the passive check mechanism itself.

Event Handlers for Automated Remediation

Event handlers are scripts or commands that Nagios can execute when a host or service changes state. This powerful feature allows for automated problem remediation attempts, potentially resolving issues before manual intervention is required, or gathering diagnostic information when a problem occurs.

How Event Handlers Work:

  1. A host or service enters a problem state (e.g., a service becomes CRITICAL) or recovers (e.g., goes from CRITICAL to OK).
  2. If an event handler is defined for that host/service and state change, Nagios executes the specified command.
  3. The event handler script runs, performing actions like restarting a service, clearing a temporary directory, logging extra diagnostics, or even triggering actions on other systems.
  4. The event handler script should ideally be short-lived and not block Nagios for too long.

Key Concepts:

  • State Changes: Event handlers can be triggered on various state changes:
    • When a host/service goes into a SOFT problem state.
    • When a host/service goes into a HARD problem state (most common).
    • When a host/service recovers from a problem state (HARD OK/UP).
  • Event Handler Command: A Nagios command definition that specifies the script/executable to run and any arguments.
  • Macros: Event handler commands can use Nagios macros to pass context about the host/service state to the script (e.g., $HOSTNAME$, $SERVICESTATE$, $SERVICEOUTPUT$, $HOSTSTATETYPE$, $SERVICESTATETYPE$).
  • Global vs. Specific: Event handlers can be enabled globally in nagios.cfg (enable_event_handlers=1) and then defined per host/service or in templates.

Defining an Event Handler:

  1. Write the Event Handler Script: This script will perform the desired action. It can be written in Bash, Python, Perl, etc. Example: A simple Bash script to attempt restarting a service. attempt_restart_service.sh

    #!/bin/bash
    
    # Arguments passed by Nagios (defined in command)
    HOSTNAME=$1
    SERVICEDESC=$2
    SERVICESTATE=$3     # e.g., CRITICAL, WARNING, UNKNOWN, OK
    SERVICESTATETYPE=$4 # e.g., SOFT, HARD
    SERVICEATTEMPT=$5   # e.g., 1/3, 2/3, 3/3 (current_attempt/max_attempts)
    
    LOGFILE="/usr/local/nagios/var/event_handler.log"
    
    echo "$(date): Event Handler triggered for ${HOSTNAME}/${SERVICEDESC}" >> ${LOGFILE}
    echo "State: ${SERVICESTATE}, Type: ${SERVICESTATETYPE}, Attempt: ${SERVICEATTEMPT}" >> ${LOGFILE}
    
    # Only attempt restart on a HARD CRITICAL state for a specific service
    # And only on the first hard state notification (SERVICEATTEMPT will be like "1/MAX_CHECKS")
    # Or, if max_check_attempts for service is 3, it's 3/3.
    # It's better to trigger event handlers on the first HARD state change.
    # We can check $SERVICEATTEMPT$ against $MAXSERVICEATTEMPTS$ or just trigger on first HARD.
    # For this example, let's assume we only want to act on HARD CRITICAL states.
    
    if [ "${SERVICESTATETYPE}" == "HARD" ] && [ "${SERVICESTATE}" == "CRITICAL" ]; then
        echo "Attempting to restart ${SERVICEDESC} on ${HOSTNAME}..." >> ${LOGFILE}
    
        # How to restart depends on the service and host
        # If the service is on a remote Linux host, you might use SSH:
        # Make sure SSH key-based authentication is set up for the 'nagios' user
        # to the remote host, and that the nagios user has sudo rights for that specific service restart.
        # EXAMPLE:
        # if [ "${SERVICEDESC}" == "HTTP Web Server" ] && [ "${HOSTNAME}" == "my-web-server-01" ]; then
        #    ssh nagios@${HOSTNAME} "sudo systemctl restart apache2" >> ${LOGFILE} 2>&1
        #    echo "Apache restart command sent." >> ${LOGFILE}
        # fi
    
        # For a local service (on the Nagios server itself):
        if [ "${SERVICEDESC}" == "Local Apache Service" ] && [ "${HOSTNAME}" == "localhost" ]; then
            # Ensure nagios user has sudo rights for this command
            # e.g., in /etc/sudoers: nagios ALL=(ALL) NOPASSWD: /bin/systemctl restart apache2
            sudo systemctl restart apache2 >> ${LOGFILE} 2>&1
            echo "Local Apache restart command executed." >> ${LOGFILE}
        fi
        echo "--------------------------------------" >> ${LOGFILE}
    else
        echo "No action taken (State: ${SERVICESTATE}, Type: ${SERVICESTATETYPE})." >> ${LOGFILE}
        echo "--------------------------------------" >> ${LOGFILE}
    fi
    
    exit 0 # Event handlers should typically exit 0
    

    • Place this script in /usr/local/nagios/libexec/.
    • Make it executable: sudo chmod +x /usr/local/nagios/libexec/attempt_restart_service.sh.
    • Ensure the nagios user has permissions to execute it and any commands within it (e.g., sudo rights if restarting system services). This is a major security consideration.
  2. Define the Event Handler Command in Nagios: Add to /usr/local/nagios/etc/objects/commands.cfg:

    define command{
        command_name    service-restarter
        command_line    $USER1$/attempt_restart_service.sh $HOSTNAME$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
    }
    

  3. Enable Event Handlers Globally: In /usr/local/nagios/etc/nagios.cfg:

    enable_event_handlers=1
    
    (This is usually the default).

  4. Assign the Event Handler to a Service: In a service definition (e.g., for a local Apache service on localhost):

    define service{
        use                     local-service
        host_name               localhost
        service_description     Local Apache Service ; Match this in your script
        check_command           check_http
        event_handler_enabled   1                   ; Enable event handler for THIS service
        event_handler           service-restarter   ; Name of the command to run
        contact_groups          admins
    }
    

    • event_handler_enabled 1: Enables it for this specific service.
    • event_handler service-restarter: Specifies the command.
  5. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

Considerations and Best Practices:

  • Security: Granting the nagios user sudo rights is a significant security risk. Restrict these rights as much as possible (e.g., only for specific commands needed by event handlers). Use NOPASSWD with caution. SSH key-based auth for remote commands must also be secured.
  • Idempotency: Event handler scripts should ideally be idempotent (running them multiple times has the same effect as running them once). This prevents issues if Nagios triggers them repeatedly.
  • Avoid Loops: Be careful not to create event handler loops (e.g., an event handler causes a state change that triggers another event handler, etc.).
  • Keep Scripts Fast: Event handlers run synchronously in some Nagios versions/configurations, potentially blocking other checks. Keep them quick or design them to background longer tasks.
  • Testing: Test event handlers thoroughly in a non-production environment.
  • Logging: Log actions performed by event handlers for auditing and troubleshooting.
  • Use for Diagnostics: Event handlers aren't just for remediation. They can gather diagnostic data (e.g., run top, ps, netstat, save logs) when a problem occurs, attaching this info to the notification or storing it centrally.
  • State Type: Decide whether to trigger on SOFT or HARD states. Triggering on SOFT states can be aggressive. HARD states are usually preferred for remediation actions.
  • Host Event Handlers: Similar concepts apply to host event handlers (e.g., try to reboot a server if it's DOWN, but this is risky).

Event handlers can significantly enhance Nagios's capabilities, transforming it from a passive monitoring system into one that can actively attempt to resolve issues. However, they must be implemented with care and strong attention to security.

Workshop Implementing an Event Handler to Log Extra Info

Objective:
Create an event handler that, when a service on localhost (e.g., "Swap Usage") enters a HARD CRITICAL state, logs detailed system information (like free -m, vmstat, df -h) to a specific file for later diagnosis. This is a non-remediating, diagnostic event handler.

Prerequisites:

  • Working Nagios server (VM1).
  • A service on localhost that you can easily force into a CRITICAL state (e.g., "Swap Usage" or the custom "Active SSH Sessions" check). We'll use "Swap Usage".

Steps:

  1. Create the Event Handler Script: On VM1, create /usr/local/nagios/libexec/log_system_diags.sh:

    sudo nano /usr/local/nagios/libexec/log_system_diags.sh
    
    Paste the following content:
    #!/bin/bash
    
    # Nagios Macros passed as arguments
    HOSTNAME=$1
    SERVICEDESC=$2
    SERVICESTATE=$3
    SERVICESTATETYPE=$4
    SERVICEOUTPUT=$5 # The plugin output for the service
    
    LOG_DIR="/usr/local/nagios/var/diag_logs"
    DIAG_FILE="${LOG_DIR}/${HOSTNAME}_${SERVICEDESC// /_}_$(date +%Y%m%d_%H%M%S).diag"
    
    # Create log directory if it doesn't exist
    mkdir -p ${LOG_DIR}
    chown nagios:nagios ${LOG_DIR} # Ensure nagios user can write
    
    # Log basic event info
    echo "Event Handler: log_system_diags.sh triggered at $(date)" > ${DIAG_FILE}
    echo "Host: ${HOSTNAME}" >> ${DIAG_FILE}
    echo "Service: ${SERVICEDESC}" >> ${DIAG_FILE}
    echo "State: ${SERVICESTATE} (${SERVICESTATETYPE})" >> ${DIAG_FILE}
    echo "Plugin Output: ${SERVICEOUTPUT}" >> ${DIAG_FILE}
    echo "-----------------------------------------" >> ${DIAG_FILE}
    
    # Only gather diagnostics on HARD CRITICAL state
    if [ "${SERVICESTATETYPE}" == "HARD" ] && [ "${SERVICESTATE}" == "CRITICAL" ]; then
        echo "Gathering system diagnostics..." >> ${DIAG_FILE}
        echo "" >> ${DIAG_FILE}
    
        echo "=== df -h ===" >> ${DIAG_FILE}
        df -h >> ${DIAG_FILE} 2>&1
        echo "" >> ${DIAG_FILE}
    
        echo "=== free -m ===" >> ${DIAG_FILE}
        free -m >> ${DIAG_FILE} 2>&1
        echo "" >> ${DIAG_FILE}
    
        echo "=== vmstat 1 3 ===" >> ${DIAG_FILE} # 3 samples, 1 second apart
        vmstat 1 3 >> ${DIAG_FILE} 2>&1
        echo "" >> ${DIAG_FILE}
    
        echo "=== top -b -n 1 ===" >> ${DIAG_FILE} # Batch mode, 1 iteration
        top -b -n 1 >> ${DIAG_FILE} 2>&1
        echo "" >> ${DIAG_FILE}
    
        echo "Diagnostics gathering complete." >> ${DIAG_FILE}
        # Also log to main Nagios event handler log for quick check
        echo "$(date): Diagnostics for ${HOSTNAME}/${SERVICEDESC} saved to ${DIAG_FILE}" >> /usr/local/nagios/var/event_handler.log
    else
        echo "No diagnostics gathered. State was ${SERVICESTATE} (${SERVICESTATETYPE})." >> ${DIAG_FILE}
        echo "$(date): Event handler for ${HOSTNAME}/${SERVICEDESC} triggered, no action for state ${SERVICESTATE} (${SERVICESTATETYPE})." >> /usr/local/nagios/var/event_handler.log
    fi
    echo "-----------------------------------------" >> /usr/local/nagios/var/event_handler.log
    
    exit 0
    
    Make the script executable and set ownership:
    sudo chmod +x /usr/local/nagios/libexec/log_system_diags.sh
    sudo chown nagios:nagios /usr/local/nagios/libexec/log_system_diags.sh
    # Create the main event handler log file and set permissions
    sudo touch /usr/local/nagios/var/event_handler.log
    sudo chown nagios:nagios /usr/local/nagios/var/event_handler.log
    

  2. Define the Event Handler Command in Nagios: Edit /usr/local/nagios/etc/objects/commands.cfg:

    sudo nano /usr/local/nagios/etc/objects/commands.cfg
    
    Add:
    define command{
        command_name    log-diagnostics
        command_line    $USER1$/log_system_diags.sh $HOSTNAME$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATETYPE$ "$SERVICEOUTPUT$"
    }
    
    Note: We are quoting $SERVICEDESC$ and $SERVICEOUTPUT$ because they can contain spaces.

  3. Enable Event Handlers Globally (if not already): Check /usr/local/nagios/etc/nagios.cfg for enable_event_handlers=1.

  4. Assign the Event Handler to the "Swap Usage" Service on localhost: Edit /usr/local/nagios/etc/objects/localhost.cfg:

    sudo nano /usr/local/nagios/etc/objects/localhost.cfg
    
    Find the "Swap Usage" service definition and modify it:
    define service{
        use                     local-service
        host_name               localhost
        service_description     Swap Usage
        check_command           check_local_swap!50%!80% ; Example: Warn >50% used, Crit >80% used
        event_handler_enabled   1                        ; Enable for this service
        event_handler           log-diagnostics          ; Use our new command
        contact_groups          admins
    }
    
    Save and exit.

  5. Verify and Reload Nagios:

    sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    sudo systemctl reload nagios
    

  6. Test the Event Handler: We need to force the "Swap Usage" service into a HARD CRITICAL state.

    • Modify check_command for "Swap Usage" to guarantee CRITICAL: In localhost.cfg, temporarily change the check_command for "Swap Usage" to something that will definitely be critical. If your system has any swap at all and it's mostly free, setting a very low "percent used" critical threshold will trigger it. Example: Change from check_local_swap!50%!80% to check_local_swap!1%!2%. This means critical if more than 2% of swap is used.
      sudo nano /usr/local/nagios/etc/objects/localhost.cfg
      # Find Swap Usage service, modify check_command to:
      # check_command           check_local_swap!1!2
      
      Save, validate (sudo /usr/local/nagios/bin/nagios -v ...), and reload Nagios (sudo systemctl reload nagios).
    • Wait for State Change: Monitor the "Swap Usage" service in the Nagios UI. It will go through:
      1. Pending
      2. SOFT CRITICAL (after first check)
      3. Retry checks...
      4. HARD CRITICAL (after max_check_attempts for the service, e.g., 3 or 4)
    • Check for Diagnostic File: Once the service is in a HARD CRITICAL state, the event handler should have run. Check the main event handler log:
      sudo tail -n 20 /usr/local/nagios/var/event_handler.log
      
      You should see a line indicating diagnostics were saved. Then, check the diagnostic log directory:
      sudo ls -lt /usr/local/nagios/var/diag_logs/
      
      You should see a new file named like localhost_Swap_Usage_YYYYMMDD_HHMMSS.diag. View its content:
      sudo cat /usr/local/nagios/var/diag_logs/localhost_Swap_Usage_*.diag
      
      It should contain the output of df -h, free -m, vmstat, and top.
  7. Revert Changes:

    • Important: Change the check_command for "Swap Usage" in localhost.cfg back to its original, sensible thresholds (e.g., check_local_swap!50%!80%).
    • Save, validate, and reload Nagios.
    • The service should eventually recover to OK. The event handler might log that it was triggered for an OK state but took no diagnostic action (as per our script's logic).

Outcome:
You have successfully implemented a diagnostic event handler that:

  • Triggers on a specific service entering a HARD CRITICAL state.
  • Executes a custom script to gather system information.
  • Saves this information to a uniquely named file for later analysis.

This workshop demonstrates a safe and useful application of event handlers – gathering data without attempting risky automated fixes. This approach can be invaluable for troubleshooting intermittent or complex issues.

Performance Tuning and Scaling Nagios

As the number of monitored hosts and services grows, Nagios performance can become a concern. Slow check execution, a lagging web interface, and delayed notifications are common symptoms. Effective tuning and scaling strategies are crucial for maintaining a responsive and reliable monitoring system.

Key Areas for Performance Optimization:

  1. Hardware Resources:

    • CPU: Nagios and its checks can be CPU-intensive. More cores are generally better than raw clock speed for parallel check execution.
    • RAM: Sufficient RAM is needed to hold Nagios's state information, run plugins, and for the OS/web server. Monitor memory usage; swap usage is a bad sign.
    • Disk I/O: Nagios writes status data, logs, and performance data frequently. Fast disks (SSDs) significantly improve performance, especially for I/O-bound operations like perfdata processing.
      • Place /usr/local/nagios/var/ (especially spool/ and perfdata/ if using addons) on a fast filesystem or separate fast disk.
  2. Nagios Configuration (nagios.cfg):

    • interval_length: Default is 60 seconds. This is the fundamental time unit for scheduling. Reducing it (e.g., to 10 or 30) allows for more granular scheduling but increases CPU load as Nagios wakes up more often. For very large installs, increasing it slightly (e.g. to 120) might be considered if extreme granularity isn't needed, but this is rare. The default 60 is usually fine.
    • Check Scheduling and Execution:
      • max_concurrent_checks: (Obsolete in Nagios 4.x, which uses a more dynamic check scheduler). In older versions, this limited how many service checks could run simultaneously.
      • service_check_timeout / host_check_timeout: Global timeouts for checks. Ensure they are reasonable. Plugins that hang can tie up Nagios worker processes.
      • max_service_check_spread / max_host_check_spread: Spreads out initial checks when Nagios starts to avoid a "thundering herd."
    • Optimizing Check Execution:
      • Use compiled plugins where possible: Compiled C plugins are generally faster than script-based ones (Perl, Python, Bash).
      • Efficient plugins: Ensure custom plugins are written efficiently. Avoid unnecessary overhead.
      • Reduce plugin timeouts: Set appropriate timeouts within plugin calls (e.g., -t option for many network plugins) so they don't hang indefinitely.
    • Object Configuration:
      • Templates: Use them extensively. Nagios processes templates efficiently.
      • Avoid overly complex dependencies: While useful, deeply nested or circular dependencies can add processing overhead.
    • use_large_installation_tweaks: (Default is 1/ON in Nagios 4.x). Enables several internal optimizations for larger environments. Ensure it's on.
    • enable_environment_macros: (Default 0/OFF). Enabling this makes more environment variables available to plugins but can add slight overhead. Only enable if strictly needed by a plugin.
  3. Optimize Check Intervals:

    • Not everything needs to be checked every minute or even every 5 minutes.
    • Prioritize: Critical services get frequent checks (e.g., 1-5 mins). Less critical or stable services can have longer intervals (e.g., 10-30 mins, or even hourly for some things).
    • Use different check_interval and retry_interval settings in service templates for different classes of service.
  4. Passive Checks (NRPE, NSCA):

    • For checks on remote hosts, offload execution to the remote host (NRPE, NSClient++). This distributes CPU load.
    • Use NSCA for asynchronous events to avoid polling.
  5. Web Interface Performance:

    • CGI Optimization: The Nagios CGIs can be slow on large installations.
      • Ensure your web server (Apache) is well-configured (e.g., KeepAlive On, appropriate MaxRequestWorkers).
      • Consider alternatives like Nagios V-Shell or Thruk for a faster web interface, or modern UIs like NagVis if you only need visualization.
    • cgi.cfg settings:
      • escape_html_tags=0 (Default is 1/ON): Turning this off can speed up CGIs but introduces a security risk (XSS) if plugin output is not sanitized. Use with extreme caution.
      • Limit default items displayed in status pages (e.g., default_page_limit).
  6. Perfdata Processing:

    • If you're graphing performance data (e.g., with PNP4Nagios, Grafana via InfluxDB), the processing of perfdata files can be I/O intensive.
    • Broker Modules: Use broker modules like NPCD (for PNP4Nagios) or NDOUtils (to write to a database) for more efficient, asynchronous perfdata handling.
    • process_performance_data=1 in nagios.cfg is needed.
    • Ensure perfdata processing scripts/daemons are efficient and don't overload the Nagios server. Consider moving perfdata processing/storage to a separate server if Nagios server is struggling.
  7. Distributed Monitoring (Advanced Scaling): For very large environments (thousands of hosts, tens/hundreds of thousands of services), a single Nagios instance may not be sufficient.

    • Mod_Gearman: A popular addon that distributes check execution to multiple "Gearman workers" (which can be on different servers). The Nagios server acts as a scheduler, offloading the actual check execution. This dramatically improves scalability.
    • DNX (Distributed Nagios eXecutor): Another framework for distributing checks.
    • Federated Nagios Servers: Multiple independent Nagios servers monitoring different parts of the infrastructure, with their status aggregated by a central "master" Nagios server (often using NSCA for passive updates or a tool like Thruk to view multiple backends).
  8. Nagios Core 4.x Worker Architecture: Nagios Core 4 introduced a worker process model for check execution, significantly improving performance over Nagios 3.x.

    • Nagios main process handles scheduling, event handling, etc.
    • Separate worker processes are forked to execute checks.
    • nagios.cfg has settings like service_check_workers and host_check_workers to control the number of worker processes. These are often auto-tuned. Manual adjustment requires careful monitoring.
  9. Monitoring Nagios Itself:

    • Monitor the Nagios process, CPU/memory/disk usage of the Nagios server.
    • Monitor the size of the Nagios external command file (nagios.cmd) and check result spool directories. If they grow too large, it indicates Nagios is falling behind.
    • Monitor event latency (time between a problem occurring and a notification being sent).
  10. Regular Maintenance:

    • Archive or rotate Nagios logs (nagios.log, retention.dat, perfdata logs).
    • Periodically review and optimize configurations. Remove unused checks or objects.
    • Keep Nagios Core and plugins updated to benefit from performance improvements and bug fixes.

Tools for Diagnosing Performance:

  • top/htop: Monitor CPU and memory usage.
  • iostat: Monitor disk I/O.
  • vmstat: Monitor system activity, memory, swap, I/O.
  • Nagios logs (nagios.log with debug enabled if necessary, but be careful as debug logging itself adds overhead).
  • nagiostats: A utility that comes with Nagios, provides statistics about check execution latencies, queue lengths, etc.
    /usr/local/nagios/bin/nagiostats -c /usr/local/nagios/etc/nagios.cfg
    
    (Run this periodically to get a snapshot).

Scaling Nagios is an ongoing process of monitoring, analyzing, and tuning. Start with simple optimizations and move to more complex solutions like distributed monitoring only when necessary.

Workshop Analyzing Nagios Performance with nagiostats

Objective:
Use the nagiostats utility to get a snapshot of your Nagios instance's performance metrics and understand what they mean. This is a diagnostic workshop, not a tuning one, but it provides the data needed for tuning.

Prerequisites:

  • A working Nagios Core installation that has been running for some time with several hosts and services being actively checked. The more activity, the more interesting nagiostats output will be.
  • Access to the Nagios server's command line.

Steps:

  1. Locate nagiostats: The nagiostats utility is typically installed in the same directory as the main nagios executable.

    ls -l /usr/local/nagios/bin/nagiostats
    
    If it's not found there, your installation might be different, but this is the standard location for source installs.

  2. Run nagiostats: Execute nagiostats pointing it to your main Nagios configuration file.

    sudo /usr/local/nagios/bin/nagiostats -c /usr/local/nagios/etc/nagios.cfg
    
    You should get output similar to this (values will vary greatly):
    Nagios Stats 4.x.x
    Copyright (c) 2009-2020 Nagios Core Development Team and Community Contributors
    Copyright (c) 1999-2009 Ethan Galstad
    Last Modified: XXXX-XX-XX
    License: GPL
    
    CURRENT STATUS DATA
    ---------------------------------------------------------------------
    Status File:                  /usr/local/nagios/var/status.dat
    Status File Age:              2s
    Status File Version:          4.x.x
    
    PROGRAM STATUS DATA
    ---------------------------------------------------------------------
    Nagios Process ID:            12345
    Running Time:                 2d 3h 15m 30s
    Nagios User:                  nagios
    Nagios Group:                 nagios
    
    CHECK PROCESSING DATA
    ---------------------------------------------------------------------
    Services Checked:             1500
    Hosts Checked:                300
    Service Check Interval:       300s
    Host Check Interval:          300s
    Service Inter-Check Delay:    1.00s
    Host Inter-Check Delay:       0.50s
    Services Actively Checked:    25
    Hosts Actively Checked:       5
    
    EVENT QUEUE DATA
    ---------------------------------------------------------------------
    Queued Events:                0
    HIGH Latency Events:          0
    TOTAL Latency Events:         10
    AVG Latency Events:           0.05s
    MAX Latency Events:           0.20s
    
    SERVICE CHECK DATA
    ---------------------------------------------------------------------
    Total Services:               50
    Services Ok:                  48
    Services Warning:             1
    Services Unknown:             0
    Services Critical:            1
    Services Pending:             0
    Services Obsessing:           50
    Services Scheduled:           50
    Services Checked:             1500
    Checks Last 1/5/15/60 Min:    10 / 50 / 150 / 600
    Latency Last 1/5/15/60 Min:   0.01s / 0.02s / 0.02s / 0.03s
    Service Max Latency:          0.15s
    Avg Service Check Latency:    0.03s
    Total Service State Change:   5
    Avg Service State Change:     1.0%
    
    HOST CHECK DATA
    ---------------------------------------------------------------------
    Total Hosts:                  10
    Hosts Up:                     9
    Hosts Down:                   1
    Hosts Unreachable:            0
    Hosts Pending:                0
    Hosts Obsessing:              10
    Hosts Scheduled:              10
    Hosts Checked:                300
    Checks Last 1/5/15/60 Min:    2 / 10 / 30 / 120
    Latency Last 1/5/15/60 Min:   0.00s / 0.01s / 0.01s / 0.01s
    Host Max Latency:             0.05s
    Avg Host Check Latency:       0.01s
    Total Host State Change:      2
    Avg Host State Change:        0.5%
    
    EXTERNAL COMMAND DATA
    ---------------------------------------------------------------------
    External Commands Checked:    25
    ... (more stats)
    

  3. Analyze the Output - Key Sections and Metrics:

    • CHECK PROCESSING DATA:

      • Services Actively Checked / Hosts Actively Checked: How many checks are currently running or in the immediate queue. High numbers consistently could indicate a bottleneck.
      • Service Inter-Check Delay / Host Inter-Check Delay: The average delay method used by Nagios to spread out checks.
    • EVENT QUEUE DATA: (More relevant if you have many scheduled events or use an event broker)

      • Queued Events: Number of events (like checks, notifications) waiting to be processed. If this is consistently high, Nagios is falling behind.
      • HIGH Latency Events: Events that took too long to process.
      • AVG Latency Events / MAX Latency Events: Average and maximum time events spent in the queue. High latency means delays in checks and notifications.
    • SERVICE CHECK DATA / HOST CHECK DATA:

      • Checks Last 1/5/15/60 Min: Number of checks performed in these time windows. Gives an idea of check velocity.
      • Latency Last 1/5/15/60 Min: Average execution latency of checks in these windows. This is a critical metric. High latency means checks are taking too long to complete. This could be due to slow plugins, network issues, or an overloaded Nagios server.
      • Service Max Latency / Host Max Latency: The longest any single check took. Helps identify outlier slow checks.
      • Avg Service Check Latency / Avg Host Check Latency: Overall average execution time for checks. Aim to keep these low (e.g., under 1-2 seconds for most environments, much lower for highly optimized ones).
    • Buffer Usage (Might be in EXTERNAL COMMAND DATA or a separate section depending on Nagios version and broker usage):

      • Metrics like buffer_slots_used, buffer_slots_free, total_buffer_slots.
      • If buffers (e.g., for external commands, check results) are consistently full, it's a sign of overload.
  4. Interpreting the Metrics for Potential Issues:

    • High Check Latencies (e.g., Avg Service Check Latency > few seconds):

      • Investigate slow plugins: Use plugin timeouts, optimize custom scripts.
      • Network issues to remote hosts.
      • Nagios server CPU/Disk I/O bound.
      • Too many checks scheduled too frequently.
    • High Event Queue Latency/Many Queued Events:

      • Nagios core processing is a bottleneck.
      • Consider if event broker modules are slowing things down.
      • Server resources (CPU mainly).
    • High Number of Actively Checked Services/Hosts:

      • May indicate checks are taking longer than their scheduled interval, leading to a backlog.
      • Check latencies are likely also high.
    • nagiostats shows "N/A" for some values: This can happen if Nagios has just restarted or if certain features (like an event broker) are not heavily used or configured.

  5. Run nagiostats Periodically: To understand trends, run nagiostats at different times, especially during peak load, and compare the output. You could even script this to collect data over time.

Outcome:
By running nagiostats and examining its output, you've gained insight into:

  • The volume of checks your Nagios instance is performing.
  • The execution latency of these checks, which is a primary indicator of performance.
  • The load on Nagios's internal event queue.

This information is the first step in diagnosing performance problems. If nagiostats reveals high latencies or queue buildups, you would then proceed to investigate the causes using techniques discussed in the "Performance Tuning and Scaling Nagios" theory section (e.g., checking server resources, optimizing plugins, adjusting check intervals, or considering distributed monitoring). This workshop equips you to gather the necessary baseline data.

Security Best Practices for Nagios

Securing your Nagios installation is paramount, as it has deep visibility into your infrastructure and can potentially execute commands. A compromised Nagios server could be a launchpad for wider attacks.

Key Security Areas:

  1. Secure the Nagios Server OS:

    • Minimal Installation: Install only necessary packages.
    • Regular Updates: Keep the OS and all packages patched.
    • Firewall: Use a host-based firewall (e.g., ufw, firewalld) to restrict access to necessary ports only (SSH, HTTP/HTTPS for web UI, NRPE/NSCA if applicable).
    • Strong Passwords & SSH Key Authentication: For server access.
    • Intrusion Detection/Prevention Systems (IDS/IPS): Consider deploying them.
    • Disable Unused Services.
  2. Secure the Web Interface:

    • HTTPS: Always use HTTPS (SSL/TLS) to encrypt web traffic to the Nagios UI. Configure Apache/Nginx with a valid SSL certificate (e.g., from Let's Encrypt).
    • Strong Authentication:
      • Use strong passwords for the htpasswd users accessing the web UI.
      • Change the default nagiosadmin username.
      • Store htpasswd file securely with restricted permissions.
    • Restrict Access:
      • In Apache/Nginx config, limit access to the /nagios URL to specific IP addresses or internal networks if possible.
      • Require ip <your_admin_network>
    • CGI Security (cgi.cfg):
      • use_authentication=1 (Ensure authentication is enabled).
      • Restrict authorized_for_* directives: Grant command execution rights (authorized_for_all_host_commands, authorized_for_all_service_commands, etc.) only to highly trusted administrator accounts. Avoid giving these rights to read-only users.
      • escape_html_tags=1: Keep this enabled (default) to prevent XSS vulnerabilities from plugin output, unless you fully trust and sanitize all plugin outputs.
  3. Secure Nagios Core Configuration and Processes:

    • Run as Unprivileged User: Nagios should run as a dedicated unprivileged user (e.g., nagios). The make install process usually sets this up.
    • File Permissions:
      • /usr/local/nagios/etc/ (config files): Readable by Nagios user, writable only by root/admin. Sensitive info like passwords in resource.cfg should be highly restricted.
      • /usr/local/nagios/libexec/ (plugins): Executable by Nagios user. Writable only by root/admin.
      • /usr/local/nagios/var/rw/nagios.cmd (external command file): Writable by Nagios user and the web server user (if external commands from UI are allowed). Permissions are critical here (dp S bit set by make install-commandmode).
    • Secure External Commands: Be extremely cautious if allowing external commands via the web UI or other means. This is a powerful feature that can be abused.
    • Disable enable_environment_macros=0 in nagios.cfg unless absolutely necessary for a plugin, as it can be a vector for injecting commands if plugins are not written carefully.
  4. Secure Check Agents and Protocols:

    • NRPE:
      • Use SSL/TLS encryption for NRPE communication (compile NRPE with SSL).
      • In nrpe.cfg on clients: allowed_hosts should strictly list only your Nagios server IP(s).
      • dont_blame_nrpe=0 (default): Do not allow command arguments from Nagios server. Define full commands with arguments in client's nrpe.cfg. If you must allow arguments (dont_blame_nrpe=1), be extremely careful about what commands are exposed and validate inputs in your plugins.
      • Firewall NRPE port (5666) on clients to only allow Nagios server(s).
    • NSClient++ (for Windows):
      • Use SSL/TLS for communication (e.g., when using NRPE listener).
      • In nsclient.ini: allowed hosts should list Nagios server IP(s).
      • If allowing arguments from Nagios, be cautious. Define secure aliases.
      • Use strong passwords if using older protocols like check_nt.
      • Firewall NSClient++ port on clients.
    • NSCA:
      • Use encryption (e.g., DES, 3DES, or AES if compiled with libmcrypt; XOR is weak). Use strong passwords.
      • In nsca.cfg on server: Define allowed_hosts if your NSCA version supports it, or firewall the NSCA port (5667) to only allow trusted submitters.
    • SNMP:
      • Use SNMPv3 (which provides encryption and authentication) instead of SNMPv1/v2c (which use plain-text community strings).
      • If using SNMPv1/v2c, use strong, non-default community strings.
      • Restrict SNMP access on devices to only the Nagios server's IP using ACLs.
  5. Plugin Security:

    • Source Plugins Carefully: Use official Nagios plugins or well-vetted community plugins. Be cautious with plugins from untrusted sources.
    • Audit Custom Plugins: If writing custom plugins, audit them for security vulnerabilities (e.g., command injection, insecure handling of arguments). Sanitize all external input.
    • Principle of Least Privilege: Plugins should run with the minimum privileges necessary.
  6. Backup Nagios Configuration: Regularly back up /usr/local/nagios/etc/ and any custom plugins. Store backups securely. Consider version control (Git) for /usr/local/nagios/etc/.

  7. Monitoring and Auditing:

    • Monitor Nagios server logs and system logs for suspicious activity.
    • Audit Nagios configurations regularly for security misconfigurations.
    • Nagios audit log (nagios.log with appropriate verbosity) can show configuration changes, commands executed, notifications sent, etc.

By implementing these security best practices, you can significantly reduce the risk of your Nagios monitoring system being compromised. Security is an ongoing process, not a one-time setup.

Workshop Securing Nagios Web UI with HTTPS (Self-Signed Cert)

Objective:
Configure the Apache web server for your Nagios installation to use HTTPS with a self-signed SSL certificate. This encrypts the web traffic between your browser and the Nagios UI.

Note:
Self-signed certificates will cause browser warnings. For production, obtain a certificate from a trusted Certificate Authority (CA) or use Let's Encrypt. This workshop focuses on the mechanism.

Prerequisites:

  • Working Nagios Core installation with Apache web server.
  • openssl command-line tool installed (usually default on Linux).
  • Apache's SSL module (mod_ssl) enabled.

Steps:

  1. Enable Apache SSL Module (if not already enabled):

    sudo a2enmod ssl
    sudo systemctl restart apache2
    
    (On RHEL/CentOS, this might be sudo yum install mod_ssl and then ensure it's loaded).

  2. Create a Directory for SSL Certificates:

    sudo mkdir /etc/apache2/ssl  # For Debian/Ubuntu
    # For RHEL/CentOS, common paths are /etc/pki/tls/certs and /etc/pki/tls/private
    # Adjust paths accordingly in subsequent steps if not on Debian/Ubuntu.
    

  3. Generate a Self-Signed SSL Certificate and Private Key:
    Use openssl to generate a key and a certificate.

    sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
        -keyout /etc/apache2/ssl/nagios.key \
        -out /etc/apache2/ssl/nagios.crt
    

    • req -x509: Request for an X.509 certificate.
    • -nodes: No DES encryption for the private key (so Apache can read it without a passphrase at startup). For higher security, you can encrypt the key, but Apache will then need the passphrase at each start.
    • -days 365: Certificate validity (1 year).
    • -newkey rsa:2048: Generate a new 2048-bit RSA private key.
    • -keyout /etc/apache2/ssl/nagios.key: Path to save the private key.
    • -out /etc/apache2/ssl/nagios.crt: Path to save the certificate.

    You will be prompted for information for the certificate (Country Name, State, Locality, Organization, Common Name, etc.).

    • Common Name (CN): This is important. Enter the FQDN or IP address of your Nagios server (how you access it in the browser, e.g., nagios.yourdomain.com or 192.168.1.100). If they don't match, browsers will give more specific warnings.
  4. Set Permissions for Key and Certificate:
    The private key must be protected.

    sudo chmod 600 /etc/apache2/ssl/nagios.key
    sudo chmod 644 /etc/apache2/ssl/nagios.crt
    

  5. Configure Apache to Use SSL for Nagios:
    You need to modify your Apache configuration for Nagios. This is often in /etc/apache2/sites-enabled/nagios.conf (Debian/Ubuntu) or /etc/httpd/conf.d/nagios.conf (RHEL/CentOS). We will create a new virtual host configuration for HTTPS or modify the existing one. A common approach is to redirect HTTP to HTTPS.

    Edit your existing Nagios Apache config file or the default SSL config file. For Debian/Ubuntu, Apache usually has a default-ssl.conf in sites-available. We can create a dedicated SSL vhost for Nagios. Let's assume you modify /etc/apache2/sites-enabled/nagios.conf. Backup the current file first:

    sudo cp /etc/apache2/sites-enabled/nagios.conf /etc/apache2/sites-enabled/nagios.conf.backup
    sudo nano /etc/apache2/sites-enabled/nagios.conf
    
    Modify it to look something like this (this example sets up an HTTPS virtual host on port 443):
    # Original HTTP VirtualHost for redirect (optional, but good practice)
    <VirtualHost *:80>
        ServerName your_nagios_server_fqdn_or_ip # e.g., nagios.example.com or server's IP
        Redirect permanent /nagios https://your_nagios_server_fqdn_or_ip/nagios
        # If you want to redirect everything on port 80 to HTTPS:
        # Redirect permanent / https://your_nagios_server_fqdn_or_ip/
    </VirtualHost>
    
    # HTTPS VirtualHost for Nagios
    <VirtualHost *:443>
        ServerName your_nagios_server_fqdn_or_ip # Must match CN in cert or be covered by it
    
        SSLEngine on
        SSLCertificateFile      /etc/apache2/ssl/nagios.crt
        SSLCertificateKeyFile   /etc/apache2/ssl/nagios.key
    
        # Nagios Specific Configuration (should be similar to your existing HTTP config)
        ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"
        <Directory "/usr/local/nagios/sbin">
            SSLRequireSSL
            Options ExecCGI
            AllowOverride None
            <IfVersion >= 2.3>
                <RequireAll>
                    Require all granted
                    # For Apache 2.4 Basic Authentication
                    AuthType Basic
                    AuthName "Nagios Access"
                    AuthUserFile /usr/local/nagios/etc/htpasswd.users
                    Require valid-user
                </RequireAll>
            </IfVersion>
            <IfVersion < 2.3>
                Order allow,deny
                Allow from all
                # For Apache 2.2 Basic Authentication
                AuthType Basic
                AuthName "Nagios Access"
                AuthUserFile /usr/local/nagios/etc/htpasswd.users
                Require valid-user
            </IfVersion>
        </Directory>
    
        Alias /nagios "/usr/local/nagios/share"
        <Directory "/usr/local/nagios/share">
            SSLRequireSSL
            Options None
            AllowOverride None
            <IfVersion >= 2.3>
                <RequireAll>
                    Require all granted
                    # For Apache 2.4 Basic Authentication
                    AuthType Basic
                    AuthName "Nagios Access"
                    AuthUserFile /usr/local/nagios/etc/htpasswd.users
                    Require valid-user
                </RequireAll>
            </IfVersion>
            <IfVersion < 2.3>
                Order allow,deny
                Allow from all
                # For Apache 2.2 Basic Authentication
                AuthType Basic
                AuthName "Nagios Access"
                AuthUserFile /usr/local/nagios/etc/htpasswd.users
                Require valid-user
            </IfVersion>
        </Directory>
    </VirtualHost>
    

    • Replace your_nagios_server_fqdn_or_ip with the actual FQDN or IP address.
    • SSLRequireSSL inside <Directory> blocks ensures these directories are only accessed over SSL.
    • This example keeps the Nagios alias and CGI script configurations, wrapping them in an SSL-enabled VirtualHost.
    • If you have Listen 80 and Listen 443 in ports.conf (Debian/Ubuntu) or httpd.conf (RHEL/CentOS), these VirtualHosts should work.
  6. Test Apache Configuration and Restart Apache:

    sudo apache2ctl configtest
    # Expected output: Syntax OK
    sudo systemctl restart apache2
    

  7. Test Accessing Nagios UI via HTTPS:
    Open your web browser and navigate to https://your_nagios_server_fqdn_or_ip/nagios/.

    • Browser Warning: You will likely see a security warning because the certificate is self-signed (not trusted by a public CA). This is expected for this workshop.
      • You'll need to accept the risk and proceed (e.g., "Advanced" -> "Proceed to ... (unsafe)").
    • Once you proceed, you should see the Nagios login prompt.
    • Log in. The connection should now be encrypted (look for https:// and a padlock icon, though it might have a warning overlay due to the self-signed cert).
    • Test if HTTP access to /nagios redirects to HTTPS, if you configured the redirect.

Outcome:
You have successfully:

  • Generated a self-signed SSL certificate and private key.
  • Configured Apache to serve the Nagios web interface over HTTPS on port 443.
  • (Optionally) Configured a redirect from HTTP to HTTPS for the Nagios URL.

While this uses a self-signed certificate (not for production public sites), it demonstrates the core steps for enabling SSL/TLS, which is crucial for securing sensitive Nagios web traffic. For production, replace the self-signed certificate with one from a trusted CA (e.g., Let's Encrypt is free and widely used).

Extending Nagios with Addons (Brief Overview)

While Nagios Core provides a powerful monitoring engine, its functionality can be significantly extended and enhanced using various addons. These addons can provide features like advanced graphing, alternative web interfaces, distributed monitoring capabilities, and more. Here's a brief overview of some popular ones:

  1. Graphing and Visualization:

    • PNP4Nagios: One of the most popular addons for graphing performance data collected by Nagios. It uses RRDtool (Round Robin Database Tool) to store and render graphs. It integrates well with Nagios and can display graphs directly within the Nagios UI (with some CGI modifications) or via its own web interface.
      • How it works: Nagios writes perfdata to files. NPCD (Nagios Perfdata C Daemon), a bulk mode processor for PNP4Nagios, picks up these files and feeds data into RRDtool databases.
    • NagVis: A powerful visualization addon that allows you to create custom maps and diagrams (e.g., network topology, datacenter layout, application flow) with Nagios status information overlaid. Status icons change color based on host/service states.
    • Grafana with InfluxDB/Prometheus: A very popular modern approach. Nagios can send perfdata to a time-series database like InfluxDB (using a perfdata script or Telegraf) or be scraped by Prometheus (using an exporter). Grafana then connects to these databases to create rich, interactive dashboards. This offers more flexibility and power than older RRDtool-based solutions but requires setting up a separate TIG/Prometheus stack.
  2. Alternative Web Interfaces:

    • Thruk: A modern, feature-rich web interface for Nagios (and other monitoring backends like Icinga, Naemon). It offers faster performance than the classic CGIs, a more customizable UI, advanced filtering, reporting, and multi-backend support (can connect to several Nagios instances).
    • Nagios V-Shell: A PHP-based frontend for Nagios that aims to be faster and more user-friendly than the standard CGIs.
  3. Distributed Monitoring and Scaling:

    • Mod_Gearman: As mentioned in scaling, this addon distributes Nagios check execution across multiple worker nodes using the Gearman job queue system. This significantly enhances the capacity of a Nagios setup.
      • Nagios Core acts as the scheduler and submits check jobs to Gearman.
      • Gearman workers (can be on separate servers) pick up jobs, execute plugins, and return results.
    • DNX (Distributed Nagios eXecutor): An alternative framework for distributing Nagios checks.
  4. Configuration Management:

    • While not strictly Nagios addons, tools like Ansible, Puppet, Chef, or SaltStack are often used to manage Nagios configuration files, especially in larger environments. They allow for templated, automated, and version-controlled deployment of host, service, and other object definitions.
    • NConf, NagiosQL: Web-based configuration tools that allow you to manage Nagios object definitions through a GUI, storing them in a database and then generating the Nagios flat config files. These can simplify configuration for users less comfortable with direct file editing but add another layer of complexity.
  5. Database Integration:

    • NDOUtils (Nagios Data Output Utilities): A broker module that exports Nagios status and historical data to a MySQL or PostgreSQL database. This data can then be used by other addons (like Nagios V-Shell, some reporting tools) or for custom querying and reporting. It's a core component for many advanced Nagios setups.

Choosing Addons:

  • Identify your needs: What specific functionality are you missing? (e.g., better graphing, faster UI, scalability).
  • Complexity: Some addons are simple to install, while others (like Mod_Gearman or a full Grafana stack) are more involved.
  • Community and Support: Look for well-maintained addons with active communities.
  • Compatibility: Ensure the addon is compatible with your Nagios Core version.

Installing and configuring these addons typically involves downloading them, following their specific installation instructions (which might include compiling code, installing dependencies, configuring Nagios broker modules, and setting up web server configurations), and then integrating them with your Nagios Core setup. Each addon has its own learning curve.

Starting with PNP4Nagios for graphing is often a good first step for extending Nagios, as visual data trends are very valuable.

Conclusion for Advanced Nagios

This section has taken you through several advanced aspects of Nagios management and optimization. You've learned about the flexibility of passive checks with NSCA for monitoring asynchronous events and firewalled services. We explored the power of event handlers for automated diagnostics and potential remediation, emphasizing the need for caution and security. Performance tuning strategies, from hardware considerations to configuration tweaks and distributed monitoring concepts, were discussed to help you scale your Nagios instance effectively. Critical security best practices were highlighted to protect your monitoring infrastructure. Finally, a brief overview of popular addons showed how Nagios Core's capabilities can be significantly extended.

By mastering these advanced techniques, you can transform Nagios from a basic monitoring tool into a highly sophisticated, efficient, and integral part of your IT operations, capable of handling complex environments and proactively contributing to system stability and reliability. Continuous learning and adaptation are key, as the landscape of IT infrastructure and monitoring tools continues to evolve.