Nagios
From Debian Clusters
From the official Nagios website, Nagios is
- a host and service monitor designed to inform you of network problems before your clients, end-users or managers do.
Nagios can be used to monitor many different aspects of a cluster through a web interface. It runs on one node and can check various services on the host node and other nodes, including doing ping tests to see if machines are up, attempting to access web services, and checking that SSH is up and running. It can also be configured to notify different users in the events of errors. Depending on user level, it can be configured very simply or with a lot complexity.
Below is a section for anyone new to Nagios on understanding the basic definitions necessary for setup. There are also tutorials on
- Nagios Installation and Configuration
- Creating Your Own Nagios Plugin
- Nagios NRPE Addon Installation and Configuration - executing remote commands
Contents |
Understanding Nagios Basics
Although Nagios can be very simple (or very complicated, depending on the setup), it does require some very specification declarations (which it calls definitions). In order to get the most of Nagios, you'll want to know about host definitions, hostgroup definitions, service definitions, and servicegroup definitions.
Host Definitions
Each host that Nagios will monitor needs a host definition. These look something like the following:
define host {
host_name gyrfalcon.raptor.loc
address 192.168.1.200
use generic-host
}
-
host_nameis the fully qualified name -
addressis the IP address of the host -
use generic-hostmeans this host will use the generic-host template, but this can be further customized.
If you have fifty hosts to monitor, you'll need 50 host declarations. The Nagios Installation and Configuration tutorial will walk you through scripting this so you don't have to write it by hand.
Hostgroup Definitions
For ease of reference in the service definitions, hosts are grouped together with hostgroup definitions. These look something like this:
define hostgroup {
hostgroup_name nodes
alias Worker Nodes
members kestrel.raptor.loc, owl.raptor.loc, goshawk.raptor.loc, ...snipped..., peregrine.raptor.loc
}
hostgroup_nameis how you'll refer to these hostsaliasis what will show up in the Nagios web interfacemembersis a comma-separated list of the fully qualified domain names of each of the hosts in this group
Service Definitions
Each monitoring command to be run is defined as a service. Services are defined as shown below.
define service {
hostgroup_name ping-servers
service_description PING
check_command check_ping!100.0,20%!500.0,60%
use generic-service
notification_interval 0 ; set > 0 if you want to be renotified
}
-
hostgroup_nameis a comma separated list of all the hostgroups that have this service monitored -
service_descriptionwill show up on the Nagios web interface -
check_commandis the name of the command to be executed as well as a an exclamation point (!) separated list of arguments -
use generic-servicespecifies that this service should use the generic template -
notification-intervalis how often the admin should be renotified if the service stays down
Servicegroup Definition
The last definition I'll cover is a servicegroup definition. These aren't strictly necessary, but they can help you group hosts and services together to better view them in the Nagios web interface. (These are accessed under the "Servicegroup" links on the left bar.) They basically group hosts together based on services. You define which hosts and which services are grouped together, and this is strictly for display and notification purposes. A servicegroup definition could look like this:
define servicegroup {
servicegroup_name pbsmoms
alias All PBS Nodes
members gyrfalcon.raptor.loc, MPI, goshawk.raptor.loc, MPI, ...snipped... peregrine.raptor.loc, MPI
}
-
mpi_groupis how this servicegroup will be referred to aliaswill show up in the Nagios web interface-
membersis a comma separated list of this form: host 1 full name, host 1 service, host 2 full name, host 2 service, etcetera. All of mine in the above example are running theMPIservice, but that doesn't have to be the case.

