Debian Clusters for Education and Research: The Missing Manual

Nagios

From Debian Clusters

Jump to: navigation, search

From the official Nagios website, Nagios is

a host and service monitor designed to inform you of network problems before your clients, end-users or managers do.

Nagios can be used to monitor many different aspects of a cluster through a web interface. It runs on one node and can check various services on the host node and other nodes, including doing ping tests to see if machines are up, attempting to access web services, and checking that SSH is up and running. It can also be configured to notify different users in the events of errors. Depending on user level, it can be configured very simply or with a lot complexity.

Below is a section for anyone new to Nagios on understanding the basic definitions necessary for setup. There are also tutorials on

Contents

Understanding Nagios Basics

Although Nagios can be very simple (or very complicated, depending on the setup), it does require some very specification declarations (which it calls definitions). In order to get the most of Nagios, you'll want to know about host definitions, hostgroup definitions, service definitions, and servicegroup definitions.

Host Definitions

Each host that Nagios will monitor needs a host definition. These look something like the following:

define host {
        host_name   gyrfalcon.raptor.loc
        address     192.168.1.200
        use         generic-host
}
  • host_name is the fully qualified name
  • address is the IP address of the host
  • use generic-host means this host will use the generic-host template, but this can be further customized.

If you have fifty hosts to monitor, you'll need 50 host declarations. The Nagios Installation and Configuration tutorial will walk you through scripting this so you don't have to write it by hand.

Hostgroup Definitions

For ease of reference in the service definitions, hosts are grouped together with hostgroup definitions. These look something like this:

define hostgroup {
        hostgroup_name nodes
        alias           Worker Nodes
        members         kestrel.raptor.loc, owl.raptor.loc, goshawk.raptor.loc, ...snipped..., peregrine.raptor.loc
}
  • hostgroup_name is how you'll refer to these hosts
  • alias is what will show up in the Nagios web interface
  • members is a comma-separated list of the fully qualified domain names of each of the hosts in this group

Service Definitions

Each monitoring command to be run is defined as a service. Services are defined as shown below.

define service {
        hostgroup_name                  ping-servers
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}
  • hostgroup_name is a comma separated list of all the hostgroups that have this service monitored
  • service_description will show up on the Nagios web interface
  • check_command is the name of the command to be executed as well as a an exclamation point (!) separated list of arguments
  • use generic-service specifies that this service should use the generic template
  • notification-interval is how often the admin should be renotified if the service stays down

Servicegroup Definition

The last definition I'll cover is a servicegroup definition. These aren't strictly necessary, but they can help you group hosts and services together to better view them in the Nagios web interface. (These are accessed under the "Servicegroup" links on the left bar.) They basically group hosts together based on services. You define which hosts and which services are grouped together, and this is strictly for display and notification purposes. A servicegroup definition could look like this:

define servicegroup {
        servicegroup_name       pbsmoms
        alias                   All PBS Nodes
        members                 gyrfalcon.raptor.loc, MPI, goshawk.raptor.loc, MPI, ...snipped... peregrine.raptor.loc, MPI
}
  • mpi_group is how this servicegroup will be referred to
  • alias will show up in the Nagios web interface
  • members is a comma separated list of this form: host 1 full name, host 1 service, host 2 full name, host 2 service, etcetera. All of mine in the above example are running the MPI service, but that doesn't have to be the case.

References

Personal tools