...A place where sharing IT monitoring knowledges

Saturday 28 May 2011

Nagios: Service checks based on host status


Notice

This article applies to Nagios Core 2.x and 3.x. Luckily Nagios Core 4 natively manages the inhibition of service notifications when the service parent (for instance its host) is not UP. Read about this and other Nagios 4 Core features at Nagios Core 4: Overview.


It is likely that when a host switch to a DOWN state or UNREACHABLE, Nagios inhibit cheking its services: Why checking them if Nagios itself has determined that the host isnot  UP?

For better or worse this is not true: Nagios keeps on running regular checks on the services on a non-UP host. The resulting state of each service check depends on how it handles the unavailability of the data source.

Beyond the advantages of that fact, there are some disadvantages:

  • Too much information produces perplexity, and a set of alarms in services related to a host failure can hide real problems in services from other hosts.
  • Resource consumption related to the implementation of checks predestined to fail.
  • Notification storm related to the host and its services failure.

Therefore it seems desirable, if not for all at least for many service types, following some steps to avoid the above problems:

  1. Establishing service states to reflect the reality of the situation, such as an UNKNOWN state.
  2. Inhibiting notifications related to service state change.
  3. Disabling active checks of services while their host is not UP.

These steps should prevent, in a major or minor way, the problems related to mesleading information, resource consumption and notification storm.


Howto
So now the question is: How to do it? There are different approaches, having each one its pros and cons. Far from analyzing all, the best solution seems to be using Nagios external commands for performing all previous tasks every time host status changes.

Required external commands should be:
All these commands must be used on a script designed for managing host status changes. This script migth manage these command line arguments:
  • Host name, avaliable through the $HOSTNAME$ host macro.
  • Host status, available (in numeric format) through the $HOSTSTATUSID$ host macro.

This could be the script algorithm using metalanguage:

if HOSTSTATUSID=0 the
  # Host has changed to an UP status
   
  # Force status for all host services
  for each host Service
    # Submit an external command to set, as service status,
    # previous current value ($LASTSERVICESTATUSID$ macro)
    ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,
                    $LASTSERVICESTATUSID:HostName:Service$)
  endfor

  # Enable notifications for all host services
  ExternalCommand(ENABLE_HOST_SVC_NOTIFICATIONS, HostName)

  # Enable active checks for all host services
  ExternalCommand(ENABLE_HOST_SVC_CHECKS, Hostname 
else
  # Host has changed to a non-UP status
   
  # Disable active checks for all host services
  ExternalCommand(DISABLE_HOST_SVC_CHECKS, Hostname)
   
  # Disable notifications for all host services
  ExternalCommand(DISABLE_HOST_SVC_NOTIFICATIONS, HostName)
  # Set UNKNOWN (3) status for all host services
  for each host Service
    ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,3)
  endfor
endif


Configuration
Once the script is written, you must define a command object for enabling its usage from Nagios:

define command {
command_name setSvcStatusByHostStatus
command_line -h $HOSTNAME$ -s $HOSTSTATUSID$
}

In the previous example, hostname will be passed to the script using the -h argument, and -s argument will be used to pass host status id.
Finally, it will be necessary setting the previous command as host event handler. If the defined solution is suitable for managing all host status changes, previous command must be set as global event handler in the Nagios configuration (usually stored in nagios.cfg file):

global_host_event_handler = setSvcStatusByHostStatus

If it's not to be used on all hosts, it must be set as event handler for every suitable host:

define host {
...
event_handler setSvcStatusByHostStatus
...
}

Centreon
Previous solution is fully supported by Centreon:
  • Command definition is not different to other usual command. The only thing to consider is defining it as "check" type in order to be available through the event handler  configuration lists.
  • You can set the value of global_host_event_handler through the field "Global host event handler" located on the "Checking options" tab in the Configuration>Nagios>Nagios.cfg menu.
  • You can set the event_handler directive for each host using the field "Event handler" located on the "Data management" of the Configuration>Hosts>(host name).

Related posts


9 comments:

  1. Great article, helped me a lot! Thank you!
    But there is one thing i can't figure out - how can i determine which services are under a host? I couldn't find any Nagios macro that could send this information to my script. Therefor I don't know how to solve your for cycle:

    # Force status for all host services
    for each host Service
    # Submit an external command to set, as service status,
    # previous current value ($LASTSERVICESTATUSID$ macro)
    ExternalCommand(PROCESS_SERVICE_CHECK_RESULT,Service,
    $LASTSERVICESTATUSID:HostName:Service$)
    endfor

    Could you please give me any advice? Thank you in advance!

    ReplyDelete
  2. Hi Honza:

    Happy to know that my article helped you. About what you ask you are right: No macros for getting all services on a host.

    You have to get it parsing the Nagios configuration files but don't fear, the fantastic Perl Nagios::Config library is here to help us. This script shows how to get all the services from a given host:

    #!/usr/bin/perl

    use Nagios::Config;

    my $Parser = Nagios::Config->new(Filename => $ARGV[0], Version => 2);
    my $Host = $Parser->find_object($ARGV[1],'Nagios::Host');

    if ( defined $Host ) {
    foreach my $Service ( $Host->list_services ) {
    printf "%s\n", $Service->{'service_description'};
    }
    }

    It takes two command line arguments: Nagios config file name and host name for what you want to get its services. Script will output, one per line, the value of the field service_description of every service bound to the host name you pass as second argument.

    You can get the Nagios::Config library from CPAN: http://search.cpan.org/~duncs/Nagios-Object-0.21.16/lib/Nagios/Config.pm

    Hope it was helpful :)

    ReplyDelete
  3. Vicente - could you post or forward to me (stvlange@gmail.com) your scripts you use for this? I tried your perl script you list about and am not getting any output and I'd love to see your actual scripts (not just the metalanguage). I'm really new to using external commands.

    ReplyDelete
  4. Hello,
    I must admit that your approach is very interesting, but I would like seeing the code, because I don't know how to implement it properly.
    I'm focus on putting UNKNOWN status in each service of a down node. That's why I think your post is worth it.
    Regards!

    ReplyDelete
    Replies
    1. Thanks for your feedback Siser. I've developed a public script release but I'm dealing with a bug in the underlying and needed Nagios::Object Perl library. This library is used for retrieving what services are bound to the target host and hence setting what services must be handled in order to avoid a notifications storm.

      The bug has has been reported to its developers that gently have checked it and tell about an early resolution. As soon as they fixed it I'll test the script and, if you agree, I'll sent it to you by email (I'll sent it to Steve Lange too) in order to test it as beta prior to releasing it.

      Delete
  5. Hello Vicente!
    Is it also possible to get you public script for this effort when the problem in the Perl library is solved? I am looking for such a solution a long time.
    Thank you very much in advance.
    Regards!

    ReplyDelete
    Replies
    1. Hello Vicente!
      I have scripted it on my own. Thank you very much for your input and your decleration. The Nagios::Object Perl library seems fixed now, because for me it is working!!

      Delete
    2. Hi would you mind sharing the script you used to get this working?

      Delete
  6. This comment has been removed by the author.

    ReplyDelete

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes