Integrating Nagios with Test Driven Development

A while back I realized something important:

Monitoring tools are to the sysadmin what testing tools are to the developer.

Recently I realized that there need to be more ways to bring both toolsets together. Here’s one, tying the Nagios monitoring toolset to anything that emits the popular Test Anything Protocol.

The sysadmin community loves monitoring tools. Partly because of Tom Limoncelli’s time management book, I’ve been using Nagios for several years now to monitor all sorts of IT hardware and software. It includes features for scheduling check scripts and contact rules (page me during weekdays, page someone else at night), and dashboard views for the visually oriented (curiously unnecessary once you have alerting configured well). I’ve found it takes investment of time and attention to configure all the checks you want, but it’s worth it. Because of its open-source, extensible nature, Nagios is especially good for monitoring weird things that other enterprise monitoring systems aren’t even aware of. There are a lot of books about Nagios available, and I’d recommend them for anyone new to Nagios. I found James Trumbull’s book published by APress particularly useful, though it looks like it might be due for a new edition by now.

The developer community has become increasingly interested in software quality control and test driven development (TDD). I first became interested because of the Perl community’s emphasis on Kwalitee testing ever so long ago. There are even more good books and online resources about testing than there are about monitoring. I have been noticing everywhere though, that Perl’s simple and ancient Test Anything Protocol has become somewhat of a standard, with testing tools in many languages (including JavaScript, PHP, Python, Ruby, Java, C, C++, C#/VB/.NET and database-specific languages) producing it. Everything from low-level unit tests in C to frontend cross-browser functional testing tools like Selenium. And TAP is so simple, it’s easy to wrap a TAP harness around other test frameworks.

One strategy–from the security world–on which sysadmins and developers both agree is enumerating goodness (take note, “antivirus” is Doing It Wrong). Basically, it’s too hard to guess all the myriad ways your technology might fail and then write monitoring or test scripts for them. Instead, focus your monitoring and testing on covering the important functionality of your product.

But at what level of detail should we monitor? When the monitoring system wakes you up at 2 am, which do you want it to say?

  1. CRITICAL. No users can log in to http://myapp/login.
  2. CRITICAL. The app can’t contact the database.
  3. CRITICAL. The database server is down/unpingable.

My preference is “4. All of the above”, because the user-facing effect is sometimes hard to guess from the sysadmin-facing event (and vice versa). Nagios is good at #3 and maybe #2. #1 is more in the realm of functional testing, and your app may already have functional test scripts that could provide this sort of information.

check_tap.pl

A Nagios plugin for consuming Test Anything Protocol. Basically it combines Test::Harness with Nagios::Plugin.

Read, download, or fork check_tap.pl from the Github Gist.

Plugin Documentation (more or less straight from check_tap.pl --help)

check_tap.pl

This plugin allows Nagios to check the output of anything that emits Test Anything Protocol output. So you can wed Nagios’s monitoring and alerting infrastructure to your unit and functional tests for deep application-level monitoring in development or even in production.

Usage:

check_tap.pl [ -v|--verbose ] [-t <timeout>]
[ -c|--critical=<critical threshold> ]
[ -w|--warning=<warning threshold> ]
[ -s|--script = '</full/path/to/test.t>' ] (Required. multiple OK)
[ -e|--exec = '/full/path/to/runnable ARGS'
[ -l|--lib = '/path/to/perl/libs' ] (Multiple OK)

-?, –usage

Print usage information

-h, –help

Print detailed help screen

-V, –version

Print version information

–extra-opts=[section][@file]

Read options from an ini file. See http://nagiosplugins.org/extra-opts for usage and examples.

-s, –script=”/path/to/executable/test.t args”

REQUIRED. Defines the path to the test script you want to run. Use multiple -s flags to run multiple tests.

-l, –lib=”/path/to/perl/lib/dir”

Optional path for Perl libs to add to @INC. Use multiple -l flags to specify multiple lib dirs.

-e, –exec=”/path/to/executable args”

Defines a non-Perl executable with which you want to run the –script.

-w, –warning=INTEGER:INTEGER

Minimum and maximum number of allowable test FAILURES, outside of which a warning will be generated. Default is 0 tolerable failures.

-c, –critical=INTEGER:INTEGER

Minimum and maximum number of allowable test FAILURES, outside of which a critical will be generated. Default is 0 tolerable failures.

-t, –timeout=INTEGER

Seconds before plugin times out (default: 15)

-v, –verbose

Show details for command-line debugging (can repeat up to 3 times)

Verbosity

Use -v to see a bit more info in the one line, including the first test that failed. This is especially useful because Nagios will include it in the alert/notification.

Use -vv to see test summary and failures.

Use -vvv to see full test script output.

Warning and Critical Thresholds

THRESHOLDs for -w and -c specify the allowable amount of test failures before the plugin returns WARNING or CRITICAL. Use ‘max’ or ‘min:max’.

The default of 0 tolerated failures is good for people like you who have high standards. But you might want to crank up the CRITICAL threshold if you want to differentiate between WARNING and CRITICAL amounts of fail.

See more threshold examples at http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT

Examples:

    check_tap.pl -s /full/path/to/testfoo.pl

will run ‘testfoo.pl’ and return OK if 0 tests fail, but CRITICAL if any fail. Excluding TODO or SKIPped tests, of course.

    check_tap.pl -s /full/path/to/testfoo.pl -c 2

will return OK if 0 tests fail, WARNING if more than 0 tests fail, and CRITICAL if more than 2 fail.

Non-Perl and remote test scripts

    check_tap.pl -e '/usr/bin/ruby -w'  -s /full/path/to/testfoo.r

will run ‘testfoo.r’ using Ruby with the -w flag.

You can use any shell command and argument which produces TAP output, for example:

    check_tap.pl -e '/usr/bin/curl -sk' -s 'http://url/to/mytest.php'
    check_tap.pl -e '/usr/bin/cat' -s '/path/to/testoutput.tap'

In fact, anything TAP::Harness or prove regards as a source or executable.

Remember that Nagios or NRPE will likely be running this command as a different, less-privileged user than you’re using now.

License

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY. It may be used, redistributed and/or modified under the terms of the GNU General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

Installation and Nagios configuration

Save check_tap.pl anywhere that makes sense and the nagios user can access. The default plugin location depends on your distribution. For RHEL and ilk, plugins are in /usr/lib/nagios/plugins/.

In misccommands.cfg:

# runs Perl test script given in $ARG1$
define command {
        command_name    check_tap
        command_line    /usr/lib/nagios/plugins/check_tap.pl -v \
               -s $ARG1$ -w $ARG2$ -c $ARG3$
}
 
# gets TAP output from the full URL in $ARG1$
define command {
        command_name    check_tap_remote_url
        command_line    /usr/lib/nagios/plugins/check_tap.pl -v \
               -e '/usr/bin/curl -sk' -s $ARG1$ \
               -w $ARG2$ -c $ARG3$
}
 
# gets TAP output from the relative URL in $ARG1$ at host $HOSTADDRESS$
define command {
        command_name    check_tap_remote_host
        command_line    /usr/lib/nagios/plugins/check_tap.pl -v \
              -e '/usr/bin/curl -sk' \
              -s http://$HOSTADDRESS$$ARG1$ \
              -w $ARG2$ -c $ARG3$
}

You can also use NRPE for *nix or NSCLient++/nscp or NC_Net for Windows (you’ll also need Perl, like Strawberry Perl) to run check_tap.pl on a machine other than the Nagios server. I’ll leave that as an exercise for the reader, or maybe a future blog post.

In services.cfg:

define service {
  use                  generic-service
  check_command        check_tap!/path/to/my/test.t!0!0
  service_description  Local test script on Nagios server
}
 
define service {
  use                  generic-service
  check_command        check_tap_remote_url!http://webhost99/nagios/mytest.php!0!0
  service_description  Remote test script accessible by URL
}
 
define service {
  use                  generic-service
  host_name            webhost1, webhost2
  check_command        check_tap_remote_host!/nagios/mytest.php!0!0
  service_description  Remote test script
}

Host-based security for URL-accessible test scripts

It would be good to limit access to the URL-accessible test scripts to your Nagios server and development/admin network. So for Apache, .htaccess or httpd.conf:

Order deny,allow
Deny from all
Allow from nagioshost.example.com
Allow from developers.example.com

For IIS you can click on stuff for the same result or use the deny element in your web.config file. For other web servers, RTFM.

Doughnuts for Developers

Wow! Now I can use Nagios’s monitoring, alerting, performance logging, acknowledging, scheduling, dashboarding, and thresholding features for my unit and functional tests! I can run them with different rules in build and production! I can detect random and senseless acts of system administration! I can detect new bugs as soon as I write them! I can hook Nagios up to the sprinkler system to put a damper on that hotheaded programmer in the next cube! I can write frontend functional tests with Selenium or AutoIt or AppleScript and have Nagios alert (the sysadmins of course) when any of my app’s core functionality breaks!!!

There are plenty of other test harnesses out there, but Nagios has some big advantages, especially for monitoring production systems.

Shortcuts for Sysadmins

Two last things which are more good news for sysadmins:

Custom test scripts for things Nagios won’t easily check

Writing your own Nagios plugins is a little bit hard, even with the example plugin I worked on ever so long ago. Writing TAP test scripts, on the other hand, is easy! So, after you’ve set up Nagios monitoring for all the low-hanging fruit like hardware utilization and network availability, write a test script for some of the harder to monitor signs of enumerable goodness. Here’s an example:

#!/usr/bin/perl
use warnings; use strict;
 
# This allows the script to run as a CGI or on the command line.
# We have to output the header in a BEGIN block or Test::Simple will
#   output the plan too soon.
BEGIN {
    if ($ENV{REQUEST_METHOD}) {
        use CGI qw(header);
        print header("text/plain");
    }
}
 
use Test::Simple tests=>4;
use Test::File;
 
# These use Test::File to test file permissions and size.
# See http://search.cpan.org/perldoc?Test::File
file_writeable_ok "path/to/cache";  
file_not_writeable_ok "path/to/index.html"; 
my $backup_file = "/usr/local/backups/myapp-backup.tar.bz2";
file_min_size_ok $backup_file, 911377;
 
# -M is a file test operator returning the file's last modifed age in days.
# See http://perldoc.perl.org/functions/-X.html
ok -M $backup_file < 1, "Backup file is newer than 1 day old";

Just run that script as a CGI or with NRPE etc. and you’ve got instant monitoring of things like backup success, which is otherwise kind of hard to hook in to Nagios. You could get even fancier with checksums or whatever you want.

See Test::Tutorial for more. Or look for TAP tools for your favorite language; they’re probably out there.

Tolerating ambiguity

In some cases, a certain amount of failure is acceptable. I’ve been monitoring connectivity between crucial hosts with a test script. I don’t want to be bothered by this check if one of our Citrix VMs isn’t available to the gateway server for a while, but it’s a problem if they’re ALL inaccessible because of a firewall issue or something. The configurable warning/critical thresholds on check_tap.pl allow me to define how much aggregate failure is worth getting excited about–something that’s hard to do with vanilla Nagios configuration.

Here’s the gist of my test script. ping_ok() and service_ok() come from a module I wrote 50 years ago to check basic connectivity/responsiveness with Net::Ping.

diag "Checking Citrix farm connectivity from DMZ\n";
my @stas = qw(
   citrixdc01.mydomain.com
   citrixdc02.mydomain.com
);
foreach (@stas) {
    ping_ok($_, 'http');	
    service_ok($_, 'http', "XML service on $_ is responding in some way" );
}
 
 
 
my @citrix_farm = qw(
   citrixps01.mydomain.com
   citrixps02.mydomain.com
   ...
   citrixps29.mydomain.com
   citrixps30.mydomain.com
);
foreach (@stas, @citrix_farm) {
    service_ok($_, '1494', "ICA service on $_ is responding in some way" );
}

I can just use the “warning” and “critical” arguments to check_tap.pl to set my thresholds for how many inaccessible servers I want to tolerate.

So there you go. Sysadmins and developers rejoice!