Integrating Nagios with Test Driven Development
A while back I realized something important:
Monitoring tools are to the sysadmin what testing tools are to the developer.
Recently I realized that there need to be more ways to bring both toolsets together. Here’s one, tying the Nagios monitoring toolset to anything that emits the popular Test Anything Protocol.
The sysadmin community loves monitoring tools. Partly because of Tom Limoncelli’s time management book, I’ve been using Nagios for several years now to monitor all sorts of IT hardware and software. It includes features for scheduling check scripts and contact rules (page me during weekdays, page someone else at night), and dashboard views for the visually oriented (curiously unnecessary once you have alerting configured well). I’ve found it takes investment of time and attention to configure all the checks you want, but it’s worth it. Because of its open-source, extensible nature, Nagios is especially good for monitoring weird things that other enterprise monitoring systems aren’t even aware of. There are a lot of books about Nagios available, and I’d recommend them for anyone new to Nagios. I found James Trumbull’s book published by APress particularly useful, though it looks like it might be due for a new edition by now.
The developer community has become increasingly interested in software quality control and test driven development (TDD). I first became interested because of the Perl community’s emphasis on Kwalitee testing ever so long ago. There are even more good books and online resources about testing than there are about monitoring. I have been noticing everywhere though, that Perl’s simple and ancient Test Anything Protocol has become somewhat of a standard, with testing tools in many languages (including JavaScript, PHP, Python, Ruby, Java, C, C++, C#/VB/.NET and database-specific languages) producing it. Everything from low-level unit tests in C to frontend cross-browser functional testing tools like Selenium. And TAP is so simple, it’s easy to wrap a TAP harness around other test frameworks.
One strategy–from the security world–on which sysadmins and developers both agree is enumerating goodness (take note, “antivirus” is Doing It Wrong). Basically, it’s too hard to guess all the myriad ways your technology might fail and then write monitoring or test scripts for them. Instead, focus your monitoring and testing on covering the important functionality of your product.
But at what level of detail should we monitor? When the monitoring system wakes you up at 2 am, which do you want it to say?
CRITICAL
. No users can log in to http://myapp/login.CRITICAL
. The app can’t contact the database.CRITICAL
. The database server is down/unpingable.
My preference is “4. All of the above”, because the user-facing effect is sometimes hard to guess from the sysadmin-facing event (and vice versa). Nagios is good at #3 and maybe #2. #1 is more in the realm of functional testing, and your app may already have functional test scripts that could provide this sort of information.
check_tap.pl
A Nagios plugin for consuming Test Anything Protocol. Basically it combines Test::Harness with Nagios::Plugin.
Read, download, or fork check_tap.pl from the Github Gist.
Plugin Documentation (more or less straight from check_tap.pl --help
)
check_tap.pl
This plugin allows Nagios to check the output of anything that emits Test Anything Protocol output. So you can wed Nagios’s monitoring and alerting infrastructure to your unit and functional tests for deep application-level monitoring in development or even in production.
Usage:
check_tap.pl [ -v|--verbose ] [-t <timeout>]
[ -c|--critical=<critical threshold> ]
[ -w|--warning=<warning threshold> ]
[ -s|--script = '</full/path/to/test.t>' ] (Required. multiple OK)
[ -e|--exec = '/full/path/to/runnable ARGS'
[ -l|--lib = '/path/to/perl/libs' ] (Multiple OK)
-?, –usage
Print usage information
-h, –help
Print detailed help screen
-V, –version
Print version information
–extra-opts=[section][@file]
Read options from an ini file. See http://nagiosplugins.org/extra-opts for usage and examples.
-s, –script=”/path/to/executable/test.t args”
REQUIRED. Defines the path to the test script you want to run. Use multiple -s flags to run multiple tests.
-l, –lib=”/path/to/perl/lib/dir”
Optional path for Perl libs to add to @INC. Use multiple -l flags to specify multiple lib dirs.
-e, –exec=”/path/to/executable args”
Defines a non-Perl executable with which you want to run the –script.
-w, –warning=INTEGER:INTEGER
Minimum and maximum number of allowable test FAILURES, outside of which a warning will be generated. Default is 0 tolerable failures.
-c, –critical=INTEGER:INTEGER
Minimum and maximum number of allowable test FAILURES, outside of which a critical will be generated. Default is 0 tolerable failures.
-t, –timeout=INTEGER
Seconds before plugin times out (default: 15)
-v, –verbose
Show details for command-line debugging (can repeat up to 3 times)
Verbosity
Use -v
to see a bit more info in the one line, including the first
test that failed. This is especially useful because Nagios will
include it in the alert/notification.
Use -vv
to see test summary and failures.
Use -vvv
to see full test script output.
Warning and Critical Thresholds
THRESHOLDs for -w and -c specify the allowable amount of test failures before the plugin returns WARNING or CRITICAL. Use ‘max’ or ‘min:max’.
The default of 0 tolerated failures is good for people like you who have high standards. But you might want to crank up the CRITICAL threshold if you want to differentiate between WARNING and CRITICAL amounts of fail.
See more threshold examples at http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
Examples:
check_tap.pl -s /full/path/to/testfoo.pl |
will run ‘testfoo.pl’ and return OK if 0 tests fail, but CRITICAL if any fail. Excluding TODO or SKIPped tests, of course.
check_tap.pl -s /full/path/to/testfoo.pl -c 2 |
will return OK if 0 tests fail, WARNING if more than 0 tests fail, and CRITICAL if more than 2 fail.
Non-Perl and remote test scripts
check_tap.pl -e '/usr/bin/ruby -w' -s /full/path/to/testfoo.r |
will run ‘testfoo.r’ using Ruby with the -w flag.
You can use any shell command and argument which produces TAP output, for example:
check_tap.pl -e '/usr/bin/curl -sk' -s 'http://url/to/mytest.php' |
check_tap.pl -e '/usr/bin/cat' -s '/path/to/testoutput.tap' |
In fact, anything TAP::Harness or prove
regards as a source or
executable.
Remember that Nagios or NRPE will likely be running this command as a different, less-privileged user than you’re using now.
License
This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY. It may be used, redistributed and/or modified under the terms of the GNU General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).
Installation and Nagios configuration
Save check_tap.pl anywhere that makes sense and the nagios
user can access. The default plugin location depends on your distribution. For RHEL and ilk, plugins are in /usr/lib/nagios/plugins/.
In misccommands.cfg:
# runs Perl test script given in $ARG1$ define command { command_name check_tap command_line /usr/lib/nagios/plugins/check_tap.pl -v \ -s $ARG1$ -w $ARG2$ -c $ARG3$ } # gets TAP output from the full URL in $ARG1$ define command { command_name check_tap_remote_url command_line /usr/lib/nagios/plugins/check_tap.pl -v \ -e '/usr/bin/curl -sk' -s $ARG1$ \ -w $ARG2$ -c $ARG3$ } # gets TAP output from the relative URL in $ARG1$ at host $HOSTADDRESS$ define command { command_name check_tap_remote_host command_line /usr/lib/nagios/plugins/check_tap.pl -v \ -e '/usr/bin/curl -sk' \ -s http://$HOSTADDRESS$$ARG1$ \ -w $ARG2$ -c $ARG3$ } |
You can also use NRPE for *nix or NSCLient++/nscp or NC_Net for Windows (you’ll also need Perl, like Strawberry Perl) to run check_tap.pl on a machine other than the Nagios server. I’ll leave that as an exercise for the reader, or maybe a future blog post.
In services.cfg:
define service { use generic-service check_command check_tap!/path/to/my/test.t!0!0 service_description Local test script on Nagios server } define service { use generic-service check_command check_tap_remote_url!http://webhost99/nagios/mytest.php!0!0 service_description Remote test script accessible by URL } define service { use generic-service host_name webhost1, webhost2 check_command check_tap_remote_host!/nagios/mytest.php!0!0 service_description Remote test script } |
Host-based security for URL-accessible test scripts
It would be good to limit access to the URL-accessible test scripts to your Nagios server and development/admin network. So for Apache, .htaccess or httpd.conf:
Order deny,allow Deny from all Allow from nagioshost.example.com Allow from developers.example.com |
For IIS you can click on stuff for the same result or use the deny
element in your web.config file. For other web servers, RTFM.
Doughnuts for Developers
Wow! Now I can use Nagios’s monitoring, alerting, performance logging, acknowledging, scheduling, dashboarding, and thresholding features for my unit and functional tests! I can run them with different rules in build and production! I can detect random and senseless acts of system administration! I can detect new bugs as soon as I write them! I can hook Nagios up to the sprinkler system to put a damper on that hotheaded programmer in the next cube! I can write frontend functional tests with Selenium or AutoIt or AppleScript and have Nagios alert (the sysadmins of course) when any of my app’s core functionality breaks!!!
There are plenty of other test harnesses out there, but Nagios has some big advantages, especially for monitoring production systems.
Shortcuts for Sysadmins
Two last things which are more good news for sysadmins:
Custom test scripts for things Nagios won’t easily check
Writing your own Nagios plugins is a little bit hard, even with the example plugin I worked on ever so long ago. Writing TAP test scripts, on the other hand, is easy! So, after you’ve set up Nagios monitoring for all the low-hanging fruit like hardware utilization and network availability, write a test script for some of the harder to monitor signs of enumerable goodness. Here’s an example:
#!/usr/bin/perl use warnings; use strict; # This allows the script to run as a CGI or on the command line. # We have to output the header in a BEGIN block or Test::Simple will # output the plan too soon. BEGIN { if ($ENV{REQUEST_METHOD}) { use CGI qw(header); print header("text/plain"); } } use Test::Simple tests=>4; use Test::File; # These use Test::File to test file permissions and size. # See http://search.cpan.org/perldoc?Test::File file_writeable_ok "path/to/cache"; file_not_writeable_ok "path/to/index.html"; my $backup_file = "/usr/local/backups/myapp-backup.tar.bz2"; file_min_size_ok $backup_file, 911377; # -M is a file test operator returning the file's last modifed age in days. # See http://perldoc.perl.org/functions/-X.html ok -M $backup_file < 1, "Backup file is newer than 1 day old"; |
Just run that script as a CGI or with NRPE etc. and you’ve got instant monitoring of things like backup success, which is otherwise kind of hard to hook in to Nagios. You could get even fancier with checksums or whatever you want.
See Test::Tutorial for more. Or look for TAP tools for your favorite language; they’re probably out there.
Tolerating ambiguity
In some cases, a certain amount of failure is acceptable. I’ve been monitoring connectivity between crucial hosts with a test script. I don’t want to be bothered by this check if one of our Citrix VMs isn’t available to the gateway server for a while, but it’s a problem if they’re ALL inaccessible because of a firewall issue or something. The configurable warning/critical thresholds on check_tap.pl allow me to define how much aggregate failure is worth getting excited about–something that’s hard to do with vanilla Nagios configuration.
Here’s the gist of my test script. ping_ok()
and service_ok()
come from a module I wrote 50 years ago to check basic connectivity/responsiveness with Net::Ping.
diag "Checking Citrix farm connectivity from DMZ\n"; my @stas = qw( citrixdc01.mydomain.com citrixdc02.mydomain.com ); foreach (@stas) { ping_ok($_, 'http'); service_ok($_, 'http', "XML service on $_ is responding in some way" ); } my @citrix_farm = qw( citrixps01.mydomain.com citrixps02.mydomain.com ... citrixps29.mydomain.com citrixps30.mydomain.com ); foreach (@stas, @citrix_farm) { service_ok($_, '1494', "ICA service on $_ is responding in some way" ); } |
I can just use the “warning” and “critical” arguments to check_tap.pl to set my thresholds for how many inaccessible servers I want to tolerate.
So there you go. Sysadmins and developers rejoice!
Well, monitoring tools like nagios are more used against in production or mission critical infrastructure/apps. While unit tests are a part of development practices and there are dedicated tools (called as Continuous Integration tools) used for checking unit tests outputs. Jenkin/Hudson, crucise control, goldberg, Go etc are few of them. These tools can monitor a code repository (git/svn/hg etc) and pull in new commits and then run the unit/functional test automatically and then notify/alert accordingly.
If your unit tests are failing, then it means they are not going via a CI tool. To me testing and monitoring are pretty different.
Never the less, i like the idea :-). We do use nagios-cucumber for work flow monitoring against some of our apps.
Cool! Nagios-cucumber looks like a great tool. I knew I wasn’t the only one thinking along those lines.
You’re right that Nagios isn’t a Continuous Integration tool. I’m just thinking that functionality can suffer because of things other than new commits, especially on a production app. So reusing existing functional tests for monitoring seems like a good way to increase monitoring coverage.
And I think sysadmins could benefit from using test tools to prove assertions about the environment and infrastructure underneath the application.
Thanks for your feedback, Ranjib!
You guys are not alone. I have recently trying to run some web automation test but would like to integrate it with Nagios because I don’t really need the Git pulling part. I just need the test to run every minutes and notify me when things fail. But I do totally see the need of bring the best of 2 worlds together here. I would use Jenkins and integrate it with Nagios if I have to.
But kudos to Nathan for creating this plugin so I can actually use only Nagios and not have to spin up additional server for Jenkins.
Thanks, Henry. See also my Nagios Conf talk last year titled “Monitoring The User Experience For Availability and Performance” for more in the functional/frontend tests department.
Nathan, Thanks for the link to your slide. That’s exactly what I am currently working on. I have setup Selenium to work with Jenkins originally. But I would like to have Nagios trigger the test instead of Jenkins.
Can you tell me if there is a way to trigger check B after check A and only when check A was successful. This was easy to do with Jenkins, but if I can do the same with Nagios, that would be the best. I am using check_mk on top of Nagios.