Wednesday, March 20, 2013

Puppet Monitoring: How to Monitor the Success or Failure of Puppet Runs

https://puppetlabs.com/blog/puppet-monitoring-how-to-monitor-the-success-or-failure-of-puppet-runs/

Wednesday, March 20, 2013 12:30 PMPuppet Monitoring: How to Monitor the Success or Failure of Puppet RunsPuppet LabsJesse Aukeman

This post, written by LogicMonitor's Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013. It has been reprinted with permission on our blog.

If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don't go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.

Why?

In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it's easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn't get pushed, production is impacted, and it's the client who notices.

How to implement monitoring?

Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info about its last run into a file (by default /var/lib/puppet/state/last_run_summary.yaml). This file contains all sorts of useful performance information and looks something like this:

---  time:  ssh_authorized_key: 0.007671  total: 20.4510050844269  rvm_gem: 1.608662  service: 3.282576  user: 0.022397  exec: 0.00584  rvm_system_ruby: 0.633996  group: 0.013463  last_run: 1360018865  file: 7.273795  config_retrieval: 7.30157208442688  package: 0.300229  filebucket: 0.000804  changes:  total: 0  resources:  total: 281  skipped: 6  changed: 0  scheduled: 0  out_of_sync: 0  failed_to_restart: 0  restarted: 0  failed: 0  version:  config: 1360014335  puppet: "3.0.0"  events:  total: 0  success: 0  failure: 0  

This is a useful summary of the last job run info, and a great basis for monitoring. With this alone, there are a number of approaches to expose this information to a monitoring solution (some type of host based agent, exposing the data points via snmp, etc). For our particular approach, we had already been working with another puppet tool named MCollective along with its RegistrationMetaData plugin for MongoDB. MCollective (Marionette Collective) is a framework for server orchestration that allows parallel job execution. The Registration plugin works within this MCollective framework and allows all hosts to send "registration" information into the collective that can be processed and centrally stored. This plugin will register all kinds of interesting information about puppet and with a slight modification it will also send the last_run_summary info from the yaml file mentioned earlier. As we already had MCollective running and registering this info, it was easy for us to choose this direction.

In our instance, we are using MongoDB as our central registration database. Because Mongodb is "schemaless" it can handle your registration data however structured (i.e. it's simple to add additional data, change data, etc). All of our servers are part of a mcollective "collective" and they periodically send their registration info (including puppet facts and last run summary info) into the "collective". The MCollective registration agent receives the registration info and stores it in our central Mongo database in json format.

We did have to make a slight modification to the ruby code for the Meta registration plugin in order to expose the puppet_last_run_summary info. Here is snippet of the meta.rb registration plugin with the changes.

class Meta [],                 :facts => {},                 :classes => [],                 :collectives => [],                 :puppet_last_run_summary => [],} # added puppet_last_run to results        cfile = Config.instance.classesfile        if File.exist?(cfile)        result[:classes] = File.readlines(cfile).map {|i| i.chomp}      end        # hackery to populate puppet last run info      puppet_last_run_summary = "/var/lib/puppet/state/last_run_summary.yaml"      if File.exist?(puppet_last_run_summary) then        result[:puppet_last_run_summary] = YAML.load_file(puppet_last_run_summary)      end        result[:identity] = Config.instance.identity      result[:agentlist] = Agents.agentlist      result[:facts] = PluginManager["facts_plugin"].get_facts      result[:collectives] = Config.instance.collectives.sort  

Now here is a sanitized excerpt of some of the registration info produced by a query of the Mongo database, including the puppet_last_run_summary info:

PRIMARY> db.nodes.find({ fqdn: /prod/ }).pretty()  {  	"facts" : {  		"operatingsystemrelease" : "13.2",  		"domain" : "prod",  		"sshrsakey" : "blah",  		"os_maj_version" : "14",  		"puppet_http_server" : "puppet",  		"ps" : "ps -ef",  		"augeasversion" : "0.9.0",  		"hostname" : "prod",  		"productname" : "PowerEdge",  		"architecture" : "x86_64",  		...  	"fqdn" : "prod.server",  	"classes" : [  		"settings",  		"default",  		"ant",  		"puppet",  		"ruby",  		"snmpd",  		"iptables",  		"sudoers",  		"syslog",  		...  	],  	"lastseen" : 1360020774,  	"puppet_last_run_summary" : {  		"events" : {  			"success" : 1,  			"failure" : 0,  			"total" : 1  		},  		"version" : {  			"puppet" : "3.0.0",  			"config" : 1360035015  		},  	"resources" : {  		"failed_to_restart" : 0,  		"skipped" : 6,  		"scheduled" : 0,  		"changed" : 1,  		"restarted" : 0,  		"total" : 412,  		"failed" : 0,  		"out_of_sync" : 1  	},  	"changes" : {  		"total" : 1  	},  	"time" : {  		"user" : 0.025634,  		"service" : 3.856809,  		"group" : 0.014908,  		"config_retrieval" : 8.66118216514587,  		"last_run" : 1360035047,  		"rvm_system_ruby" : 0.484688,  		"rvm_gem" : 3.570474,  		"filebucket" : 0.000629,  		"package" : 0.359344,  		"total" : 25.4297361651459,  		"file" : 8.242915,  		"ssh_authorized_key" : 0.004949,  		"exec" : 0.208204  	}  }  ...  

Now that we have all this data centrally stored in a database, if you are using a monitoring application such as LogicMonitor, it's easy to graph this data and setup alerting based on it.

The first piece of information inside the "puppet_last_run_summary" is under the "time" subsection and has key value "last_run". As it suggests, this data indicates the time that puppet last ran. It is stored as a unix timestamp (number of seconds since the epoch, ie 00:00:00 UTC Jan 1st, 1970). This value will be updated after each puppet run on the node itself, and then this updated value will be propagated to the database when the next registration run occurs. By comparing this timestamp against the current time, you can compute how long it has been since the last puppet run.

Now that we have this information in a Mongo database we should be able to easily track it, plot it in a graph, etc.. Here's a snippet of groovy code that demonstrates how you could easily pull data from the database.

import com.mongodb.*;    mongohost = "mongodb.host"  fqdn = "puppetagent.host"Mongo m = new Mongo(mongohost,27018)    db = m.getDB("puppet")coll = db.getCollection("nodes")  doc = coll.findOne([identity:fqdn] as BasicDBObject)    seconds_since_last_run = System.currentTimeMillis() / 1000 - doc.puppet_last_run_summary.time.last_run    println "seconds_since_last_run: " + time_since_last_puppet_run  println "events_failure: " + doc.puppet_last_run_summary.events.failure  println "events_success: " + doc.puppet_last_run_summary.events.success    m.close()  

Using something similar I was able to pull the data into LogicMonitor and setup graphs to track it. You can see below that the puppet runs are semi-random but generally occur every 15 minutes. The sawtooth pattern is a confluence of the puppet cron schedule, the registration interval, and the interval between monitoring checks. We could potentially lower the periods between registration runs, but this would be an unnecessary increase in overhead, and the current resolution of data points is more than sufficient for our purposes.

We are now able to set a threshold for alerting. For example, we may want to generate warning alerts if puppet has not updated for a period of 1 hour, and further escalations, at the 2 and 3 hour marks.

To extend this example a bit further, we could create additional graphs for any other datapoints tracked in the puppet_last_run_summary file.

Here are graphs tracking puppet events and the time per puppet run:

and here is an example of when something has gone off the rails:

In the above graph you can easily see that puppet runs stopped occurring at approximately 18:00. Shortly after this time, an alert would be generated once the time exceeded configured thresholds, and administrators would be notified so that they may take corrective action.

Here's an example of the alert generated within the LogicMonitor application:

All of our hosts are automatically added to the Mcollective registration, and thus are also automatically added to puppet last run monitoring. This gives us peace of mind that we will alway be notified if there are issues with puppet updates.



No comments:

Post a Comment