----
Chef Reporting API and Resource Updates
// Chef Blog
Have you ever wanted to find a list of nodes that updated a specific resource in a period of time? Such as "show me all the nodes in production that had an application service restart in the last hour"? Or, "which nodes have updated their apt cache recently?" For example,
% knife report resource 'role:supermarket-app AND chef_environment:supermarket-prod' execute 'apt-get update' execute[apt-get update] changed in the following runs: app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z Use 'knife runs show' with the run UUID to get more info
I have released a new knife plugin to do that, but first some background.
At CHEF, we run the community's cookbook site, Supermarket. We monitor the systems that run the site with Sensu. The current infrastructure runs instances on Amazon Web Services EC2, with an Elastic Load Balancer (ELB) in front of them. As a corrective action for a Supermarket outage, CHEF's operations team added a new check for elevated HTTP 500 responses from the application servers behind the ELB. One thing we found was that when Supermarket was deployed, and the unicorn
server restarted, we would see elevated 500's, but the site often wouldn't actually be impacted.
The Sensu check is run from a "relay" node. That is, it isn't run on the application servers or the Sensu server – it's run out of band since it's for the ELB. One might imagine we could have similar checks for other services that aren't run on "managed nodes," but that's neither here nor there. The issue is that we get an alert message that looks like this:
Sensu Alerts ALERT - [i-d1dfd5d9/check-elb-backend-500] - CheckELBMetrics CRITICAL: supermarket-elb; Sum of HTTPCode_Backend_5XX is 2538.0. (expected lower than 30.0); (HTTPCode_Backend_5XX '2538.0' within 300 seconds between 2014-08-19 13:33:36 +0000 to 2014-08-19 13:38:36 +0000) [Playbook].
The first part, [i-d1dfd5d9/check-elb-backend-500]
is the node name and the check that alerted. The node name here is the monitoring relay that runs the check, not the actual node or nodes where Supermarket was deployed and restarted. This is where Chef Reporting comes into play. In Chef Reporting, we can view information about recent Chef client runs, which gives us a graph like this.
If we go look at the reports in the Chef Manage console, we can drill down to something like this.
This shows that unicorn was restarted in this run. That's great, but if I'm getting this alert at a time when I'm not particularly coherent (e.g, 2AM), I want a command in a playbook that I can run to get more information quickly without having to log into the webui and click around imprecisely. CHEF publishes a knife-reporting
gem that has a couple handy sub-commands to retrieve this run data. For example, we can list runs.
% knife runs list node_name: i-3022aa3b run_id: 9eccd8f6-876b-4a57-87ac-0b3e7b7ef1e7 start_time: 2014-08-21T17:03:56Z status: started node_name: i-a09424a8 run_id: f2b7871a-149b-4fd3-abdc-d74a838d719a start_time: 2014-08-21T17:00:23Z status: success
Or, we can display a specific run.
% knife runs show eecb04fb-11df-438a-8e81-dd610eb66616 run_detail: data: end_time: 2014-08-20T17:50:12Z node_name: i-9f22aa94 run_id: eecb04fb-11df-438a-8e81-dd610eb66616 run_list: ["role[base]","role[supermarket-app]"] start_time: 2014-08-20T17:45:37Z status: success total_res_count: 261 updated_res_count: 17 run_resources: cookbook_name: supermarket cookbook_version: 2.7.2 duration: 209 final_state: enabled: false running: true id: unicorn initial_state: enabled: false running: true name: unicorn result: restart type: service uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/eecb04fb-11df-438a-8e81-dd610eb66616/15
This is handy, but a little limited. What if I want to display only the runs containing the service[unicorn]
resource?
That's where my knife-report-resource
plugin helps. At first, it was very much specific to finding unicorn restarts on Supermarket app servers. However, I wanted to make it more general purpose as I think people would want to be able to find when arbitrary resources were updated. This is how it works:
- Query the Chef Server for a particular set of nodes. For example,
'role:supermarket-app AND chef_environment:supermarket-prod'
. - Get all the Chef client runs for a specified time period up until the current time. By default, it starts from one hour ago, but we can pass an ISO8601 timestamp.
- Iterate over all the runs looking for runs by the nodes that were returned by the search query, gathering the specified resource type and name.
- Display some nice output with the node's FQDN, the run's UUID, and a timestamp.
From the earlier example:
% knife report resource 'role:supermarket-app AND chef_environment:supermarket-prod' execute 'apt-get update' execute[apt-get update] changed in the following runs: app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z Use `knife runs show` with the run UUID to get more info
Then, we can drill down further into one of these runs with the knife-reporting
plugin.
% knife runs show 2230cf30-6d95-4e43-be18-211137eaf802 run_detail: data: end_time: 2014-10-07T14:07:03Z node_name: i-d7fed0df run_id: 2230cf30-6d95-4e43-be18-211137eaf802 run_list: ["role[base]","role[supermarket-app]"] start_time: 2014-10-07T14:03:59Z status: success total_res_count: 271 updated_res_count: 12 run_resources: cookbook_name: chef-client cookbook_version: 3.6.0 duration: 99 final_state: enabled: true running: false id: chef-client initial_state: enabled: true running: true name: chef-client result: enable type: runit_service uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/0 ... cookbook_name: supermarket cookbook_version: 2.11.0 duration: 8506 final_state: id: apt-get update initial_state: name: apt-get update result: run type: execute uri: https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/5
Hopefully you find this plugin useful! It is a RubyGem, and is available on RubyGems.org, and the source is available on GitHub.
----
Shared via my feedly reader
Sent from my iPhone
No comments:
Post a Comment