Tuesday, October 7, 2014

Chef Reporting API and Resource Updates [feedly]



----
Chef Reporting API and Resource Updates
// Chef Blog

This post originally appeared on jtimberman's Code Blog.

Have you ever wanted to find a list of nodes that updated a specific resource in a period of time? Such as "show me all the nodes in production that had an application service restart in the last hour"? Or, "which nodes have updated their apt cache recently?" For example,

% knife report resource 'role:supermarket-app AND chef_environment:supermarket-prod' execute 'apt-get update'  execute[apt-get update] changed in the following runs:  app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z  app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z  app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z  app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z  app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z  Use 'knife runs show' with the run UUID to get more info  

I have released a new knife plugin to do that, but first some background.

At CHEF, we run the community's cookbook site, Supermarket. We monitor the systems that run the site with Sensu. The current infrastructure runs instances on Amazon Web Services EC2, with an Elastic Load Balancer (ELB) in front of them. As a corrective action for a Supermarket outage, CHEF's operations team added a new check for elevated HTTP 500 responses from the application servers behind the ELB. One thing we found was that when Supermarket was deployed, and the unicorn server restarted, we would see elevated 500's, but the site often wouldn't actually be impacted.

The Sensu check is run from a "relay" node. That is, it isn't run on the application servers or the Sensu server – it's run out of band since it's for the ELB. One might imagine we could have similar checks for other services that aren't run on "managed nodes," but that's neither here nor there. The issue is that we get an alert message that looks like this:

Sensu Alerts  ALERT - [i-d1dfd5d9/check-elb-backend-500] - CheckELBMetrics CRITICAL: supermarket-elb; Sum of HTTPCode_Backend_5XX is 2538.0. (expected lower than 30.0); (HTTPCode_Backend_5XX '2538.0' within 300 seconds between 2014-08-19 13:33:36 +0000 to 2014-08-19 13:38:36 +0000) [Playbook].

The first part, [i-d1dfd5d9/check-elb-backend-500] is the node name and the check that alerted. The node name here is the monitoring relay that runs the check, not the actual node or nodes where Supermarket was deployed and restarted. This is where Chef Reporting comes into play. In Chef Reporting, we can view information about recent Chef client runs, which gives us a graph like this.

If we go look at the reports in the Chef Manage console, we can drill down to something like this.

This shows that unicorn was restarted in this run. That's great, but if I'm getting this alert at a time when I'm not particularly coherent (e.g, 2AM), I want a command in a playbook that I can run to get more information quickly without having to log into the webui and click around imprecisely. CHEF publishes a knife-reporting gem that has a couple handy sub-commands to retrieve this run data. For example, we can list runs.

% knife runs list  node_name:  i-3022aa3b  run_id:     9eccd8f6-876b-4a57-87ac-0b3e7b7ef1e7  start_time: 2014-08-21T17:03:56Z  status:     started    node_name:  i-a09424a8  run_id:     f2b7871a-149b-4fd3-abdc-d74a838d719a  start_time: 2014-08-21T17:00:23Z  status:     success  

Or, we can display a specific run.

% knife runs show eecb04fb-11df-438a-8e81-dd610eb66616  run_detail:    data:    end_time:          2014-08-20T17:50:12Z    node_name:         i-9f22aa94    run_id:            eecb04fb-11df-438a-8e81-dd610eb66616    run_list:          ["role[base]","role[supermarket-app]"]    start_time:        2014-08-20T17:45:37Z    status:            success    total_res_count:   261    updated_res_count: 17  run_resources:    cookbook_name:    supermarket    cookbook_version: 2.7.2    duration:         209    final_state:      enabled: false      running: true    id:               unicorn    initial_state:      enabled: false      running: true    name:             unicorn    result:           restart    type:             service    uri:              https://api.opscode.com/organizations/supermarket/reports/org/runs/eecb04fb-11df-438a-8e81-dd610eb66616/15

This is handy, but a little limited. What if I want to display only the runs containing the service[unicorn] resource?

That's where my knife-report-resource plugin helps. At first, it was very much specific to finding unicorn restarts on Supermarket app servers. However, I wanted to make it more general purpose as I think people would want to be able to find when arbitrary resources were updated. This is how it works:

  1. Query the Chef Server for a particular set of nodes. For example, 'role:supermarket-app AND chef_environment:supermarket-prod'.
  2. Get all the Chef client runs for a specified time period up until the current time. By default, it starts from one hour ago, but we can pass an ISO8601 timestamp.
  3. Iterate over all the runs looking for runs by the nodes that were returned by the search query, gathering the specified resource type and name.
  4. Display some nice output with the node's FQDN, the run's UUID, and a timestamp.

From the earlier example:

% knife report resource 'role:supermarket-app AND chef_environment:supermarket-prod' execute 'apt-get update'  execute[apt-get update] changed in the following runs:  app-supermarket1.example.com 2230cf30-6d95-4e43-be18-211137eaf802 @ 2014-10-07T14:07:03Z  app-supermarket2.example.com c5e4d7bf-95a6-4385-9d8e-c6f5617ed79b @ 2014-10-07T14:14:04Z  app-supermarket3.example.com c4c4b4bb-91b6-4f73-9876-b24b093c7f1e @ 2014-10-07T14:09:54Z  app-supermarket4.example.com 3eb09034-7539-4a3c-af6d-5b01d35bc63f @ 2014-10-07T13:31:56Z  app-supermarket5.example.com aa48c1d3-da91-4031-a43d-582a577cbf2d @ 2014-10-07T13:35:15Z  Use `knife runs show` with the run UUID to get more info

Then, we can drill down further into one of these runs with the knife-reporting plugin.

% knife runs show 2230cf30-6d95-4e43-be18-211137eaf802  run_detail:    data:    end_time:          2014-10-07T14:07:03Z    node_name:         i-d7fed0df    run_id:            2230cf30-6d95-4e43-be18-211137eaf802    run_list:          ["role[base]","role[supermarket-app]"]    start_time:        2014-10-07T14:03:59Z    status:            success    total_res_count:   271    updated_res_count: 12  run_resources:    cookbook_name:    chef-client    cookbook_version: 3.6.0    duration:         99    final_state:      enabled: true      running: false    id:               chef-client    initial_state:      enabled: true      running: true    name:             chef-client    result:           enable    type:             runit_service    uri:              https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/0  ...    cookbook_name:    supermarket    cookbook_version: 2.11.0    duration:         8506    final_state:    id:               apt-get update    initial_state:    name:             apt-get update    result:           run    type:             execute    uri:              https://api.opscode.com/organizations/supermarket/reports/org/runs/2230cf30-6d95-4e43-be18-211137eaf802/5

Hopefully you find this plugin useful! It is a RubyGem, and is available on RubyGems.org, and the source is available on GitHub.


----

Shared via my feedly reader


Sent from my iPhone

No comments:

Post a Comment