Friday, March 1, 2013

ApacheCon North America Report: Troubleshooting CloudStack

ApacheCon North America Report: Troubleshooting CloudStack:
next-event-164x178There’s no shortage of talks I wanted to attend at ApacheCon North America, but I was determined not to miss Kirk Kosinski’s troubleshooting talks on Tuesday. Kirk presented two talks, one on the top 10 networking issues, and another on using logs to diagnose issues with CloudStack.
Kirk is an escalation engineer for Citrix, who has worked with CloudPlatform and CloudStack for about two years. He’s no stranger to the common issues (and weirder ones) that folks run into when wrangling their open source clouds.
No doubt Kirk could talk for hours about issues he’s helped troubleshoot, but alas – he was only allotted two.

Common Networking Issues

Kirk spent a lot of time on the first two major issues that people run into when deploying CloudStack – namely, VLAN issues.
When you hear “troubleshooting CloudStack” you might be thinking “oh, CloudStack has a lot of glitches.” Actually, it’s not so much with CloudStack bugs – it has a lot to do with the environment that you’re deploying CloudStack in.
For example, Kirk talked about VLAN issues caused by misconfigured switches. How do you spot this? Kirk says that one telltale sign is DHCP may work for some instances, but not all.
How do you spot this? One way is to use TCPdump on different interfaces to see what traffic is (or isn’t) making its way in.
Another issues is VLAN issues caused by the hypervisor. For instance, substandard NIC drivers might introduce problems with VLAN configurations because they’re not designed for that kind of work. Kirk says he sees that often on XenServer.
Symptoms can be similar to switch misconfigurations, dropping traffic, inability for certain hosts to “talk” to one another, etc.
For example, users might want to modify settings directly in the database if they’re not editable via the UI or APIs. While that might work, you can also introduce a world of hurt by editing one setting and missing others.
One solution, Kirk says, is to make sure you’re using the most recent network drivers and versions - e.g., make sure XenServer is up-to-date with all patches.
Bonding also can be a source of problems – but Kirk notes that CloudStack actually provides scripts to set up network bonding properly. Use those if you’re using network bonding with CloudStack!

Open vSwitch

Open vSwitch is the default on newer versions of XenServer, and “it’s great as long as it works,” says Kirk. When it doesn’t, you run into all kinds of weirdness.
Problems include “weirdness” that “defies explanation” and is “hard to troubleshoot” where some traffic is slow or packets are dropped.
He does note that it’s not common for Open vSwitch to have problems, but he has run into a fair number of problems over time with it.
Again, the solution is to make sure you’re up to date with patches for your hypervisors and network drivers.

Security Groups

Problems with CloudStack’s security groups, hypervisor-level allowing or blocking traffic, include traffic being allowed/blocked that shouldn’t be.
Kirk says he hasn’t seen many problems with KVM, but has seen some issues with XenServer 6.0.2 without the update pack. Security groups and vSphere don’t mix, says Kirk.
To troubleshoot, Kirk says to look at the iptables rules on the hosts and the ebtables rules on the hypervisor to see what’s being allowed through (or not).

Read the Fine Logs

The following talk, about using the logs for troubleshooting, was entertaining and illuminating. I spent a little less time taking notes in this one, because the specifics were less important than the overall gist of the talk:
  • Read the logs, looking for ERROR, WARN, and other clues that indicate where there’s a problem – and ignoring the rest
  • Scroll up
  • Ignore “avoid set”
First, a lot of folks are put off by digging into logs. The first time you encounter system logs, it can be a scary proposition. There’s a lot of junk in the log that (at first, at least) seems to be completely senseless.
Take a deep breath, and get ready to plunge in. Learn to use some of the tools that admins have relied on for decades (e.g. grep) to find the telltale lines in the log that start to indicate a problem.
Then? Scroll up after you hit the errors and start seeing what happened in the timeframe immediately before the error.
Finally, Kirk expressed a bit of frustration with the “avoid set” message that appears in logs frequently. This indicates that, for some reason, CloudStack is not using a given host or cluster to start a new instance. It doesn’t indicate a problem in and of itself. Yet it seems to be a frequently cited “issue” in reports.
The “avoid set” issue, and a few other comments during the logreading session, lead me to believe we could do well spending some time looking at the logs and asking whether we could revamp the errors and reports issued by CloudStack to be easier to read and work with.
All in all, I really enjoyed Kirk’s sessions – it’s always a pleasure to attend a presentation from someone who knows their subject area so well, and who’s interested in sharing that knowledge. Have a troubleshooting tip? Please share it!

No comments:

Post a Comment