http://blogs.vmware.com/cloudops/?p=561
By: Pierre Moncassin
A chance conversation with a retired airline captain first brought home to me the paradox of automation. It goes something like this: Never assume that complete automation means removing the human element.
The veteran pilot was adamant that a commercial aircraft could be landed safely with the autopilot – but, he explained, contrary to what some people believe, that does not mean the human pilot can just push a button and sleep through the landing. Instead, it means that the autopilot handles the predictable, routine elements of the landing while the pilot plays the vital role of supervising the maneuver and reacting to any unforeseen situations.
We've seen a similar paradox at play in workflow automation situations faced by some of our enterprise customers. Here's a typical scenario: A customer has deployed an automated provisioning workflow using VCO along with vCD and/or VCO. They have relied on VCO scripting to automate the provisioning steps so that end users can provision infrastructure just by "pushing a button." As with the aircraft autopilot (though hopefully less life-threatening), the automated workflows work well until an unexpected situation occurs – there's an error in the infrastructure, a component with a key dependency changes, or the key dependency itself changes.
This often means a failed workflow, and sometimes an error message that the end user struggles to interpret. After a couple of "failed workflow" experiences, the end user is quickly discouraged, user satisfaction plummets and… need I say more?
Well, this is not what automation is supposed to be all about – We want maximum user satisfaction. The missing element here is an error recovery mechanism, one that very often involves human intervention. So how does that work?
One approach, in terms of VCO workflows, is to build in error handling into the workflows. It is not possible to predict all error situations, of course, but it is possible to detect error situations and issue an error message to an administrator; this at least enables the interception of the condition, which maybe simple to fix.
A second and more advanced part of the solution is to build modular scripts – that way you are fixing the problems once only and, of course, making your scripts more robust and repeatable over time.
The third part of the solution is to build re-startable workflows. This essentially means giving an administrator or process owner the ability to undo steps at any point in the flow. In the case of a straight-forward VM provisioning workflow, the solution might be as simple as removing the VM and automatically restarting the workflow from the beginning.
Or, it could be more complex – perhaps your resources have run out (maybe additional storage needs provisioning), or an issue arises with network settings. In these cases, you may need to troubleshoot before the workflow can re-start. But the point remains the same: A re-startable workflow gives your end users the best chance to complete their original request, rather than stare at an error message.
With error detection, you can roll back to the initial state and flag the error. Once the error is resolved, the administrator can either "resume" or restart from that known point with a known configuration, or at least no worse knowledge than you had before.
Crucially, all the error and exception handling is hidden from the user. That allows the request to complete (or to at least have a better chance of completing) – making for a much better experience for the end user.
It is up to the script designers to decide how much of the error they want to share with the end users – a decision that should be made with the administrator responsible for overseeing the process and responding to exceptions. The goal, though, is to keep end users happy and blissfully unaware of error situations as long as their request is satisfied!
To reiterate my original point: Despite the apparent automaticity of these resolutions, they will have been the result of human intervention along the way.
Finally, as a further step towards optimum organization, I recommend looking at the broader picture of governance around the cloud-related processes. How does the resolution team interact with the Service Desk, for example? Are there policies about when to re-provision instead of repair? Is there a specific organization to manage the cloud-based services? See our whitepaper "Organizing for the Cloud" for an introduction to optimizing the whole IT organization to leverage a cloud infrastructure. But I digress…
In summary – if you are worried that workflow failures may impact your end users:
- Build resilience in your VCO workflows and related scripts
- Build in mechanisms to facilitate human resolution for unpredictable situations
- Create re-startable VCO workflows
- Identify a process owner who has responsibility and accountability for managing exceptions and errors
Thank you to my colleague David Burgess, who helped me formulate several of the key ideas in this post.
For more, browse our blog for some of our previous posts on automation.
Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.
No comments:
Post a Comment