Tuesday, December 13, 2016

CloudStack upgrades – best practices [feedly]

CloudStack upgrades – best practices
http://www.shapeblue.com/cloudstack-upgrades-best-practices/

-- via my feedly newsfeed

Introduction

Upgrading CloudStack can sometimes be a little daunting – but as the 5P's proverb goes – Proper Planning Prevents Poor Performance. With planning, testing and the right strategy upgrades will have a high chance of success and have minimal impact on your CloudStack end users.

The CloudStack upgrade process is documented in the release notes for each CloudStack version (e.g. version 4.6 to 4.9 upgrade guide). These upgrade notes should always be adhered to in detail since they cover both the base CloudStack upgrade as well as the required hypervisor upgrade changes. This blog article does however aim to add some advise and best practices to make the overall upgrade process easier.

The following, outlines some best practice steps which can – and should – be carried out in advance of the actual CloudStack upgrade. This ensures all steps are done accurately and without the pressure of an upgrade change window.

Strategy – in-place upgrade vs. parallel builds

The official CloudStack upgrade guides are based on an in-place upgrade process. This process works absolutely fine  – it does however have some drawbacks:

  • It doesn't easily lend itself to technology refreshes – if you want to carry out hardware or OS level upgrades or configuration changes to your CloudStack management and MySQL servers at the time of upgrade this adds complexity and time.
  • To ensure you have rollback point in case of problems with the upgrade you need to either take VM snapshots of the infrastructure (if your management and MySQL servers are Virtual Machines) or otherwise back up all configurations.
  • In case of a rollback scenario you also have to ensure all upgraded system CloudStack logs as well as the database is backed up before reverting to any snapshots / backups – such that a post-mortem investigation can be carried out.

To get around these issues it is easier and a less risky to simple prepare and build new CloudStack management and MySQL servers in advance of the actual upgrade:

  • This means you can carry out the upgrade on brand new servers, whilst leaving the old servers offline for a much quicker rollback.
  • This also means in the case of a rollback scenario the newly built servers can simply be disabled and left for further investigation whilst the original infrastructure is brought back online.
  • Since most people run their CloudStack management infrastructure as Virtual Machines the cost of adding a handful of new VMs for this purpose is negligible compared to the benefits of a more efficient and less risky upgrade process.

By using this parallel build process the upgrade process will include a few more steps – these will be covered later in this article:

  • Upgrade starts with stopping and disabling CloudStack management on the original infrastructure.
  • The original database is then copied to the new MySQL server(s), before MySQL is disabled.
  • Once completed the new CloudStack management servers are configured to point to the new MySQL server(s).
  • The upgrade is then run on the new CloudStack management servers, pointing to the new MySQL server(s).
  • As an additional step this does require the CloudStack global "host" to be reconfigured.

New CloudStack management infrastructure build

Following the standard documented build procedure for the new / upgrade version of CloudStack:

Build new management servers but without carrying out the following steps:

  • Do not carry out the steps for seeding system templates.
  • Do not carry our the cloudstack-setup-databases step.
  • Do not run cloudstack-setup-management.

Build new MySQL servers:

  • Ensure the cloud username / password is configured.
  • Create new empty databases for cloud / cloud_usage / cloudbridge.
  • If possible configure the MySQL master-slave configuration.

 Upgrade lab testing

As for any technical project testing the upgrade prior to upgrading is important. If you don't have a production equal lab to do this in it is worth the effort to build one.

A couple of things to keep in mind:

  • Try to make the upgrade lab as close to production as possible.
  • Make sure you use the same OS'es, the same patch level and the same software versions (especially Java and Tomcat).
  • Use the same hardware model for the hypervisors if possible. These don't have to be the same CPU / memory spec. but should otherwise match as closely as possible.
  • Use the same storage back-ends for primary and secondary storage. If not possible at least use the same protocols (NFS, iSCSI. FC, etc.)
  • Try to match your production cloud infrastructure with regards to number of zones.
  • If your production infrastructure has multiple clusters or pods also try to match this.

Once the lab has been built make sure it is prepared with a test workload similar to production:

  • Configure a selection of test domain and accounts + users.
  • Create VM infrastructure similar to production – e.g. configure isolated and shared networks, VPCs, VPN connections etc.
  • Create a number of VMs across multiple guest OS'es – at the very minimum try to have both Windows and Linux VMs running.
  • Make sure you also have VMs with additional disks attached.
  • Keep in mind your post-upgrade testing may be destructive – i.e. you may delete or otherwise break VMs. To cover for this it is good practice to prepare a lot more VMs than you think you may need.

For the test upgrade:

  • Ensure you have rollback points – you may want run the test upgrade multiple times. An easy way is to take VM snapshots and SAN/NAS snapshots.
  • Using the parallel build process create a new management server and carry out the upgrade (see below).
  • Investigate and fix any problems encountered during the upgrade process and add this to the production upgrade plan.

Once upgraded carry out thorough functional, regression and user acceptance testing. The nature of these depend very much on the type, complexity and nature of the production cloud, but should at the very minimum include:

User actions
  • Logins with passwords created prior to the upgrade.
  • Creation of new domains / accounts / users.
  • Deletion of old and new domains / accounts / users.
All VM lifecycle actions
  • Create / delete new VMs as well as deletion of existing VMs.
  • Start / stop of new and existing VMs.
  • Additions and removals of disks.
  • Additions and removals of networks to VMs.
Network lifecycle actions
  • Creation of new isolated / shared / VPC networks.
  • Deletion of existing as well as new isolated / shared / VPC networks.
  • Connectivity testing from VMs across all network types.
Storage lifecycle actions
  • Additions and deletions of disks.
Other
  • Test any integrated systems – billing portals, custom GUIs / portals, etc.

If any problems are encountered these need to be investigated and addressed prior to the production upgrade.

Production DB upgrade test

The majority of the CloudStack upgrade is done against the underlying databases. As a result the lab testing discussed above will not highlight any issues resulting from:

  • Schema changes
  • Database or table sizes
  • Internal database constraints

To test for this it is good practice to carry out the following test steps. This process should be carried out as close in time to the actual upgrade as possible:

  • Build a completely network isolated management server using the new upgrade version of CloudStack. For this test simply use the same server for management and MySQL.
  • Import a copy of the production databases.
  • Start the upgrade by running the standard documented cloudstack-setup-databases configuration command as well as cloudstack-setup-management. The latter will complete the server configuration and start the cloudstack-management service.
  • Monitor the CloudStack logs (/var/log/cloudstack/management/management-server.log).

This will highlight any issues with the underlying SQL upgrade scripts. If any problems are encountered these need to be investigated thoroughly and fixed.

Please note – if any problems are found and subsequently fixed the original database needs to be imported again and the full upgrade restarted, otherwise the upgrade process will attempt to carry out upgrade steps which have already been completed – which in most cases will lead to additional errors.

Once all issues have been resolved and the upgrade completes without errors the time taken for the upgrade should be noted. The database upgrade process can take from minutes up to multiple hours depending on the size of the production databases – something which needs to be taken into account when planning for the production upgrade. If the upgrade takes longer than expected it may also be prudent to consider database housekeeping of event and usage tables to cut down on size and thereby speeding up the upgrade process.

To prevent re-occurrence of problems during the production upgrade it is key that all database upgrade fixes and workarounds are documented and added to the production upgrade plan.

Install system VM templates

The official upgrade documentation specifies which system VM templates to upload prior to the upgrade.

Please note – whether you do the upgrade in-place or on a parallel build the system VM templates need to be uploaded on the original CloudStack infrastructure (i.e. it can not be done on the new infrastructure when using parallel builds).

In other words – the template upload can and should be done in the days prior to the upgrade. Uploading these will not have any adverse effect on the existing CloudStack instance – since the new templates are simply uploaded in a similar fashion to any other user template upload. Carrying this out in advance will also ensure the templates should be fully uploaded by the time the production upgrade is carried out, thereby cutting down the upgrade process time.

Other things to consider prior to upgrade

  • Always make sure the production CloudStack infrastructure is healthy prior to the upgrade – an upgrade will very seldom fix underlying issues – these will however quickly cause problems during the upgrade and lead to extended downtime and rollbacks.
  • Once the upgrade has been completed it is very difficult to backport any post-upgrade user changes back to the original database in case of a rollback scenario. In other words – it is prudent to disable user access to the CloudStack infrastructure until the upgrade has been deemed a success.
  • It is good practice to at the very minimum have the CloudStack MySQL servers configured in a master-slave configuration. If this has not been done on the original infrastructure it can be configured on the newly built MySQL servers prior to any database imports, again cutting down on processing time during the actual production upgrade.

Upgrade

Step 1 – review all documentation

The official upgrade documentation is updated for each version of CloudStack (e.g. CloudStack 4.6 to 4.9 upgrade guide) – and should always be taken into consideration during an upgrade.

Step 2 – confirm all system VM templates are in place

In the CloudStack GUI make sure the previously uploaded system VM templates are successfully uploaded:

  • Status: Download complete
  • Ready: Yes

Step 3 – stop all running CloudStack services

On all existing CloudStack servers stop and disable the cloudstack-management service:

# service cloudstack-management stop;  # service cloudstack-usage stop;  # chkconfig cloudstack-management off;  # chkconfig cloudstack-usage off;    

Step 4 – back up existing databases and disable MySQL

On the existing MySQL servers back up all CloudStack databases:

# mysqldump -u root -p cloud > /root/cloud-backup.sql;  # mysqldump -u root -p cloud_usage > /root/cloud_usage-backup.sql;  # mysqldump -u root -p cloudbridge > /root/cloudbridge-backup.sql;    # service mysqld stop;  # chkconfig mysqld off;    

Step 5 – copy and import databases on the new MySQL master server

# scp root@<original MySQL server IP>:/root/cloud*.sql;  # mysql –u root –p cloud < cloud-backup.sql;  # mysql –u root –p cloud_usage < cloud_usage-backup.sql;  # mysql –u root –p cloudbridge < cloudbridge-backup.sql;    

Step 6 – update the "host" global setting

The parallel builds method introduces new management servers whilst disabling the old ones. If no load balancers are being used the "host" global setting requires to be updated to specify which new management server the hypervisors should check into:

# mysql –p;  # update cloud.configuration set value="<new management server IP address | new loadbalancer VIP>" where name="host";    

Step 7 – carry out all hypervisor specific upgrade steps

Following the official upgrade documentation upgrade all hypervisors as specified.

Step 8 – configure and start the first management server

On the first management server only carry out the following steps to configure connectivity to the new MySQL (master) server and start the management service. Note the "cloudstack-setup-databases" command is executed without the "–deploy-as" option:

# cloudstack-setup-databases <cloud db username>:<cloud db password>@<cloud db host>;   # cloudstack-setup-management;  # service cloudstack-management status;    

Step 9 – monitor startup

  • Monitor the CloudStack service startup:
    # tail –f /var/log/cloudstack/management/management-server.log
  • Open the CloudStack GUI and check and confirm all hypervisors check in OK.

Step 10 – update system VMs

  • Destroy the CPVM and make sure this is recreated with the new system VM template.
  • Destroy the SSVM and again makes sure this comes back online.
  • Restart guest networks "with cleanup" to ensure all Virtual Routers are recreated with the new template.
  • Checking versions of system VMs / VRs can be done by either:
    • Checking /etc/cloudstack-release locally on each appliance.
    • By checking the vm_instance / vm_templates table in the cloud database.
    • Updating the "minreq.sysvmtemplate.version" global setting to the new system VM template version – which will show which VRs require updated in the CloudStack GUI.

Step 11 – configure and start additional management servers

Following the same steps as in step 8 above configure and start the cloudstack-management service on any additional management servers.

Rollback

Rollbacks of upgrades should only be carried out in worst case scenarios. A rollback is not an exact science and may require additional steps not covered in this article. In addition any rollback should be carried out as soon as possible after the original upgrade – otherwise the new infrastructure will simply have too many changes to VMs, networks, storage, etc. which can not be easily backported to the original database.

Step 1 – disable new CloudStack management / MySQL servers

On the new CloudStack management servers stop and disable the management service:

# service cloudstack-management stop;  # chkconfig cloudstack-management off;
 

On the new MySQL server stop and disable MySQL:

# service mysqld stop;  # chkconfig mysqld off;    

Step 2 – tidy up new VMs on the hypervisor infrastructure

Using hypervisor specific tools identify all VMs (system VMs and VRs) created since the upgrade. These instances need to be deleted since the old management infrastructure and databases have no knowledge of them.

Step 3 – enable the old CloudStack management / MySQL servers:

On the original MySQL servers:

# service mysqld start;  # chkconfig mysqld on;
 

On the original CloudStack management servers:

# service cloudstack-management start;  # chkconfig cloudstack-management on;    

Step 4 – restart system VMs and VRs

System VMs and VRs may restart automatically (since the original instances are gone). If these do not come online follow the procedure described in upgrade step 10 to recreate these.

Conclusion

Many CloudStack users are reluctant to upgrade their platforms – however the risk of not upgrading and falling behind on technology and security tends to be a greater risk in the long run. Following these best practices should reduce the overall workload, risk and stress of upgrades by both allowing a lot of work to be carried out upfront, as well as providing a relatively straight forward rollback procedure.

As always we're happy to receive feedback , so please get in touch with any comments, questions or suggestions.

About The Author

Dag Sonstebo is a Cloud Architect at ShapeBlue, The Cloud Specialists. Dag spends most of his time designing, implementing and automating IaaS solutions based on on Apache CloudStack.

No comments:

Post a Comment