In a previous article, I described how dom0 event channels can cause a hard limitation on VM density scalability.
Event channels were just one hard limit the XenServer engineering team needed to overcome to allow XenServer 6.2 to support up to 500 Windows VMs or 650 Linux VMs on a single host.
In my talk at the 2013 Xen Developer Summit towards the end of October, I spoke about a further six hard limits and some soft limits that we overcame along the way to achieving this goal. This blog article summarises that journey.
Firstly, I'll explain what I mean by hard and soft VM density limits. A hard limit is where you can run a certain number of VMs without any trouble, but you are unable to run one more. Hard limits arise when there is some finite, unsharable resource that each VM consumes a bit of. On the other hand, a soft limit is where performance degrades with every additional VM you have running; there will be a point at which it's impractical to run more than a certain number of VMs because they will be unusable in some sense. Soft limits arise when there is a shared resource that all VMs must compete for, such as CPU time.
Here is a run-down of all seven hard limits, how we mitigated them in XenServer 6.2, and how we might be able to push them even further back in future:
-
dom0 event channels
- Cause of limitation: XenServer uses a 32-bit dom0. This means a maximum of 1,024 dom0 event channels.
- Mitigation for XenServer 6.2: We made a special case for dom0 to allow it up to 4,096 dom0 event channels.
- Mitigation for future: Adopt David Vrabel's proposed change to the Xen ABI to provide unlimited event channels.
-
blktap2 device minor numbers
- Cause of limitation: blktap2 only supports up to 1,024 minor numbers, caused by #define MAX_BLKTAP_DEVICE in blktap.h.
- Mitigation for XenServer 6.2: We doubled that constant to allow up to 2,048 devices.
- Mitigation for future: Move away from blktap2 altogether?
-
aio requests in dom0
- Cause of limitation: Each blktap2 instance creates an asynchronous I/O context for receiving 402 events; the default system-wide number of aio requests (fs.aio-max-nr) was 444,416 in XenServer 6.1.
- Mitigation for XenServer 6.2: We set fs.aio-max-nr to 1,048,576.
- Mitigation for future: Increase this parameter yet further. It's not clear whether there's a ceiling, but it looks like this would be okay.
-
dom0 grant references
- Cause of limitation: Windows VMs used receive-side copy (RSC) by default in XenServer 6.1. In netbk_p1_setup, netback allocates 22 grant-table entries per virtual interface for RSC. But dom0 only had a total of 8,192 grant-table entries in XenServer 6.1.
- Mitigation for XenServer 6.2: We could have increased the size of the grant-table, but for other reasons RSC is no longer the default for Windows VMs in XenServer 6.2, so this limitation no longer applies.
- Mitigation for future: Continue to leave RSC disabled by default.
-
Connections to xenstored
- Cause of limitation: xenstored uses select(2), which can only listen on up to 1,024 file descriptors; qemu opens 3 file descriptors to xenstored.
- Mitigation for XenServer 6.2: We made two qemu watches share a connection.
- Mitigation for future: We could modify xenstored to accept more connections, but in the future we expect to be using upstream qemu, which doesn't connect to xenstored, so it's unlikely that xenstored will run out of connections.
-
Connections to consoled
- Cause of limitation: Similarly, consoled uses select(2), and each PV domain opens 3 file descriptors to consoled.
- Mitigation for XenServer 6.2: We use poll(2) rather than select(2). This has no such limitation.
- Mitigation for future: Continue to use poll(2).
-
dom0 low memory
- Cause of limitation: Each running VM eats about 1 MB of dom0 low memory.
- Mitigation for future: Using a 64-bit dom0 would remove this limit.
Summary of limits
Okay, so what does this all mean in terms of how many VMs you can run on a host? Well, since some of the limits concern your VM configuration, it depends on the type of VM you have in mind.
Let's take the example of Windows VMs with PV drivers, each with 1 vCPU, 3 disks and 1 network interface. Here are the number of those VMs you'd have to run on a host in order to hit each limitation:
Limitation | XS 6.1 | XS 6.2 | Future |
dom0 event channels | 150 | 570 | no limit |
blktap minor numbers | 341 | 682 | no limit |
aio requests | 368 | 869 | no limit |
dom0 grant references | 372 | no limit | no limit |
xenstored connections | 333 | 500 | no limit |
consoled connections | no limit | no limit | no limit |
dom0 low memory | 650 | 650 | no limit |
The first limit you'd arrive at in each release is highlighted. So the overall limit is event channels in XenServer 6.1, limiting us to 150 of these VMs. In XenServer 6.2, it's the number of xenstore connections that limits us to 500 VMs per host. In the future, none of these limits will hit us, but there will surely be an eighth limit when running many more than 500 VMs on a host.
What about Linux guests? Here's where we stand for paravirtualised Linux VMs each with 1 vCPU, 1 disk and 1 network interface:
Limitation | XS 6.1 | XS 6.2 | Future |
dom0 event channels | 225 | 1000 | no limit |
blktap minor numbers | 1024 | 2048 | no limit |
aio requests | 368 | 869 | no limit |
dom0 grant references | no limit | no limit | no limit |
xenstored connections | no limit | no limit | no limit |
consoled connections | 341 | no limit | no limit |
dom0 low memory | 650 | 650 | no limit |
This explains why the supported limit for Linux guests can be as high as 650 in XenServer 6.2. Again, in the future, we'll likely be limited by something else above 650 VMs.
What about the soft limits?
After having pushed the hard limits such a long way out, we then needed to turn our attention towards ensuring that there weren't any soft limits that would make it infeasible to run a large number of VMs in practice.
Felipe Franciosi has already described how qemu's utilisation of dom0 CPUs can be reduced by avoiding the emulation of unneeded virtual devices. The other major change in XenServer 6.2 to reduce dom0 load was to reduce the amount of xenstore traffic. This was achieved by replacing code that polled xenstore with code that registers watches on xenstore and by removing some spurious xenstore accesses from the Windows guest agent.
These things combine to keep dom0 CPU load down to a very low level. This means that VMs can remain healthy and responsive, even when running a very large number of VMs.
No comments:
Post a Comment