Switzerland, Zurich Status Page
Switzerland, Geneva Status Page
Germany, Frankfurt Status Page
Germany, Dusseldorf Status Page
Philippines, Manila Status Page
Philippines, Clark Status Page
Philippines, Manila-2 Status Page
Saudi Arabia, Riyadh Status Page
United Kingdom, Cardiff Status Page
United Kingdom, London Status Page
United States, Honolulu Status Page
United States, San Jose Status Page
ROOT CAUSE:
During a period of higher CPU load, softlockup detection was triggered and it rebooted some of the AMD hosts in a short succession. Note this is only possible by a combination the four factors as outlined under additional information below. F1, F2 and F3 are mandatory for triggering the issue and F4 is a contributing (but not mandatory) factor.
ADDITIONAL INFORMATION:
We have been liaising constantly with our hardware and software vendors on the compute and storage aspects of our cloud stack as well as our own internal code and set-up which forms the majority of the system used to deliver services to customers.
Factors contributing to the fault:
F1. There is a known issue with softlockup detection in the Linux kernel, which produces false positive detections. There wasn't really a soft lockup but it was detected. We discovered the existence of this bug after the incident.
F2. The softlockup false positive detection can only be triggered if there are userspace processes running on core 0. There were not supposed to be any, but there were, due to an error in CPU allocation for cloud servers on the AMD compute hosts.
F3. There was a new bug discovered in our storage system module (protocol processing), which holds the CPU for longer than needed. It cannot produce a true soft lockup, but it can be detected as such. The bug was not detected in testing, because CPU affinity / CPU allocation was always configured correctly.
F4. There is a lot more CPU time being spent on the other tasks working on core 0 than anticipated, making the softlockup detection a lot more likely. We did not anticipate such high CPU usage. This is specific to AMD hosts as multi-processor coordination is dealt with exclusively by core 0.
CURRENT LESSONS LEARNED & ACTIONS TAKEN
We have mitigated the sources of instability issues on the current AMD hosts as follows:
M1. We have disabled the "Panic on softlockup" sysctl setting. This means that in the case of a false positive detection, it would not trigger a reboot of the host.
M2. We ensured that VMs are never scheduled on cores 0 and 1, or any cores that are used by the storage protocol processing.
M3. We are going to fix the bug in the storage module in due course.
M4. We have moved all storage processes away from core 0, so that there is on average 50% more CPU time available for these processes.
Additionally we are accelerating the transition period over to the new Intel fabric which was planned for late July/early August for all client computing in order to avoid further potential issues from running both AMD and Intel based systems in parallel. We are contacting all customers separately to this notice. In addition we have implemented or are in the process of implementing additional measures/lessons learned from the outage:
improved packet storm controls on internal networking interfaces (this was an alternate root cause that we disproved)
improved logging of packet flows on the compute nodes including latencies
a number of bug fixes have been reported and are being fixed with our storage vendor
Please accept our sincere apologies for the disruption this situation has caused. We do believe we have identified and mitigated the source of the problems to avoid any repeat. As outlined it required the combination of several overlapping factors in order to propagate. This was a serious outage in our Zurich cloud however incidents of this magnitude remain rare and, with the upgrades to hardware and software we have implemented and will implement in the coming weeks, will become rarer.
CloudSigma tries to do everything possible to minimize any inconvenience to our customers. We appreciate your patience and welcome any feedback.
Thank you for your understanding.