[ZRH] Power and networking issues in Zurich

Incident Report for CloudSigma

Postmortem

ROOT CAUSE:

During a period of higher CPU load, softlockup detection was triggered and it rebooted some of the AMD hosts in a short succession. Note this is only possible by a combination the four factors as outlined under additional information below. F1, F2 and F3 are mandatory for triggering the issue and F4 is a contributing (but not mandatory) factor.

ADDITIONAL INFORMATION:

We have been liaising constantly with our hardware and software vendors on the compute and storage aspects of our cloud stack as well as our own internal code and set-up which forms the majority of the system used to deliver services to customers.

Factors contributing to the fault:

F1. There is a known issue with softlockup detection in the Linux kernel, which produces false positive detections. There wasn't really a soft lockup but it was detected. We discovered the existence of this bug after the incident.
F2. The softlockup false positive detection can only be triggered if there are userspace processes running on core 0. There were not supposed to be any, but there were, due to an error in CPU allocation for cloud servers on the AMD compute hosts.
F3. There was a new bug discovered in our storage system module (protocol processing), which holds the CPU for longer than needed. It cannot produce a true soft lockup, but it can be detected as such. The bug was not detected in testing, because CPU affinity / CPU allocation was always configured correctly.
F4. There is a lot more CPU time being spent on the other tasks working on core 0 than anticipated, making the softlockup detection a lot more likely. We did not anticipate such high CPU usage. This is specific to AMD hosts as multi-processor coordination is dealt with exclusively by core 0.

CURRENT LESSONS LEARNED & ACTIONS TAKEN

We have mitigated the sources of instability issues on the current AMD hosts as follows:

M1. We have disabled the "Panic on softlockup" sysctl setting. This means that in the case of a false positive detection, it would not trigger a reboot of the host.
M2. We ensured that VMs are never scheduled on cores 0 and 1, or any cores that are used by the storage protocol processing.
M3. We are going to fix the bug in the storage module in due course.
M4. We have moved all storage processes away from core 0, so that there is on average 50% more CPU time available for these processes.

Additionally we are accelerating the transition period over to the new Intel fabric which was planned for late July/early August for all client computing in order to avoid further potential issues from running both AMD and Intel based systems in parallel. We are contacting all customers separately to this notice. In addition we have implemented or are in the process of implementing additional measures/lessons learned from the outage:

improved packet storm controls on internal networking interfaces (this was an alternate root cause that we disproved)
improved logging of packet flows on the compute nodes including latencies
a number of bug fixes have been reported and are being fixed with our storage vendor

Please accept our sincere apologies for the disruption this situation has caused. We do believe we have identified and mitigated the source of the problems to avoid any repeat. As outlined it required the combination of several overlapping factors in order to propagate. This was a serious outage in our Zurich cloud however incidents of this magnitude remain rare and, with the upgrades to hardware and software we have implemented and will implement in the coming weeks, will become rarer.

CloudSigma tries to do everything possible to minimize any inconvenience to our customers. We appreciate your patience and welcome any feedback.

Thank you for your understanding.

Posted Jul 22, 2015 - 07:37 UTC

Resolved

As of 60 minutes ago, all compute has been fully recovered. Our support and operations teams will be following up with affected customers with a full root cause analysis.

Posted Jul 16, 2015 - 19:24 UTC

Update

Most customer computing is now recovered. We are continuing to recover the remaining cloud servers affected by this incident. A full root cause analysis will be supplied to all affected customers.

Posted Jul 16, 2015 - 17:34 UTC

Identified

The network has been stabilised and our engineers are now working to recover client computing that may have been impacted by this incident.

Posted Jul 16, 2015 - 16:55 UTC

Update

We are seeing heavy packet loss against one of the storage VLANs and this is causing some stability issues with virtual machines also. Our engineers are continuing to isolate the issue in order to resolve it.

Posted Jul 16, 2015 - 16:26 UTC

Investigating

We are currently experiencing a partial outage of our Zürich cloud and our engineers are investigating this. We are working to recover affected systems and customer cloud servers and will be updating affected customers again shortly.

Posted Jul 16, 2015 - 16:03 UTC

Status Page per location