Host outage in Zurich 2.0
Incident Report for CloudSigma
Postmortem

We have performed a root cause analysis regarding the hosts crashes yesterday.

Findings identified thus far:

The issue was traced to legacy storage nodes that had deduplication enabled which were identified last year as unstable when drive deletion is enabled. Those storage nodes were removed from allocation at that time but are still in production in relation to customer legacy drives still stored on them.

The instability identified last year was traced to a bug related to iSCSI connections that can cause a full compute host crash when storage nodes are slow to respond during drive delete operations involving deduplicated data.

A recent patch of our agent software running on the storage nodes accidentally re-enabled deletion on these storage nodes. Unfortunately drives from these storage nodes were mounted across a significant proportion of the compute nodes in our Zurich cloud.

In short a human error resulted in a rollback of configuration settings on a small number of storage nodes that created significant instability issues in the cloud.

Current Lessons Learned & Action Taken: Going forward we have modified our procedures to ensure that equipment which was scheduled for decommission to NOT be updated as it might cause unpredictable behaviours unless it presents a security risk to client computing. We are also accelerating the decommission procedures in relation to the limited number of storage nodes with none problems in relation to deduplication.

As this outage was due directly to human error we do not expect a repeat of the instability as a result of this issue.

Please accept our sincere apologies for the caused inconveniences.

While it is important for CloudSigma to perform maintenance such as this to ensure the quality of our services, we try to do everything possible to minimize any inconvenience to our customers. We appreciate your patience and welcome any feedback.

Posted Feb 07, 2014 - 14:11 UTC

Resolved
We have managed to resolve the outage from the past few hours.
Also we are doing the initial root cause analisys that we will share with you as soon as possible.
And within that we managed to find out what was causing the problem.
Posted Feb 05, 2014 - 18:38 UTC
Investigating
At the moment we are experiencing issues with some of our hosts in Zurich
Posted Feb 05, 2014 - 17:44 UTC