Storage node in Zurich 2.0 is down

Incident Report for CloudSigma

Postmortem

Currently our working hypothesis are that the drives we were supplied are not living up to the use case we are using them for combined with some network issues regarding the iSCSI connections themselves. The drives are being replaced starting tomorrow 2013-08-28 at 10:00 UTC and the following next 72 hours. This will not lead to any service disruption for our customers.

We have already identified several factors that lead to problems and mitigated them, which has been disclosed previously - such as limiting disk work queues, lowering amount of drives per storage box etc. however we are still to fully identify the root cause as experienced with the last outages.

We are also testing different scenarios in our lab by changing kernels and by stress testing our lab cloud stack with VMs and iSCSI targets to simulate and reproduce the outages.

We will keep you posted on this until the issue is fully resolved.

ACTION UNTIL NEXT UPDATE: Replace drives in our storage boxes.

NEXT UPDATE: 2013-08-30, 14:00 UTC

Posted Aug 27, 2013 - 15:11 UTC

Resolved

Storage node has been rebooted and is operational again.

Posted Aug 26, 2013 - 19:03 UTC

Investigating

One of our storage nodes in Zurich 2.0 just went down. We are currently investigating.

Posted Aug 26, 2013 - 18:41 UTC

Status Page per location