Network issue in ZRH
Incident Report for CloudSigma
Postmortem

We have performed a root cause analysis regarding the network issue two days ago.

Findings identified thus far:

Two days ago, for approximately 9 minutes the port-channels on two of our core switches in ZRH flapped. This issue caused significant disruption within the cloud infrastructure. Our logs indicated that switches were failing VRRP domain master election and started flooding the network with ARP traffic.

Please note that some VMs lost connectivity to their drives for few minutes and brought them in read only state. In such case the issue can usually be resolved simply by stopping and starting the VM via the UI and/or the API.

Current Lessons Learned & Actions Taken: Since this is a very unexpected behavior and also something that has been tested thoroughly before we have addressed networking equipment manufacturer directly and escalated to the highest possible level.

In parallel we are analyzing logs and testing different possible scenarios in our lab.

Please accept our sincere apologies for the inconvenience caused.

We appreciate your patience and welcome any feedback.

Posted Mar 28, 2014 - 08:52 UTC

Resolved
CloudSigma would like to inform you that we experienced significant packet loss within our network in ZRH cloud location this afternoon. It seems that the port-channels flapped wich caused the network interruption. The issue is now resolved but during the above mentioned window you may have experienced difficulties accessing virtual machines within the ZRH cloud.

All support activities were conducted under the direct supervision of CloudSigma’s operations staff.

IMPACT:
This event affected the network connection within our cloud in Zurich. As the network issue affected some of our compute and storage resources some running virtual machines may have lost network connectivity to their drives and became in read only state. In such case please power cycle your VM (simple restart will not do the job) and if the issue still persist, please do not hesitate to contact our support department over the live chat and or email support@cloudsigma.com.

Please accept our sincere apologies for the caused inconvenience.

We will update you as soon as we have more information on the root cause of the network issue.

Next Update at: 2014-03-27, 15:30 UTC; 16:30 CET

While it is important for CloudSigma to perform maintenance such as this to ensure the quality of our services, we try to do everything possible to minimize any inconvenience to our customers. We appreciate your patience and welcome any feedback.
Posted Mar 26, 2014 - 11:46 UTC
Investigating
We are experiencing a network issue in our Zurich cloud. Our administrators are already working on the resolution. There will be another update shortly. Your patience and understanding is kindly appreciated.
Posted Mar 26, 2014 - 11:10 UTC