Host Issue, ZRH 2.0

Incident Report for CloudSigma

Postmortem

Today we have performed root cause analysis regarding the host crash.

Two things have been identified:

1) Our storage system dropped iSCSI connections between the storage nodes and the client physical hosts, which rendered VM guests drives on the host read only.

2) guests with drives in read only or offline mode lead the physical host to crash.

The root cause for the storage system to create such behaviour is suspected to be correlated with either too many iSCSI connections / targets and/or a job to mass delete drives which ran at the same time that the problems occurred. To confirm these hypotheses we are recreating the situation in our lab.

However, in the meantime for precautions, the lessons learned and action taken to prevent the same problem, assuming our hypothesis are correct:

decrease the upper limit of iSCSI connections with 50%
implement a job queue for the deletion of drives with a rate limit to it

Regarding the host crash itself, the root cause is still being investigated to find a plausible set of causes why a host would crash just because drives are in read only state.

It should be limited to the kernel itself where the iSCSI managing code resides which means we will now test other kernels and see if we can reproduce the problems in the lab.

Next Update: 2013-08-09, 12:00 UTC

We are in a process of recreating the situation in our lab. The results and the analytics should be ready in the beginning of the next week.

Next Update: 2013-08-12, 14:00 UTC

We are still performing root cause analysis regarding the host crash last week.

The current actions have been taken:

Limit iSCSI connections per storage node down to 50% to lower the load on the boxes (so far we have seen zero problems since last outage) - DONE
Rate limit mass deletion of drives via a job queue since this is a heavy operation that influences both storage and host machines configuration - IN PROGRESS
Add alerts to our monitoring system so that if a host discovers abnormal operation of iSCSI connections, it calls IT operations on duty - DONE
Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets - IN PROGRESS

Next Update: 2013-08-14, 14:00 UTC

We are still performing root cause analysis regarding the host crash last week.

The current actions have been taken since the last update:

Rate limit mass deletion of drives via a job queue since this is a heavy operation that influences both storage and host machines configuration - DONE
Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets - IN PROGRESS

Next Update: 2013-08-16, 14:00 UTC

We are still performing the following test: - Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets is still ongoing - IN PROGRESS

Next Update: 2013-08-19, 14:00 UTC

The following test is ongoing: - Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets.The Linux iSCSI seems to be the source of the issue, so we are playing different scenarios in order to find the best solution on the matter

Next Update: 2013-08-28, 14:00 UTC

We have already identified several factors that leads to problems and mitigated them, which has already been disclosed - such as limiting work queues, amount of drives per storage box etc. however we are still to fully identify the root cause as experienced with the last outages.

Currently our working hypothesis are that the drives we were supplied are not living up to the use case we are using them for combined with some network issues regarding the iSCSI connections themselves.

We will keep you posted on this until the issue is fully resolved.

Posted Aug 08, 2013 - 19:39 UTC

Resolved

The issue has been resolved. All guests are in active state now.
Please accept our sincere apologies for the caused inconveniences.

Posted Aug 08, 2013 - 11:22 UTC

Investigating

We are experiencing an issue with one of our hosts. We are working on it. Some VMs might be in unavailable state.

Posted Aug 08, 2013 - 11:06 UTC

Status Page per location