Switzerland, Zurich Status Page
Switzerland, Geneva Status Page
Germany, Frankfurt Status Page
Germany, Dusseldorf Status Page
Philippines, Manila Status Page
Philippines, Clark Status Page
Philippines, Manila-2 Status Page
Saudi Arabia, Riyadh Status Page
United Kingdom, Cardiff Status Page
United Kingdom, London Status Page
United States, Honolulu Status Page
United States, San Jose Status Page
Today we have performed root cause analysis regarding the host crash.
Two things have been identified:
1) Our storage system dropped iSCSI connections between the storage nodes and the client physical hosts, which rendered VM guests drives on the host read only.
2) guests with drives in read only or offline mode lead the physical host to crash.
The root cause for the storage system to create such behaviour is suspected to be correlated with either too many iSCSI connections / targets and/or a job to mass delete drives which ran at the same time that the problems occurred. To confirm these hypotheses we are recreating the situation in our lab.
However, in the meantime for precautions, the lessons learned and action taken to prevent the same problem, assuming our hypothesis are correct:
Regarding the host crash itself, the root cause is still being investigated to find a plausible set of causes why a host would crash just because drives are in read only state.
It should be limited to the kernel itself where the iSCSI managing code resides which means we will now test other kernels and see if we can reproduce the problems in the lab.
Next Update: 2013-08-09, 12:00 UTC
We are in a process of recreating the situation in our lab. The results and the analytics should be ready in the beginning of the next week.
Next Update: 2013-08-12, 14:00 UTC
We are still performing root cause analysis regarding the host crash last week.
The current actions have been taken:
Next Update: 2013-08-14, 14:00 UTC
We are still performing root cause analysis regarding the host crash last week.
The current actions have been taken since the last update:
Next Update: 2013-08-16, 14:00 UTC
We are still performing the following test: - Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets is still ongoing - IN PROGRESS
Next Update: 2013-08-19, 14:00 UTC
The following test is ongoing: - Lab setup to find the root cause by changing kernels and by massive loading with VMs and iSCSI targets.The Linux iSCSI seems to be the source of the issue, so we are playing different scenarios in order to find the best solution on the matter
Next Update: 2013-08-28, 14:00 UTC
We have already identified several factors that leads to problems and mitigated them, which has already been disclosed - such as limiting work queues, amount of drives per storage box etc. however we are still to fully identify the root cause as experienced with the last outages.
Currently our working hypothesis are that the drives we were supplied are not living up to the use case we are using them for combined with some network issues regarding the iSCSI connections themselves.
We will keep you posted on this until the issue is fully resolved.