Outage storage systems

Outage storage systems

20-10-2019 06:23:00 - 21-10-2019 17:40:00
Status: resolved

As of 06:23h we're again experiencing problems with the availability of our shared storage systems. Engineers are investigating this outage. As a result, our shared Linux hosting platform and customers using shared storage solutions provided by BIT are experiencing degraded availability.


Update placed at 07:42h
The instability of the storage platform has been resolved. Unfortunately we are hitting a bug on the Ceph MDS systems which causes problems for services on Cephfs. We are still investigating what the bug entails and what triggers the bug. Investigation is complicated due to the fact the bug does not happen when the software is put in debug mode. Also known as a Heisenbug.

Update placed at 10:48h
We are going to make configuration changes to prevent this issue from happening. In case we do hit the bug, this should lead to shorter downtime of the shared filesystem (faster failover). In the meantime that we are re-configuring the system the shared filesystem will not be available.

Update placed at 12:40h
The configuration changes are in place. We hope to achieve a more stable system this way. There is a noticable performance impact, but as far as we can tell this is acceptable.

Update placed on 21/10/2019 at 7:50h
Unfortunately we're observing instability on the storage platform again. Additional changes have been made to increase stability.

Update placed at 8.04h
The platform has been stable again since 7.54h.

Update placed at 8.33h
We still observe intermittent instability on the platform. Engineers are still investigating and making changes to improve stability.
Update placed at 11:12h
At around 10:45h we have implemented a fix on the cluster that has been advised to us by the Cephfs developers. Before that change we saw the cluster becoming instable irregularly, therefore we cannot be assured yet that the fix has solved the problem definitely. Our engineers keep monitoring the system meticulous.
Update placed at 17:31
The by Cephfs developers advised fix has cleared the instability of the cluster. It has not solved the underlying root cause of the instability. A change is in preparation that will help identifying the root cause. After enabling that change we can start working on solving the root cause. In a to be announced emergency maintenance we will apply the change.

The incident has been resolved. If you have any questions regarding this incident, please contact the Customer Care department by phone on +31 (0)318 648 688 or by email on support@bit.nl.