Disturbance customer storage and customer VM's

Disturbance customer storage and customer VM's

17-12-2014 12:29:55 - 19-12-2014 11:41:55
Status: resolved

We have noticed a disturbance on our customer storage and customer VM platform.

The underlying cause is being investigated by our engineers.

Follow the incident on our website for updates concerning this incident.
Update placed at 12:53
During a forced takeover to the other filer in the metrocluster, the redundant filer became unavailable. That caused bit.nl to be unavailable. In these cases bit.org becomes available for updates on incidents. One of the filers is available again en bit.nl can be checked for updates. Cause and impact of the incident is under investigation.
Update placed at 12:56
We are speaking with NetApp TAC on this matter. Engineers of BIT are working to get all services back online. Most of the services are back, we are still working on shared windows websites.
Update geplaatst om 13:05
One of two filers in the MetroCluster is back online. That means the storage is not redundant yet.
Update placed at 13:29
Specialists of the storage supplier have arrived. Together with BIT engineers they are investigating the cause of the incident.
Update placed at 13:40
The root cause of the incident has not been found yet. Untill that moment we will not make the storage redundant again. Because it might trigger the cause. All services except for windows websites are back available. Please contact us if you continue to experience problems.

Update placed at 14:10
Windows websites are back online. Please contact us if you continue to experience problems.

Update placed at 15:57
Our vendor has done an initial investigation of the crash reports and has concluded that the crash of the first filer was not hardware or high availability related. Therefor we have reenabled the metrocluster and the setup is redundant again. It will take our vendor another 48 hours to do a complete analysis of the crash reports. Also, a bug has been identified which caused the crash of the second filer.

Update placed at 17:30
At this moment the metrocluster is stable. However, both filers have one defective disk. These disks need to be replaced by spares. This replacement will take approximately 10 hours. Until these disks are operational both filers don't have any spare disks. Data is always stored on both filers in the metrocluster, so it's still redundant. BIT engineers will stay onsite until the metrocluster is back in a completely stable situation again.

Update placed at 00:40
After replacing the first disk and the tests we performed we did not get the expected and desired result. We are discussing the issue with NetApp technical support and try to find a way to resolve the issue.

Update placed at 04:49
We have did a fair amount of troubleshooting. We are going to perform some more tests to make sure we isolated the issue. As soon as those tests have been conducted we will perform maintenance. We will update what possible impact this maintenance might have in a new update.

Update placed at 07:08
We created a plan to recover from the issues we are facing. The plan is being executed at the moment. Based on all the tests we have done we should be able to resolve all issues. As soon as procedures and tests have been performed we will add an update.

Update placed at 08:32
All maintenance to restore full redundant operation has been completed succesfully, and all required checks have been performed. The NetApp MetroCluster is fully redundant again.

Update placed at 11:32
NetApp Technical Support and our storage supplier have gone through all logs and have not found any discrepancies. But because we want to ensure ourselves that the NetApp MetroCluster is functioning correctly (in case of maintenance / emergency) we will test the NetApp MetroCluster functionality in an emergency maintenance window friday 19th december, 08.30 h AM. We will perform a "takeover / giveback" of filer NetApp2. We do not expect any impact but the risk thereof is increased. This emergency maintenance will also be announced in a seperate emergency maintenance notice.

Update placed at 19-12-2014 - 11:37
The takeover / failover test went succesfully. A RFO (in Dutch) will appear within the Dutch news section soon.