We would like to share some details on the power outage incident that occurred on October 2nd and October 7th, and what we are doing to prevent this from happening in the future.
What happened
On October 2nd, our Burnaby Datacentre (YVR1) lost our primary power at 9:40 AM PDT. There was a grid power outage, that lasted approximately 10 seconds that occurred at this time. Our primary UPS, failed to transfer the datacentre load to itself, and instead faulted and turned off the load power completely. Power was fully restored at 11:09 AM PDT by our on-site technician by manually instructing the UPS to turn back on the load power. Remaining services and customer servers were sequentially restored, and we had technicians on-site for the remainder of the day restoring any remaining services.
Before we could complete our planned maintenance with our vendor, on October 8th, we encountered another grid power fluctuation that took down power on October 7th. The UPS again kept the load power off after the grid power was restored. Our team was in the area near the datacentre at the time, we were able to restore everything in under 30 minutes.
Status updates and complete timeline of events can be found on our external status page, status.cloudsingularity.net.
How we fixed it
Working with the UPS manufacturer, they came to two reasons why this may have happened:
- there was a power module fault, or
- one or more battery modules need replacing
The UPS really boiled down to those two components. The initial diagnostics pointed to the power module, which made sense as it is responsible for regulating and controlling the power. But in the end it turned out to be one or more battery modules that needed replacement. In short, the UPS has 3 sets of battery modules, each set contains 4 modules. The UPS only needs a single string to operate and power the load, meaning one or both of the other sets can fail without impact. In essence making it N+2 redundant.
A1 | A2 | A3 | A4 |
B1 | B2 | B3 | B4 |
C1 | C2 | C3 | C4 |
However, if a single battery module fails, the UPS cannot use the entire set. We believe one or more modules failed in each of the sets, leaving the UPS with no battery power to transfer the load to.
We ended up replacing all 12 battery modules, and will be running further diagnostics on them.
Improvements for the future
In order to prevent these type of failures of battery modules in multiple sets, we are planning on implementing more stringent maintenance of the UPS. Traditionally, UPS maintenance by the vendor is done bi-annually, twice per year. Obviously this was not enough, we will be performing monthly testing of the UPS using both its built in self test diagnostics and external testing when necessary. These tests will be scheduled as normal planned maintenance windows.
Conclusion
We understand how critical it is for your services to be available at all times, and the burden of trust placed upon us as your infrastructure service provider. Any form of downtime is regrettable, and we are committed to continuously improving our infrastructure. We have some exiting changes coming up over the next few months that will vastly be improving both our infrastructure redundancy and network resiliency. More details will be forthcoming in the next few weeks.