Reason for outage: 15 January 2020

On 15 January, Realtime Trains and RailMiles suffered a 30 minute outage between 1038 and 1108 as a result of an issue with power delivery to our London DC.

Summary

Two separate issues compounded to cause a loss of connectivity to our switching equipment in the London data centre (DC). The direct cause is the malfunction of the management function of the ‘A’ side power distribution unit in our London DC which was then compounded by a wiring issue in the ‘B’ side core switch resulting in a loss of connectivity between our routers, firewalls and servers.

Detailed reason for outage

A loss of power information and management over remote reboot control on our ‘A’ side PDU in the evening of 14th January led to a request to reboot its management interface. We contract with our transit provider in London and they liaise with the DC on our behalf. The DC provides the PDUs as part of our contract.

Our upstream links pass directly into our routers and thence continue directly into the firewalls. The firewalls are directly connected to their core switch with a trunk link between all core switching. We have a pair of each networking device to create the ‘A’ and ‘B’ sides. Most of our networking equipment is fed by a single PSU per device and, to mitigate these issues, we have an ‘A’ side and ‘B’ side setup with each side connected to its appropriate PDU.

On installation, we connected the ‘B’ side switch to the ‘A’ side PDU and vice versa by mistake. This meant we had a single point of failure on either PDU failing. We were previously aware of an issue with the wiring on the networking and were planning for further investigation with a visit in the coming weeks. We were not previously aware of an issue with crossed wiring for the core switch PSUs.

Failover is normally performed between our London and Manchester DCs automatically. There is a direct connection between the London and Manchester DCs and each connects to a third monitoring location for quorum. The system was designed to assume that there would always be connectivity in all three DCs. The outage in London did not result in a loss of connectivity to the routers or firewalls but did to the monitoring servers.

There are two connected reasons for the outage: the loss of power on the ‘A’ side PDU and a mistake during installation when installing the core switches. Our automatic failover failed to work due to an unexpected split brain scenario. The manual failover processes were also unavailable due to the same split brain scenario.

Mitigations

We will be making changes to the power delivery of the core switches before the end of the month. We have purchased equipment that will ensure the ‘A’ side receives power delivery from both PDUs. To avoid a single point of failure with this new equipment, we will be leaving the ‘B’ side connected to the ‘B’ PDU.

Adjustments are being made to the failover arrangements over the course of the next few weeks and we expect to be complete by the end of February.

Detailed event overview

3rd December 2019 18:56
We reported an issue to our London provider regarding the ‘A’ side PDU management not reporting the current power output of the unit.

4th December 2019 12:45
The management of the PDU was rebooted and function was restored.

14 January 2020 22:20
We noticed the same issue had recurred and reported the issue again to the provider.

14 January 2020 22:40
Our provider advised they would report the issue to the DC and requests were made about timings to potentially replace the PDU, given the rarity of such a fault happening twice. We reported back that we may have some issues with replacing straight away and we needed to check our documentation before proceeding any further as we suspected a wiring issue but in the networking equipment, rather than power.

15 January 2020 10:30 (approx)
The DC rebooted the management interface of the PDU which resulted in an unexpected effect onto the functionality of the PDU.

15 January 2020 10:38
Our external monitoring alerts as to a loss of connectivity to services in our London DC. The automatic failover functionality fails due to a split brain scenario, but we retain out of band access to the rack. Shortly after, we receive communications from our provider advising they have seen a loss of one BGP link, expected if power is lost.

15 January 2020 10:50
We finish our investigations and conclude the ‘A’ side has lost power. This is reported to our provider and the DC. We identify a previously unknown wiring issue with our core switching on the ‘B’ side, which was unexpectedly powered from the ‘A’ side, meaning there was no through path for any connectivity on the ‘B’ side while the ‘A’ side was powered off.

15 January 2020 11:08
Power to the ‘A’ side is restored and the service returns online.