This is the first in a small series of posts about the work behind the scenes to move the entirety of Realtime Trains infrastructure. This post concentrates on choosing the new datacentres and the initial work to get everything moving.
Realtime Trains is almost entirely hosted on equipment (servers, firewalls, routers, switches, etc) and this is in space rented within datacentres, commonly known as co-location. We do this predominantly as it is, and remains, the most cost-effective way we have found of operating with a fairly constant base load.
We started the year with our equipment in two datacentres, known by us as York and Gatwick, which were reaching their contract end dates this year. For various reasons, including growth and disagreements with one provider, we elected to move both sites. York was our primary site and finishing soonest so we concentrated on replacing this first. We worked with Ingenio IT on the moves, who have provided infrastructure assistance to us for the last four years.
We managed the entire move with nearly zero downtime to Realtime Trains and an hour scheduled maintenance to RailMiles. The ‘nearly zero’ downtime to Realtime Trains comes from an unfortunate set of DNS server issues we experienced across the entire fleet of three on 8th June, predominantly from change controls not being implemented properly. We’ve now added protection against that ever happening again.
Most data centres are pretty much the same, but we have a few items of interest we look at:
- Redundant provision for electricity and network connectivity
- Information about air conditioning and air cooling (servers don’t like getting too hot)
- General tidiness around the floor, ease of access and associated security
- Good public transport links (apparently, this one is very unusual)
- Outside London, but this one didn’t stand for long
Identifying the new primary DC
The new primary DC was always going to be in the southern UK: we’ve had a couple of hardware issues over the last few years resulting in some very short notice trips north which is something we’d prefer to avoid. Starting in February, we looked at around ten different DCs across London and the South East. We initially chose one supplier approximately 40 miles outside London but unfortunately had unresolvable contractual issues in the required timescales.
Back to the drawing board, and we eventually found ConnetU who provide services across a number of DCs in London and surrounding areas. Following two visits to one facility near London Bridge, we settled on a full rack here and started moving equipment in mid-April.
Identifying the new secondary DC
Once we had identified and completed contracts with the primary DC in April, we started hunting for a secondary DC. One additional requirement for our secondary DC was sufficient geographic distance between the two sites. A fair few providers we met in our primary search were keen to offer their other nearby DCs (think between 15 and 30 miles away) but our view is that it must be at least 100 miles away - hence previously York and Gatwick. There are a number of internet exchange points in the north, including Leeds, Manchester and Edinburgh, and our searches were largely in these vicinities.
We eventually settled on the Manchester or Leeds areas and viewed five DCs, settling with Teledata in June predominantly on grounds of public transport accessibility. We are using IX Reach for primary connectivity in this site following a number of recommendations.
Setting up networking
We are a RIPE LIR. This means that we effectively own our own IPv4 and IPv6 space on the public internet - so when we move we don’t have to change everything. Our IP space was previously announced by our DCs, who also supplied our connectivity, and as part of this move, for greater networking flexibility, we obtained our own AS Number (AS209082).
Over the last few years, we’ve investigated how to use BGP1 with our SonicWall firewalls and understood that they had direct support for this. After significant destructive testing to the network in London once setup, we found that they were unable to work well in a high availability situation resulting in outages of sometimes up to 90 seconds - this was unacceptable to our eyes. RTT has had less than an hour of outages since 2013 directly resulting from connectivity issues.
We’ve had a considerable amount of support from Charles Lyons at ConnetU, our London provider, with our networking setup and associated problems. With his assistance for configuration, we now have a pair of Ubiquiti ER-PRO routers in each site for high availability and are now able to reconverge on BGP within a few seconds for any networking issues.
In the last few weeks, we’ve completed the network setup with a backhaul connection between both sites. This will primarily increase our redundacy to server failure but will also aid with increasing redundacy on streaming data ingress for a number of services we operate.
There has been a huge amount of technical deficit built up over the years of operating RTT. A direct result of this meant that we essentially couldn’t copy any of the existing fleet, and a full rebuild was required. We’ll cover that in the next post.
- This is required when operating with Autonomous Systems. We’d previously had our IP space statically routed to our firewalls. [return]