In the last post, I explained how we began the move of Realtime Trains searching for a new pair of datacentres and the networking setup. This post describes the internal server infrastructure and how it’s set up. I’m not going to directly mention any of the technical deficit that has arisen over the years but it should be clear that there was a lot. As an advance warning, this post is quite technically detailed.
From 2013 until late 2015, RTT ran on a geographically separated pair of master/standby pair Dell R210 II servers, along with a few rented servers in Europe. In late 2015, when the servers moved into York I acquired several HP DL360 Gen9 machines which remain the mainstay of the fleet. The Gatwick DC was introduced in August 2016 with a single DL360.
As part of the move to London and Manchester, several new DL360 servers have been acquired. The R210s are now used for development, having served for 6 years in production service. VMWare ESXi is now used on all servers, previously a few machines ran KVM.
To put it politely, it was an absolute mess. Configuration management was non-existent. A large number of VMs did have it, through Puppet, towards the tail end of 2018 but a reasonable proportion did not. A considerable number of nodes were also coded on an individual basis.
RTT v21 has several distinct systems. The real-time and timetable processing components are two distinct monolithic binaries, written in Java. The website is written in PHP, but is generated and deployed with a complicated combination of multiple shell scripts linked with multiple runs of PHP. These shell scripts largely broke in 2015 and it has been challenging to do anything to the site since then.2
A central tenet of the new setup, as one would reasonably expect in this day and age, is a high level of automation and management. I need not carry on about its virtues but it has already paid dividends. We continue to predominantly use Puppet, as it has served us well, with some infrequent work on Ansible. For infrastructure, we are in the early stages of working with Terraform.
A comprehensive rewrite of the Puppet to the profile and role pattern has led to much greater control. All source code is now in an internal Gitlab installation, using their CI feature with a separate new in-house product being used for artifact management and rapid deployment.3 Honestly, I underestimated the work in doing a lot of this as I thought more was already centralised - even things like the public API weren’t entirely under source control with local on-server revisions.
My personal milestone for the moves was the introduction of TLS encryption to the RTT website. We had a few services running HTTPS, predominantly the API and the fund raising pages, but required manual administrative effort for each renewal. Let’s Encrypt is the way of automating certificate issuance and all our services are now using this, with certificates being issued through certbot and the DNS-01 challenge. Due to the mechanisms and nature of the new internal setup for high availability, they are centrally managed and distributed.
A lot of the work to date has involved simply copying and transferring source code and binaries to their new appropriate locations. As time has moved on, so has technology, and thus a not insignificant proportion of code required updating.
The RTT website, as mentioned previously, is written in PHP. When first released, it ran on PHP 5.6. Today, PHP is now on version 7.3. RTT uses MongoDB as one of four database engines and the library the code initially used was deprecated. A replacement driver operates at a lower level, with a higher client abstraction in the composer package, and hence some additional development was required.
The public API4 ran on Django 1.8 LTS, on Python 2.7, and hadn’t been updated since its initial release in 2016. Python 2.7 ends support at the end of this year, and Django 1.8 was out of support a year ago. Needless to say, it’s now all up to date and running on Python 3.7.
In the final post in the series, I’ll cover the lengths taken to ensure as little downtime as possible, where it went wrong (those pesky DNS servers) and what’s next for Realtime Trains.
- The currently live version, released in September 2013. [return]
- TLS encryption being a prime example: the URLs across most of the RTT website were hardcoded to HTTP. TLS encryption was enabled in August 2019 as part of the move to London. [return]
- A new version of RailMiles, another product we run and will occasionally mention on the blog, can now be released in around a second once the artifact is generated by Gitlab CI. The RTT website will make a new release in 5 seconds across both sites. [return]
- There is a second version of the API, provided to commercial clients with some SLA guarantees. [return]