Replication Delays 2nd November
On 2nd November there were delays replicating the main database due to index rebuilding on the master server Katla. This lead to some confusion where mappers were successfully saving their changes, but not seeing the results when making subsequent map calls. The problem resolved itself a few hours later. The blocking of replication during certain stages of the index rebuilding was unexpected.
It is anticipated that index rebuilds will not take as long in the future, as there will be less to rebuild. Also, the move to NVMe-based servers will increase the I/O capacity, meaning that this sort of maintenance will be less disruptive.
Added disks in services server
Site outage 4th November
In order to increase the capacity of the services server, ironbelly, two extra disks were added. After being added, the disks must be incorporated into the RAID array to be useful. The server supports an “online” rebuild with no downtime, so that was used. However, the machine became overloaded. Services that depend on ironbelly, notably the web frontends and rails backends for NFS mounts, were affected. Service was recovered by suspending non-critical tasks on ironbelly and slowing the rate of RAID rebuiding.
OWG has been discussing migrating the Rails storage function off the services server and onto a distributed filesystem. This work is not ready yet, but it is anticipated that something like it will reduce dependence on the services server.
Failed RAID battery in forum server
Failed disk in dev server
The development and testing server, errol, had a disk which indicated failure. An attempt was made to replace it, but then it hung on boot. In case the issue was with the disk, the bad one was returned, but the error didn’t clear. It turned out that the error was the file system requiring an extraordinarily long time to replay the journal log. This required some downtime while waiting for the file system to come back online.
In the meantime, the failure alerts on the disk disappeared. Given the difficulties in replacing this disk, it was decided that we would not attempt another replacement unless there was a new alert. Issue.
Recover overwritten old minutely diffs
mmd-osm noticed that some of the oldest minutely diffs had been overwritten, perhaps due to file system corruption issues resetting the config for
osmosis replication. Thankfully,
mmd-osm was also able to provide archived copies of those diffs and they have now been restored. Issue.
Extra disks for ridley, urmel
In addition, existing 72GB disks in ridley were replaced with 146GB disks. This means that all the disks in ridley are the same size, and are interchangeable. The 146GB disks are also newer, which should mean that they are less likely to fail in the near future. Issue.
Replaced disks in replica database server
Two disks in the replica database server, ramoth, were flagged by SMART health checks as predicted failures. This usually happens with old disks as they approach the end of the “bathtub curve” and the risk of failure increases. These were replaced. Issue.
Replace RAID battery with FBWC in munin & piwik servers
As part of a general move towards FBWC (Flash-Based Write Cache) from battery-based caches, the caches in the munin server, urmel, and web site statistics server, eustace, were replaced. The FBWC is more reliable over longer periods than the battery version, as the capacitor which provides emergency power to flush the RAM cache to flash does not chemically degrade and bloat in the same way batteries tend to. Issues for urmel and eustace.
Refresh disks in web frontends and backends
To pre-emptively avoid disk failures, all the disks in the web frontends (spike-NN) and backends (thorn-NN) were replaced. The disk arrays were expanded where necessary to be able to run RAID6. In addition, the battery caches in the frontends were replaced with FBWC. Both of these measures were preventative, as the newer disks and FBWC (as mentioned above) are expected to have lower maintenance requirements in the future. Expanding to RAID6 ensures that the machine can tolerate two disk failures before a third would cause it to fail. Issue.
Replace faulty RAM in spike-01
One of the web frontends, spike-01, developed a fault in one of its RAM modules. This is not a serious failure, as the fault was detected and that module was disabled. However, it did diminish the amount of RAM available on that machine. Since compatible RAM was available spare from previously recycled machines, the RAM was replaced to bring the machine back up to a full complement. Issue.
There was an API and website outage between 18:30-19:00 UTC, which we think is related to issue 1354 as it had similar errors. Although we have not been able to reliably reproduce this error, it does appear to be related to connections not correctly being cleaned up before being returned to the thread pool. It is not clear whether this is a bug in the website code, or upstream in Rails itself. The suggested fix is to not raise exceptions which might corrupt the connection state, in which case this might have been fixed by changes to how we’re using