August 2016

OWG Meeting

OWG held a meeting on 2016-08-01 between 18:00-19:13 UTC to discuss the items below.

Present: Tom, Andy, Sarah, Matt Apologies: Grant

Adopt board Circular Resolutions

Making decisions is hard, and even more so without any procedure for doing so. In order to make making decisions easier, the OWG agreed to adopt the board circular resolutions with a change only that any such resolutions will be minuted in the OWG log rather than the minutes of the next meeting. We will revisit how well these are working in approximately 6 months time to see if further tweaks are necessary.

Database: replication, failover, new hardware

Upgrade from 9.1 to 9.5 took 25s, and it looks like we can complete the upgrade and roll out karm within a 1-hour mostly read-only period, with 5-10 minutes of actual downtime. The disk on karm would be full if we retain the 1.7TB of WAL log archives that we currently have on katla, but we could store them elsewhere or expand the disks in karm (NVMe or SAS/SATA).

As ramoth, the replica database, has had several failures over the past months, the best option would seem to be to replace it with a machine almost identical to karm. OWG will aim to keep at least 3 database servers in operation at any one time.

Being multi-site: Slough, getting other sites

To spread the risk of single-site outages, OWG agrees that it would be good to add at least one other site to our network. To start with, OWG members will try reaching out through personal networks. Another option is reaching out to Exonetric to see if they’d be happy to take a currently-unpowered server at UCL. OWG could also reach out to commercial providers, and consider paying commercial rates. Given that the existing 3 site are all in the UK, OWG would prefer options in the EU.

Virtualisation

The benefits are that images / machines can be moved around more easily, possibly. But the downsides are that moving storage is not easy, there’s a large amount of initial work and a steep learning curve. It should be noted that container-based systems aren’t currently advised for security isolation - that would need a full VM. We need to carefully consider the benefits over plain Chef.

Un- or under-used machines

We running Gosmore backend, but is it getting any traffic? We had a Forum machine, but the Forum was never moved to it, it’s now shut down.

Backups: where, when, wal-e?

Discussed tarsnap as a potential backup option. We need to look at the cost of this and glacier-based backup. Getting the encryption right is important. For online systems, there are always “root” credentials which can wipe everything, so secondary concern is where/how to keep these accessible but not deletable. For “offline” systems, where physical access is limited, this might be a problem for us if we’re not able to get physical access when we need it.

Aerial imagery: we need a policy.

Aerial imagery can be a very useful source of information for mappers, and no one seems to be running a service which can host it and make it available for mappers.

Problems: existing solution is pretty ad-hoc, we host a mixed bag of not well-known sources. It’s not advertised, and the existence of the layers is not well known. Might snowball into a very big thing, and keeping it running would divert resources and effort from core services. Possible policies to try out initially:

Question: Should we aim to redistribute the source imagery, rather than the tile service.

Question: Is there a distinction between sources and service? Can we ask Amazon if they would host the raw imagery?

OpenStreetMap-Carto 2.42.0

New release. More details.

Outage 2016-08-02

At around 18:30 UTC, the replica database, ramoth, spontaneously rebooted, causing intermittent outages until about 19:50 UTC as load was moved back to the master database server.

Outage 2016-08-03

The services and planet server, ironbelly, experienced some unknown technical issue causing it to hang. It was rebooted at 08:30 UTC, but this caused a temporary outage as the application servers mount the services server using NFS, and the NFS protocol isn’t tolerant of failures.

Pokemon Go

At least two Pokemon Go “finder” maps started heavily using OSM tiles in violation of the Acceptable Usage Policy. The increased load was enough to push the current infrastructure over the edge, and it started to perform noticably poorly, including dropping many render requests. The sites were blocked, but one deploy had a regular expression syntax mistake which resulted in tiles being blocked to any browser, effectively a 30 minute outage of the web map. Block for pmap.kuku.lu, block with syntax error, block for fastpokemap.com.

Improving tile rendering

A discussion started around how to improve the efficiency of tile rendering across multiple machines. The number of tile requests to the OSM tile servers has got to the point where each server finds it difficult to keep up with the minutely diffs and requests from mappers. This is clearly undesirable, but simply adding more servers may not help unless they are able to share the load between themselves. More details.

To help figure out which queries were most in need of attention and possible optimisation, a log of rendering-related PostgreSQL was released. Issue.

New rendering server

A new rendering server, called vial, was generously donated by Komяpa, a long time community member.

Tests!

A suite of tests is beginning to be created for the various Chef recipes. This will help make it easier to understand what the recipes are doing, to maintain the recipes, to replicate their effects on other systems and to make changes with confidence that they’re Doing the Right Thing.