Replace failed RAID battery in Noquiklos

Noquiklos, the GPS tile server, had a failed RAID array battery which was degrading performance. This was replaced with a flash-backed write cache. Issue.

Upgrade Clifford from G5 to G6

Clifford, the soon-to-be forum server, used to use “Generation 5” HP hardware. The “Generation 6” hardware is both lower-power and more reliable, so the forum should be much improved, along with the power usage at UCL. Issue.

Replace failed fans & disk in Yevaud

Internal fans had failed in [yevaud], one of the tile rendering servers. Although the warranty expired long, long ago, the supplier (now known as Zeta Systems) provided us with replacement parts free of charge, which was very nice of them. Issue.

The failing disk was finally replaced in yevaud, after much drama. Issue.

Replace RAM in Ridley

One stick of RAM in ridley, the web server for OSM Foundation-related sites, had a reported ECC failure, which is often indicative of impending failure. As we had suitable RAM rescued from other systems, it seemed like a good opportunity to replace it all and double the memory available in that machine. Issue.

Base backup on karm

A base backup of the main database was taken on karm, which should allow it to start being used as a replica database server. This will mean we’re not reliant on ramoth any more, which is great as it has has not always been reliable and is particularly un-graceful when coming back from a cold boot. Issue.

CGImap 0.5.5 deployed

This version of CGImap allows editors to avoid problems with per-IP address rate limiting by using OAuth to sign all API requests. This makes CGImap perform rate-limiting on a per-user rather than per-IP address basis. Issue, mailing list announcement.

Better logging for export failures

Report errors on resource exhaustion. Commit.

Planet web server instability

The planet web server has been having a series of short outages. These are possibly related to known issues in the “event” worker and hopefully have been fixed by moving to the “worker” worker.

Carto stylesheet v2.44.1 deployed

The latest version of the carto stylesheet was released.

A new mod_tile package was deployed to help take advantage of Google’s “Noto” fonts and improve multi-lingual text rendering.

New tile cache in Spain

A new tile cache, called culebre, was generously donated by the University of Zaragoza.

Replication outage 2016-10-14

The main storage disk on ironbelly, the planet file server, filled up. This caused replication to fail. To mitigate, larger disks are being ordered. Old file clean-ups will also be reviewed to see if unnecessary files can be removed regularly. Monitoring is in place to warn when disks are approaching capacity.

New database server in production

The new database server, karm, is in production as of 2016-10-23 and being used as the replica server for read-only requests. This “soft launch” is intended to give a month to discover any potential issues with the new hardware before it becomes the master and is used to upgrade to PostgreSQL 9.5.

This is the first “all NVMe” database server that OSM has operated, although dulcy, a Nominatim server, has been running an “all NVMe” configuration stably for a while. NVMe is a protocol for storage devices which is intended for use with SSDs and acheives very low latency and high throughput. Initial indications for karm show that it handles the same workload as the previous replica database server while utilising less than 10% of its capacity.

Failed disk in mailing list and help server

A disk had failed in the mailing list and help server, shenron. It was reported to and replaced by Bytemark. Issue.

Meeting

OWG met on 2016-10-20 and discussed:

Planning

OSM will continue to grow over the next few years and indications are that it is gaining some mainstream support. This means that reliability and scaling are more important than ever. OWG discussed what we see as the priorities for growth in the near future. We aim to produce a range of costed plans to support growth over the coming years.

Discussed and the following is rough priority order:

1. More database servers

Having deployed karm as a replica, the intention is to promote it to master. This puts us in a bad situation as the replica (current master katla) will not have the capacity to handle the same load as karm. In an ideal world, we’d like to aim for three similarly-capable servers in three separate sites.

The next step is to see how well the transition to using karm as database primary goes. This will determine what the best choice for database hardware would be.

2. More sites

In order to be robust to site outages, we would like to have at least three “core” sites at all times. A “core” site is one capable of hosting the whole OSM API and so needs a database, web frontends and backends and a “services” / planet server. Ideally, it would include rendering and Nominatim as well.

Our current situation is that we have “core” sites at Imperial College and Bytemark, and hope to bring UCL up to “core” site status over the coming months. Unfortunately, Imperial College data center will be shut down at some point in the next few years. Therefore we need to find at least one extra “core” site. There was general agreement that we should focus efforts outside of the UK to increase resilience.

The next step is to figure out what the minimum and preferred specs are, and start looking for partners willing to provide that - either gratis or under some commercial terms. This can be used to provide a costed plan with options for both gratis and commercial hosting over the coming years.

3. Storage cluster

Currently, there are two “services” / planet servers (ironbelly and grisu), each of which stores data for the website (user profile images and GPX traces) as well as planet and replication information. When switching between sites, synchronising this data is onerous, and can cause unexpected behaviour (e.g: skipped or duplicate replication files).

In order to make this more reliable, we intend to implement a storage cluster, probably based on Ceph and CephFS.

The next step is to put hardware into “core” sites to support this. We have some hardware “rescued” from Imperial College, which may be suitable. One open question is about the reliability of such hardware, and possibly we will want to have a plan B in case it proves unreliable. The costs of both plan A and plan B should be set out in the overall plan.

4. Internal cloud

Many individual services are currently run on their own machines or in small “batches”. For example ridley runs the Foundation website, CiviCRM, OSMF blog and State of the Map sites. This presents a problem when ridley needs to be taken down for maintenance or the site it’s at goes down.

The widely-used solution to this is to abstract the underlying hardware, and its location, by running a “private cloud”. This has additional advantages in that abstracting the hardware makes it easier to run services, may make it easier to scale services and administer them.

The next step is to dedicate some resources to the host machines and put up a test instance of something, for example OpenShift Origin. This will help us understand the hosting requirements and we can put associated costs into the plan.

5. Central logging and elasticsearch

Ingesting logs into a central elasticsearch / logstash would allow better understanding of the current state of the systems and provide information to make better provisioning decisions.

The next step is to figure out what kind of resources are required to support this.

Additional items

In no particular order:

Routing: The website uses the public OSRM instance. Ideally, we would use one dedicated to the OSM website so we neither rely on nor use up the resources of the public instance.
Overpass: The website uses the public Overpass instance. Ideally, we would use one dedicated to the OSM website so we neither rely on nor use up the resources of the public instance.
Testable Chef cookbooks: Currently, most of the Chef cookbooks are not tested, nor are they testable. This makes it very difficult to have confidence in any changes, which makes it difficult for any external contributors. We’d like to make it easier for other to contribute.
CI / CD pipeline: Many OSMF services do have external continuous integration (CI) via TravisCI or other open-source friendly services, which run the tests for a piece of code when it has been updated. This helps to keep the code in a “good” state, and avoids some bugs creeping in. Continuous delivery (CD) is a technique used to automate deployment of code which passes the tests (and meets other criteria), and makes it easier and faster to deploy code changes.

New members

Getting help for all the work that needs doing is hard, and OWG has no policy on how someone can help out or join. This leads to the perception that no one can join, which is off-putting and means we lose out on help.

OWG discussed:

Making it easier to have casual contributions by making small, independent tasks public and tagging them in some way that indicates that we’re looking for help and this task can be taken on by anyone.
Having a guideline which explains how someone can help, and the difference between “sysadmin” roles and “OWG” roles.
Having a policy and guideline for on-boarding new members to OWG and the sysadmins group, setting out what the requirements and responsibilities are.

The next step will be to create drafts of the policies and guidelines above and make sure they’re suitable.

Policy for granting whitelists

In many cases, people ask for whitelist status for their IPs to exempt them from rate limits on tiles or the OSM API. OWG doesn’t have a policy or mechanism for them to make that request formally, and this leads to inconsistent results. In turn, this makes OWG appear not to be transparent and perhaps a little cliquey, which is harmful to the project.

OWG decided that old whitelists on tiles should be removed, and that whitelisting should be reserved for exceptional cases. We are opening a discussion about what the appropriate policy should be, in the hope that we can get something that helps solve problems that mapping party organisers experience while still trying to keep a good level of service for all the other individual mappers.

In addition, we are opening a discussion about changes to the tile acceptable use policy which could help curb use of OSMF resources by apps and sites which, in aggregate, cause most of the load on the tile rendering machines.

OWG will also follow up with mapping party organisers to find out what their experiences have been so that we make sure to solve the right problem.

Face to face meeting

OWG decided to investigate having a “face to face” meeting. It was felt that this is more productive for discussions than using an online video-conferencing system. And that by blocking off a whole weekend for the meeting, more progress could be made.

The next step for this will be to investigate possible locations (outside of the UK would be preferred) and draw up budgets as appropriate.

October 2016