The OpenStreetMap Standard Tile Layer experienced degraded service from 2022-07-18T07:30:00Z to 2022-07-18T10:00:00Z, a total incident of 2 hours and 30 minutes. An apache bug was the cause.

Service background

The standard tile layer is the default map on openstreetmap.org, and is also used by other sites and apps. As of June 2022, the traffic was 72 billion requests/month, with a daytime peak of 50 000 requests/second. Traffic is highest during the weekday between 10:00 and 16:00 local time in each region. Across the world, usage is highest between 06:00 UTC and 21:00 UTC. Under normal operation the hit ratio of the CDN is 89%.

Graph showing weekly traffic cycle

The architecture consists of a content delivery network (CDN) hosted by Fastly, backed by 7 rendering servers. The OpenStreetMap Foundation runs 4 render servers in Europe, 2 in Australia, and 1 in the USA. Fastly distributes traffic by geographic location to the nearest set of servers, with the US East coast and Africa going to Europe for load reasons.

Within Europe, Fastly additionally distributes traffic by map location, with “odd” metatiles going to one set of servers, and “even” metatiles going to another.

Server Relative
capacity
Location Traffic Served
odin 1 Amsterdam, Netherlands Europe “even” traffic
nidhogg 3 Umeå, Sweden Europe “even” traffic
ysera 1 Slough, United Kingdom Europe “odd” traffic
culebre 3 Dublin, Ireland Europe “odd” traffic
balerion 1 Carlton, Victoria, Australia Australia and Asia
bowser 1 Carlton, Victoria, Australia Australia and Asia
pyrene 0.5 Portland, Oregon, United States South America and West Coast of North America

Outage summary

A bug in Apache caused all render servers to decrease in capacity, causing a complete outage of the US render server, and rolling failures of the CDN-level healthchecks of European render servers, resulting in traffic shifting between backends. Partial recovery occured when health check settings were changed and complete recovery when Apache was restarted.

Total customer impact was a 4% failure rate lasting for 2.5 hours.

Graph showing error rate

Timeline

What went wrong

A bug in Apache caused all render servers to decrease the number of connections they had capacity for each day when the server was reloaded for log rotation. The US server was upgraded first, and its capacity hit zero on 2022-07-17, causing its traffic to fail over the European servers. This caused no immediate problems with the low weekend traffic levels, and no alarms were triggered. The US traffic had a disporportate effect on the European servers because it was not split into “even” and “odd” requests, meaning half the requests required a metatile that didn’t exist on the server.

The same bug also reduced the capacity of all servers, with ysera the most impacted. As traffic levels increased as the European day on started 2022-07-18 this caused ysera to reach its capacity, causing it to fail the health checks, which shifted traffic to other servers. This freed the capacity of ysera, causing it to pass the health check, and get full traffic again.

When the other servers were at capacity, the same behaviour started on them. With all four servers alternately failing and passing health checks the traffic was bouncing between servers, overloading each in turn. This load sometimes caused the servers to fail to respond to requests, or fail to render new tiles.

While this was happening, the servers were responding when manually health checked, so Operations did not realise that they were failing the health checks done by Fastly.

Operations pushed Fastly config #305 with an error that misdirected all traffic to ysera, but reverted it in 2 minutes 40 seconds. Operations pushed a fixed config that would have normally removed all ysera traffic, but due to the multiple failures, did not. The metrics showed traffic still going to ysera. This led to the incorrect hypothesis that someone was bypassing the CDN, which has happened before.

Operations then pushed Fastly config #310 to alter the health check and make it only require 1/5 of checks be successful. This improved the oscillating load (traffic going between a maximum and close to zero), but did not fix it completely. Eventually, Operations identified an Apache bug as a root cause, and Apache was restarted across all the servers, freeing up connections and restoring normal service.

Corrective measures