Post-mortem - Planet replication diff outage 25 June 2025 21:33 until 26 June 2025 13:22 (16 hours)

Summary

From 21:33 UTC on 25 June 2025 until 13:22 UTC on 26 June 2025 (a period of approximately 16 hours), OpenStreetMap.org’s system for generating replication diffs unexpectedly stopped working. As a result, no new minutely, hourly, or daily diffs were automatically produced during this time. The issue was resolved manually by the operations team.

Replication diffs are used by other OpenStreetMap.org and third-party services to stay up to date with OpenStreetMap.org map edits.

Timeline of Events

Impact

The OpenStreetMap.org planet replication diffs could not be generated and subsequently published. Downstream (tile.openstreetmap.org, nominatim, overpass) and third-party services were unable to stay up to date with OpenStreetMap.org map edits.

During the OpenStreetMap.org maintenance window (~25 minutes) the OpenStreetMap.org website and mapping API was read-only. The map edit API did not allow any map changes during this period.

Root Cause

An unidentified condition in PostgreSQL replication caused the SQL query used by osmdbt-get-log to repeatedly fail (ERROR: invalid memory alloc request size 1243650064) (See #Appendices) when osmdbt-get-log was generating the log data that is required by osmdbt-create-diff.

Detection

Alertmanager sent multiple alerts related to the event. The event did not trigger any pager or SMS based alerts.

Response

Ops responsed to the issue early on 26 June 2025. An attempt to workaround the issue by modifying the number of changesets dumped per replication diff did not resolve the issue. Ops raised the issue with the Jochen (osmdbt developer) and a path to fixing the issue was worked out.

Resolution

osmdbt’s fakelog tool was manually used by the operations team to skip ahead of the PostgreSQL’s replication condition which triggered the failure.

Follow-up Actions / Remediations

Lessons Learned

Appendices

Documented osmdbt recovery process

https://github.com/openstreetmap/osmdbt/blob/master/man/osmdbt.md#recovery-procedure

PostgreSQL patches

It is possible that the recent changes to PostgreSQL in 15.13 may have caused or influenced the unexpected PostgreSQL error.

PostgreSQL logs at point of initial failure

...
2025-06-25 21:31:56 GMT LOG:  starting logical decoding for slot "osmdbt"
2025-06-25 21:31:56 GMT STATEMENT:  SELECT * FROM pg_logical_slot_peek_changes($1, NULL, NULL);
2025-06-25 21:31:56 GMT LOG:  logical decoding found consistent point at 143DA/2D6FC7D0
2025-06-25 21:31:56 GMT STATEMENT:  SELECT * FROM pg_logical_slot_peek_changes($1, NULL, NULL);
2025-06-25 21:32:14 GMT LOG:  checkpoint complete: wrote 41420 buffers (0.2%); 0 WAL file(s) added, 0 removed, 12 recycled; write=239.841 s, sync=0.117 s, total=239.979 s; sync files=3679, longest=0.016 s, average=0.001 s; distance=279264 kB, estimate=304989 kB
2025-06-25 21:32:24 GMT LOG:  starting logical decoding for slot "osmdbt"
2025-06-25 21:32:24 GMT STATEMENT:  SELECT * FROM pg_replication_slot_advance($1, CAST ($2 AS pg_lsn));
2025-06-25 21:32:24 GMT LOG:  logical decoding found consistent point at 143DA/2DA8E0C8
2025-06-25 21:32:24 GMT STATEMENT:  SELECT * FROM pg_replication_slot_advance($1, CAST ($2 AS pg_lsn));
2025-06-25 21:33:00 GMT LOG:  starting logical decoding for slot "osmdbt"
2025-06-25 21:33:00 GMT STATEMENT:  SELECT * FROM pg_logical_slot_peek_changes($1, NULL, NULL);
2025-06-25 21:33:00 GMT LOG:  logical decoding found consistent point at 143DA/2EE23DF0
2025-06-25 21:33:00 GMT STATEMENT:  SELECT * FROM pg_logical_slot_peek_changes($1, NULL, NULL);
2025-06-25 21:33:14 GMT LOG:  checkpoint starting: time
2025-06-25 21:33:25 GMT ERROR:  invalid memory alloc request size 1243650064
...