Data delays in reporting

Incident Report for Xandr

Postmortem

Summary

From approximately 18:15 UTC August 14, 2019 to 09:07 UTC on August 15, 2019, two of our providers' circuits between our U.S. and Europe data centers went down, saturating the remaining circuit and resulting in stale reporting data.

Scope of Impact

During the incident window, reporting data was out of date by up to 10 hours.

Timeline (UTC)

2019-08-14 18:05: Incident Started: circuits dropped between our (New York City) NYM2 and Amsterdam (AMS1) data centers. Network team alerted immediately.
2019-08-14 18:49: Issue escalated as incident.
2019-08-14 19:01: Some ad traffic shifted from AMS1 data center to NYM2 data center to relieve data congestion.
2019-08-14 19:23: Traffic shift reverted.
2019-08-14 20:28: Data transportation system in FRA1 turned off to improve bandwidth.
2019-08-14 20:40: Data transportation system in AMS1 turned off to improve bandwidth.
2019-08-14 22:03: Data transportation system in AMS1 and FRA1 turned back on to prevent full disk space.
2019-08-14 23:11: One circuit recovered.
2019-08-15 04:00: Reporting 10 hours delayed, begins to catch up.
2019-08-15 07:12: Two circuits up, but traffic is still congested.
2019-08-15 08:09: Data transportation system in AMS1 turned off to improve bandwidth.
2019-08-14 08:38: Congestion improves, data transportation system in AMS1 turned back on.
2019-08-15 09:07: Incident resolved. Reporting back to normal delays.

Cause Analysis

One provider's circuit failed due to a fiber duct breach that occurred when installing wooden posts. The second provider's circuit failed due to a manhole fire.

Resolution Steps

While circuit providers worked to fix the downed circuits, our engineers employed various traffic and data transport mitigation strategies to prioritize critical logs and avoid excessive back pressure to other systems.

Next Steps

Provision a fourth circuit between Europe and U.S. to increase capacity and failure tolerance.
Improve internal notifications to all impacted teams for similar future incidents.
Address data job dependencies to accelerate data catch-up speed.

Posted Aug 28, 2019 - 21:03 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support. 1) all external supply was restored as of 01:15 UTC 2) reporting delays dropped to normal levels as of 07:00 UTC

Posted Aug 15, 2019 - 10:32 UTC

Investigating

We are currently investigating the following issue:

Component(s): Reporting
Impact(s):
- Stale reporting data
- Limited delivery on external sellers in Europe
Severity: Major Outage
Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Aug 14, 2019 - 22:06 UTC