Data delays in reporting
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 06:30 UTC on August 9, 2019 to 08:50 UTC on August 9, 2019, two out of three circuits between our U.S. and Europe data centers went down due to a manhole fire in NYC. Resulting circuit congestion caused delays in reporting.

Scope of Impact

During the incident window, Console reporting was delayed by up to 9 hours, compared to the typical 2-4 hours.

Timeline (UTC)

2019-08-09 06:30: Incident started: two circuits simultaneously down.

2019-08-09 07:16: Engineering notified of data congestion.

2019-08-09 07:25: Source of congestion identified on remaining circuit.

2019-08-09 07:28: Incident ticket created.

2019-08-09 08:50: One downed circuit recovered.

2019-08-09 09:40: Impression bus traffic shifted from Amsterdam data center (AMS1) to New York data center (NYM2) to relieve data congestion.

2019-08-09 12:07: Impression bus traffic shifted back from NYM2 to AMS1.

2019-08-10 02:30: Reporting delays falls back within 6 hr SLA.

2019-08-10 04:46: Incident resolved: Reporting back to normal.

Cause Analysis

Failure of two of the three circuits between Europe and the U.S. saturated the remaining circuit.

Resolution Steps

While the circuit provider worked to fix the downed circuit(s), our engineering team (1) shifted ad traffic from AMS1 to NYM2 to help relieve data congestion and (2) ensured business-critical data was prioritized over less important data.

Next Steps

  • Provision an additional circuit between AMS and NYM datacenters to increase capacity.
  • Improve notificiations to all relevant data pipeline teams for similar future incidents.
  • Address data job dependencies and improve data resolver to accelerate data catch-up speed.
Posted Aug 21, 2019 - 04:19 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Aug 10, 2019 - 04:45 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): Reporting
  • Impact(s):
    • Stale reporting data
    • Data delays
  • Severity: Major Outage
  • Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Aug 09, 2019 - 13:41 UTC