Batch Segment Service is taking longer than usual to process.
Incident Report for Xandr
Postmortem

Incident Summary
On Wednesday February 27th, we deployed a change to the processing pipeline for the Batch Segment Service (BSS). This change was a performance regression, causing major delays in job processing. By Monday March 4th, jobs were delayed by more than 48 hours. Processing for new jobs (submitted after the 4th) was disabled at this time, in order to avoid losing data. As we caught up to queued work, new job processing was turned back on Wednesday March 6th.

Scope of Impact
During the incident window, some customers may have noticed delays in Batch Segment jobs in geographies closest to our Los Angeles (LAX1) and New York (NYM2) datacenters.

Timeline (in UTC)
* 08:00:00 Thursday 28 February 2019 UTC: Incident Started: BSS exceeds 24h SLA in NYM/LAX
* 14:00:00 Thursday 28 February 2019 UTC: IM Ticket Created
* 16:32:00 Monday 4 March 2019 UTC: Processing for new jobs is disabled for BSS.
* 21:57:00 Monday 4 March 2019 UTC: Performance regression identified from change.
* 13:44:00 Wednesday 6 March 2019 UTC: Processing for new jobs is re-enabled for BSS.
* 16:24:00 Wednesday 6 March 2019 UTC: Additional capacity is launched to aid in processing jobs.
* 12:00:00 Friday 8 March 2019 UTC: Incident Resolved: Queue of Batch Segment Upload jobs is completed.
* 12:32:00 Friday 8 March 2019 UTC: IM Ticket Closed

Cause Analysis
The cause appears to stem from the change to the processing pipeline for Batch Segment Service.

Resolution Steps
Our engineering team resolved the issue by reverting the change to the processing pipeline.

Next Steps

Improve monitoring of performance regressions in BSS

Posted Mar 14, 2019 - 20:20 UTC

Resolved

The following incident has been fully resolved, and we will post a post-mortem as soon as we have completed one:

  • Component(s): Batch Segments
  • Impact(s):
    • Delays in Batch Segment jobs
  • Severity: Minor Outage
  • Datacenter(s): LAX1, NYM2

We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Mar 08, 2019 - 13:19 UTC