Failing to ingest some span data

Incident Report for ServiceNow Cloud Observability

Postmortem

Summary

Increased span query load led to backpressure on span data writes. This backpressure led to resource exhaustion in one fleet of span ingestion pods, which caused a partial failure to ingest span data. The issue was resolved by scaling up the ingest pool.

Timeline (Pacific Time)

2:45AM: Expensive queries lead to a spike in CPU utilization for a database pool. This spike leads to load shedding for writes, causing span ingestion errors.

3:14AM: Scaled up the CPU for the affected pool.

3:33AM: Scaled up the span ingestion pods.

4:15AM: All systems are healthy again.

Action items

We have compiled a list of action items, including prioritizing ingest over queries and improving retry handling, to more gracefully handle increased load.

Posted Jul 29, 2022 - 11:38 PDT

Resolved

This incident has been resolved.

Posted Jul 21, 2022 - 04:41 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 21, 2022 - 03:49 PDT

Investigating

We have identified the issue and are monitoring a fix.

Posted Jul 21, 2022 - 03:15 PDT

This incident affected: US - Observability - Data Ingestion (US - Trace Assembly).