Failing to ingest some span data
Incident Report for ServiceNow Cloud Observability
Postmortem

Summary

Increased span query load led to backpressure on span data writes. This backpressure led to resource exhaustion in one fleet of span ingestion pods, which caused a partial failure to ingest span data. The issue was resolved by scaling up the ingest pool.

Timeline (Pacific Time)

2:45AM: Expensive queries lead to a spike in CPU utilization for a database pool.  This spike leads to load shedding for writes, causing span ingestion errors.

3:14AM: Scaled up the CPU for the affected pool.

3:33AM: Scaled up the span ingestion pods.

4:15AM: All systems are healthy again.

Action items

We have compiled a list of action items, including prioritizing ingest over queries and improving retry handling, to more gracefully handle increased load.

Posted Jul 29, 2022 - 11:38 PDT

Resolved
This incident has been resolved.
Posted Jul 21, 2022 - 04:41 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 21, 2022 - 03:49 PDT
Investigating
We have identified the issue and are monitoring a fix.
Posted Jul 21, 2022 - 03:15 PDT
This incident affected: Observability - Data Ingestion (Trace Assembly).