Increased span query load led to backpressure on span data writes. This backpressure led to resource exhaustion in one fleet of span ingestion pods, which caused a partial failure to ingest span data. The issue was resolved by scaling up the ingest pool.
2:45AM: Expensive queries lead to a spike in CPU utilization for a database pool. This spike leads to load shedding for writes, causing span ingestion errors.
3:14AM: Scaled up the CPU for the affected pool.
3:33AM: Scaled up the span ingestion pods.
4:15AM: All systems are healthy again.
We have compiled a list of action items, including prioritizing ingest over queries and improving retry handling, to more gracefully handle increased load.