Lightstep Webapp Down
Incident Report for ServiceNow Cloud Observability
Postmortem

Summary

Web UI inaccessible for 48 minutes (9:49am - 10:37am), and streams page inaccessible for an additional 48 minutes (until 11:25am).

Timeline (12-hour Pacific Time)

09:49 AM PT: 100% of API requests for the webapp begin failing

09:52 AM PT: Database hits 100% resource utilization

10:31 AM PT: Divert all traffic from webapp to bring up database

10:37 AM PT: Allow all traffic except from the operation/get endpoint so the webapp is back up

11:25 AM PT: Allow operation/get traffic and incident is resolved

Root Cause

Database errors led to transient unavailability.  Aggressive retries saturated the database, leading to a negative feedback loop.

Action items

We have updated our database configuration to limit the blast radius of failing or slow database calls.  We have also implemented additional rate limiting to avoid the negative feedback loop where many concurrent requests lead to failures, leading to more requests.

Posted Feb 01, 2022 - 15:16 PST

Resolved
This incident has been resolved.
Posted Jan 20, 2022 - 13:21 PST
Update
We are continuing to monitor for any further issues.
Posted Jan 20, 2022 - 13:21 PST
Monitoring
A fix for the remaining unavailability on the streams page is rolling out, and is expected to complete by 1pm PST.
Posted Jan 20, 2022 - 12:08 PST
Update
All pages are operational except for the streams page. We are continuing to investigate.
Posted Jan 20, 2022 - 10:41 PST
Investigating
We are currently investigating this issue.
Posted Jan 20, 2022 - 09:58 PST
This incident affected: Observability - Web UI (Service Directory, Dashboards, Change Intelligence, Explorer).