Ingestion and web UI outage
Incident Report for Lightstep
Postmortem

Summary

Lightstep UI and Data Ingest experienced an outage triggered by a planned database change on July 21st. This database change caused a service failure, which led to a cascading failure in dependent services. Action was taken to restore the service in a degraded mode, which recovered the Lightstep UI. A subsequent roll back of the database change restored all systems and services.

Timeline

12:22 PM: A database change was run causing a service to become unavailable.

12:25 PM: Cascading failures impact Lightstep UI and ingestion. Ingested data loss begins.

12:35 PM: Incident declared in status page.

12:49 PM: Root cause identified, service brought back in a degraded mode, recovery begins.

13:01 PM: Several systems stabilize, Lightstep UI recovers.

13:17 PM: Database change rolled back. Remaining affected services begin recovering.

13:30 PM: Ingestion recovers. Incident resolved, status page updated to “mitigated and monitoring”.

14:02 PM: Status page updated to fully operational.

Action Items

  • Adding automated checks on code changes with database changes to help prevent the original root cause
  • Enabling the directly affected service to continue to serve traffic in a degraded state, rather than failing entirely.
  • Making architectural changes to Lightstep’s Data Ingest path to prevent data loss in these situations by continuing to accept and buffer incoming traffic.
Posted Jul 28, 2022 - 17:22 PDT

Resolved
All components are operational.
Posted Jul 21, 2022 - 14:02 PDT
Monitoring
We have resolved issues with trace ingestion, assembly and alerting and believe the issue has been resolved, but are actively monitoring.
Posted Jul 21, 2022 - 13:30 PDT
Identified
We have mitigated issues with metric ingestion. Trace ingestion, assembly and alerting are still experiencing degraded performance for some customers.
Posted Jul 21, 2022 - 13:20 PDT
Update
We have now identified the issue and are actively implementing mitigations.
Posted Jul 21, 2022 - 13:05 PDT
Update
We are still actively investigating this issue, and will provide another update shortly.
Posted Jul 21, 2022 - 12:48 PDT
Update
We are continuing to investigate this issue.
Posted Jul 21, 2022 - 12:35 PDT
Investigating
We are currently investigating this issue.
Posted Jul 21, 2022 - 12:35 PDT
This incident affected: Observability - Alerting (Metric alerting, Trace alerting), Observability - Data Ingestion (Trace Assembly, Trace Statistics, Metrics), Observability - Web UI (Service Directory, Dashboards, Notebooks, Change Intelligence, Explorer), Observability - Third-Party Notification Methods (PagerDuty Events API, Slack Apps/Integrations, Slack Connections, Slack Messaging), and Observability - API.