A sharp increase in traffic overloaded one of our backend services responsible for communicating with satellites. The unavailability of the service led to a negative feedback loop as satellites attempted to reconnect to the failing service. During this time Lightstep experienced a partial outage of trace assembly and ingestion of statistics for streams.
~10:00a.m. Traffic to our SAAS sharply increases, overloading one of our backend services.
10:30a.m. The backend services restart, kicking off a cascading failure and starting the outage. 10:45a.m. Lightstep deploys more resources to an adjacent service which helps alleviate the symptoms of the outage and gets data flowing back into the system.
11:20a.m. The rollout is complete and the cascading failure has resolved itself.
12:07p.m. Lightstep marks the original incident as resolved as the symptoms have disappeared.
12:24p.m. Lightstep increases resources for the root-cause service to reduce the likelihood of future outages. The change is deployed incorrectly and the service’s resources are decreased, which resumes the outage.
12:28p.m. Lightstep updates the status page to reflect the new ongoing incident.
12:51p.m. Lightstep identifies the error in the configuration, corrects it, and deploys the service with increased resources.
12:58p.m. The service has finished rolling out and the outage is over.
We have outlined 10 action items to avoid outages like this. Some examples include allocating more resources to the failing service, improving the autoscaling and alerting on the failing service, improving our autoscaling tools on adjacent services to be able to handle a failure in the root-cause service, and preventing OOMs of customer satellites when they cannot reach our SAAS. We’re also updating our incident response workflows to make it harder to incorrectly deploy bad configurations.