On Roblox’s Outage

Roblox is one of the world’s biggest game platforms. With over fifty million daily users, it is a wildly popular platform to build and play games. 

In October last year, they had an outage where the entire platform was down for over 72 hours. This was all over the news at the time..

Today, Roblox published a post mortem about the incident. It is fascinating reading for anyone interested in distributed systems, DevOps, and Engineering (link below). I will write up a more detailed note in a couple of days.

Summary
– The outage was due to an issue in their service discovery infrastructure which is implemented in Consul
– Roblox is deployed on-premise(!!) on 18,000 servers which run 170,000 service instances
– These services rely on Consul (from HashiCorp) for service discovery and configuration
– An upgrade to Consul and the resulting switch to the way services interact with Consul lead to a cascading set of failures resulting in the outage

Some Initial Thoughts
– Distributed systems are hard, and the use of service-oriented architectures come with costs of coordination and service discovery
– Microservice architectures do not reduce complexity, just move it up a layer of abstraction
– The complexity of the modern software stack comes not just from your code, but also from your dependencies. 
– Leader election is one of the hardest problems in Computer Science 🙂