The truth is that complex systems eventually fail, particularly as they become more distributed and take on more dependencies. While strategies like load testing and chaos engineering can predict how systems will react to usage spikes and dependency failure, all complex systems will experience failure at some point. And, often it’s challenging to replicate these scenarios in non-production environments. To address scalability challenges, it’s crucial to identify the upper bounds and limitations of systems early. Zak Islam, Head of Product Engineering at Atlassian, shares how to build resilient and scalable systems and explains what strategies teams should implement when architecting systems that will scale to support millions of transactions.
Distributed Systems Can Fail Under Unexpected Usage Spikes
Let’s begin with an anecdote to help illustrate how distributed systems can fail as they scale.
To set the scene, one night the on-call engineers began investigating an anomaly within a service that supported tens of millions of requests per second. The on-call engineers noticed that at the top of each hour, the service responded with an internal server error for a period of five to ten minutes and then auto-recovered.
This was not expected behavior. The on-call engineers were unable to determine the cause of the service disruption, prompting them to escalate the investigation. The investigation aimed to identify why the service, which previously ran without issues for several years, was behaving this way.
Prior to this outage, a neighboring team released a new product that took a dependency on the impacted service for some core functionality. When the new product saw great success at launch, it caused usage to spike for the dependencies, especially the service that was now experiencing reliability issues.
Since its launch several years prior, the impacted service had not shown any reliability issues. It regularly processed millions of transactions per second, which caused the team to overlook how the sharp uptick in usage could impact it. The service had ‘Auto Scaling’ enabled, so when usage ramped up, the team was confident that automation would kick in and scale out the fleet to support increased traffic. The ‘Auto Scaling’ functionality was exercised daily without issue, as usage of the impacted service ramped up and down for several years.
Distributed systems do not scale linearly as load increases
From this instance, the team learned that distributed systems do not scale linearly as load increases. In this case, as usage ramped up, the automated infrastructure scaling systems added more and more hosts to the fleet as designed. This was effective, until an hourly cache synchronization mechanism, which synchronized fleet-level metadata (e.g. host names and IP addresses), could no longer keep up with the number of hosts to synchronize data across.
The fleet size grew to the point where the data within the fleet could no longer be collected fast enough. In turn, the cache could not be updated fast enough by the ‘synchronizer,’ due to the sheer size of the fleet after it was flushed (as designed). This resulted in a stream of cascading failures, in other parts of the system.
This failure helped identify an upper bound of the system. The team fixed this problem, permanently, by splitting up the fleet discreetly into units of capacity, in this case into multiple clusters of 100 hosts each. This pattern enabled the service to scale out horizontally, without concerns about the limitations of horizontal scalability of a very large and complex system.