Seven Strategies to Address Scalability and Build Resilient Systems

Zak Islam, Head of Product Engineering, Atlassian

Zak Islam, Head of Product Engineering, Atlassian

The truth is that complex systems eventually fail, particularly as they become more distributed and take on more dependencies. While strategies like load testing and chaos engineering can predict how systems will react to usage spikes and dependency failure, all complex systems will experience failure at some point. And, often it’s challenging to replicate these scenarios in non-production environments. To address scalability challenges, it’s crucial to identify the upper bounds and limitations of systems early. Zak Islam, Head of Product Engineering at Atlassian, shares how to build resilient and scalable systems and explains what strategies teams should implement when architecting systems that will scale to support millions of transactions.

Distributed Systems Can Fail Under Unexpected Usage Spikes

Let’s begin with an anecdote to help illustrate how distributed systems can fail as they scale.

To set the scene, one night the on-call engineers began investigating an anomaly within a service that supported tens of millions of requests per second. The on-call engineers noticed that at the top of each hour, the service responded with an internal server error for a period of five to ten minutes and then auto-recovered.

This was not expected behavior. The on-call engineers were unable to determine the cause of the service disruption, prompting them to escalate the investigation. The investigation aimed to identify why the service, which previously ran without issues for several years, was behaving this way.

Prior to this outage, a neighboring team released a new product that took a dependency on the impacted service for some core functionality. When the new product saw great success at launch, it caused usage to spike for the dependencies, especially the service that was now experiencing reliability issues.

Since its launch several years prior, the impacted service had not shown any reliability issues. It regularly processed millions of transactions per second, which caused the team to overlook how the sharp uptick in usage could impact it. The service had ‘Auto Scaling’ enabled, so when usage ramped up, the team was confident that automation would kick in and scale out the fleet to support increased traffic. The ‘Auto Scaling’ functionality was exercised daily without issue, as usage of the impacted service ramped up and down for several years.

Distributed systems do not scale linearly as load increases

From this instance, the team learned that distributed systems do not scale linearly as load increases. In this case, as usage ramped up, the automated infrastructure scaling systems added more and more hosts to the fleet as designed. This was effective, until an hourly cache synchronization mechanism, which synchronized fleet-level metadata (e.g. host names and IP addresses), could no longer keep up with the number of hosts to synchronize data across.

The fleet size grew to the point where the data within the fleet could no longer be collected fast enough. In turn, the cache could not be updated fast enough by the ‘synchronizer,’ due to the sheer size of the fleet after it was flushed (as designed). This resulted in a stream of cascading failures, in other parts of the system.

This failure helped identify an upper bound of the system. The team fixed this problem, permanently, by splitting up the fleet discreetly into units of capacity, in this case into multiple clusters of 100 hosts each. This pattern enabled the service to scale out horizontally, without concerns about the limitations of horizontal scalability of a very large and complex system.

Another aspect of the outage that the team focused on was the broad impact of a small failure. Below are strategies the team considered, as part of their re-architecture, to meet the demand for increasing usage of their systems. The patterns discussed below optimize for reducing the impact of systematic failure first and addressing prevention second.

While strategies like load testing and chaos engineering can predict how systems will react to usage spikes and dependency failure, all complex systems will experience failure at some point

Strategies to Scale Complex Systems

Avoid unbounded systems and APIs, at all costs.

The lesson that distributed systems does not scale linearly, as load increases, applies to APIs as well. To prevent a single customer from impacting the operations of the system, the team applied patterns such as rate limiting and pagination to APIs. The goal was to prevent failures stemming from unexpected traffic patterns, including usage growth beyond the limits of the system.

Identify the upper bounds of systems early and scale out in 'units of capacity

The incident retrospective demonstrated that breaking up highly complex and large systems into simple and manageable units of capacity helps to prevent large-scale failure. This pattern helps with fault isolation, speedy root cause analysis, and most importantly faster remediation. Each capacity unit (e.g. cluster or shards) should have all the systems required to operate without the liveliness of its sibling capacity unit. As the system takes on more load, new units of capacity (e.g. a new shard) are added with known upper bounds and limitations to avoid unknown scaling cliffs.

Do less work when your system is impaired, not more.

Adding more complexity to a highly complex system generally leads to problems of a greater caliber. That’s why the team discussed auto-remediation and self-healing patterns such as standby capacity and failover to standby systems.

The team believes in simplicity and applied the same pattern to its architecture. If a system is impaired, do not rely on highly complex and infrequently exercised code to repair the system. Instead, the team focused on limiting the blast radius of systematic failure by applying patterns such as sharding and rate-limiting callers.

Fail gracefully

When the system is not able to operate at full capacity, rate limit requests. While operators repair the system, customers can back off their requests as the system rate limits the input. This will ease the system load and expedite returning the service to full functionality.

Optimize for operating in a degraded state, instead of failing

The team accepted the fact that large systems fail in unexpected ways. That’s why it optimized to implement patterns that would allow the system to operate in a degraded manner. For example, the team added caches to various subsystems that enabled read/GET operations to proceed, even with stale data, while all write/PUT operations failed in case of a database outage. The simple pattern of caching data and allowing the system to operate on stale data, to the degree that it’s possible, can significantly minimize the blast radius of a subsystem failure.

Handle downstream failures gracefully

If a system is unable to communicate with downstream systems, it fails gracefully. For example, exponential backoff is a reliable pattern to utilize when teams encounter issues with downstream systems. This principle goes hand in hand with doing less work when systems are impaired.

Consider fast fail mechanisms that use decay algorithms to track the liveliness of downstream systems, and apply backpressure to consumers. This provides a signal to slow down processing, which gives operators the time to recover the impaired system.

Minimize blast radius

The last approach teams should consider is blast radius reduction. We know that complex systems will eventually fail. When systems fail, the best service owners will limit the scope of the impact to prevent a full system outage. Consider patterns (e.g. sharding) that minimize the scope of outages to impact a minimal amount of customers.

Conclusion

All customers experienced some form of outage from the system during the nine-hour incident. Using the above principles, the team was able to limit the blast radius of future incidents, particularly when downstream systems failed.

When scale increases rapidly, it's challenging to test how your system will operate and fail. Applying countermeasures to common scaling challenges will ensure services can operate in a degraded mode, in case of the systematic failure of subsystems. Since this outage, the service has been able to scale gracefully, handling many more millions of requests per second by horizontally scaling. We were able to future-proof it by incorporating these practices, along with chaos engineering, to create greater system resiliency moving forward. At Atlassian, we’ve deeply incorporated this learning into our architectural patterns, which have enabled us to scale our systems and process millions of requests at a global scale.