Designing for Failure (Again): Lessons from the AWS Outage

As Werner Vogels, Amazon CTO, said: “Everything fails all the time.” AWS reminded us of that again this week with a major outage in us-east-1 that took down parts of DynamoDB, Lambda, EC2, and cascading into disruptions across more than a hundred AWS services. Chances are you noticed something acting up in one of your favorite apps or websites that day. It wasn’t the first time the “cloud” has failed, and it certainly won’t be the last.

As with any incident, every time it happens, it’s a new opportunity to validate what we’ve built, to see where our assumptions hold, and where we can still do better. This incident proved the robustness of the heart of our platform: the search experience, the most critical part of what we deliver to our customers and their end users, stayed fully operational. People searching on our customers’ sites continued to get fast, relevant results, completely unaware that a large portion of the internet was having a very bad day — or maybe they noticed, and just assumed that the search on that website couldn’t possibly be running on AWS.

It’s not like our team had a quiet day because of this. The incident started around 3 a.m. EST and lasted until the early evening. Our internal alerting picked up on a disturbance and paged many team members when the issue started. Our SRE and platform teams, along with many others worked on this the entire day, ensuring our platform stayed up and performant while the cloud was having one of its worst days in recent memory.

During much of that time, many of the AWS services that we rely on for platform operations were either unavailable or degraded. We couldn’t launch new EC2 instances. Lambda functions were timing out. DynamoDB was unreachable. And yet, our traffic was ramping up fast as most of our customers’ end-user traffic starts picking up in the morning. That’s when our infrastructure usually scales to handle the load. But this time, we couldn’t rely on autoscaling and had to make it work with whatever compute capacity we already had. One of the first things we did was disable auto-scaling. It was the safest way to keep our platform stable and make sure we held on to the capacity we already had.

We operate an active-active setup across multiple availability zones within each AWS region, and also between regions. That design makes it easy for us to redirect traffic if one region becomes unstable. The challenge this time was that other AWS regions were feeling the ripple effects of the outage and we didn’t want to solely rely on this option.

To stay on the safe side, we planned additional contingency measures. We were ready to reallocate capacity and apply active load shedding – temporarily shutdown non-critical systems and features to keep search running smoothly if things had worsened. Fortunately, our existing capacity held up, and we managed to launch additional instances from AWS in time, keeping those contingency plans safely on standby.

As you can imagine, when an infrastructure relies heavily on AWS and such a widespread outage occurs, not every part of the platform comes out unscathed. Some of our services which depend more directly on the affected AWS components could not be kept operating normally during the incident.

Document ingestion was unavailable for about 2.5 hours early in the incident. Data was queued safely and processed once services recovered.
Reporting and analytics slowed down as Firehose and Snowflake became unstable.
Real-time features were impacted and certain machine learning model rebuilds were delayed, but this only affected the freshness of the model. Models already in production continued to operate and serve requests as usual.

Unsurprisingly, some third-party services we rely on were also caught in the ripple effects of the outage. Having been through situations like this ourselves, we know how hard it is to keep everything running smoothly when services we rely on are having a bad day. Our status page mechanism was affected, so we could not even update our customers on what was going on. Also, our feature flagging system was down just as AWS was recovering and we were finally able to spin up new EC2 instances. That meant freshly launched services couldn’t fetch their configuration flags as usual. The team quickly found a work- around, circumventing the issue to make sure everything booted with the correct settings. A good reminder that resilience extends beyond AWS itself and a lesson that will be integrated in our future simulations.

By late afternoon, as AWS gradually restored functionality, those workarounds paid off. Systems caught up, pending workloads resolved, ingestion resumed, and our dashboards slowly returned to green.

What stood out most wasn’t just that our search stayed up, it’s why it stayed up. It’s the result of years of deliberate design choices: isolating the query path from everything else, building regional redundancy, and investing heavily in observability so we could react with precision, not panic.

Now that the incident is behind us and everything is back to green, our work isn’t done. The reason our infrastructure is resilient is because we always take the time to inspect, dig, discuss, improve, and learn from every incident. This one will be no different. We’ll have many things to review: what went well, where we can simplify, and how we can make our systems even more resilient the next time something fails.

Resilience is a journey, not a checkbox. This incident proved that again, and reminded us that the best architecture is invisible when everything else is failing.

After all, as Werner reminds us: “Everything fails all the time.”