fbpx Skip to content

Knowledge Byte: Designing the Cloud to Expect Failure


Paulo Guimarães


Designing software for failure is an extra barrier to overcome but isn’t too hard, and it certainly pays off.

Largely, it boils down to make sure that operations do not leave the system in an unstable state if they are aborted partway through for some reason. This is mainly a challenge for the frameworks and infrastructure upon which applications are built; with the infrastructure for retrying failed operations built into the system, application developers only really need to worry about the areas where the system can’t automate recovery from failure (such as operations that trigger real-world actions).

Design Tips:

● Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure. The best practice is deployment into multiple availability zones.

● Each application component must make no assumptions about the underlying infrastructure, it must be able to adapt to changes in the infrastructure without downtime.

● Each application component should be partition tolerant, it should be able to survive network latency (or loss of communication) among the nodes that support that component.

● Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure.

The use of Chaos Monkey—the best way to avoid failure is to fail constantly. From an early stage, Netflix used a Chaos Monkey—a piece of software that can randomly kill off different services/ features in Netflix, with the intention of assessing how well the recovery works. Initially, this was used in the test, but now is being used randomly in production. Quoting from the blog—“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most—in the event of an unexpected outage.”

Related products to help you upskill

Never miss an interesting article

Get our latest news, tutorials, guides, tips & deals delivered to your inbox.

Please enter your name.
Please enter a valid email address.
Please check the required field.
Something went wrong. Please check your entries and try again.

Keep learning

A Massive Influx Into Remote Work Creates an Opportunity for Hackers

A Massive Influx Into Remote Work Creates an Opportunity for Hackers

While the coronavirus pandemic has infected millions of people worldwide, sending people back to work and study from home, these new habits could benefit cybercriminals....
jurian article

ITIL® 4, Why Should You? What’s New?

By 2019, when ITIL® 4 was finally launched, ITIL had been the leading guidance for IT Service Management for the past three decades. Millions of...
Challenges in Becoming a More Digitally Enabled Organization

Challenges in Becoming a More Digitally Enabled Organization

Our Global Digital Skills Survey 2019 report identifies three critical and eight key findings. The analysis covers cultural, individual, and organizational readiness for the changes...
Scroll To Top