fbpx Skip to content

Knowledge Byte: Designing the Cloud to Expect Failure


Cloud Credential Council (CCC)


Designing software for failure is an extra barrier to overcome but isn’t too hard, and it certainly pays off.

Largely, it boils down to make sure that operations do not leave the system in an unstable state if they are aborted partway through for some reason. This is mainly a challenge for the frameworks and infrastructure upon which applications are built; with the infrastructure for retrying failed operations built into the system, application developers only really need to worry about the areas where the system can’t automate recovery from failure (such as operations that trigger real-world actions).

Design Tips:

● Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure. The best practice is deployment into multiple availability zones.

● Each application component must make no assumptions about the underlying infrastructure, it must be able to adapt to changes in the infrastructure without downtime.

● Each application component should be partition tolerant, it should be able to survive network latency (or loss of communication) among the nodes that support that component.

● Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure.

The use of Chaos Monkey—the best way to avoid failure is to fail constantly. From an early stage, Netflix used a Chaos Monkey—a piece of software that can randomly kill off different services/ features in Netflix, with the intention of assessing how well the recovery works. Initially, this was used in the test, but now is being used randomly in production. Quoting from the blog—“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most—in the event of an unexpected outage.”

Courses to help you get
results with

Never miss an interesting article

Get our latest news, tutorials, guides, tips & deals delivered to your inbox.

Please enter your name.
Please enter a valid email address.
Please check the required field.
Something went wrong. Please check your entries and try again.

Keep learning


Is Digital Transformation Transforming?

Is Digital Transformation Transforming? 52% believe that, within the next three years, some part of their organization will have fundamentally changed the way it operates...

Pros and cons of working from home: how can we boost productivity?

The COVID-19 crisis gave us no choice in the Spring of 2020: it pushed us out of the offices and transformed our homes into working...

Knowledge Byte: Moving Legacy IT to Cloud Computing

The question of how cloud impacts legacy IT is one that is brought up numerous times, yet often gets little to no clarification. The following...
Scroll To Top