fbpx Skip to content

Knowledge Byte: Designing the Cloud to Expect Failure

84057392_3382513485109192_2570936499621068800_n

Cloud Credential Council (CCC)

PCD-444

Designing software for failure is an extra barrier to overcome but isn’t too hard, and it certainly pays off.

Largely, it boils down to make sure that operations do not leave the system in an unstable state if they are aborted partway through for some reason. This is mainly a challenge for the frameworks and infrastructure upon which applications are built; with the infrastructure for retrying failed operations built into the system, application developers only really need to worry about the areas where the system can’t automate recovery from failure (such as operations that trigger real-world actions).

Design Tips:

● Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure. The best practice is deployment into multiple availability zones.

● Each application component must make no assumptions about the underlying infrastructure, it must be able to adapt to changes in the infrastructure without downtime.

● Each application component should be partition tolerant, it should be able to survive network latency (or loss of communication) among the nodes that support that component.

● Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure.

The use of Chaos Monkey—the best way to avoid failure is to fail constantly. From an early stage, Netflix used a Chaos Monkey—a piece of software that can randomly kill off different services/ features in Netflix, with the intention of assessing how well the recovery works. Initially, this was used in the test, but now is being used randomly in production. Quoting from the blog—“If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most—in the event of an unexpected outage.”

Related products to help you upskill

Never miss an interesting article

Get our latest news, tutorials, guides, tips & deals delivered to your inbox.

Please enter your name.
Please enter a valid email address.
Please check the required field.
Something went wrong. Please check your entries and try again.

Keep learning

PCSA-666

Knowledge Byte: Moving Legacy IT to Cloud Computing

The question of how cloud impacts legacy IT is one that is brought up numerous times, yet often gets little to no clarification. The following...
PCSM-cover2

Knowledge Byte: 5 Key Cloud Management Roles

Cloud service management roles are not fully defined in a single framework or standard. In addition, the crossover among service management, the organization and cloud...
A Massive Influx Into Remote Work Creates an Opportunity for Hackers

A Massive Influx Into Remote Work Creates an Opportunity for Hackers

While the coronavirus pandemic has infected millions of people worldwide, sending people back to work and study from home, these new habits could benefit cybercriminals....
Scroll To Top
Tweet
Share
Share