Sign up FAST! Sign Up Login

That Which Does Not Kill You Makes You Stronger ~ CHAOS MONKEY


Stashed in: DevOps, FAIL, Kaizen

To save this post, select a stash from drop-down menu or type in a new one:

From the article:

Chaos Monkey randomly kills instances and services within Netflix's AWS infrastructure to help developers to make sure each individual component returns something even when system dependencies aren't responding.

For example, if the recommendation system is down Netflix will display popular titles instead of personalized picks. The quality of the response is degraded, but least there is a response. Ciancutti explains it this way: "If we aren't constantly testing our ability to succeed despite failure, then it isn't likely to work when it matters most - in the event of an unexpected outage."

Strength and reliability through self-destruction. Sounds kinda insane, right?

Wrong!

At least, it's not insane according to Jeff Atwood, who had this to say about his experience with the Simian of Anarchy:

Every week that went by, we made our system a tiny bit more redundant, because we had to. Despite the ongoing pain, it became clear that Chaos Monkey was actually doing us a big favor by forcing us to become extremely resilient. Not tomorrow, not someday, not at some indeterminate "we'll get to it eventually" point in the future, but right now where it hurts.

Now, none of this is new news; our problem is long since solved, and the Netflix Tech Blog article I'm referring to was posted last year. I've been meaning to write about it, but I've been a little busy. Maybe the timing is prophetic; AWS had a huge multi-day outage last week, which took several major websites down, along with a constellation of smaller sites. Notably absent from that list of affected AWS sites? Netflix.

When you work with the Chaos Monkey, you quickly learn that everything happens for a reason. Except for those things which happen completely randomly. And that's why, even though it sounds crazy, the best way to avoid failure is to fail constantly.

"The best way to avoid failure is to fail constantly."

Technically, that's not avoiding failure. It's embracing it.

You May Also Like: