The Beds are Burning - When You Fail to Manage the Risks in Your Environment

|
risk aws management

When Systems Fail Terribly

Eight Sleep are a “cloud powered” sleeping system to keep your bed at the desired temperature for a better night’s sleep. They recently experienced a critical outage resulting from AWS’s recent outage. Their systems failed in such a way that customers found their beds either heating up uncontrollably, cooling down as far as possible, and those with the higher-tier offering, had their beds moving uncontrollably.

Of course the CEO apologised, but you have to ask: how could they possibly design a system so badly that an outage would cause such a serious, and potentially deadly, malfunction? It also flags the challenge (or folly) of designing products that are excessively dependent on the cloud and lack basic functionality that runs locally.

Novel Approaches

Netflix invented a system called “Chaos Monkey” to improve their network stability. It randomly takes down different parts of their system to ensure that things fail-over properly. Their philosophy is that failures should be survivable, and it’s better that they fail during business hours when they have full staffing, rather than having it fail after hours with a skeleton crew.

We definitely wouldn’t recommend this approach to testing the resilience of your OT systems. It is worth however to consider each component in your systems and what would the impact of failure would be.

Our Approach

One key aspect we’ve always pushed back against, having production environments hosted anywhere but on-site. While the internet WAN links to site become a lot faster and more reliable since I first started in mining, they are still a key failure point that aren’t an acceptable risk for a platform that has to operate 24/7. In my opinion, the benefits of having servers in hi-tech data centres in capital cities doesn’t beat the upsides of having the server on-site.

Fortunately, we don’t see outages very often but the most common cause of LineupBoard going down is due to a power outage at site and either the generators not kicking in properly, or key routers don’t stay online.

To further improve the resilience of our systems, a key aspect of our next-gen platform, devfu.one, we’re moving to a modular-based approach so that key features operate in isolation to minimise the impact of any one component malfunctioning. Once we’ve moved to this new platform we’ll have the ability to work with clients who want to have redundant instances for High Availability (HA).

What We Can Learn

Think about the components of your systems that might fail and consider what the impact of its failure would be. Will your systems cook an operator? Will they fail to safe? How quickly can you get the system back up and operating. How can you avoid catastrophic failure like Eight Sleep experienced? Our systems often have lives at stake, so we must be confident of failing safe.

It’s unrealistic to have redundant everything, but it’s worth considering where redundancy or secondary systems are justified. Sometimes just having a spare device or cable on a shelf that you know you’ll use eventually is well worth the cost. There’s no one-size-fits-all solution, you’re the expert and need to make the call.

The most simple system we usually manage well is our power supply. We usually have UPSes to handle minor power outages and generators when the delays continue. We have processes in place to make sure generators are refueled until power can be restored. Power is just one part of the chain that can fail.

What assumptions are you making in your systems?

And if you were wondering, yes, the AWS failure was DNS.