AWS - Assume Weak Stability

With me never shutting up about it, who could forget my Amazon Web Services disaster? If you’ll recall, I was testing AWS to see if I might find it useful and I was hit with a $6700 infestation of Chinese bitcoin miners almost immediately. This was thanks to Amazon creating a pricing system that is pretty stupid for large spenders and evil for regular folks. It also showed us that Amazon is pretty stupid about sending confirmation emails when a customer fires up maximum VMs from China hours after the account is created in California. If you disagree with that, then you must accept that what "maximum VMs" means is stupid - either way too high or too low. It also showed us that, because of Amazon’s exceptional value to criminals, search engines can no longer be trusted to do, well, anything.

Those were all good lessons and I’m glad I learned them in a setting where I was educating myself with a controlled test. Last night, however, I learned an extremely valuable AWS lesson in a production situation. Before getting to that, I need to get a bit defensive for a moment. I feel like with my Amazon problems the answer is pretty clear to most people: this kind of stuff happens to idiots; if it happened to you, you’re an idiot. But this is not always so. For example, although I was catastrophically burned by my experience with AWS when I first tried it, I consider it a success to have found such a failure while evaluating the service for exactly that kind of problem. In the same spirit, although I suffered a catastrophic loss last night with AWS, it wasn’t a completely catastrophic loss. By design. There is no way I would put all my eggs in the AWS basket and that is today’s lesson.

I have a client who has a lot of sensors out in the world and the sensors report their sensings which are very valuable. We primarily have the sensors report back to a local system that we physically control. We thought, wouldn’t it be good if we could have a completely redundant system where the sensors could also report? (I’m told that when this was presented to a group of climate scientists as "in case extreme weather takes out all of Southern California" no one considered that possibility hyperbole.)

That seemed like a good job for AWS. And, in fact, it really has been. For almost no cost I have single-handedly implemented a complete replication of our ability to receive all of this data. If the primary receiver dies, this can be an invaluable fallback.

Moving forward in the story, on Christmas Eve (good timing!) I got an email from Amazon saying, that within a fortnight the EC2 VM I was using was going to be rebooted after enjoying nearly a year’s worth of continuous operation reliably receiving millions of sensor readings. This explains the likely cause though truly competent VM managers should not have this problem. But ok. Fine.

When the time came, I kept an eye on things to restart the listening server because I actually had never really prepared for that. When I first set it up, it was as a pilot test and then it worked so well that we never touched it. So sure, definitely properly establishing things should no longer be neglected. What was odd though was that the reboot time came and went and I saw that the VM did not actually reboot. Ok. I figured that maybe they warned me in case they couldn’t avoid a reboot, but apparently it was unnecessary. Lesson one: when AWS tells you they’re going to reboot your VMs, they may, in fact, not reboot them.

Months later, I got another email warning that they were going to reboot my VM. This time they did! The VM was left running fine with a reset uptime. That was ok, but I was going to need to set things up so that my server software was started automatically on boot. Last night I was navigating that hornet’s nest and I had created scripts that could start and stop the service and I felt like if the machine were rebooted, it would come back and resume service without missing a beat. What did I do next? Relax and congratulate myself on a job well done? Of course not, the last thing I was going to do, I was hoping, was to test it.

I typed sudo shutdown -r now which is the proper Linux way to reboot. I was logged out of my session. I waited a couple of minutes and tried to log back in… And I couldn’t. Waited a bit more. Nope. Getting nervous, I fired up the web console. The console showed I had no instances running. None even existed. My security group and key pair were gone. Everything about that VM, had just disappeared. I then spent the next hour or so recreating this VM from back ups.

So rebooting was stupid of me, right? Well, maybe not. This morning I fired up a test VM, logged in and did the exact same thing with shutdown -r. And within a minute it was back up and accessible. I even did shutdown -h and the machine still remained in my web console instances display ready to be reactivated (though in that case it got a new IP number).

Looking into it I did find some AWS documentation that says this.

We recommend that you use Amazon EC2 to reboot your instance instead of running the operating system reboot command from your instance. If you use Amazon EC2 to reboot your instance, we perform a hard reboot if the instance does not cleanly shut down within four minutes.

Ah ha. They conveniently fail to mention that if you do not follow this suggestion, your entire setup may get nuked. Lesson two: even if the shutdown command sometimes does work, you should never use it. I hesitate to recommend removing it outright since who knows if it gets used by the AWS automagical system? I’ll probably just make an alias called shutdown to prevent it from ever being used accidentally.

Finally we come to the important point of this story. Lesson three: Assume Weak Stability. But that’s ok. At the top of my RAID notes, I say this.

The uncertainty is not if your hard drives will fail. The question is when will they fail? The answer is, at the absolute most inconvenient moment.

And so it is with AWS. The trick is to make the most inconvenient moment not that inconvenient.

Like the hard drive manufacturers, everybody is trying very hard to make things always work, but eventually luck always runs out. This is why you can never completely rely on a single hard drive or a single service from AWS. As with RAID, you must diversify solutions to your critical requirements. For AWS specifically, this means being able to deploy a new copy of the VM very quickly and very easily. Some people believe that if you have to log into the VM to fuss with it, you have failed to make it easy enough to recreate. Obviously all data needs to be backed up aggressively. An aspect that is subtle yet not at all surprising is that if you rely on a stable AWS IP number, that will burn you. ~~That's what DNS is for~~ Although it is not what DNS was designed for, DNS can help provide resiliency to this problem.

While it was extremely unnerving to see my cloud VM disappear in a cloud of smoke, I forgive AWS for that. I understand that clouds sometimes come with storms. This problem is more severe for minor users who can not easily afford to engineer elaborate mechanisms of redundancy.

Despite letting this go as a learning experience, I’m still not a fan of AWS. The predatory billing arrangements and weak security are what still really bother me. I will start recommending AWS when they solve the following two problems.

I want to be certain that I will not be liable for any more than my credit card limit.
I want to get notified by email if malicious adversaries log in from China to consume maximum resources.

Note that Amazon retail shopping intelligently does both. It’s not like Amazon doesn’t know how to do this properly.