29 Apr 2018 c.e.
On Outages, or Taking Time Off

I recently joined a large software engineering organization, working on a server side team for their fast growing consumer product. There's a couple of services that make up the backend, my immediate team works on a very important but not entirely critical path set of side feature services.

We had our first major outage this week. I'm lucky to work with some very competent engineers who did most of the heavy lifting, but it still took about 72 hours for us to get back to normal.

To help with getting us back up, we ended up enlisting some help from the networking team. The original catalyst for the issue had to do with some long running connections to an outside service getting shut down by our load balancer, so we needed their help to switch to a different web proxy, which would let us connect directly to our external services without going through the schizophrenic load balancer. The networking team has been doing a lot of work lately with helping our 'front end' server team break their monolithic service into sharded pieces.

One of the engineers made the comment that doing the sharding changes while trying to keep the service up was like 'trying to change the tires on a car going 60 MPH down the highway'.

Let's talk about this, for a minute. Our service struggles with our traffic load. On our high volume days, it's not uncommon for us to 'load shed' 10-30% of traffic so that our servers don't go down completely. This means that 10-30% of the people trying to use our service on high traffic times, will be unable to.

Once we've switched to a more robust, sharded architecture, we'll hopefully be much better able to handle the traffic coming in. But switching over to this sharded architecture is really hard to do while the servers are still running, because you have to worry about data updates and insertions happening while you're making these huge infrastructure changes.

So.... why do we do it? If it's merely a matter of switching over to a new service, why don't we shut the service down for a few minutes, do the switch and then come back up again? While we're expanding into international arenas, our userbase is still 98% based in the USA. There are slow periods. Surely a 100%, planned and communicated downtime done at a time when the majority of our users are sleeping is better than 10-30% of our customers being unable to get things done during our peak hours.

People sleep. Subways shut down. Gas station stores close. Factories have downtime for retooling and repair. What is it about software that its makers have decided that we need to perpetrate this illusion of always on availability. Imagine if GMail went offline every Sunday. Like, that's it, no GMail for anyone available on Sunday. I'd personally be pretty screwed because I store so much of my life and to dos and personal correspondence in GMail but also, on the other hand, maybe that's not such a bad thing to not be able to get to on a Sunday. I'd have to plan ahead. I'd have to figure out something else to look at on Sunday evenings.

Imagine Twitter turning off 'after hours'. You can tweet and read other people's posts from 6am to 10pm every day, but after that you're on your own. Would I miss things? Sure. But I'd also argue that I really, personally, don't think my life would be the worse for not being able to tweet asinine comments at 1am.

Internet services have pervaded our lives because they make themselves available, at all hours, in all forms.

So why don't we plan them?

I think there's something to be said about the fact that we don't 'plan' our outages. For my team, our outages happen more or less every big day. Could you call them planned? No, but they're hella predictable. There's always that chance, slight (and growing larger as we continue to make improvements to our load capacity) that nothing will go wrong, that we won't have to shed load or turn off important backend pipelines just so that requests can get served.

Here's the thing though: unplanned outages are 'blameless'. Customers can complain to us about being down, about being unpredictable, but they also just point to the weather gods and go "damn, I guess it's raining today". If, on the other hand, you plan an outage, that gives customers the opportunity to excoriate you. You're weak, you've admitted it publicly. You're unable to keep up with traffic and that's a signal to others that all is not well in your ecosystem. So by publishing downtime, you're, in a way, giving out information to competitors and investors that we'd probably otherwise do better keeping to ourselves.

So we just keep turning off the service for a random 10-30% of users every big day, and trying to do this Herculean task of changing the tires on a moving automobile.


I started this post off talking about the outage on my team, and veered into a long discursion about planned downtime. In case you're wondering about what happened to my particular team, we managed to get the service back up, days after the initial web connection problems caused us to more or less shut down. We haven't done a post mortem yet, but one interesting thing about our particular brand of outage is truly how sistemic and 'blameless' it was. There wasn't a single point of failure, rather a systemic series of problems that we had to figure out a solution to. It took us a while, because deploying code is slow, because running through millions of records to get back on track is even slower. Unlike the problems that plague our front end server team, this wasn't one that we would have seen coming without some amount of planned gamedays. And gamedays are always a nice to have until you're in the middle of a multi-day outage.

I feel really lucky that my co-workers are really experienced, and have written enough of the service to have a fairly deep understanding of the underlying structures of the application.

Finally, I'm sure there's a lot more nuance to downtime and outages than I've captured here, like how it takes hours to run an upgrade and sometimes you don't have hours to take a service down. But also, I find it fascinating that software systems, which in someways are more complicated than metro systems, try really hard not to take time off whereas the train that runs through my neighborhood is on vacation for repairs every other goddamn weekend.

#server-eng #outages #downtime #how-many-nines
<< >>