Uptime is primetime

At PipelineDeals, we are product engineers. We work on a 12 year old codebase that powers a SaaS product that tens of thousands of our customers spend a large portion of their day using, and rely upon to drive and maintain their relationships with their customers. We have all sorts of metrics to track how we’re doing as an engineering team. Reigning supreme above all others, is uptime.

What’s my uptime?

In the early days of the web, uptime was pretty straight-forward. Everyone was publishing monolithic apps that either let you log in and do your thing or it didn’t. The URL either resolved correctly and presented a working application, or it didn’t. If it failed, it would show a 500 page, browser DNS error page or it spun indefinitely and never did anything. 

These days, large and complex apps are typically delivered as a set of networked but independent services, each performing their task and communicating with others. At PipelineDeals, we began with a monolithic application from which we extracted several services, each with their own persistent storage and responsibilities.

From a systems engineering perspective, this approach gives us new tools to help us provide a higher guarantee of availability than is possible under a monolithic delivery model.



Circuit Breakers

The boundaries between services, where one part of the application is depending on data or functionality provided by another, can be hardened against cascading failure using a circuit breaker. This is a piece of logic that will be invoked if there is a network error, internal server error, etc. that prevents a service from responding in a timely manner. This code will return a default value to the client service, allowing execution of the user’s web request to continue (albeit with a degraded or eventually-consistent state).

For example, one of the core services that powers PipelineDeals is responsible for billing and subscriptions. When the main app needs to know which features a particular user or account has access to, it must make an internal API call to this billing service. What if the database server this billing service talks to has a hardware failure? 

Well, since we have a circuit breaker installed in the code that handles this communication, the main app will detect that requests to the billing API are currently timing out or failing. Noticing this, it will substitute a standard set of features instead of the realtime set that would be returned from the billing service, and continue with serving the user’s request.

Without the circuit breaker, a failure in the billing service would have cascaded into failures in all services dependent on it. In this naive case, the app’s uptime is only as good as the uptime of the least reliable service in the request’s critical path through the system. 

Yellow WARNING Barrier Tape Background Isolated on White


Early Warning System

Service-based apps have some failure modes that you don’t see in monolithic apps. Chief among these is the cascading failure. One component will start receiving more load than it can handle, causing requests to it from other components to fail or hang around waiting for a response. These delays or failed requests will pile up in these client components, causing retries and degraded service to their clients, and so forth. Before you know it, the whole web of interdependent services will seize up and start refusing or failing requests from the web app and customer API calls and cause a downtime event.

There’s an opportunity here, as well. Well-designed services have well-defined boundaries, and it’s at these boundaries that we can look for trouble (any good monitoring service will make this easy to set up) and alert early. Data points that we’ve found helpful to monitor include queue job count, error rates and API request 95th percentile response times.

Responding to these alerts early enough will allow a team to avoid a full-blown downtime by, at worst, shutting down services not critical to the core product during an emergency. Happy days!

Panorama of medieval town walls. Avila, Spain


Defence in Depth

We can also look at the monitoring and alerting infrastructure in a top-down way. Failures can occur not only across components but also at different layers within a single component. Imagine you suddenly start seeing elevated 5xxresponses to user-facing web transactions. What’s the cause? Depending on your infrastructure and setup, this could point to a problem with your DNS setup, load balancers, application host, web server, routing layer, database, in-memory object storage, etc.

In an emergency situation, the less probing in the dark we have to do, the better. Setting up monitoring at every level of the stack is an excellent way to cut down this search space. For example, our ping tests report everything is fine but application error monitoring is showing elevated error responses, its reasonable to conclude that everything upstream of your app servers is functioning correctly and that the likely culprit is a recent code change.



Cost / Benefit

So, what are all these handy dashboards and fancy infrastructure going to cost you? It’s a sliding scale. There are engineering teams working at all points on the spectrum between a single monolithic app and a constellation of microservices. Having said that, there is no question that there are additional fixed and marginal costs of delivering your app as a set of networked services.

Most easily measurable are the direct costs of extra servers and hosting infrastructure. We usually want redundancy at the level of the hardware that the service is running on, so most production setups will deploy one service per server instance. The extra database hardware to isolate storage per service is another significant cost, as is the data transferred between service APIs.

The cost incurred for your engineering team’s time is much harder to quantify. Even assuming everyone on the team has the DevOps chops to isolate and debug production issues, the resolution of those issues will necessarily be more complex as they involve more moving parts than under the original monolithic approach.

Do I need it?

Distributed application architectures are not a silver bullet for all that ails your application. In fact, shoehorned into a deployment that doesn’t make sense, it will multiply your problems. Best practice is to begin your app with a simple, monolithic deployment model that’s quick to iterate and develop new features for. As your app acquires paying customers, you’re building a business case for investing the time and money required to improve the resiliency of the infrastructure the business relies on. Even at this point, that may not mean a fully distributed service model as we’ve discussed above – it’s an art as much as it is a science.

The bus that couldn’t slow down

Consider a solitary gold miner. 99% of the time spent in mining for gold is at the face, making incremental progress. A vein followed here, a dead end routed around there. Then there are those rare moments when the only way to make further progress is to make a lot at once – blow the face away to discover what lies beyond the blocked shaft.

Too tortured a metaphor? Perhaps. But keeping a software product going is a lot like this. Most of the time you’ll make progress in bits and pieces, and once in a while you’ll take a bigger jump. The following is a discussion of how the happy band of hackers at PipelineDeals took one of these jumps recently, and how our infrastructure and deployment setup made that possible without any customers noticing.

The road to here

A year and a bit ago, we were in a very different place. PipelineDeals was running Ruby 1.9.3, Rails 2.3 and we were using Jammit for asset compilation. Being on so ancient a version of Rails was the most pressing pain point – it locked us in to older versions of the gems we depend on, as well as stopping us ditching Jammit for the Rails asset pipeline. What we had worked well enough, but we knew we were living on borrowed time.

We had one goal:

Set up and use a repeatable process for major upgrades of pieces of our software stack, such that customers don’t even notice we did it.

That means no ‘Log out and login again’, no ‘clear your cache, please!’, and certainly no downtime (scheduled or otherwise). We decided that upgrading to Rails 3.0 will be our first big bite out of the technical debt sandwich, and it’s that instance I’ll be covering here.

Deployments: A new hope

We’ve blogged about our deployment strategy before – we love it and it gives us a ton of flexibility. Turns out that this flexibility is crucial to accomplishing what we set out to do. Because our infrastructure and deployment logic is just code, this repeatable process starts out with a pull request against our Ansible playbook repository.

Step 0: two lanes

We run two app servers behind our load balancer in production, pointing at app.pipelinedeals.com

Our infrastructure PR changes our deployment process so that our setup looks like this:

We’re now at a place where we can test rails3 under production by using rails3.pipelinedeals.com instead of app.pipelinedeals.com, noticing and fixing errors that occur exclusively under rails3 (using NewRelic), while regular deployments are not affected. Step 0 achieved. This feedback loop where we can dogfood our upgrade, making it available to ourselves and the rest of the company to harden before a real customer sees it is the heart of this process.

Step 1: get a job

Our app does a lot of background processing. Imports, exports, sending emails, bulk actions, etc. – there’s a lot going on outside of a user’s request-response cycle. Alongside our 2 app servers, our deployments also stand up 2 queue servers running sidekiq, which share a single redis instance to allow jobs to be executed on any machine. To be confident in our upgrade, we’ll need to send some jobs to a rails3 queue server to see what breaks. We add to our PR above, so that a build includes a queue server running the upgrade rails3 branch.

Step 2: (partial) showtime

We’ve de-risked the deploy to customers as much as we can, and now it’s time to go live. We do this with a small change: on deploy, we move the rails3 app server to the production load balancer alongside our other 2 app servers.

From now on, customers will have a 1 in 3 chance of their next request being served by a rails3 server. This is where the rubber meets the road, and we find out how good a job we did weeding out the bugs. Since every step in our deployment process is just a method call (invoking an ansible playbook under the hood), it only takes a few seconds to yank the rails 3 app server out of rotation if things go very wrong.

Step 3: Nothing to see here

By this time, we’ve had a few days to observe our upgrade under live traffic. This is where we’ll notice any lingering errors that occur infrequently or in our cron jobs. Once we’ve fixed all we’ve found and the error rates have fallen off, it’s time to party!

The cleanup is uneventful – we merge our branch into master and revert our ansible PR, taking us back to our single deployment path. A little automation goes a long way, and in our case gave us the flexibility to bite off only as much as we could chew.

Red/Black Deployments at PipelineDeals

Martin Fowler’s post on BlueGreenDeployment gives a name to a deployment practice that is used by many different organizations. Our deployment practice is quite similar to the process that Martin describes, with a few distinct differences.

In his post, he describes using two identical stacks, one of which is the hot stack, servicing production requests. The other is the warm stack, which has the newest build and can be quickly switched to.

Our deployment process differs slightly from what Martin describes in that we don’t keep unused instances up for staging purposes. For us, this isn’t a fantastic use of capital or resources. Because we are firm believers of immutable infrastructure, we instead fire up the new instances we need on-demand, and retire the old stack after the deployment is complete.

This practice is shared by many other teams, and I’m going to call it Red / Black deployment, which is a tribute to Netflix’s deployment setup. The way we execute this is by using a combination of tools centered around AWS and the Ansible configuration management tool.

On top of Ansible, we have a tool called Deployer, which listens to Hubot commands run in our Operations Slack channel, and responds accordingly.

The deployment process is a state machine, and there are 3 states the system can be in at any given time.

The hubot commands are the arrows, which transition the machine to the next state.

Why you should do this

There are a few reasons to consider utilizing this type of deployment. The first is a less jarring experience than more traditional approaches. Routing requests to separate infrastructure that is ready to receive requests is generally smoother than bouncing application servers and having requests queue up.

Another reason is the rollback option. If things go wrong, you can roll back to literally the exact same hardware you were just previously on.

Third, for a Red/black deployment you exercise your configuration management each time you want to deploy. That helps with keeping the recipes from going stale and potentially breaking over time, if they are not exercised enough.

We’re not the only engineering team pushing this practice forward. Betterment recently released slides that describe their deployment process, which is very similar to ours. Netflix essentially does the same process with two autoscaling groups.

Drawbacks

One of the biggest drawbacks of a Red/black deployment strategy is the build time, which currently takes about 8 minutes. This limits the number of deployments we can do daily, and eventually will not scale with our pace of development.

Another drawback is the complexity of the deployment machine. It takes a long time to groom up your Ansible recipes to the point where they are completely autonomous and reliable.

So if you’re ready, we’ll go through how PipelineDeals implements Red/Black deploys below.

Cruise

The first state is where we live most of the time. Cruising along, servicing requests, and there are no active deploys going on.

Build

To start the deployment process, new servers need to be provisioned:

It all starts with one simple Hubot command.

hubot deploy pld:build

This will do the following:

  1. Hubot will send an API command to our deployer app which is responsible for running the actual Ansible commands.
  2. The deployer app will spin up the instances that make up pipelinedeals: 2 app instances, 2 sidekiq instances, and 2 API instances.
  3. The instances undergo a health check.
  4. If all the instances pass the health check, they will be tagged as new. In the example above, the healthy app servers that get spun up above will be tagged as new-app-server.

Afterwards the app instances are attached to a test load balancer, and we can run any final sanity checks or tests that absolutely must require the production environment (we do our best to minimize this case, but it happens.)

This is the longest step of the process, taking about 8 minutes to complete.

Deploy

After the build checks have run, and manual verification (if any) has been completed, then the deploy is ready to go.

The deploy command is very quick, and does the following:

  1. The new servers get attached to the production load balancer
  2. The old servers immediately get removed from the production load balancer
  3. All servers get re-tagged. new-app-servers become hot-app-servers, and hot-app-servers become old-app-servers.
  4. The developers responsible will check New Relic and other sources for any anomalies in error rates or response times.

If everything looks good, then the next command, cleanup, is run.

Cleanup

Cleanup is another very fast command. It brings the deployment state back to Cruise by terminating the old app servers.

Whoops! Rollback!

Ruh roh. On the rare occasion where we detect a problem after deploying, we execute the rollback command. This will immediately back out the deploy and put us back into the built state, where we can do further analysis into what happened.

Going forward

If you’re already using configuration management and practicing immutable infrastructure, this deployment strategy might help. Not only will it keep your recipes in shape and well tested, it helps to ensure deployments are smooth.

The future is looking good for this deployment method. As tools like Docker become more mature, it should allow us to reduce the build time to a matter of seconds, rather than minutes.

Super-fast deploys using AWS and ELBs

At PipelineDeals, we deploy code frequently, usually 2-3x per week, and sometimes even more often. As all web application developers know, deploying is sort of a nervous process. I mean, sure, 99.99% of the time, everything will go perfectly smooth. All your tests pass, your deploy to staging went perfectly, all the tickets have been verified. There is no reason to fear hitting the button. And, the vast majority of the time, this is true.

But all web application developers also know that sometimes, there is a snag. Sometimes the fates are against you, and for whatever reason, something goes bust. Perhaps you have a series of sequenced events that must occur to deploy, and one of the events silently failed because the volume that /tmp is mounted on temporarily had a 100% full disk. Perhaps that upload of the new assets to S3 did not work. Perhaps you did not deploy to ALL the servers you needed to deploy to.

And then, the worst happens. For a short period while you are scrambling to revert, your customers see your mistake. They start questioning the reliability of your system. Your mistake (and it is yours, even if some bug in some server caused the problem) is clearly visible to your customers, your bosses, and your peers.

Taking advantage of the tools we have

PipelineDeals runs on Amazon AWS. We utilize EC2 for our server instances, ELB for our load balancing, ElastiCache for our memcached storage. We are also major proponents of Opscode’s Chef, and use it to spin up and configure any type of instance that makes up our stack.

Since we have all these fantastic tools, we decided to use them in a way that makes deploying seamless and easy. We wrote a simple Rakefile called Deployer that orchestrates a seamless app server deploy.

Using the Deployer script

The first thing it does is creates new app servers that have the new code on them. Once the app servers have completed their configuration, the deployer rakefile will register those new app servers with a test ELB load balancer.

Phase 1, new app servers are brought up and registered with the test LB.

From there, we can do a final walkthrough of what exactly is going into production, and indeed the app server is up, awake, and ready to receive requests.

After that final validation, we simply run rake deploy, which adds the new app servers to the current load balancers, verifies their health, then removes the old app servers from the production LB. This all runs in about 3 seconds, so the transition is smooth and seamless.

During the deploy, new app servers are added to the Prod ELB, then the old app servers are moved out.

If indeed anything was wrong with our code, or we found it was generating an error we did not expect, we can simply run rake rollback which does the opposite.

Or, if we are completely satisfied that everything looks ok, we can run rake cleanup which will tag the new app servers as the current production servers, and terminate the old app servers.

Takeaways

Originally we designed the Deployer for when we launch large projects, or risky chunks of code. But I have found that we have started using the Deployer for nearly every deploy, because it is so easy.

If your company utilizes Chef, EC2, and ELB, check out the deployer. It might work great for your deployment workflow!