uptime-is-primetime

Uptime is primetime

At PipelineDeals, we are product engineers. We work on a 12 year old codebase that powers a SaaS product that tens of thousands of our customers spend a large portion of their day using, and rely upon to drive and maintain their relationships with their customers. We have all sorts of metrics to track how we’re doing as an engineering team. Reigning supreme above all others, is uptime.

What’s my uptime?

In the early days of the web, uptime was pretty straight-forward. Everyone was publishing monolithic apps that either let you log in and do your thing or it didn’t. The URL either resolved correctly and presented a working application, or it didn’t. If it failed, it would show a 500 page, browser DNS error page or it spun indefinitely and never did anything. 

These days, large and complex apps are typically delivered as a set of networked but independent services, each performing their task and communicating with others. At PipelineDeals, we began with a monolithic application from which we extracted several services, each with their own persistent storage and responsibilities.

From a systems engineering perspective, this approach gives us new tools to help us provide a higher guarantee of availability than is possible under a monolithic delivery model.



Circuit Breakers

The boundaries between services, where one part of the application is depending on data or functionality provided by another, can be hardened against cascading failure using a circuit breaker. This is a piece of logic that will be invoked if there is a network error, internal server error, etc. that prevents a service from responding in a timely manner. This code will return a default value to the client service, allowing execution of the user’s web request to continue (albeit with a degraded or eventually-consistent state).

For example, one of the core services that powers PipelineDeals is responsible for billing and subscriptions. When the main app needs to know which features a particular user or account has access to, it must make an internal API call to this billing service. What if the database server this billing service talks to has a hardware failure? 

Well, since we have a circuit breaker installed in the code that handles this communication, the main app will detect that requests to the billing API are currently timing out or failing. Noticing this, it will substitute a standard set of features instead of the realtime set that would be returned from the billing service, and continue with serving the user’s request.

Without the circuit breaker, a failure in the billing service would have cascaded into failures in all services dependent on it. In this naive case, the app’s uptime is only as good as the uptime of the least reliable service in the request’s critical path through the system. 

Yellow WARNING Barrier Tape Background Isolated on White


Early Warning System

Service-based apps have some failure modes that you don’t see in monolithic apps. Chief among these is the cascading failure. One component will start receiving more load than it can handle, causing requests to it from other components to fail or hang around waiting for a response. These delays or failed requests will pile up in these client components, causing retries and degraded service to their clients, and so forth. Before you know it, the whole web of interdependent services will seize up and start refusing or failing requests from the web app and customer API calls and cause a downtime event.

There’s an opportunity here, as well. Well-designed services have well-defined boundaries, and it’s at these boundaries that we can look for trouble (any good monitoring service will make this easy to set up) and alert early. Data points that we’ve found helpful to monitor include queue job count, error rates and API request 95th percentile response times.

Responding to these alerts early enough will allow a team to avoid a full-blown downtime by, at worst, shutting down services not critical to the core product during an emergency. Happy days!

Panorama of medieval town walls. Avila, Spain


Defence in Depth

We can also look at the monitoring and alerting infrastructure in a top-down way. Failures can occur not only across components but also at different layers within a single component. Imagine you suddenly start seeing elevated 5xxresponses to user-facing web transactions. What’s the cause? Depending on your infrastructure and setup, this could point to a problem with your DNS setup, load balancers, application host, web server, routing layer, database, in-memory object storage, etc.

In an emergency situation, the less probing in the dark we have to do, the better. Setting up monitoring at every level of the stack is an excellent way to cut down this search space. For example, our ping tests report everything is fine but application error monitoring is showing elevated error responses, its reasonable to conclude that everything upstream of your app servers is functioning correctly and that the likely culprit is a recent code change.



Cost / Benefit

So, what are all these handy dashboards and fancy infrastructure going to cost you? It’s a sliding scale. There are engineering teams working at all points on the spectrum between a single monolithic app and a constellation of microservices. Having said that, there is no question that there are additional fixed and marginal costs of delivering your app as a set of networked services.

Most easily measurable are the direct costs of extra servers and hosting infrastructure. We usually want redundancy at the level of the hardware that the service is running on, so most production setups will deploy one service per server instance. The extra database hardware to isolate storage per service is another significant cost, as is the data transferred between service APIs.

The cost incurred for your engineering team’s time is much harder to quantify. Even assuming everyone on the team has the DevOps chops to isolate and debug production issues, the resolution of those issues will necessarily be more complex as they involve more moving parts than under the original monolithic approach.

Do I need it?

Distributed application architectures are not a silver bullet for all that ails your application. In fact, shoehorned into a deployment that doesn’t make sense, it will multiply your problems. Best practice is to begin your app with a simple, monolithic deployment model that’s quick to iterate and develop new features for. As your app acquires paying customers, you’re building a business case for investing the time and money required to improve the resiliency of the infrastructure the business relies on. Even at this point, that may not mean a fully distributed service model as we’ve discussed above – it’s an art as much as it is a science.

Share this post:

Share on facebook
Share on google
Share on twitter
Share on linkedin
Share on pinterest
Share on print
Share on email

Don't miss another post! Sign up here.