Uptime is primetime

At PipelineDeals, we are product engineers. We work on a 12 year old codebase that powers a SaaS product that tens of thousands of our customers spend a large portion of their day using, and rely upon to drive and maintain their relationships with their customers. We have all sorts of metrics to track how we’re doing as an engineering team. Reigning supreme above all others, is uptime.

What’s my uptime?

In the early days of the web, uptime was pretty straight-forward. Everyone was publishing monolithic apps that either let you log in and do your thing or it didn’t. The URL either resolved correctly and presented a working application, or it didn’t. If it failed, it would show a 500 page, browser DNS error page or it spun indefinitely and never did anything. 

These days, large and complex apps are typically delivered as a set of networked but independent services, each performing their task and communicating with others. At PipelineDeals, we began with a monolithic application from which we extracted several services, each with their own persistent storage and responsibilities.

From a systems engineering perspective, this approach gives us new tools to help us provide a higher guarantee of availability than is possible under a monolithic delivery model.



Circuit Breakers

The boundaries between services, where one part of the application is depending on data or functionality provided by another, can be hardened against cascading failure using a circuit breaker. This is a piece of logic that will be invoked if there is a network error, internal server error, etc. that prevents a service from responding in a timely manner. This code will return a default value to the client service, allowing execution of the user’s web request to continue (albeit with a degraded or eventually-consistent state).

For example, one of the core services that powers PipelineDeals is responsible for billing and subscriptions. When the main app needs to know which features a particular user or account has access to, it must make an internal API call to this billing service. What if the database server this billing service talks to has a hardware failure? 

Well, since we have a circuit breaker installed in the code that handles this communication, the main app will detect that requests to the billing API are currently timing out or failing. Noticing this, it will substitute a standard set of features instead of the realtime set that would be returned from the billing service, and continue with serving the user’s request.

Without the circuit breaker, a failure in the billing service would have cascaded into failures in all services dependent on it. In this naive case, the app’s uptime is only as good as the uptime of the least reliable service in the request’s critical path through the system. 

Yellow WARNING Barrier Tape Background Isolated on White


Early Warning System

Service-based apps have some failure modes that you don’t see in monolithic apps. Chief among these is the cascading failure. One component will start receiving more load than it can handle, causing requests to it from other components to fail or hang around waiting for a response. These delays or failed requests will pile up in these client components, causing retries and degraded service to their clients, and so forth. Before you know it, the whole web of interdependent services will seize up and start refusing or failing requests from the web app and customer API calls and cause a downtime event.

There’s an opportunity here, as well. Well-designed services have well-defined boundaries, and it’s at these boundaries that we can look for trouble (any good monitoring service will make this easy to set up) and alert early. Data points that we’ve found helpful to monitor include queue job count, error rates and API request 95th percentile response times.

Responding to these alerts early enough will allow a team to avoid a full-blown downtime by, at worst, shutting down services not critical to the core product during an emergency. Happy days!

Panorama of medieval town walls. Avila, Spain


Defence in Depth

We can also look at the monitoring and alerting infrastructure in a top-down way. Failures can occur not only across components but also at different layers within a single component. Imagine you suddenly start seeing elevated 5xxresponses to user-facing web transactions. What’s the cause? Depending on your infrastructure and setup, this could point to a problem with your DNS setup, load balancers, application host, web server, routing layer, database, in-memory object storage, etc.

In an emergency situation, the less probing in the dark we have to do, the better. Setting up monitoring at every level of the stack is an excellent way to cut down this search space. For example, our ping tests report everything is fine but application error monitoring is showing elevated error responses, its reasonable to conclude that everything upstream of your app servers is functioning correctly and that the likely culprit is a recent code change.



Cost / Benefit

So, what are all these handy dashboards and fancy infrastructure going to cost you? It’s a sliding scale. There are engineering teams working at all points on the spectrum between a single monolithic app and a constellation of microservices. Having said that, there is no question that there are additional fixed and marginal costs of delivering your app as a set of networked services.

Most easily measurable are the direct costs of extra servers and hosting infrastructure. We usually want redundancy at the level of the hardware that the service is running on, so most production setups will deploy one service per server instance. The extra database hardware to isolate storage per service is another significant cost, as is the data transferred between service APIs.

The cost incurred for your engineering team’s time is much harder to quantify. Even assuming everyone on the team has the DevOps chops to isolate and debug production issues, the resolution of those issues will necessarily be more complex as they involve more moving parts than under the original monolithic approach.

Do I need it?

Distributed application architectures are not a silver bullet for all that ails your application. In fact, shoehorned into a deployment that doesn’t make sense, it will multiply your problems. Best practice is to begin your app with a simple, monolithic deployment model that’s quick to iterate and develop new features for. As your app acquires paying customers, you’re building a business case for investing the time and money required to improve the resiliency of the infrastructure the business relies on. Even at this point, that may not mean a fully distributed service model as we’ve discussed above – it’s an art as much as it is a science.

Virtual Offices at PipelineDeals: How PipelineDeals has Mastered Remote Working

Covid-19 has plunged hundreds of companies across the globe into adopting a required work from home policy almost overnight.  Many companies were not prepared to adopt and adapt in such a short timeline. PipelineDeals has been working a partial remote model for over 14 years.  In particular, the engineering team which comprises roughly one third of the company has always been 100% remote. We have software engineers in the US, Poland, Dominican Republic, Canada, and more.   And coincidently, the rest of the company enabled a work from home option in January 2020 for the Seattle Headquarters based people. Given the abrupt changes companies and teams are making, I wanted to share the best practices we use to ensure high productivity, high morale, and success.

Remote work success boils down to a great set of tools and accompanying processes and defined culture to enable above and beyond communication.  Let’s first talk about the tools.

The Tools

Collaboration

Video conferencing

A high quality reliable video conferencing product is essential.  This does not have to be expensive. We have been through many different productssuch as Google Hangouts, Skype, and now Zoom.  When Zoom came along it changed our life. We spent so much time fixing and troubleshooting the other products. Dealing with intermittent outages, video hanging, and trouble organizing meetings.  Zoom with its Brady Bunch style view of everyone on the call changed our life. It was stable out of the gate. It has always been reliable. Like…always. Our productivity for meetings went way up.  We don’t think twice about video logistics. It’s all integrated into our email and calendar clients with ease. A reliable and easy tool is critical.

Chat and Conversations

The next critical piece is a real time communications vehicle.  We use Slack. Slack is not just chat for us. It is our platform of all company communications.  

Productivity

Let’s start with productivity.  For productivity we live and die by slack.  Here are some of the various ways we put it to work:

  • Engineering team technical discussions
  • Product development discussions
  • Customer support help from engineering
  • Product planning discussions
  • Innovation discussions.
#daily-updates

Multiple time zones make daily standups impossible.  Especially since ours span 9 hours of differences. So we took the standups to slack asynchronously.  Meaning, people enter their daily updates when their workday is over on their own schedule. This keeps everyone informed on progress, who is working on what, and if anyone needs help.

#support

Our Customer eXperience team is connected to the engineering team who can help them with difficult customer technical issues in real-time.  Our company is above and beyond when it comes to how we want to treat our customers and this channel is vital to the success of that mission.

DevOps

And then there is development and operations, otherwise known as Devops.  Here are some key channels

#engineering

All of our design, architecture, and forward looking thinking usually enters through this channel.  Engineers talking tech.

#deployments

We deploy our code to production several times each day.  And we use slack to do it. “Hubot deploy app: [build | deploy | cleanup]” can be seen regularly by the engineers in slack.  Through this channel and “Hubot deploy:status”, everyone knows exactly what is going on. 

#operations

Our cloud operations are all monitored and then alerted through slack.  If our site has an issue, we know first through slack. Our team uses PagerDuty to be alerted to critical operational issues.  While the engineer on call will get an alert through the PagerDuty app, the rest of the team will see the issue in this channel.  We dedicate this channel to seeing and fixing performance, security, stability, reliability, and more. This channel is also open to the company in case others want to see what is going on at an engineering level

#circleci

Build status.  Like any agile team, we do continuous integration of our code which means after code is committed and pushed, automated tests are run to ensure nothing has broken.  The team watches this channel for state of the build

Why So Serious!

We share our music and passions for cooking through Slack

It’s not all work and productivity.  Maintaining connection with each other is equally important to the success of the whole remote dynamic.  Here are some of the channels that add to this dimension

#music

We share our favorite bands, albums, songs, and videos

#trackoftheday

Someone picks one unique song from their music collection that they think people will enjoy

#whatscooking

Photos from the dinner tables of our amazing Pipeliner chefs

#holidaycheer

We get festive virtually!

The Rest of the Stack

There are other tools, that are specific to our specific work and the type of company we are.  Thinks like product development kanban boards, code repositories, and of course email, calendars, and documentation.  Every company has these, and they become even more important while remote.

The Remote Culture

We sat down as a company and wrote down what is important to us, no matter if we are in an office together or distributed across the globe.  This is what we came up with.

Remote culture is built on trust and integrity
  • Quality of Life
  • Positive team dynamic
  • Alignment
  • Collaboration
  • Productivity
  • Progress
  • Reliability
  • Engagement
  • Achieving our goals
  • Overall satisfaction

If we could achieve these things, we are achieving the company culture we desire.  So we tapped the engineering team, who has been doing this since the inception of the company to find out what has been important to them.  This is what they said:

#1 CONTEXT 

Searchable, text-based record of all decisions and discussions (Slack). Secondly, no side conversations. 1:1 conversations can be started but are moved to #engineering or #operations, so that others can benefit and refer to whatever info later.

#2 DAILY UPDATES

Posting simple and short daily update messages: what I did, what I’m working on, any blockers. Easy and searchable way for everyone to know what’s going on without synchronous communication (#daily-updates channel in Slack). 

#3 DISCIPLINE

Communication habits slipping means that communication doesn’t happen, there’s no running into others in the hallway and remembering what you need to tell them 

#4 TRUST

Being trusted to work reliably and professionally from home without supervision inspires people to never break that trust. The flexibility the company grants in organizing days outside of required meetings makes life easier, and in return employees have no issue with checking in after hours, doing the PagerDuty on-call rotation, etc. People have a sense of ownership over our platform, and it’s as much a personal mission as a professional one to make sure it NEVER GOES DOWN

Wrapping it all up

There is no replacement for in person spontaneous conversation across a desk, or in the work kitchen.  The depth and richness of human connection is undeniable. But a similar world can in fact be achieved in a virtual work environment.  PipelineDeals has spent 14 years providing superior customer service and product delivery using the remote model. It works and the proof is in the product and our customers.   The benefits it brings to our employees keep people out of their cars, off the buses, and has a huge impact on our environment. It gives them more time with their families each day.  Everyone has their personal list of benefits. If you have questions about being successful as a newly remote team or company, give me a shout!

 

See Your Pipeline From Every Angle with Performance Pulse

Performance pulse gives you the ability to dive deep into your pipeline(s) with configurable visuals so you can access data relevant to your business. At any time during your sales cycle, you can check the health of your pipeline and track Performance Lanes across multiple territories, teams, or even individuals toward your Performance Pulse Goals.

Continue reading

The Psychology of Selling

If you understand the psychology of selling you can improve your close rates and sell to prospects like never before. Here are our top tips to start selling smarter and faster by understanding the psychology of the buying process.

Continue reading