IT Environments – Top 5 Deployment Metrics
By Christian Meléndez.
Preamble: Following on from our previous blog on Enterprise Release Management bench-marking, Christian talks about the “sister disciple” of deployment management and more specifically the top 5 metrics one could use to better understand and optimize their operations and streamline application lead times and project delivery.
It’s time to talk about the top five deployment metrics that you could use to make wise decisions when improving the lead time of an application. Mapping your value stream is an excellent way to identify where you’re having problems shipping application changes on time. Deployment metrics will tell you when and how much time you need to spend improving the deployment pipeline’s quality and make it more efficient.
When talking about deployments, the key metrics are related to frequency and how much time deployments are taking. But quality is also an important metric. How many deployments fail? Ideally, none. How much time does it take to recover when there’s a failure? How many deployments are on schedule?
There are going to be a lot of reasons why these metrics could be wrong, like spending too much time doing manual tasks. But without all this information it’s hard to know where to focus your efforts.
The Number of Deployments
Just because the current trend is to make several deployments a day doesn’t mean that you must focus on deploying changes even if you don’t need to. The number of deployments you execute says a lot about how frequently you’re changing the application. The more frequently you deploy, the smaller the changes and the risk they carry are. It’s really different to do one big deployment to change the database engine as opposed to changing the database engine little by little—for example by using feature flags.
The best approach to how frequently you should deploy resonates with the saying if it hurts, do it more often. Martin Fowler wrote about this saying, explaining that frequency reduces the difficulty of activities. And we all know that deployments can be both difficult and risky. Most of the time when there’s a downtime it’s because a deployment just occurred. Fearing such downtimes, we decide to accumulate changes or set tight schedules to deploy.
Increasing the number of deployments you do helps you to improve deployments until they become dull and deterministic.
The Time for Each Deployment
If you want to deploy frequently, deployments shouldn’t take too much time to finish. The amount of time a deployment takes is a sign of how complex or simple the steps to deploy a change for the application are. Maybe you need to update a lot of servers, and you’re updating them one by one. Or maybe there are many dependencies that you need to take care of to finish the deployment. For example, you need to update the database or update the HTTP endpoint for the services that depend on the application you deploy.
Time is also a sign of how many manual processes still exist in your deployment pipeline. The more manual labor exist, the riskier a deployment becomes. And that risk only increases when there’s pressure to ship an application change. Everything that can go wrong will go wrong. This is especially true when someone is following a manual or trusting their ability to remember every step. Just one mistake in a deployment, and chances are high that you’ll need to start over again.
So is the answer to improving deployment time getting rid of manual labor with automation? Well, sometimes the effort is worth it. Other times, the problem might be in a tight architecture where every fix could generate a new issue. An application should be able to be deployed independently without having to coordinate too many things.
A deployment that takes a few minutes or less will also allow you to increase the frequency of deployments.
The Number of Failed Deployments
Now let’s say that you’ve been able to (or at least are able to) deploy more frequently and deployments are fast. How many of these deployments do you fail? Hopefully not many. But if you know that the failure rate has increased, you can decide to slow down a little bit and start focusing all your energy on improving the deployment’s quality.
Correlate the number of errors in the application with deployments. The next time a downtime occurs or the availability of the application is not good, you could start by asking if this starting to happen because a deployment just happened.
When the deployment failure rate increases, maybe you need to increase the code coverage, add more integration tests, add performance tests, or implement canary releases. But what I’ve seen happening most of the time is that the quality of deployments decreases because deployments are treated differently in production than other environments. Having differences in the way you deploy your application to all different environments will cause you problems. Deploy the same way everywhere, like using the same scripts, recipes, or playbooks.
Having a low failure rate will increase the confidence in the team and will let you know that you’re on the right path to improve the lead time of your deployments.
The Mean Time to Recover After a Downtime
Every time a deployment happens, there’s always the possibility of causing downtime in the application. Reasons behind a downtime may vary, like missing test cases or a type in the number of servers that have to be updated with the new code. Having 100 percent of availability for a service is expensive, and it isn’t worth it. Instead of trying to enforce how to avoid any downtime, try to focus on recovering as fast as you can when something goes wrong.
Start by setting the baseline to define how the system looks when it’s healthy. Then take action every time something gets outside the parameters of that “normal” state. For example, you might define a service level objective (SLO) to have a latency under 500ms. If, after a deployment, the latency of the system turns out to be above the SLO, you might want to roll back immediately.
Ideally, if you’ve been able to improve the deployment frequency and it doesn’t take too much time for the deployment to finish, the mean time to recover should be small. I mean, it’s just about doing another deployment, ordinary business by this point. It could be about changing the artifact version to a previous one. Remember, the time it takes to deploy a change will indicate how hard or easy it is to make a change in the application.
The best way to keep the mean time to recover small is by practicing the rollback mechanism. When everything is under version control, it gets easier—it’s just about going back to the previous version and deploying again.
The Number of Deployments on Schedule
It doesn’t matter if you’re able to ship application changes rapidly if they’re not on time, live in production. The number of on-schedule deployments is a good metric to motivate the team to improve the deployment pipeline. Also, anyone that’s interested in knowing more about the reasons why the need to introduce automation or infrastructure as code will find this metric helpful.
I’ve worked on projects where we needed to change many things in every deployment. A few of the changes were done manually. There were even times where unexpected problems arose in a testing environment. These problems had a significant impact on the testing phase of the application. We lacked heterogeneal environments. It was possible that something could happen when pushing the application changes to the next environment.
Having deployments that are being done on schedule is a sign that your deployment pipeline is in good condition. You frequently deploy, it takes just minutes, and more importantly, you have a steady quality.
How Are Your Deployment Metrics Doing?
Before you start using these metrics to identify waste and efficiently deployments, you need to know your current state. What are the current numbers for these type of metrics in your workloads? If you don’t know what the current state is, you’ll end up improving processes that don’t add value.
Using these deployment metrics will help you identify where there’s waste in your value stream mapping. You’ll have a clear view of where your efforts should be oriented to improve speed and quality when shipping new features.
Author Christian Meléndez.
Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.
19 MARCH, 2020 by Michiel Mulders SRE vs DevOps: Friends or Foes? Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though...
06 MARCH, 2020 by Arnab Roy Chowdhury Top 10 SRE Practices Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are...
20 FEBRUARY, 2020 by Arnab Row Chowdhury Technically, the world today has advanced to a level we never could’ve imagined a few years ago. What do you think made it possible? We now understand complexities. And how do you think that became possible? Literacy! Since...
14 FEBRUARY, 2020 by Michiel Mulders A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to...
07 February, 2020 by Arnab Roy Chowdhury Do you remember what Uncle Ben said to young Peter Parker? “With great power comes great responsibility.” The same applies to companies. At present, businesses hold a huge amount of data—not only the data of a company but also...
17 JANUARY, 2020 by Sylvia Fronczak Site reliability engineering (SRE) uses techniques and approaches from software engineering to tackle reliability problems with a team’s operations and a site’s infrastructure. Knowing the history of SRE and understanding which...