IT Environments – Top 5 Deployment Metrics
By Christian Meléndez.
Preamble: Following on from our previous blog on Enterprise Release Management bench-marking, Christian talks about the “sister disciple” of deployment management and more specifically the top 5 metrics one could use to better understand and optimize their operations and streamline application lead times and project delivery.
It’s time to talk about the top five deployment metrics that you could use to make wise decisions when improving the lead time of an application. Mapping your value stream is an excellent way to identify where you’re having problems shipping application changes on time. Deployment metrics will tell you when and how much time you need to spend improving the deployment pipeline’s quality and make it more efficient.
When talking about deployments, the key metrics are related to frequency and how much time deployments are taking. But quality is also an important metric. How many deployments fail? Ideally, none. How much time does it take to recover when there’s a failure? How many deployments are on schedule?
There are going to be a lot of reasons why these metrics could be wrong, like spending too much time doing manual tasks. But without all this information it’s hard to know where to focus your efforts.
The Number of Deployments
Just because the current trend is to make several deployments a day doesn’t mean that you must focus on deploying changes even if you don’t need to. The number of deployments you execute says a lot about how frequently you’re changing the application. The more frequently you deploy, the smaller the changes and the risk they carry are. It’s really different to do one big deployment to change the database engine as opposed to changing the database engine little by little—for example by using feature flags.
The best approach to how frequently you should deploy resonates with the saying if it hurts, do it more often. Martin Fowler wrote about this saying, explaining that frequency reduces the difficulty of activities. And we all know that deployments can be both difficult and risky. Most of the time when there’s a downtime it’s because a deployment just occurred. Fearing such downtimes, we decide to accumulate changes or set tight schedules to deploy.
Increasing the number of deployments you do helps you to improve deployments until they become dull and deterministic.
The Time for Each Deployment
If you want to deploy frequently, deployments shouldn’t take too much time to finish. The amount of time a deployment takes is a sign of how complex or simple the steps to deploy a change for the application are. Maybe you need to update a lot of servers, and you’re updating them one by one. Or maybe there are many dependencies that you need to take care of to finish the deployment. For example, you need to update the database or update the HTTP endpoint for the services that depend on the application you deploy.
Time is also a sign of how many manual processes still exist in your deployment pipeline. The more manual labor exist, the riskier a deployment becomes. And that risk only increases when there’s pressure to ship an application change. Everything that can go wrong will go wrong. This is especially true when someone is following a manual or trusting their ability to remember every step. Just one mistake in a deployment, and chances are high that you’ll need to start over again.
So is the answer to improving deployment time getting rid of manual labor with automation? Well, sometimes the effort is worth it. Other times, the problem might be in a tight architecture where every fix could generate a new issue. An application should be able to be deployed independently without having to coordinate too many things.
A deployment that takes a few minutes or less will also allow you to increase the frequency of deployments.
The Number of Failed Deployments
Now let’s say that you’ve been able to (or at least are able to) deploy more frequently and deployments are fast. How many of these deployments do you fail? Hopefully not many. But if you know that the failure rate has increased, you can decide to slow down a little bit and start focusing all your energy on improving the deployment’s quality.
Correlate the number of errors in the application with deployments. The next time a downtime occurs or the availability of the application is not good, you could start by asking if this starting to happen because a deployment just happened.
When the deployment failure rate increases, maybe you need to increase the code coverage, add more integration tests, add performance tests, or implement canary releases. But what I’ve seen happening most of the time is that the quality of deployments decreases because deployments are treated differently in production than other environments. Having differences in the way you deploy your application to all different environments will cause you problems. Deploy the same way everywhere, like using the same scripts, recipes, or playbooks.
Having a low failure rate will increase the confidence in the team and will let you know that you’re on the right path to improve the lead time of your deployments.
The Mean Time to Recover After a Downtime
Every time a deployment happens, there’s always the possibility of causing downtime in the application. Reasons behind a downtime may vary, like missing test cases or a type in the number of servers that have to be updated with the new code. Having 100 percent of availability for a service is expensive, and it isn’t worth it. Instead of trying to enforce how to avoid any downtime, try to focus on recovering as fast as you can when something goes wrong.
Start by setting the baseline to define how the system looks when it’s healthy. Then take action every time something gets outside the parameters of that “normal” state. For example, you might define a service level objective (SLO) to have a latency under 500ms. If, after a deployment, the latency of the system turns out to be above the SLO, you might want to roll back immediately.
Ideally, if you’ve been able to improve the deployment frequency and it doesn’t take too much time for the deployment to finish, the mean time to recover should be small. I mean, it’s just about doing another deployment, ordinary business by this point. It could be about changing the artifact version to a previous one. Remember, the time it takes to deploy a change will indicate how hard or easy it is to make a change in the application.
The best way to keep the mean time to recover small is by practicing the rollback mechanism. When everything is under version control, it gets easier—it’s just about going back to the previous version and deploying again.
The Number of Deployments on Schedule
It doesn’t matter if you’re able to ship application changes rapidly if they’re not on time, live in production. The number of on-schedule deployments is a good metric to motivate the team to improve the deployment pipeline. Also, anyone that’s interested in knowing more about the reasons why the need to introduce automation or infrastructure as code will find this metric helpful.
I’ve worked on projects where we needed to change many things in every deployment. A few of the changes were done manually. There were even times where unexpected problems arose in a testing environment. These problems had a significant impact on the testing phase of the application. We lacked heterogeneal environments. It was possible that something could happen when pushing the application changes to the next environment.
Having deployments that are being done on schedule is a sign that your deployment pipeline is in good condition. You frequently deploy, it takes just minutes, and more importantly, you have a steady quality.
How Are Your Deployment Metrics Doing?
Before you start using these metrics to identify waste and efficiently deployments, you need to know your current state. What are the current numbers for these type of metrics in your workloads? If you don’t know what the current state is, you’ll end up improving processes that don’t add value.
Using these deployment metrics will help you identify where there’s waste in your value stream mapping. You’ll have a clear view of where your efforts should be oriented to improve speed and quality when shipping new features.
Author Christian Meléndez.
Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.
03 JULY, 2019 by Justin Reynolds Even since the agile manifesto was published in 2001, software development has never been the same. In a pre-agile world, software was released in monolithic packages every year or every two years. The agile approach to development...
20 MAY, 2019 by Mark Henke It’s a great step when teams deliberately manage their deployments instead of treating them as second-class citizens to writing code. But there are many pitfalls to managing deployments effectively. Many things lurk, waiting to trip us up....
08 MAY, 2019 by Mark Henke Taking enterprise release management seriously is a great step toward helping our organization flourish. Embracing release management will allow us to make the invisible visible. We’ll be able to effectively manage how work flows through our...
04 MAY, 2019 by Rodney Smith If you work in an organization that uses the scaled agile framework (SAFe), chances are it's not a small company. It's enterprise-y. It's probably gone through some growing pains, which is a good problem to have in the business sense. The...
29 APRIL, 2019 by Carlos "Kami" Maldonado "DevOps at scale" is what we call the process of implementing DevOps culture at big, structured companies. Although the DevOps term was coined almost 10 years ago, even in 2018 most organizations still haven't completely...
24 APRIL, 2019 by Mark Robinson It’s the normal case with software buzzwords that people focus so much on what something is that they forget what it is not. DevOps is no exception. To truly embrace DevOps and cherish what it is, it’s important to comprehend what it...