5 Red Flags Deployment Management Is Failing
by Mark Henke
It’s a great step when teams deliberately manage their deployments instead of treating them as second-class citizens to writing code. But there are many pitfalls to managing deployments effectively. Many things lurk, waiting to trip us up. We want to start with a guiding light: to automate our deployments completely from a push of code up to production. Every step we take toward automation makes deployment management easier. We’ll reduce human error and save loads of time. We’ll allow our customers to trust us. In the spirit of complete and utter automation, I give you five red flags that we can use to find and root out obstacles to our deployment management.
1. Manual Steps and Approvals
Our first red flag is a fairly obvious one to those familiar with continuous delivery: manual steps and approvals. Every one of these steps is a spot of darkness against the guiding light of automation.
Manual steps are clear obstacles to automated deployments. They continue to be a source of human error and slowdowns. It’s common when practicing deployment management to automate some steps. But it’s easy for a team to lose steam and give up on automating all of them. How hard it is to automate some of these steps can depend on the maturity of your organization. It can be especially tricky to deal with them when the steps require tooling or infrastructure that isn’t yet in place.
Many manual steps come in the form of supervisor or compliance approvals. The most insidious of these is when someone outside of the software team must approve deployments. Oftentimes the person approving has no grasp on what’s going to production. This makes such approvals only illusions for safety. These can be tricky to root out because they’re outside the direct influence of the team.
Dealing With Manual and External Steps
With perseverance and data to make our case, we can drive out manual steps and external approvals. Management loves talking about money, and you can show how these approvals cost your organization. First off, they increase lead time significantly. The cost of an approval handoff is one of the largest costs across an enterprise. Additionally, human error is still in effect. If you show how defects can still escape into production with approvals, you weaken the reason for their existence. For manual steps, forecast configuration error reductions. Show how much time you can save per deployment by automating the manual. You can even strengthen your case by eliminating the next red flag we’ll discuss.
2. High Error Rates per Deployment
A high error rate in a team’s application is another red flag that many may consider obvious. This can include actual error responses in the application or defects that break an application’s service level agreements or objectives. A team that’s getting high error rates indicates that it needs to build more quality into its deployment pipeline. This could mean automating certain manual steps, as we discussed above. It often means adding more tests into the deployment pipeline.
One of the more counterintuitive ways to deal with this is to “slow down” per user story and bake in more testing. It could also mean putting in more resiliency-focused code or focusing more on edge cases. Practices like behavior-driven development really help bake this quality in to keep your error rates per deployment low.
Ensure you diligently measure this metric when managing deployments.
3. Not Deploying on Certain Days
Now we come to a more subtle indication that you can improve your deployment management: not deploying on certain days. This red flag sends a signal that your deployments are unreliable. One common example of this is “Please don’t deploy anything on Fridays.” This is common because people assume a few things. First, they assume your deployment has a high enough chance that something will go wrong. Second, they believe that your deployment pipeline doesn’t automate recovering itself. Third, the team members themselves may say this because they’ll have to support it over the weekend. This last one may be an indication that the team doesn’t have the tooling or training to quickly diagnose, roll back, or fix defects.
There are a few ways to deal with this, above reducing the other red flags listed here.
Separating Deployment From Release
When teams are just starting to manage deployments, it can be easy to think that deploying software to production is the same thing as releasing it to customers. However, these concepts are separate. Releasing software exposes it to your customers. Deploying gets a new version of your application into production. Deployment tests things like “Am I connecting to the right databases or web services?” and “Do I have enough memory to run this service?” Releasing allows us to answer questions like “Will this feature make more money?” or “Do the customers like the new layout?” You can see that deployment answers “Am I building things right?” whereas release answers “Am I building the right thing?”
Separating these two will reduce the risk of causing problems for your customers on certain days. We can shift control over releasing changes to our business stakeholders while we continue to just manage deployment. The main way to do this is to build in release toggles that let us turn on deployed code that’s inert. We can also evolve this to canary release features to subsets of customers. This lets us change how our system operates in a low-risk, controlled way.
Another key way to deal with the resistance to deploying on certain days is to ensure your deployments cause no downtime for your customers. This is common practice in large commercial software, such as Amazon and Google. By having no downtime your customers will only see changes when a team chooses to release features or when something goes wrong during deployment. We can achieve zero downtime by practicing blue/green deployments. This pattern lets us deploy our new software alongside our old, then switch traffic to the new software once we validate that it’s ready.
4. A High Mean Time to Recovery
Despite our best efforts, things go wrong sometimes. It’s prudent for us to measure how quickly we can get back to a working state when that happens. If we don’t, then people start distrusting our deployments and will create pressure for us to deploy less frequently. A deployment’s mean time to recovery (MTTR) is a popular measurement between when an incident starts causing problems and how long it takes for that incident to go away. Measuring this with error rate per deployment will allow us to constantly review and improve our deployment management practices.
One way to deal with a high MTTR is to automate away any manual steps to rolling back a deployment. It’s common for teams to home in on the manual steps that exist for successful deployments. But then they easily ignore how to automate the steps the system needs to recover from those steps. I encourage every team to think about both the success and failure of every step and how to script away human interaction.
But in order to automate failures, we need a way to automatically know that something has failed. This brings us to our final red flag.
5. Unverified Deployments
A team can get so caught up in automating its deployments that it doesn’t bother to learn whether or not a deployment has actually succeeded. For many teams, the limit of checking a deployment is to hit the home page of their web application. We need to automate not only our deployment steps but also how we verify a deployment was successful. Every time a team promotes software to a new environment, it should check that the promotion succeeded. This will also help to dealign the red flag of approvals. Below are some strategies we can apply to verify deployments.
The simple health check can go a long way to verifying deployment success. Just like a person checking the home page, we can check a specific heartbeat URL or do a simple GET request on our service. This tells us at least the application started up and is running.
Smoke tests are more comprehensive than health checks but more complex. They run through some of the system’s scenarios. This ensures that the system is not only running but also has the correct high-level functionality. Be careful as you want to ensure you don’t pollute your database with unwanted information. It’s good to have a way to set up and clean up these smoke tests.
Contract tests are like smoke tests for your downstream systems. We want to ensure that we’re connecting to our dependencies correctly. We also want to ensure that those dependencies are upholding their end of our contract. Running some simple tests against their systems in each environment allows us to verify the contract is intact and that we have configured our properties correctly.
There are many more ways to verify deployments, but these are some of the most frequently used.
The goal is clear to us: fully automated, self-recovering deployment pipelines that can deploy on every push. But with our day-in and day-out workload, it’s easy to become desensitized to the numerous obstacles to that goal. I recommend every few retrospectives to review this list and see if there are any of these red flags remaining in your deployment pipeline. If you find them, root them out. Be relentless in your pursuit of automation. The investment will definitely pay off, possibly faster than you think.
This post was written by Mark Henke. Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.
19 MARCH, 2020 by Michiel Mulders SRE vs DevOps: Friends or Foes? Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though...
06 MARCH, 2020 by Arnab Roy Chowdhury Top 10 SRE Practices Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are...
20 FEBRUARY, 2020 by Arnab Row Chowdhury Technically, the world today has advanced to a level we never could’ve imagined a few years ago. What do you think made it possible? We now understand complexities. And how do you think that became possible? Literacy! Since...
14 FEBRUARY, 2020 by Michiel Mulders A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to...
07 February, 2020 by Arnab Roy Chowdhury Do you remember what Uncle Ben said to young Peter Parker? “With great power comes great responsibility.” The same applies to companies. At present, businesses hold a huge amount of data—not only the data of a company but also...
17 JANUARY, 2020 by Sylvia Fronczak Site reliability engineering (SRE) uses techniques and approaches from software engineering to tackle reliability problems with a team’s operations and a site’s infrastructure. Knowing the history of SRE and understanding which...