LF

Top 5 Cloud Metrics

21

MARCH, 2019

by Mark Henke

Preamble

These days, deploying our services to the cloud just makes sense. Deploying to the cloud means you’re letting someone else handle low-level infrastructure costs, which gives us incredible flexibility. And with such flexibility comes a potentially overwhelming set of options on the best ways we can monitor our systems and take advantage of how easily we can scale our services.

But luckily, we can limit our options based on two key factors: what decisions we’ll make to better our system and how customer-focused they are. With that in mind, I’ll share what I think are the five most important cloud metrics.

 

Metrics Are for Decisions

Here, I want to revisit some of my points from my “Top 5 DevOps Metrics” post on the Enov8 blog. And I’ll start by reemphasizing what I said there: metrics are useless on their own. If someone comes to you asking for a dashboard or to track some data, it’s fair to ask, “How will you use that data to make decisions?” If data doesn’t help you decide what actions to take, it’s not data worth collecting—the information will simply be clutter. And we eliminate clutter from our minds so we can concentrate on the data that will guide our decisions.

I want to be clear about what I mean when I say decision here. I don’t mean you need to know exactly what specific code you will hit or what report you will send out. What I mean is that you should have a decision-making system in place that you trigger when a metric reaches some threshold.

For example, let’s say you want to be able to notify a customer when your service is unstable. So you ask your team to create a process where they send an email to all customers. Then once the situation settles, they’ll send another email saying everything is A-OK. If the instability is extensive, you may decide to have phone calls with your preferred customers as well. Now you can build a metric that monitors when availability dips below 99 percent, and at that point, you’d send the email.

Customers First, Then Everything Follows

Yes, we need to know what decisions our metrics will support. But we need more than that, too. For any metric we use, we should be able to point back to how it helps our customers. After all, our customers are our reason for existing!

Top Five Metrics

The top five customer-focused, decision-encouraging metrics for cloud systems are as follows:

  • Service availability
  • Reliability
  • Incident rate
  • Throughput
  • Service response time

Service Availability

Also known as uptime, service availability is how often the service is able to receive requests from users and/or consuming applications. This is usually measured in “9s.” For example, a service that is available 99 percent of the time in a year has two 9s of availability. The gold standard for this is five 9s, or 99.999-percent uptime per year. Achieving this, however, is usually very expensive.

What Decisions It Supports

The threshold at which you should trigger some decisions depends on the cost of downtime for your organization. If your system goes down, roughly how much money will the organization lose? Knowing that will help determine what we are willing to invest into availability in order to avoid those costs. It may help to have two thresholds: one that acts as a warning that something may be going on with a service, and a second one that signals critical instability.

With these two thresholds, you can set up some decisions. If you hit the warning threshold you may want the team to put a card at the top of their backlog to investigate the problem during this iteration or the next one. If you hit the critical threshold, you may want the team to stop what they are doing and swarm on investigating and resolving the problem.

How It Relates to the Customer

The service availability metric has some clear connections to customer satisfaction. If they cannot use your application, they will go elsewhere. If your application is flaky, they will not trust you enough to use you consistently. So having a highly available system builds warm customer rapport.

Reliability

While availability tells you that a service is up, it may still have problems. When those problems occur, reliability measures how quickly we get the system back to a usable place. Reliability measures both short-term and long-term problem elimination.

Short-term reliability is measured by mean time to recovery. This is how long it takes for the team or system to overcome some problem. For example, if the cache fills up too much and causes product searches to take twice as long as normal, the mean time to recovery would be how long it takes to flush the cache.

Long-term reliability is measured by mean time to repair. This is how long it takes to permanently root out a recurring issue. So going back to our cache example, this would be how long it takes for the developers to implement a feature where the cache expires before it fills up.

What Decisions It Supports

I strongly recommend mapping a value stream for handling production incidents. It’s good to look at how an incident moves from being reported by a user or the system to being resolved and finally being permanently repaired. Doing so will help you track and investigate waste to these metrics, just like you do with your development value stream.

How It Relates to the Customer

Like availability, customers trust a system that responds the way they expect it to. The fewer nasty surprises in your systems, the more customers will trust you. They’ll also be more likely to stick around, using you in the future. Additionally, an unreliable system will cause bugs that may lose customer transactions, which loses you money. This is especially true if you have to recompense the customer something for the inconvenience. This applies to internal customers too, since ultimately the consuming application is serving real customers somewhere.

Even for purely internal apps, such as a timesheet application, high reliability means higher employee morale and employees wasting less time trying to contact the help desk and figure out what’s going on.

Incident Rate

Reliability only gives a portion of the picture of customer trust. The other side of this is incident rates. This metric shows how frequently an incident occurs. You can measure this via your error tracker or even through a customer support tool. The incident rate plus reliability will give you a good picture of how often your system does what is expected.

What Decisions It Supports

What incidents pop up in your system vary widely, but you probably want a system in place similar to what we discussed with availability. With a good error tracking tool, you can monitor the severity of different errors and warnings. You can also measure severity based on how frequently an error or incident occurs. Medium-severity incidents may trigger an investigation for the next iteration. High-severity errors can trigger an immediate triage from one or more of your developers.

How It Relates to the Customer

The factors here are a lot like the factors for reliability and service availability. There’s not much more to add on this front. Use all three of these metrics to get a sense of how much a customer may trust your application when it is running in the cloud.

Throughput

We shift gears a bit with this next metric, moving away from avoiding problems toward providing maximum service to our customers. Throughput lets us look at how many customer requests we can handle at a time. This is often measured in transactions per second. Slower applications can adjust the time unit as necessary.

What Decisions It Supports

Set up your throughput thresholds based upon current and anticipated customer demand. When your throughput goes below that demand it makes sense to trigger some sort of investigation with the development team. A strong development team will dedicate a portion of their work per iteration to technical debt. It can make sense to dump these investigations into the technical debt backlog.

How It Relates to the Customer

The throughput you need is directly connected to the number and speed at which you service customers. The more customers likely to hit your service at a time, the more throughput you need to handle it. If you don’t handle and anticipate the rate of customer transactions, you run the risk of increasing incident rates as your system buckles under pressure.

Service Response Time

Our final metric is how responsive our system or service is in the cloud. How fast do customers or consumers receive responses to their requests? This can be measured in latency in milliseconds per request. For web apps, this is very simple. For apps with asynchronous processing, it may take a bit more elbow grease to instrument the requests.

What Decisions It Supports

Based on your application you can establish certain response service-level agreements (SLAs) for your customers. When latency hits that threshold, just like with throughput, you can toss an investigation or “fix it” card into the team’s technical debt backlog. I recommend ensuring these SLAs are actually made visible to the customer and known in advance. Otherwise, you may find yourself scrambling to “false alarms”—slower requests that don’t actually break any SLAs.

A more advanced decision-making system would proactively stop SLA breakages by having an early warning threshold. One way you can achieve this is by making visible the moments when your response times are 80 percent of the SLA time.

How It Relates to the Customer

Anything that takes more than a second on the web is noticeable by people. When customers notice slowness, they get frustrated. This affects the trust and loyalty you retain with them, just like reliability. You want to present a responsive, accommodating experience to your customers so they will keep coming back.

An Umbrella for a Rainy Day

When we run in the cloud, sometimes we get a downpour of scaling and production issues. Similar to the metrics in my “Top 5 DevOps Metrics” post, knowing the decisions we will make will give us an umbrella against this rain. Our umbrella will be strongest when we focus on our customers. Drawing from this, we can derive a set of strong metrics that ensure our services stay dry on rainy days.

Mark Henke

This post was written by Mark Henke. Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.

Relevant Articles

The Test Environment Management Plan Template

13 AUGUST, 2019 by Jane Temov So, you’ve been asked to write a “Test Environment Management Plan”? Or perhaps you just want to write a plan to baseline your current non-production processes, outline future test environment strategy and/or educate those around you. *...

Five Reasons You Need Enterprise Configuration Management

02 AUGUST, 2019 by Eric Olsson Preamble Software is buggy. It's a bold claim, I know. Sarcasm aside, the battle to keep your applications up and running is ongoing. Wouldn't it be nice if you had a way to eliminate an entire class of bugs from your application?...

Software Testing Anti Patterns

15 JULY, 2019 by Peter Morlion Martin Fowler AntiPattern "An antipattern is a solution that initially looks like an attractive road lined with flowers... but further on leads you into a maze filled with monsters."   Since the dawn of computers, we’ve always had to...

A Brief History of Configuration Management

26 June, 2019 by Justin Reynolds Gone are the days of monolithic software applications. In today’s microservices-and-continuous-integration driven world, where apps run in hybrid cloud environments and users engage with them on any device and on any underlying...

5 Red Flags DevOps Is Failing

03 JULY, 2019 by Justin Reynolds Even since the agile manifesto was published in 2001, software development has never been the same. In a pre-agile world, software was released in monolithic packages every year or every two years. The agile approach to development...

5 Red Flags Deployment Management Is Failing

20 MAY, 2019 by Mark Henke It’s a great step when teams deliberately manage their deployments instead of treating them as second-class citizens to writing code. But there are many pitfalls to managing deployments effectively. Many things lurk, waiting to trip us up....