Top five Cloud Metrics for Value-Based Practices – Enov8
PreambleThese days, deploying our services to the cloud just makes sense. Deploying to the cloud means you’re letting someone else handle low-level infrastructure costs, which gives us incredible flexibility. And with such flexibility comes a potentially overwhelming set of options on the best ways we can monitor our systems and take advantage of how easily we can scale our services. But luckily, we can limit our options based on two key factors: what decisions we’ll make to better our system and how customer-focused they are. With that in mind, I’ll share what I think are the five most important cloud metrics.
Metrics Are for DecisionsHere, I want to revisit some of my points from my “Top 5 DevOps Metrics” post on the Enov8 blog. And I’ll start by reemphasizing what I said there: metrics are useless on their own. If someone comes to you asking for a dashboard or to track some data, it’s fair to ask, “How will you use that data to make decisions?” If data doesn’t help you decide what actions to take, it’s not data worth collecting—the information will simply be clutter. And we eliminate clutter from our minds so we can concentrate on the data that will guide our decisions. I want to be clear about what I mean when I say decision here. I don’t mean you need to know exactly what specific code you will hit or what report you will send out. What I mean is that you should have a decision-making system in place that you trigger when a metric reaches some threshold. For example, let’s say you want to be able to notify a customer when your service is unstable. So you ask your team to create a process where they send an email to all customers. Then once the situation settles, they’ll send another email saying everything is A-OK. If the instability is extensive, you may decide to have phone calls with your preferred customers as well. Now you can build a metric that monitors when availability dips below 99 percent, and at that point, you’d send the email.
Customers First, Then Everything FollowsYes, we need to know what decisions our metrics will support. But we need more than that, too. For any metric we use, we should be able to point back to how it helps our customers. After all, our customers are our reason for existing!
Top Five MetricsThe top five customer-focused, decision-encouraging metrics for cloud systems are as follows:
- Service availability
- Incident rate
- Service response time
Service AvailabilityAlso known as uptime, service availability is how often the service is able to receive requests from users and/or consuming applications. This is usually measured in “9s.” For example, a service that is available 99 percent of the time in a year has two 9s of availability. The gold standard for this is five 9s, or 99.999-percent uptime per year. Achieving this, however, is usually very expensive.
What Decisions It SupportsThe threshold at which you should trigger some decisions depends on the cost of downtime for your organization. If your system goes down, roughly how much money will the organization lose? Knowing that will help determine what we are willing to invest into availability in order to avoid those costs. It may help to have two thresholds: one that acts as a warning that something may be going on with a service, and a second one that signals critical instability. With these two thresholds, you can set up some decisions. If you hit the warning threshold you may want the team to put a card at the top of their backlog to investigate the problem during this iteration or the next one. If you hit the critical threshold, you may want the team to stop what they are doing and swarm on investigating and resolving the problem.
How It Relates to the CustomerThe service availability metric has some clear connections to customer satisfaction. If they cannot use your application, they will go elsewhere. If your application is flaky, they will not trust you enough to use you consistently. So having a highly available system builds warm customer rapport.
ReliabilityWhile availability tells you that a service is up, it may still have problems. When those problems occur, reliability measures how quickly we get the system back to a usable place. Reliability measures both short-term and long-term problem elimination. Short-term reliability is measured by mean time to recovery. This is how long it takes for the team or system to overcome some problem. For example, if the cache fills up too much and causes product searches to take twice as long as normal, the mean time to recovery would be how long it takes to flush the cache. Long-term reliability is measured by mean time to repair. This is how long it takes to permanently root out a recurring issue. So going back to our cache example, this would be how long it takes for the developers to implement a feature where the cache expires before it fills up.
What Decisions It SupportsI strongly recommend mapping a value stream for handling production incidents. It’s good to look at how an incident moves from being reported by a user or the system to being resolved and finally being permanently repaired. Doing so will help you track and investigate waste to these metrics, just like you do with your development value stream.
How It Relates to the CustomerLike availability, customers trust a system that responds the way they expect it to. The fewer nasty surprises in your systems, the more customers will trust you. They’ll also be more likely to stick around, using you in the future. Additionally, an unreliable system will cause bugs that may lose customer transactions, which loses you money. This is especially true if you have to recompense the customer something for the inconvenience. This applies to internal customers too, since ultimately the consuming application is serving real customers somewhere. Even for purely internal apps, such as a timesheet application, high reliability means higher employee morale and employees wasting less time trying to contact the help desk and figure out what’s going on.
Incident RateReliability only gives a portion of the picture of customer trust. The other side of this is incident rates. This metric shows how frequently an incident occurs. You can measure this via your error tracker or even through a customer support tool. The incident rate plus reliability will give you a good picture of how often your system does what is expected.
What Decisions It SupportsWhat incidents pop up in your system vary widely, but you probably want a system in place similar to what we discussed with availability. With a good error tracking tool, you can monitor the severity of different errors and warnings. You can also measure severity based on how frequently an error or incident occurs. Medium-severity incidents may trigger an investigation for the next iteration. High-severity errors can trigger an immediate triage from one or more of your developers.
How It Relates to the CustomerThe factors here are a lot like the factors for reliability and service availability. There’s not much more to add on this front. Use all three of these metrics to get a sense of how much a customer may trust your application when it is running in the cloud.
ThroughputWe shift gears a bit with this next metric, moving away from avoiding problems toward providing maximum service to our customers. Throughput lets us look at how many customer requests we can handle at a time. This is often measured in transactions per second. Slower applications can adjust the time unit as necessary.
What Decisions It SupportsSet up your throughput thresholds based upon current and anticipated customer demand. When your throughput goes below that demand it makes sense to trigger some sort of investigation with the development team. A strong development team will dedicate a portion of their work per iteration to technical debt. It can make sense to dump these investigations into the technical debt backlog.
How It Relates to the CustomerThe throughput you need is directly connected to the number and speed at which you service customers. The more customers likely to hit your service at a time, the more throughput you need to handle it. If you don’t handle and anticipate the rate of customer transactions, you run the risk of increasing incident rates as your system buckles under pressure.
Service Response TimeOur final metric is how responsive our system or service is in the cloud. How fast do customers or consumers receive responses to their requests? This can be measured in latency in milliseconds per request. For web apps, this is very simple. For apps with asynchronous processing, it may take a bit more elbow grease to instrument the requests.
What Decisions It SupportsBased on your application you can establish certain response service-level agreements (SLAs) for your customers. When latency hits that threshold, just like with throughput, you can toss an investigation or “fix it” card into the team’s technical debt backlog. I recommend ensuring these SLAs are actually made visible to the customer and known in advance. Otherwise, you may find yourself scrambling to “false alarms”—slower requests that don’t actually break any SLAs. A more advanced decision-making system would proactively stop SLA breakages by having an early warning threshold. One way you can achieve this is by making visible the moments when your response times are 80 percent of the SLA time.
How It Relates to the CustomerAnything that takes more than a second on the web is noticeable by people. When customers notice slowness, they get frustrated. This affects the trust and loyalty you retain with them, just like reliability. You want to present a responsive, accommodating experience to your customers so they will keep coming back.
An Umbrella for a Rainy DayWhen we run in the cloud, sometimes we get a downpour of scaling and production issues. Similar to the metrics in my “Top 5 DevOps Metrics” post, knowing the decisions we will make will give us an umbrella against this rain. Our umbrella will be strongest when we focus on our customers. Drawing from this, we can derive a set of strong metrics that ensure our services stay dry on rainy days.
03JUNE, 2022 by Niall Crawford & Carlos "Kami" Maldonado. Modified by Eric Goebelbecker.DevOps at scale is what we call the process of implementing DevOps culture at big, structured companies. Although the DevOps term was back in 2009, most organizations still...
Test Environment Management Explained3JUNE, 2022 by Erik Dietrich, Ukpai Ugochi, and Jane Temov. Modified by Eric GoebelbeckerMost companies spend between 45%-55% of their IT budget on non-production activities like Training, Development & Testing and lose 20-40%...
3JUNE, 2022 by Eric GoebelbeckerWhat Is Serverless Computing? Serverless computing is a cloud architecture where you don’t have to worry about buying, building, provisioning, or maintaining servers. In return for structuring your code around their APIs, your cloud...
25MAY, 2022 by Niall Crawford & Justin Reynolds. Modified by Eric Goebelbecker.So, you’ve decided to implement a Scaled Agile Framework (SAFe) and promote a continuous delivery pipeline by implementing “Agile Release Trains” (ART)*. Definition: An Agile Release...
24MAY, 2022 by Michiel Mulders. Modified by Eric Goebelbecker.With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24...
24MAY, 2022 by Keshav MalikWith the rise of agile development methodologies, the need to quickly test new features is more critical than ever. This is especially true for websites and applications that rely on real-time data and interaction. The only way to ensure...