Top 5 Container Metrics
by Christian Meléndez
Even though containers are different from virtual machines (VMs), most of the metrics you get from a container are pretty similar to the ones you get from a VM or a physical server. What’s different is the meaning a metric has in a containerized system. Metrics are important because they give you a better understanding of what’s happening in the system and how that affects the continuous delivery pipeline.
Today’s post is about the top container metrics you’ll want to look at before going deeper when troubleshooting. Most of the time, problems are solved just by scaling out the containers. With the help of container orchestrators like Kubernetes, most applications will self-heal, removing the need for manual intervention before scaling out.
Ready? Let’s take a look at the top metrics when working with containers.
1. Memory Consumption
Memory is the most critical metric when it comes to Docker containers because when the memory is full, the container stops working. Java applications usually face the problem of being terminated when memory isn’t configured correctly. When a Docker container’s memory hits the limit, Docker kills the process inside the container that is causing that. Docker protects the host by stopping the container so there’s no chance it will affect other containers running in the same host.
What’s good about containers is that they can be started again in a matter of seconds if Docker stopped them. Even so, the application will have some downtime—even if it’s only for a short period. If you’re using Docker Compose, you can configure the container to start again when it’s stopped. Or if you’re using Kubernetes, you can configure the desired state of how many containers you want to be running all the time. Kubernetes will then continuously validate that the state is compliant by recreating containers.
Memory is essential not only to avoid unexpected responses for an application, but also to set the guidelines for when to scale out the containers. The memory metric will set a threshold to define autoscaling rules. An auto-scaling rule might say, “If memory consumption is higher than 80 percent, then start two more containers.”
It’s always best to scale out containers based on memory consumption to avoid unresponsive applications. It’s good to know what the memory consumption is, but you need to take preventive actions. Container orchestrators like Kubernetes enable you to do that.
2. CPU Usage
Another critical metric in a container is CPU usage. In Docker, by default containers have full access to the CPU resources in the host. To avoid affecting other containers in the host, you can limit the percentage of CPU that Docker will allow a container to use. For example, let’s say that a container has a CPU configuration of 0.5. That means the container will have access to 50 percent at most of one CPU core from the host.
Docker will not stop a container when the container reaches its CPU limit, but the application’s performance will be negatively impacted when this happens. With the CPU metric, you can quickly identify if you need to give more CPU resources to the container or not. Containers will help you to optimize resources but only if you know where you need to tune—for example if you’re configuring way too many CPU resources and the container only uses a small portion of those resources.
As with memory, the CPU metric works as a threshold to configure autoscaling rules for the containers. In Kubernetes or any other container orchestrator, you can define autoscaling rules when a container reaches the CPU limit to add more containers. Therefore, applications performance will be steady because the orchestrator will make sure to provide as many containers as are needed.
3. Disk Operations
Disk operations metrics like I/O operations or disk usage are numbers that could affect more than just a single container’s performance. The disk is a resource that both containers and the host depend on to provide good performance.
You can get I/O metrics with the “docker stats” command. And you can get disk usage metrics with the “docker system df” and get more detailed information by adding the “-v” flag. Controlling the disk usage in the host is crucial to avoid problems when pulling new container images. Images could be huge, which is why you might need to run a clean-up process to remove old images. Logs written to the disk is another reason why disk usage increases. So “log rotation” is a good practice you still need to implement with containers.
Even though containers are stateless by default, other applications are stateful like a database. For stateful applications, you’ll use volumes. A volume is how you map a folder or driver inside the container with another path or driver in the host. You could create volumes in the host before assigning them to a container.
Disks become a shared resource that could affect containers’ and the host’s performance.
4. Network Traffic
As you’ve seen with the previous metrics, the same type of metrics that are important in a virtual or physical server are still crucial for containers. So networking traffic can’t be missed when discussing the top metrics for containers. You could get networking traffic with the “docker stats” command to know the amount of data a container has sent or received.
Knowing how much traffic is happening in a container could be a clue whether the host for a container is appropriate or not. Maybe the container needs to be running in a host that has better network performance. Or perhaps you need to scale out the containers. I’ve seen cases where specific applications, memory, and CPU were OK but the application’s performance was still poor.
When taking a look at the networking metrics, I’ve found some interesting information. Sometimes, the network traffic was low, and it was because of a misconfiguration in the load balancer. Other times, the applications in the container weren’t able to use enough network bandwidth.
My point here is that there were times where I wasn’t able to spot problems (for example because of a host type or a bug in the application) with memory or CPU metrics, but I was able to with network metrics.
5. Number of Containers Running
To run containers at scale, you need to use a container orchestrator like Kubernetes or Swarm. There are going to be times when the orchestrator won’t be able to schedule more containers because there are no more resources available. To fix this, you could use the number of containers that are running to scale out the host.
I noticed the importance of this metric when a friend used it to demonstrate that certain containers were having problems. It turns out that because of a misconfiguration, the container was running only for a period of time. Of course, the orchestrator was spinning up a new container but the application was unstable.
You can get this metric by running a “docker container is” command or by asking the orchestrator how many containers are running for an application. In Kubernetes you can get this information by listing the pods that are running or by getting the state of a deployment object.
The number of containers running is not a metric of a container per se. Sometimes you need to zoom out to get a different perspective on a problem.
Don’t Rely Only Upon Top Metrics
I didn’t give you too many details on how to collect these types of metrics for containers. The reason for that is that you now have a lot of tools for that job, which are available as paid services or for free. Some examples of free tools are cAdvisor, InfluxDB, Grafana, or Prometheus. These tools will give you other metrics—not just for the containers, but also for the host. So don’t rely only on the top metrics I listed here. There are going to be times when these metrics look good and maybe the host is running out of IOPS, or the memory swap is terrible, which is when alternate metrics will come in handy.
Knowing which are the top metrics for containers is just the beginning of improving the lead time. For example, you can also use these metrics when drawing the value stream mapping.
This post was written by Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.
19 MARCH, 2020 by Michiel Mulders SRE vs DevOps: Friends or Foes? Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though...
06 MARCH, 2020 by Arnab Roy Chowdhury Top 10 SRE Practices Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are...
20 FEBRUARY, 2020 by Arnab Row Chowdhury Technically, the world today has advanced to a level we never could’ve imagined a few years ago. What do you think made it possible? We now understand complexities. And how do you think that became possible? Literacy! Since...
14 FEBRUARY, 2020 by Michiel Mulders A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to...
07 February, 2020 by Arnab Roy Chowdhury Do you remember what Uncle Ben said to young Peter Parker? “With great power comes great responsibility.” The same applies to companies. At present, businesses hold a huge amount of data—not only the data of a company but also...
17 JANUARY, 2020 by Sylvia Fronczak Site reliability engineering (SRE) uses techniques and approaches from software engineering to tackle reliability problems with a team’s operations and a site’s infrastructure. Knowing the history of SRE and understanding which...