Select Page

What Is SRE (Site Reliability Engineering)?



by Michiel Mulders

A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to the overall happiness of other employees. Why? Because an SRE helps automate repetitive tasks and thereby reduces the pressure for other colleagues.

Often, we see SREs as the driving innovators of the company. They have a good overview of the system architecture, the software and its metrics, and operations. Through their extensive knowledge, they can drive innovation. For example, an effective SRE can advise you on which tasks to improve in order to reduce costs or improve efficiency.

This post will teach you what a site reliability engineer does and why. Also, we’ll cover aspects of site reliability engineering such as application performance monitoring, logging, the importance of metrics, and the use of a configuration management database.

First, let’s explore the definition of site reliability engineering.


What Is Site Reliability Engineering (SRE)?

Nowadays, there’s a lack of clarity about the exact definition of SRE. It sits between software engineering and DevOps. However, DevOps is concerned only with deploying and monitoring builds. When we speak about SRE, we actually find a software engineer who’s taking care of deployments but also spends time writing code and improving processes.

Why a software engineer? He or she truly understands the software, which is valuable knowledge for debugging nasty issues, including increased memory usage.

The SRE movement is trying to shift the art of deployments more and more toward software engineers. The idea is that a software engineer should also know about deploying the code they write.

However, there’s much more to it. SRE is a cross-disciplinary role. A site reliability engineer is responsible for deploying and scaling builds. Besides that, he or she should also take care of application performance monitoring and implement logging.

A software engineer can move to the role of a site reliability engineer only if he or she has an internal motivation for improving processes. SRE is essential for a company, as the team is responsible for allowing the company to scale quickly and easily. An SRE spends a huge chunk of his or her time on finding inefficient processes and trying to improve them. Ideally, the goal is to automate processes so the organization can “forget” about them.

Importance of Application Performance Monitoring and Logging

Application performance monitoring (APM) and logging are crucial parts of an SRE’s job. He or she will use the data gathered from APM and logging for key metrics, such as the reliability of services and availability. These are two important metrics that indicate the health of your services and deployment strategy.

Four other important metrics to monitor include:

  • Database response time: Both measure the average and min/max response time.
  • Requests per minute: This measures how many requests the application can handle per minute.
  • Error rate: How often does the end user experience an error?
  • Network latency: Often, people forget to measure the network latency, which can help them calculate the total request time.

How can your organization use those metrics?

For example, you might find out that network latency is particularly high for requests served to the United States. Because you’re monitoring the network latency and in combination with geographical data you track, you can investigate this issue. An SRE can use this data to quickly find out that you have only a few servers running in the United States. A possible solution for this problem would be to increase the server count for the U.S. region.

This example showcases how you can combine and use data that SRE tracks. Without SRE metrics, solving this problem would be much harder.

Next, let’s learn about the relationship between development and SRE.

SRE Requires a Deep Understanding of Software

To be able to improve processes and quickly troubleshoot application crashes or issues, a site reliability engineer needs a thorough knowledge of the code. This is essential for the last aspect of an SRE, which is scaling and optimization.

Without a thorough understanding of the software, it’s hard to optimize code or processes related to the software development cycle. This means an SRE spends up to 50% of his or her time writing code. However, not all code will end up in the final product. A lot of code is part of scripts that try to automate repetitive tasks for other developers to make their lives easier.

Next, let’s explore three vital elements of SRE.

3 Most Important Elements of SRE

This section will guide you through some common elements of an SRE’s daily routine. You’ll learn about the need for a configuration management database (CMDB), standardization, and automation.

1. Why a Configuration Management Database Matters

A CMDB helps with centralizing all sorts of configurations data. This can include SSH keys, roles, physical machine locations, access keys, and more.

A CMDB is a great solution because it allows you to store this information centrally. Also, a CMDB exposes endpoints for generating configs, retrieving configs to be used in third-party applications, or even updating configuration items automatically.

For example, let’s say a certain security key needs to be regenerated quarterly. A CMDB allows you to automatically update this key every three months. If this functionality isn’t included, an SRE can create a script that reads the key, generates a new one, and updates the key through the CMDB’s application program interface.

Some argue a CMDB is the most important tool every site reliability engineer should know about. You definitely don’t want to work with Excel sheets to track and store all configuration data. Besides, most cloud providers work well with CMDBs to automatically retrieve configuration for deployments.

2. Need for Standardization

Standardization is key for system reliability engineering. You can best optimize a process by setting standards and using templates. A template allows you to standardize the expected data input or data output, which saves you time and effort later.

In short, an SRE spends quite some time defining new standards to further optimize processes.

3. Automation as Part of SRE

Standardization is great for improving processes. Often, with standardization comes automation. Standardization kickstarts automation.

When you can standardize a process, the chances are high you can also automate that process.

Therefore, the SRE is also responsible for finding processes that can be automated. This can be as simple as writing a small script that filters specific logs each day.

That being said, automation is a key element of SRE. Automation helps reduce the workload and works well with standardization.


It’s safe to say a site reliability engineer is a creative growth hacker with engineering skills. An SRE should have a natural craving for spotting and improving inefficient processes. Therefore, he or she contributes directly to the overall happiness of other employees.

An SRE is capable of reducing the workload for other employees by automating repetitive tasks. In addition, an SRE drives innovation. He or she continuously improves existing processes and tries to automate them. Therefore, he or she is innovating the way people work—and saving the company a lot of money.

In the end, there’s no such thing as a standard set of tools an SRE uses. For each company, the implementation of SRE can be freely decided and is open for creativity. At his or her core, the SRE is focused on development, deployment, and continuous optimization.

Michiel Mulders

This post was written by Michiel Mulders. Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!

Relevant Articles

Data Compliance: A Detailed Guide for IT Leaders

Data Compliance: A Detailed Guide for IT Leaders

31MARCH, 2021 by Ukpai UgochiSo, As the leader of a DevOps or agile team at a rising software company, how do you ensure that users' sensitive information is properly secured? Users are on the internet on a daily basis for communication, business, and so on. While...

What Is IT Operational Intelligence

What Is IT Operational Intelligence

24MARCH, 2021 by Taurai MutimutemaKnowledge is more important than ever in businesses of all types. Each time an engineer makes a decision, the quality of outcomes (always) hangs on how current and thorough the data that brought about their knowledge is. This...

What Is Data Fabrication in TDM

What Is Data Fabrication in TDM

15MARCH, 2021 by Carlos SchultsIn today’s post, we’ll answer what looks like a simple question: what is data fabrication in TDM? That’s such an unimposing question, but it contains a lot for us to unpack. What is TDM to begin with? Isn’t data fabrication a bad thing?...

Top TDM Metrics

Top TDM Metrics

19 FFEBRUARY, 2021 by Carlos Schults "You can't improve what you don't measure." I'm sure you're familiar with at least some variation of this phrase. The saying, often attributed to Peter Drucker, speaks to the importance of metrics as fundamental tools to enrich and...

Structured Versus Unstructured Data

Structured Versus Unstructured Data

08 FEBRUARY, 2021 by Zulaikha Greer Data is the word of the 21st century. The demand for data analysis skills has skyrocketed in the past decade. There exists an abundance of data, mostly unstructured, paired with a lack of skilled professionals and effective tools to...

Enterprise Environments: Understanding Deployment at Scale

Enterprise Environments: Understanding Deployment at Scale

04 JANUARY, 2021 by Ukpai Ugochi Have you ever wondered what would happen if you mistakenly added bugs to your codes and shipped them to users? For instance, let's say an IT firm has its primary work tree on GitHub, and a team member pushes codes with bugs to the...