What Is SRE (Site Reliability Engineering)?
by Michiel Mulders
A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to the overall happiness of other employees. Why? Because an SRE helps automate repetitive tasks and thereby reduces the pressure for other colleagues.
Often, we see SREs as the driving innovators of the company. They have a good overview of the system architecture, the software and its metrics, and operations. Through their extensive knowledge, they can drive innovation. For example, an effective SRE can advise you on which tasks to improve in order to reduce costs or improve efficiency.
This post will teach you what a site reliability engineer does and why. Also, we’ll cover aspects of site reliability engineering such as application performance monitoring, logging, the importance of metrics, and the use of a configuration management database.
First, let’s explore the definition of site reliability engineering.
What Is Site Reliability Engineering (SRE)?
Nowadays, there’s a lack of clarity about the exact definition of SRE. It sits between software engineering and DevOps. However, DevOps is concerned only with deploying and monitoring builds. When we speak about SRE, we actually find a software engineer who’s taking care of deployments but also spends time writing code and improving processes.
Why a software engineer? He or she truly understands the software, which is valuable knowledge for debugging nasty issues, including increased memory usage.
The SRE movement is trying to shift the art of deployments more and more toward software engineers. The idea is that a software engineer should also know about deploying the code they write.
However, there’s much more to it. SRE is a cross-disciplinary role. A site reliability engineer is responsible for deploying and scaling builds. Besides that, he or she should also take care of application performance monitoring and implement logging.
A software engineer can move to the role of a site reliability engineer only if he or she has an internal motivation for improving processes. SRE is essential for a company, as the team is responsible for allowing the company to scale quickly and easily. An SRE spends a huge chunk of his or her time on finding inefficient processes and trying to improve them. Ideally, the goal is to automate processes so the organization can “forget” about them.
Importance of Application Performance Monitoring and Logging
Application performance monitoring (APM) and logging are crucial parts of an SRE’s job. He or she will use the data gathered from APM and logging for key metrics, such as the reliability of services and availability. These are two important metrics that indicate the health of your services and deployment strategy.
Four other important metrics to monitor include:
- Database response time: Both measure the average and min/max response time.
- Requests per minute: This measures how many requests the application can handle per minute.
- Error rate: How often does the end user experience an error?
- Network latency: Often, people forget to measure the network latency, which can help them calculate the total request time.
How can your organization use those metrics?
For example, you might find out that network latency is particularly high for requests served to the United States. Because you’re monitoring the network latency and in combination with geographical data you track, you can investigate this issue. An SRE can use this data to quickly find out that you have only a few servers running in the United States. A possible solution for this problem would be to increase the server count for the U.S. region.
This example showcases how you can combine and use data that SRE tracks. Without SRE metrics, solving this problem would be much harder.
Next, let’s learn about the relationship between development and SRE.
SRE Requires a Deep Understanding of Software
To be able to improve processes and quickly troubleshoot application crashes or issues, a site reliability engineer needs a thorough knowledge of the code. This is essential for the last aspect of an SRE, which is scaling and optimization.
Without a thorough understanding of the software, it’s hard to optimize code or processes related to the software development cycle. This means an SRE spends up to 50% of his or her time writing code. However, not all code will end up in the final product. A lot of code is part of scripts that try to automate repetitive tasks for other developers to make their lives easier.
Next, let’s explore three vital elements of SRE.
3 Most Important Elements of SRE
This section will guide you through some common elements of an SRE’s daily routine. You’ll learn about the need for a configuration management database (CMDB), standardization, and automation.
1. Why a Configuration Management Database Matters
A CMDB helps with centralizing all sorts of configurations data. This can include SSH keys, roles, physical machine locations, access keys, and more.
A CMDB is a great solution because it allows you to store this information centrally. Also, a CMDB exposes endpoints for generating configs, retrieving configs to be used in third-party applications, or even updating configuration items automatically.
For example, let’s say a certain security key needs to be regenerated quarterly. A CMDB allows you to automatically update this key every three months. If this functionality isn’t included, an SRE can create a script that reads the key, generates a new one, and updates the key through the CMDB’s application program interface.
Some argue a CMDB is the most important tool every site reliability engineer should know about. You definitely don’t want to work with Excel sheets to track and store all configuration data. Besides, most cloud providers work well with CMDBs to automatically retrieve configuration for deployments.
2. Need for Standardization
Standardization is key for system reliability engineering. You can best optimize a process by setting standards and using templates. A template allows you to standardize the expected data input or data output, which saves you time and effort later.
In short, an SRE spends quite some time defining new standards to further optimize processes.
3. Automation as Part of SRE
Standardization is great for improving processes. Often, with standardization comes automation. Standardization kickstarts automation.
When you can standardize a process, the chances are high you can also automate that process.
Therefore, the SRE is also responsible for finding processes that can be automated. This can be as simple as writing a small script that filters specific logs each day.
That being said, automation is a key element of SRE. Automation helps reduce the workload and works well with standardization.
It’s safe to say a site reliability engineer is a creative growth hacker with engineering skills. An SRE should have a natural craving for spotting and improving inefficient processes. Therefore, he or she contributes directly to the overall happiness of other employees.
An SRE is capable of reducing the workload for other employees by automating repetitive tasks. In addition, an SRE drives innovation. He or she continuously improves existing processes and tries to automate them. Therefore, he or she is innovating the way people work—and saving the company a lot of money.
In the end, there’s no such thing as a standard set of tools an SRE uses. For each company, the implementation of SRE can be freely decided and is open for creativity. At his or her core, the SRE is focused on development, deployment, and continuous optimization.
This post was written by Michiel Mulders. Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!
01 JULY, 2020 by Diego Gavilanes Ever since the dawn of time, test environments have been left for the end, which is a headache for the testing team. They might be ready to start testing but can’t because there’s no test environment. And often, the department in...
29 JUNE, 2020 by Carlos Schults In today’s post, we’ll discuss data literacy and its relevance in the context of GDPR. We start by defining data literacy and giving a brief overview of GDPR. Then we proceed to explain some of the challenges organizations might face...
23 June, 2020 by Arnab Roy Chowdhury In this digital era, online businesses have become mainstream. Consequently, online commerce has flourished—and led to loads and loads of data! Businesses need to build data centers to store information. Not only that, but if you...
08 JUNE, 2020 by Eric Boersma Every company needs a disaster recovery plan. This is just a simple fact of life. Your company needs to know how to recover when something breaks or you can’t get access to something you need. In larger, more advanced tech companies,...
25 May, 2020 by Daniel Longest Zombie and ghost assets sound exciting, like a late-night movie you’d watch around Halloween. While in reality they may not be that exciting, they’re scary if you don’t understand and prevent them. The good news is the steps you need to...
05 May, 2020 by Eric Boersma Taking on Site Reliability Engineering (SRE) is not an easy task. It doesn’t matter where you’re coming from. Some organizations have done a little DevOps and are trying to break into SRE. Others haven’t even taken that step, and figure...