DevOps versus SRE – Friend or Foe
by Michiel Mulders
SRE vs DevOps: Friends or Foes?
Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though there are clear distinctions. Where DevOps focuses on automation of deployments and tests, SRE focuses on post-deployment processes—for example, measuring the application’s performance or aggregating logs to make them easily discoverable.
First of all, allow me to explain DevOps. DevOps proposes to merge the development and operations teams. It means an end to just writing your code and throwing it over the wall for the operations team to deploy and test.
No matter the size of the organization, DevOps tries to align processes and improve communication between both teams. When a functionality is finished, then the developer helps the operations team test the code because they have a shared responsibility.
Besides that, DevOps helps improve the code’s quality. Every time a new functionality is complete, the team will use automated tools to build and test it. Doing this allows the team to find bugs earlier and provides the development team with faster feedback through automation.
In short, the DevOps culture aligns all involved stakeholders around a shared goal: the delivery of high-quality, stable software.
Now that you understand the mission and purpose of DevOps, let’s move on to SRE.
Here’s how Ben Treynor, the person who developed the SRE role at Google, defines a site reliability engineer:
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”
To summarize this, an SRE is a software engineer with a deep understanding of the code. SREs spend a lot of time writing code to improve processes and introduce automation. They’re a crucial part of the development team because they reduce the workload for many developers.
SREs spend the other 50% of their time monitoring the system’s health. An SRE observes the health of a system by measuring different metrics, such as availability and reliability. Therefore, the SRE’s work is key to application monitoring and logging.
Differences Between DevOps and SRE
Let’s take a look at some key differences in these roles so you can understand how they complement each other and serve the larger organization.
Difference #1: Background
DevOps: A DevOps engineer has a solid understanding of different system architecture types and is most often fluent with Unix- and Linux-based distributions. These engineers have a deep understanding of the deployment process and all involved elements.
SRE: A site reliability engineer is actually a software engineer who has a key understanding of deployed systems. The knowledge combination of software engineering and deployments makes them highly valued.
Next, let’s take a look at how DevOps and SRE differ in terms of measuring metrics.
Difference #2: Metrics
DevOps: Generally speaking, the mindset of DevOps isn’t much concerned with metrics. Instead, DevOps’ core goal is to automate development processes, including testing, deployments, and builds.
SRE: An SRE tracks metrics, such as availability and reliability of services. These metrics help an SRE to understand the system’s health. This means that SRE keeps track of data related to the system’s health through application monitoring tools. In addition, logging and log aggregation make up a big part of the data the SRE captures.
Difference #3: Cultural Shift
DevOps: The origin of DevOps reveals a cultural shift. DevOps wants to merge the operations and development teams. In the DevOps view, development should be more than writing code and throwing the code to the operations team to deploy. The idea is to create a more streamlined process and support a more agile way of working.
SRE: A site reliability engineer is a software engineer who takes care of processes related to scaling and measuring the system’s health. The SRE approach is more than just a cultural shift. Instead, SRE helps innovate processes to increase the efficiency of development and operations teams.
Difference #4: Automation
DevOps: DevOps helps provide faster feedback about code through a continuous integration (CI) pipeline. It takes care of test automation but also builds the software for certain platforms, creating artifacts. A CI tool can execute tests in parallel, which makes test execution much faster than doing so locally on an engineer’s laptop.
In short, DevOps is concerned with process automation related to the development and deployment phase.
SRE: SRE professionals focus on automating processes related to scaling and measuring the system’s health. This means they mostly focus on automating processes related to post-deployment. For example, an SRE professional might aggregate information in a dashboard about the server’s health, monitoring memory usage and CPU allocation. Also, as mentioned previously, application performance management helps the SRE aggregate logs and make them more discoverable.
In brief, an SRE collects data about running services so he or she can quickly solve issues in case of a system crash. Therefore, automation is also a key aspect to improve processes and automate these tasks.
Difference #5: Dealing With Bugs and Application Failures
DevOps: DevOps lacks a deep understanding of the code. In case of an error, the DevOps engineer can roll back the code to a previous minor version to restore service. The DevOps engineer wants to minimize the impact of an issue and restore service as soon as possible. After that, the development team should investigate and fix the issue. (This article describes ways your DevOps may be failing.)
SRE: In contrast, a site reliability engineer has a deep understanding of the software. This allows him or her to solve nasty bugs on the spot using the data he or she keeps track of. Tracking relevant data greatly eases the debugging process. It’s also possible to deploy SRE for solving more tricky bugs, including performance issues and memory leaks.
In case of a crash, an SRE can use the aggregated logs to play back the events that led up to the crash. Having this data readily available helps the SRE to solve those problems quickly.
In short, a DevOps engineer doesn’t have the required knowledge of the software to solve the problem, whereas an SRE can use his or her data to solve the problem.
Conclusion: DevOps and SRE Are Both Necessary
To summarize, DevOps is concerned with the automation of deployments and test automation, whereas SRE focuses on tasks after deployment. SRE works on the automation of tasks related to availability and the system’s health.
DevOps and SRE have many common elements. For example, both fields have an interest in automation. The main difference is in which stage they are active. DevOps focuses on the development and deployment phase. DevOps empowers developers and provides them with fast feedback about their code. In contrast, SRE is concerned with post-deployment automation and innovating related processes. In the end, your organization needs to understand there’s a difference between the roles.
Although both terms are often used interchangeably, for large organizations, it makes sense to have both DevOps and SRE. SREs play a critical role in making sure a service is available because they can quickly debug problems based on their knowledge of the software.
In smaller organizations, the same person or team often executes SRE and DevOps. Although this double duty is sometimes necessary, it can put too much stress on your operations team. Often, startups like to use software engineers who can help with DevOps, thereby creating an indirect SRE role.
In essence, a site reliability engineer should automate himself or herself. In other words, that person should try to automate as many processes as possible, almost going as far as automating themselves!
This post was written by Michiel Mulders. Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!
31MARCH, 2021 by Ukpai UgochiSo, As the leader of a DevOps or agile team at a rising software company, how do you ensure that users' sensitive information is properly secured? Users are on the internet on a daily basis for communication, business, and so on. While...
24MARCH, 2021 by Taurai MutimutemaKnowledge is more important than ever in businesses of all types. Each time an engineer makes a decision, the quality of outcomes (always) hangs on how current and thorough the data that brought about their knowledge is. This...
15MARCH, 2021 by Carlos SchultsIn today’s post, we’ll answer what looks like a simple question: what is data fabrication in TDM? That’s such an unimposing question, but it contains a lot for us to unpack. What is TDM to begin with? Isn’t data fabrication a bad thing?...
19 FFEBRUARY, 2021 by Carlos Schults "You can't improve what you don't measure." I'm sure you're familiar with at least some variation of this phrase. The saying, often attributed to Peter Drucker, speaks to the importance of metrics as fundamental tools to enrich and...
08 FEBRUARY, 2021 by Zulaikha Greer Data is the word of the 21st century. The demand for data analysis skills has skyrocketed in the past decade. There exists an abundance of data, mostly unstructured, paired with a lack of skilled professionals and effective tools to...
04 JANUARY, 2021 by Ukpai Ugochi Have you ever wondered what would happen if you mistakenly added bugs to your codes and shipped them to users? For instance, let's say an IT firm has its primary work tree on GitHub, and a team member pushes codes with bugs to the...