Site Reliability Engineering Top 10 Best Practice

MARCH, 2020

by Arnab Roy Chowdhury

Top 10 SRE Practices

Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are several SRE practices at play in making a site reliable. And if a site is reliable, it’s successful.

For instance, in case of an urgent task, like filing taxes or a bill payment, you’re likely to check internet connectivity first. And what’s most likely to be your first step for checking whether the internet is working? Trying a Google search! Why? Of course, because Google has such a high level of reliability.

Since 2003, the concept of site reliability engineering (SRE) has existed. In this post, we’re going to focus on the top 10 SRE practices. But before that, we’ll look at what site reliability engineering is and its importance.

What Is Site Reliability Engineering?

Google introduced the concept of site reliability engineering in 2003, just before DevOps became a trend. Initially, developers working with SRE made Google’s large-scale sites more efficient, scalable, and reliable. While doing so, they developed a set of practices. The concept of SRE became very popular with companies like Netflix and Amazon. They adopted this new technology and developed new practices for making a site reliable.

Slowly, SRE became an independent domain. The goal was to develop automated solutions. The solutions were meant to help the operations team.

Does an SRE Write Code?

You’ll often see SREs who can write code. That’s hardly surprising since many of them are originally software engineers who made a career switch. Despite of this fact—and that coding skills can come in handy during the execution of their jobs—SREs aren’t generally expected to write code on a daily basis.

SRE handles tasks like capacity and performance planning, on-call monitoring, and disaster response. But why did SRE become so popular? The reason is SRE works well with continuous delivery and other DevOps practices. Another focus of SRE is to build systems that can learn something from outages and errors.

Additionally, SREs set important measurements for their organizations, such as Service Level Indicators (SLIs), Service Level Objectives (SLO), and Service Level Agreement (SLA). What are those?

What are SLA, SLO and SLI in SRE?

Let’s briefly cover these concepts:

SLI (Service-Level Indicators): the measure of a service level that an organization provides to a customer
SLO (Service-Level Objectives): these are specific objectives that provider and customer agreed upon when it comes to the performance of the service provider:
SLA (Service Level Agreement): the set of all SLOs agreed upon between the provider and customer

To sum it up: SLIs are metrics that help define SLOs, which in turn are the basis of the overall SLA.

Why Is SRE Important?

We know that most companies nowadays are adopting agile and DevOps. This eliminates silos to a great extent. Development and operations teams have barriers between them. SRE is a practice that helps in breaking down these barriers. Both teams are aware of each other’s tasks. As a result, the development of new features takes place flawlessly. At the same time, SRE ensures the smooth running of production systems. But how?

Building on that, let’s take a look at why SRE is important. Here are some striking benefits SRE offers to a firm.

1. Focus on Learning via a Data-Driven Approach

Are your decisions good or bad for business? Are you making wise choices? SRE is a data-driven approach. SRE utilizes data to determine how your choices impact your business. It also collects massive amounts of data. By doing so, it helps the teams to assess system availability and reliability. And do you know what the best part is? This happens at an early stage of the development cycle, thereby reducing the chance of risks at a later stage.

2. Reduce System Toil via Automation

Most companies keep releasing new features to update an existing application that’s in production. But don’t you think that it can take a toll on the system’s health? For instance, let us consider system toil. When you run a production service, the toil refers to the repetitive manual tasks. With the help of automation, SRE reduces toil.

Every site experiences downtime. While setting an error budget, teams determine how much downtime they can afford. However, there comes a point when companies run out of error budget. SRE engineers then pause deployment. Once the toil is reduced, deployment starts again, thus making the system scalable and release management easier.

3. Consider IT Operations a Value Center

We know that the operations team helps a company to reduce downtime. They do so by promoting continuous testing and integration. They ensure that maximum services are available and increase revenue. Therefore, companies adopting SRE consider IT operations to be a value center. And do you know what this calls for? Skilled employees for managing SRE. We know what that means! With better skill sets, employees become more productive, making the team more effective.

Now that we know why SRE is important, let’s move on to the best practices you must follow while embracing the SRE culture.

Top 10 Site Reliability Engineering Practices

Google is the first company to develop and promote SRE (Site Reliability Engineering). But this doesn’t mean what works well for one company would work for the other, meaning there isn’t a one-size-fits-all approach. But one thing is for sure. Speed, performance, capacity planning, security, hardware and software upgrades, and availability are the main factors that make a system reliable. SRE is concerned with all of them. Here are the top 10 SRE practices that ensure impeccable system reliability. Let’s take a look.

1. Analyze Changes Keeping the Big Picture in Mind

Codes have their own impact on the overall business. A successful software engineer understands that. But the intensity of this particular trait has to be higher when you’re implementing SRE. Every little change impacts the business. Therefore, analyze each change for the risk it carries. Consider the impact of the changes for the long haul by seeing the big picture, not just how they can affect the system today.

2. Move on If Something Seems Like a Dead End

You obviously have good intentions while taking any action for the business. But even though you aim for the best, not all your endeavors will end up successful. For example, suppose you develop something to prevent losses because of changes. But what if the thing you developed ends up slowing your production cycle? There are two signs of something being a dead end—firstly, if it doesn’t turn out the way you intended, and secondly, if it causes more losses than it does good for the company. If either of these are the case, you should know when to move on.

3. Have Forward and Pragmatic Thinking

Silos do little good in the SRE culture. A siloed approach doesn’t take into account how a process will impact others in the system. As established earlier, the aim of SRE is to eliminate silos. Thus, a good SRE practice is to think about how a step will affect the rest of the team. If you’re finding solutions to a problem, consider the effects of the solution on others down the road.

4. Automate Wherever Possible

Quick delivery is one of the most important requirements of a company. But so is accuracy. So, together, speed and accuracy make a system reliable. As a result, most firms strive to make their systems reliable without slowing down different processes. Stop the time-consuming and repetitive processes that waste time. Take the time you spend in relentless manual work and instead use it to automate. After all, what’s the first thing in the job description of an SRE engineer? Automation!

5. Expand Skill Sets

We know that a firm needs skilled employees for handling SRE. But a predefined skill set isn’t the only thing for successful SRE. A company needs to ensure that the employees are ready to expand their skill sets. Since SRE is a relatively new field, we have people from both development and operations backgrounds. However, it’s important not to pigeonhole the SRE engineer’s role to a single background. Encourage employees to step out of their comfort zone. Enable them to keep developing new skills and learn new things.

6. Keep Striving Toward Perfection Without Obsessing Over It

Nothing turns out perfect even though that’s what you’re aiming for. The same goes for SRE. After all, you’re not running to create a break-proof system! You know that won’t always work. In short, do everything you can to ensure reliability. Leave no stones unturned from your end. Automate, develop new skills, and look at the big picture. Strive to do better to stay close to perfection.

7. Do Everything to Eliminate Toils

When a project begins, the setup is simple. You only have a few files in JSON or INI format. But as the number of modules increases, managing the configuration becomes a difficult task. Suppose that in your project, you need to add a few lines of code for supporting multiple languages. In a large project, the task consumes a lot of time as you have to implement the changes in many files. To reduce manual labor, use automation. Invest some time in developing a framework that performs duplicate tasks. Do things to reduce the developer’s workload.

8. Persuade Management to Do What’s Important

Let’s discuss this with an example. What if, during the development phase, you come across a certain automation framework? You find out that the framework is a bit expensive, but in the long run, it saves a great deal of rework. If that’s the scenario, convince the management why you need this framework. Remember, the aim of SRE is to make systems reliable. If you’re convinced that the framework will help the team once the project is deployed in production, prepare a document. Explain everything to management and get it for your team.

9. Blameless Postmortem

Whenever an outage or incident occurs, SRE experts carry out a postmortem. In this stage, they find out the root cause of the issue and document the incident. Postmortem offers a great learning scope to an SRE engineer. While writing the report, engineers get a clear idea of how things in the back end work.

However, while writing the postmortem report, don’t blame anyone or any team for what went wrong. Focus on finding out the “what” instead of “who.” Postmortems offer a scope to fix the weaknesses in a system. Writing a constructive report will identify the actions that led to the incident. The teams will make sure that this never happens again. The blame game will only weaken the bond between teams, which contradicts the SRE culture.

10. Define Service-Level Objectives Like an End User

To manage service in the right way, it’s important to understand how the service will behave. While designing the service-level objectives (SLOs), think from a user’s point of view. Consider the factors that will matter to an end user.

For instance, Google started measuring the latency and error rates of Gmail on the client side, rather than the server, which they used to do before. The error rate and latency count differed a lot. Code changes were made accordingly to fix the issues. The result was that the availability rate of Gmail increased to 99.9% from 99% in just a few years.

Wrapping It Up

To succeed, SRE requires special skills. There should be a sense of trust among teams. Being responsible for SRE is more about taking ownership of operations related to production. It’s a specific approach that focuses on IT operations. The SRE model helps in establishing a productive and healthy bond between production and development teams. The experts responsible take the necessary steps to make the system reliable.

If you’re planning to adopt SRE culture in your project, train your team, follow the best practices, and trust the process. You won’t achieve 100% perfection. That’s a myth. But you will make things better and get as close to perfection as possible.

—————————-

Innovate with Enov8, the IT Environment & Data Company.

Specializing in the Governance, Operation & Orchestration of your IT systems and data.

Delivering outcomes like

Improved visibility of your IT Fabric,
Streamlined Delivery of IT Projects,
Operational Standardization,
Security & Availability,
DevOps / DataOps Automation,
Real-Time insights supporting decision making & continuous optimization.

Our Key solutions include

Environment Manager for IT & Test Environment Management.
Release Manager for Enterprise Release Management & Implementation Planning.
Data Compliance Suite (DCS) for Test Data Management, including Data/Risk Profiling /Discovery, Automated Remediation & Compliance Validation.

Arnab Roy Chowdhury

This post was written by Arnab Roy Chowdhury. Arnab is a UI developer by profession and a blogging enthusiast. He has strong expertise in the latest UI/UX trends, project methodologies, testing, and scripting.

Relevant Articles

DORA Compliance – Why Data Resilience is the New Digital Battlefield

0 Comments

How Enov8 Helps Financial Institutions Align with the EU's Digital Operational Resilience Act Executive Introduction As of January 2025, the EU's Digital Operational Resilience Act (DORA) has become legally binding for financial institutions operating across the...

Data Fabric vs Data Mesh: Understanding the Differences

0 Comments

When evaluating modern data architecture strategies, two terms often come up: data fabric and data mesh. Both promise to help enterprises manage complex data environments more effectively, but they approach the problem in fundamentally different ways. So what’s...

What Is Release Management in ITIL? Guide and Best Practices

0 Comments

Managing enterprise software production at scale is no easy task. This is especially true in today’s complex and distributed environment where teams are spread out across multiple geographical areas. To maintain control over so many moving parts, IT leaders need to...

Test Environment: What It Is and Why You Need It

0 Comments

Software development is a complex process that requires meticulous attention to detail to ensure that the final product is reliable and of high quality. One of the most critical aspects of this process is testing, and having a dedicated test environment is essential...

PreProd Environment Done Right: The Definitive Guide

0 Comments

Before you deploy your code to production, it has to undergo several steps. We often refer to these steps as preproduction. Although you might expect these additional steps to slow down your development process, they help speed up the time to production. When you set...

What is Data Tokenization? Important Concepts Explained

0 Comments

In today’s digital age, data security and privacy are crucial concerns for individuals and organizations alike. With the ever-increasing amount of sensitive information being collected and stored, it’s more important than ever to protect this data from...

QUICKLINKS

Environment & Release Manager

Test Data Manager (DCS)

VirtualizeMe (Database Cloning)

NEWS

COMPANY