Select Page

Site Reliability Engineering (SRE) Top 10 Best Practice

06

MARCH, 2020

by Arnab Roy Chowdhury

Top 10 SRE Practices

Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are several SRE practices at play in making a site reliable. And if a site is reliable, it’s successful.

For instance, in case of an urgent task, like filing taxes or a bill payment, you’re likely to check internet connectivity first. And what’s most likely to be your first step for checking whether the internet is working? Trying a Google search! Why? Of course, because Google has such a high level of reliability.

Since 2003, the concept of site reliability engineering (SRE) has existed. In this post, we’re going to focus on the top 10 SRE practices. But before that, we’ll look at what site reliability engineering is and its importance.

 

What Is Site Reliability Engineering?

Google introduced the concept of site reliability engineering in 2003, just before DevOps became a trend. Initially, developers working with SRE made Google’s large-scale sites more efficient, scalable, and reliable. While doing so, they developed a set of practices. The concept of SRE became very popular with companies like Netflix and Amazon. They adopted this new technology and developed new practices for making a site reliable.

Slowly, SRE became an independent domain. The goal was to develop automated solutions. The solutions were meant to help the operations team. SRE handles tasks like capacity and performance planning, on-call monitoring, and disaster response. But why did SRE become so popular? The reason is SRE works well with continuous delivery and other DevOps practices. Another focus of SRE is to build systems that can learn something from outages and errors.

All clear about what SRE is? Let’s discuss why SRE is so important.

Why Is SRE Important?

We know that most companies nowadays are adopting agile and DevOps. This eliminates silos to a great extent. Development and operations teams have barriers between them. SRE is a practice that helps in breaking down these barriers. Both teams are aware of each other’s tasks. As a result, the development of new features takes place flawlessly. At the same time, SRE ensures the smooth running of production systems. But how?

Building on that, let’s take a look at why SRE is important. Here are some striking benefits SRE offers to a firm.

1. Focus on Learning via a Data-Driven Approach

Are your decisions good or bad for business? Are you making wise choices? SRE is a data-driven approach. SRE utilizes data to determine how your choices impact your business. It also collects massive amounts of data. By doing so, it helps the teams to assess system availability and reliability. And do you know what the best part is? This happens at an early stage of the development cycle, thereby reducing the chance of risks at a later stage.

2. Reduce System Toil via Automation

Most companies keep releasing new features to update an existing application that’s in production. But don’t you think that it can take a toll on the system’s health? For instance, let us consider system toil. When you run a production service, the toil refers to the repetitive manual tasks. With the help of automation, SRE reduces toil.

Every site experiences downtime. While setting an error budget, teams determine how much downtime they can afford. However, there comes a point when companies run out of error budget. SRE engineers then pause deployment. Once the toil is reduced, deployment starts again, thus making the system scalable and release management easier.

3. Consider IT Operations a Value Center

We know that the operations team helps a company to reduce downtime. They do so by promoting continuous testing and integration. They ensure that maximum services are available and increase revenue. Therefore, companies adopting SRE consider IT operations to be a value center. And do you know what this calls for? Skilled employees for managing SRE. We know what that means! With better skill sets, employees become more productive, making the team more effective.

Now that we know why SRE is important, let’s move on to the best practices you must follow while embracing the SRE culture.

Top 10 SRE Practices

Google is the first company to develop and promote SRE. But this doesn’t mean what works well for one company would work for the other, meaning there isn’t a one-size-fits-all approach. But one thing is for sure. Speed, performance, capacity planning, security, hardware and software upgrades, and availability are the main factors that make a system reliable. SRE is concerned with all of them. Here are the top 10 SRE practices that ensure impeccable system reliability. Let’s take a look.

1. Analyze Changes Keeping the Big Picture in Mind

Codes have their own impact on the overall business. A successful software engineer understands that. But the intensity of this particular trait has to be higher when you’re implementing SRE. Every little change impacts the business. Therefore, analyze each change for the risk it carries. Consider the impact of the changes for the long haul by seeing the big picture, not just how they can affect the system today.

2. Move on If Something Seems Like a Dead End

You obviously have good intentions while taking any action for the business. But even though you aim for the best, not all your endeavors will end up successful. For example, suppose you develop something to prevent losses because of changes. But what if the thing you developed ends up slowing your production cycle? There are two signs of something being a dead end—firstly, if it doesn’t turn out the way you intended, and secondly, if it causes more losses than it does good for the company. If either of these are the case, you should know when to move on.

3. Have Forward and Pragmatic Thinking

Silos do little good in the SRE culture. A siloed approach doesn’t take into account how a process will impact others in the system. As established earlier, the aim of SRE is to eliminate silos. Thus, a good SRE practice is to think about how a step will affect the rest of the team. If you’re finding solutions to a problem, consider the effects of the solution on others down the road.

4. Automate Wherever Possible

Quick delivery is one of the most important requirements of a company. But so is accuracy. So, together, speed and accuracy make a system reliable. As a result, most firms strive to make their systems reliable without slowing down different processes. Stop the time-consuming and repetitive processes that waste time. Take the time you spend in relentless manual work and instead use it to automate. After all, what’s the first thing in the job description of an SRE engineer? Automation!

5. Expand Skill Sets

We know that a firm needs skilled employees for handling SRE. But a predefined skill set isn’t the only thing for successful SRE. A company needs to ensure that the employees are ready to expand their skill sets. Since SRE is a relatively new field, we have people from both development and operations backgrounds. However, it’s important not to pigeonhole the SRE engineer’s role to a single background. Encourage employees to step out of their comfort zone. Enable them to keep developing new skills and learn new things.

6. Keep Striving Toward Perfection Without Obsessing Over It

Nothing turns out perfect even though that’s what you’re aiming for. The same goes for SRE. After all, you’re not running to create a break-proof system! You know that won’t always work. In short, do everything you can to ensure reliability. Leave no stones unturned from your end. Automate, develop new skills, and look at the big picture. Strive to do better to stay close to perfection.

7. Do Everything to Eliminate Toils

When a project begins, the setup is simple. You only have a few files in JSON or INI format. But as the number of modules increases, managing the configuration becomes a difficult task. Suppose that in your project, you need to add a few lines of code for supporting multiple languages. In a large project, the task consumes a lot of time as you have to implement the changes in many files. To reduce manual labor, use automation. Invest some time in developing a framework that performs duplicate tasks. Do things to reduce the developer’s workload.

8. Persuade Management to Do What’s Important

Let’s discuss this with an example. What if, during the development phase, you come across a certain automation framework? You find out that the framework is a bit expensive, but in the long run, it saves a great deal of rework. If that’s the scenario, convince the management why you need this framework. Remember, the aim of SRE is to make systems reliable. If you’re convinced that the framework will help the team once the project is deployed in production, prepare a document. Explain everything to management and get it for your team.

9. Blameless Postmortem

Whenever an outage or incident occurs, SRE experts carry out a postmortem. In this stage, they find out the root cause of the issue and document the incident. Postmortem offers a great learning scope to an SRE engineer. While writing the report, engineers get a clear idea of how things in the back end work.

However, while writing the postmortem report, don’t blame anyone or any team for what went wrong. Focus on finding out the “what” instead of “who.” Postmortems offer a scope to fix the weaknesses in a system. Writing a constructive report will identify the actions that led to the incident. The teams will make sure that this never happens again. The blame game will only weaken the bond between teams, which contradicts the SRE culture.

10. Define Service-Level Objectives Like an End User

To manage service in the right way, it’s important to understand how the service will behave. While designing the service-level objectives, think from a user’s point of view. Consider the factors that will matter to an end user.

For instance, Google started measuring the latency and error rates of Gmail on the client side, rather than the server, which they used to do before. The error rate and latency count differed a lot. Code changes were made accordingly to fix the issues. The result was that the availability rate of Gmail increased to 99.9% from 99% in just a few years.

Wrapping It Up

To succeed, SRE requires special skills. There should be a sense of trust among teams. Being responsible for SRE is more about taking ownership of operations related to production. It’s a specific approach that focuses on IT operations. The SRE model helps in establishing a productive and healthy bond between production and development teams. The experts responsible take the necessary steps to make the system reliable.

If you’re planning to adopt SRE culture in your project, train your team, follow the best practices, and trust the process. You won’t achieve 100% perfection. That’s a myth. But you will make things better and get as close to perfection as possible.

Arnab Roy Chowdhury

This post was written by Arnab Roy Chowdhury. Arnab is a UI developer by profession and a blogging enthusiast. He has strong expertise in the latest UI/UX trends, project methodologies, testing, and scripting.

Relevant Articles

Data Compliance: A Detailed Guide for IT Leaders

Data Compliance: A Detailed Guide for IT Leaders

31MARCH, 2021 by Ukpai UgochiSo, As the leader of a DevOps or agile team at a rising software company, how do you ensure that users' sensitive information is properly secured? Users are on the internet on a daily basis for communication, business, and so on. While...

What Is IT Operational Intelligence

What Is IT Operational Intelligence

24MARCH, 2021 by Taurai MutimutemaKnowledge is more important than ever in businesses of all types. Each time an engineer makes a decision, the quality of outcomes (always) hangs on how current and thorough the data that brought about their knowledge is. This...

What Is Data Fabrication in TDM

What Is Data Fabrication in TDM

15MARCH, 2021 by Carlos SchultsIn today’s post, we’ll answer what looks like a simple question: what is data fabrication in TDM? That’s such an unimposing question, but it contains a lot for us to unpack. What is TDM to begin with? Isn’t data fabrication a bad thing?...

Top TDM Metrics

Top TDM Metrics

19 FFEBRUARY, 2021 by Carlos Schults "You can't improve what you don't measure." I'm sure you're familiar with at least some variation of this phrase. The saying, often attributed to Peter Drucker, speaks to the importance of metrics as fundamental tools to enrich and...

Structured Versus Unstructured Data

Structured Versus Unstructured Data

08 FEBRUARY, 2021 by Zulaikha Greer Data is the word of the 21st century. The demand for data analysis skills has skyrocketed in the past decade. There exists an abundance of data, mostly unstructured, paired with a lack of skilled professionals and effective tools to...

Enterprise Environments: Understanding Deployment at Scale

Enterprise Environments: Understanding Deployment at Scale

04 JANUARY, 2021 by Ukpai Ugochi Have you ever wondered what would happen if you mistakenly added bugs to your codes and shipped them to users? For instance, let's say an IT firm has its primary work tree on GitHub, and a team member pushes codes with bugs to the...