Site Reliability Engineering (SRE) Top 10 Best Practice
by Arnab Roy Chowdhury
Top 10 SRE Practices
Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are several SRE practices at play in making a site reliable. And if a site is reliable, it’s successful.
For instance, in case of an urgent task, like filing taxes or a bill payment, you’re likely to check internet connectivity first. And what’s most likely to be your first step for checking whether the internet is working? Trying a Google search! Why? Of course, because Google has such a high level of reliability.
Since 2003, the concept of site reliability engineering (SRE) has existed. In this post, we’re going to focus on the top 10 SRE practices. But before that, we’ll look at what site reliability engineering is and its importance.
What Is Site Reliability Engineering?
Google introduced the concept of site reliability engineering in 2003, just before DevOps became a trend. Initially, developers working with SRE made Google’s large-scale sites more efficient, scalable, and reliable. While doing so, they developed a set of practices. The concept of SRE became very popular with companies like Netflix and Amazon. They adopted this new technology and developed new practices for making a site reliable.
Slowly, SRE became an independent domain. The goal was to develop automated solutions. The solutions were meant to help the operations team. SRE handles tasks like capacity and performance planning, on-call monitoring, and disaster response. But why did SRE become so popular? The reason is SRE works well with continuous delivery and other DevOps practices. Another focus of SRE is to build systems that can learn something from outages and errors.
All clear about what SRE is? Let’s discuss why SRE is so important.
Why Is SRE Important?
We know that most companies nowadays are adopting agile and DevOps. This eliminates silos to a great extent. Development and operations teams have barriers between them. SRE is a practice that helps in breaking down these barriers. Both teams are aware of each other’s tasks. As a result, the development of new features takes place flawlessly. At the same time, SRE ensures the smooth running of production systems. But how?
Building on that, let’s take a look at why SRE is important. Here are some striking benefits SRE offers to a firm.
1. Focus on Learning via a Data-Driven Approach
Are your decisions good or bad for business? Are you making wise choices? SRE is a data-driven approach. SRE utilizes data to determine how your choices impact your business. It also collects massive amounts of data. By doing so, it helps the teams to assess system availability and reliability. And do you know what the best part is? This happens at an early stage of the development cycle, thereby reducing the chance of risks at a later stage.
2. Reduce System Toil via Automation
Most companies keep releasing new features to update an existing application that’s in production. But don’t you think that it can take a toll on the system’s health? For instance, let us consider system toil. When you run a production service, the toil refers to the repetitive manual tasks. With the help of automation, SRE reduces toil.
Every site experiences downtime. While setting an error budget, teams determine how much downtime they can afford. However, there comes a point when companies run out of error budget. SRE engineers then pause deployment. Once the toil is reduced, deployment starts again, thus making the system scalable and release management easier.
3. Consider IT Operations a Value Center
We know that the operations team helps a company to reduce downtime. They do so by promoting continuous testing and integration. They ensure that maximum services are available and increase revenue. Therefore, companies adopting SRE consider IT operations to be a value center. And do you know what this calls for? Skilled employees for managing SRE. We know what that means! With better skill sets, employees become more productive, making the team more effective.
Now that we know why SRE is important, let’s move on to the best practices you must follow while embracing the SRE culture.
Top 10 SRE Practices
Google is the first company to develop and promote SRE. But this doesn’t mean what works well for one company would work for the other, meaning there isn’t a one-size-fits-all approach. But one thing is for sure. Speed, performance, capacity planning, security, hardware and software upgrades, and availability are the main factors that make a system reliable. SRE is concerned with all of them. Here are the top 10 SRE practices that ensure impeccable system reliability. Let’s take a look.
1. Analyze Changes Keeping the Big Picture in Mind
Codes have their own impact on the overall business. A successful software engineer understands that. But the intensity of this particular trait has to be higher when you’re implementing SRE. Every little change impacts the business. Therefore, analyze each change for the risk it carries. Consider the impact of the changes for the long haul by seeing the big picture, not just how they can affect the system today.
2. Move on If Something Seems Like a Dead End
You obviously have good intentions while taking any action for the business. But even though you aim for the best, not all your endeavors will end up successful. For example, suppose you develop something to prevent losses because of changes. But what if the thing you developed ends up slowing your production cycle? There are two signs of something being a dead end—firstly, if it doesn’t turn out the way you intended, and secondly, if it causes more losses than it does good for the company. If either of these are the case, you should know when to move on.
3. Have Forward and Pragmatic Thinking
Silos do little good in the SRE culture. A siloed approach doesn’t take into account how a process will impact others in the system. As established earlier, the aim of SRE is to eliminate silos. Thus, a good SRE practice is to think about how a step will affect the rest of the team. If you’re finding solutions to a problem, consider the effects of the solution on others down the road.
4. Automate Wherever Possible
Quick delivery is one of the most important requirements of a company. But so is accuracy. So, together, speed and accuracy make a system reliable. As a result, most firms strive to make their systems reliable without slowing down different processes. Stop the time-consuming and repetitive processes that waste time. Take the time you spend in relentless manual work and instead use it to automate. After all, what’s the first thing in the job description of an SRE engineer? Automation!
5. Expand Skill Sets
We know that a firm needs skilled employees for handling SRE. But a predefined skill set isn’t the only thing for successful SRE. A company needs to ensure that the employees are ready to expand their skill sets. Since SRE is a relatively new field, we have people from both development and operations backgrounds. However, it’s important not to pigeonhole the SRE engineer’s role to a single background. Encourage employees to step out of their comfort zone. Enable them to keep developing new skills and learn new things.
6. Keep Striving Toward Perfection Without Obsessing Over It
Nothing turns out perfect even though that’s what you’re aiming for. The same goes for SRE. After all, you’re not running to create a break-proof system! You know that won’t always work. In short, do everything you can to ensure reliability. Leave no stones unturned from your end. Automate, develop new skills, and look at the big picture. Strive to do better to stay close to perfection.
7. Do Everything to Eliminate Toils
When a project begins, the setup is simple. You only have a few files in JSON or INI format. But as the number of modules increases, managing the configuration becomes a difficult task. Suppose that in your project, you need to add a few lines of code for supporting multiple languages. In a large project, the task consumes a lot of time as you have to implement the changes in many files. To reduce manual labor, use automation. Invest some time in developing a framework that performs duplicate tasks. Do things to reduce the developer’s workload.
8. Persuade Management to Do What’s Important
Let’s discuss this with an example. What if, during the development phase, you come across a certain automation framework? You find out that the framework is a bit expensive, but in the long run, it saves a great deal of rework. If that’s the scenario, convince the management why you need this framework. Remember, the aim of SRE is to make systems reliable. If you’re convinced that the framework will help the team once the project is deployed in production, prepare a document. Explain everything to management and get it for your team.
9. Blameless Postmortem
Whenever an outage or incident occurs, SRE experts carry out a postmortem. In this stage, they find out the root cause of the issue and document the incident. Postmortem offers a great learning scope to an SRE engineer. While writing the report, engineers get a clear idea of how things in the back end work.
However, while writing the postmortem report, don’t blame anyone or any team for what went wrong. Focus on finding out the “what” instead of “who.” Postmortems offer a scope to fix the weaknesses in a system. Writing a constructive report will identify the actions that led to the incident. The teams will make sure that this never happens again. The blame game will only weaken the bond between teams, which contradicts the SRE culture.
10. Define Service-Level Objectives Like an End User
To manage service in the right way, it’s important to understand how the service will behave. While designing the service-level objectives, think from a user’s point of view. Consider the factors that will matter to an end user.
For instance, Google started measuring the latency and error rates of Gmail on the client side, rather than the server, which they used to do before. The error rate and latency count differed a lot. Code changes were made accordingly to fix the issues. The result was that the availability rate of Gmail increased to 99.9% from 99% in just a few years.
Wrapping It Up
To succeed, SRE requires special skills. There should be a sense of trust among teams. Being responsible for SRE is more about taking ownership of operations related to production. It’s a specific approach that focuses on IT operations. The SRE model helps in establishing a productive and healthy bond between production and development teams. The experts responsible take the necessary steps to make the system reliable.
If you’re planning to adopt SRE culture in your project, train your team, follow the best practices, and trust the process. You won’t achieve 100% perfection. That’s a myth. But you will make things better and get as close to perfection as possible.
Arnab Roy Chowdhury
This post was written by Arnab Roy Chowdhury. Arnab is a UI developer by profession and a blogging enthusiast. He has strong expertise in the latest UI/UX trends, project methodologies, testing, and scripting.
01 JULY, 2020 by Diego Gavilanes Ever since the dawn of time, test environments have been left for the end, which is a headache for the testing team. They might be ready to start testing but can’t because there’s no test environment. And often, the department in...
29 JUNE, 2020 by Carlos Schults In today’s post, we’ll discuss data literacy and its relevance in the context of GDPR. We start by defining data literacy and giving a brief overview of GDPR. Then we proceed to explain some of the challenges organizations might face...
23 June, 2020 by Arnab Roy Chowdhury In this digital era, online businesses have become mainstream. Consequently, online commerce has flourished—and led to loads and loads of data! Businesses need to build data centers to store information. Not only that, but if you...
08 JUNE, 2020 by Eric Boersma Every company needs a disaster recovery plan. This is just a simple fact of life. Your company needs to know how to recover when something breaks or you can’t get access to something you need. In larger, more advanced tech companies,...
25 May, 2020 by Daniel Longest Zombie and ghost assets sound exciting, like a late-night movie you’d watch around Halloween. While in reality they may not be that exciting, they’re scary if you don’t understand and prevent them. The good news is the steps you need to...
05 May, 2020 by Eric Boersma Taking on Site Reliability Engineering (SRE) is not an easy task. It doesn’t matter where you’re coming from. Some organizations have done a little DevOps and are trying to break into SRE. Others haven’t even taken that step, and figure...