Select Page

Self-Healing Applications

March, 2019 by Sylvia Froncza

An IT and Test Environment Perspective

Traditionally, test environments have been difficult to manage. For one, data exists in unpredictable or unknown states. Additionally, various applications and services contain unknown versions or test code that may skew testing results. And then to top it all off, the infrastructure and configuration of each environment may be different. But why is that a problem? Well, although testing and test management play a crucial role in delivering software, they often get less attention than the software development process or production support. And without efficient, repeatable, and properly configured test environments, we can greatly delay the delivery of new features or allow devastating bugs into production. Fortunately, a solution exists for your test environment management (TEM) woes. Because with self-healing applications and environments, you gain access to standardized, repeatable, and automated processes for managing your environments. In this post, we’re going to discuss self-healing applications, separate the hype from reality, and get you started on the self-healing journey. And we’ll be considering all of this from the perspective of IT and TEM.  

Why Do We Need Self-Healing Apps and Environments?

Since you’re on Enov8’s blog, you may already be familiar with some of the challenges that exist with test environment management. But let’s briefly review some of them.

Limited Number of Test Environments

First, even fairly mature companies have a limited number of test environments. This might not seem like a big deal, but many of us have felt the crunch when multiple initiatives are tested at once. Initiative “A” locks down the system integration environment, while initiative “B” requires kicking everyone out of the end-to-end environment. Then, load testing requires the use of pre-production, while smaller projects and work streams scramble to find time slots for their own testing.

Unknown State of Environments

Later, once the environments are free for testing again, no one knows the current state of anything: data, versions, configuration, infrastructure, or patches. And it’s a manual process to get things back to where they need to be.

Not Able to Replicate Production

Additionally, test environments do not typically have as many resources available as the full-blown production environment. This is usually a cost-cutting measure, but it often makes load testing difficult. Therefore, we often have to extrapolate how the production environment will react under load. If we could easily scale our test environment up or down, we might have better data around load testing.

Painful Provisioning

Finally, with the increasingly distributed systems we rely on, it’s becoming more and more difficult to manually provision and later manage new test environments. And because many of the processes are manual, finding defects related to infrastructure setup and configuration becomes increasingly difficult. For example, if patches roll out manually to fix infrastructure bugs, QA personnel can’t always see easily what patches have been rolled out where. Now let’s look at what self-healing applications and environments are and how they can help.

What’s Self-Healing?

Self-healing implies the ability of applications, systems, or environments to detect and fix problems automatically. As we all know, perfect systems don’t exist. There are always bugs, limitations, and scaling issues. And the more we try to tighten everything up, the more brittle the application becomes. So what do we do? Embrace the possibility of failure. And automate systems to fix issues with minimal intervention. Now, please note I said minimal intervention. Though self-healing purports to eliminate the need for human intervention entirely, that’s not quite true. It reduces the need, but it doesn’t completely eliminate it. We’ll talk more about that later in this post. But first, let’s examine the two types of self-healing processes.

Reactive vs. Preventive

There are two types of automated healing we’ll discuss today: reactive and preventive. Reactive healing occurs in response to an error condition. For instance, if an application is down or not responding to external calls, we can react and automatically restart or redeploy the application. Or, within an application, reactive healing can include automated retry logic when calling external dependencies. Preventive healing, in contrast, monitors trends and acts upon the application or system based on that trend. For example, if memory or CPU usage climb at an unacceptable rate, we might scale the application vertically to increase available memory or CPU. Alternatively, if our metrics trend upward due to too much load, we can scale the application horizontally by adding additional instances before failure. Thorough self-healing necessitates both types of measures. However, when getting started it’s easier to add reactive healing. That’s because it’s typically easier to detect a complete failure or error condition than it is to detect a trend. And the number of possible fixes is typically smaller for reactive healing, too.

Self-Healing Applications

OK, so then what are self-healing applications? Well, they’re applications that either reactively or preventively correct or heal themselves internally. Instead of just logging an error, the application takes steps to either correct or avoid the error. For example, if calling a dependency fails, the application may contain automatic retry logic. Alternatively, the application could also go to a secondary source for the call. One common use of this involves payment processing. If calls to your primary payment processor fail after a few attempts, the application will then call a secondary payment processor.

Self-Healing Systems and Test Environments

Beyond an application, we encounter the system that contains it and possibly other applications that work together. Here, when we talk about self-healing systems or environments, we should consider generalized healing processes that can be applied regardless of what types of applications make up the core. For example, if an application in an environment is unreachable, then redeploying or restarting the application can react to the down state. Additionally, if latency or other metrics show service is degrading, scaling the number of instances can help. All these corrective measures should be generic enough that they can be automated. They apply to many different application types. Self-healing at an environment level incidentally provides self-managed environments as well. If scripts exist that scale or deploy applications in case of error, they can also automate provisioning environments for specialized and self-service test environments.

Getting Started

You can’t get to fully self-healing applications and environments overnight. And you’ll have to lay some solid groundwork first. Here’s how.


First, you’ll need to make some upfront investment in the following:
  1. Infrastructure as code. Infrastructure as code makes provisioning servers repeatable and automated using tools like Terraform or Chef. This will let you spin up and tear down test environments with ease.
  2. Automated tests. These tests shouldn’t just be tests that run as part of your integration pipeline. You’ll also want long-running automated tests that continually drive traffic to your services in your test environments. These tests will spot regression issues and degradation in performance.
  3. Logging. Next, logging will give your team the ability to determine root cause faster. It will also help identify the aspects of your environment to which you can apply self-healing processes.
  4. Monitoring and alerting. Finally, monitoring will let you see trends over time and alert you to issues that can’t be resolved through self-healing processes.


Once you have the basics in place, take stock of your environments and the pain points your QA team experiences. Then, draw a graph like the one shown below to chart the potential frustration and time commitment of self-healing automation against how easy automation would be. Once you’re done plotting your automation opportunities, start at the top right of the graph to implement the easiest automation process that offers the most benefit. Self Healing Another way to start involves identifying symptoms that require manual intervention as well as the possible automation that would resolve them. Let’s look at a few examples: Symptom: Service is unreachable. Automation: Restart or redeploy to a known good state. Symptom: Increase in errors reported. Automation: Alert appropriate parties; redeploy to a known good version. Symptom: Latency increases under load. Automation: Scale application and report result. However you decide which self-healing automation to add, it will require tweaking and monitoring over time to make sure you’re not masking issues with simple hacks.

Does This Mean We Don’t Need People?

Before we conclude, let’s talk about one misconception of self-healing applications. Often a purported benefit includes completely eliminating manual intervention. But does that mean we don’t need people anymore? Of course not. Because we still have to investigate why the applications or environments need to self-heal. So for every unique self-healing episode, we should look at the root cause. And we should consider what changes can be made to reduce the need for self-healing in the future. What self-healing applications and environments can do is reduce busy work. This, in turn, reduces the burden on support staff who must react immediately to every outage or problem. That frees them up to make the system more reliable as a whole. So, in addition to healing systems, take care to also put in proper monitoring and logging. Then the people involved in root cause analysis will have all the tools to investigate and won’t be bothered by repeating manual steps to get systems back online. All of this combines to make QA and development teams happier and more productive.

Healing Your Test Environments

Hopefully you’ve now gained a better idea of what self-healing can do for your organization. By looking at reactive and preventive manual actions, you can build automated processes that will improve efficiency and time to resolution for many failures. And with proper monitoring tools, you’ll feel confident that your processes work.  
Author Sylvia Froncza This post was written by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.

Relevant Articles

Test Environments – The Tracks for Agile Release Trains

25MAY, 2022 by Niall Crawford & Justin Reynolds. Modified by Eric Goebelbecker.So, you’ve decided to implement a Scaled Agile Framework (SAFe) and promote a continuous delivery pipeline by implementing “Agile Release Trains” (ART)*.  Definition: An Agile Release...

Test Environments: Why You Need One and How to Set It Up

24MAY, 2022 by Keshav MalikWith the rise of agile development methodologies, the need to quickly test new features is more critical than ever. This is especially true for websites and applications that rely on real-time data and interaction. The only way to ensure...

What is Data Masking?

20MAY, 2022 by Jane TemovMost organizations employ strong security measures to keep production data secure while being made available for day-to-day business activity. However, Data may be utilized for less secure activities like testing and training, or by third...

Data Compliance: A Detailed Guide for IT Leaders

15MAY, 2022 by Ukpai Ugochi & Arnab Roy Chowdhury. Modified by Eric Goebelbecker.As a DevOps manager or agile team leader, how do you ensure that users’ sensitive information is properly secured? Users are on the internet daily for communication, business, etc....


15May, 2022 by Carlos Schults & Justin Reynolds. Modified by Eric Goebelbecker.Organizations today are using more data than ever before. Indeed, data plays a critical role in decision-making for everything from sales and marketing to the production and development...

What is Release Management (an ERM & SAFe Perspective)

15MAY, 2022 by Jane TemovRelease Management, from an enterprise software definition, is the process Release Managers use for planning, executing, and monitoring a software release. It involves coordinating developers, testers, operations staff, and end-users to ensure...