Programmable Data Infrastructure

Automating SRE Toolchains with Delphix and AppDynamics

Read about how our integration solution with AppDynamics allows enterprise teams to reproduce data-related issues, perform root cause analysis, develop and test fixes, and drastically shorten the time to restore services.

Michelle Kim

Feb 24, 2021

Outages are still a terrifying threat to enterprises today. According to Gartner, the average cost of one minute of system downtime is $5,600, which extrapolates to well over $300,000 per hour.

Companies today spend millions on plans every year without dealing with the underlying issues at play—namely, an ineffective data infrastructure that exposes the business to downtime and potential data loss. Handling outages swiftly requires the ability to recreate a specific state of the application environment, including the underlying data. This becomes extremely challenging particularly for complex, integrated systems.

Businesses need a faster, more comprehensive root cause analysis (RCA) environment creation process that helps reduce MTTR for mission-critical applications.

That’s why we’ve partnered with AppDynamics, a full-stack, business-centric observability platform that helps organizations quickly address critical problems in production through its unique application insight and business analytics capabilities.

Once AppDynamics detects an issue, it can trigger Delphix to automatically provision the right databases for the affected application from the right point in time. With this new integrated solution, SRE teams can leverage Delphix data provisioning within CI/CD and testing environments to help reproduce issues, perform root cause analysis, develop and test fixes, and drastically shorten the time to restore services.

Here are the key benefits enterprise software teams can expect to see using Delphix with AppDynamics.

Automating Full Stack Forensics

When a production downtime event is detected, an operations engineer or site reliability engineer must initiate and complete a manual and lengthy process to provision the right data-ready environments to resolve the issue:

  • Capture complete data sets, both prior to and after an event

  • Copy the right data from the right moment to an RCA environment

  • Create an integrated environment to enable a forensics review

Legacy technologies many times make it impossible to perform these actions in the first place, such as getting data from a specific moment in time. With Delphix, teams can automate all of the data elements of this process. The architecture diagram below shows how AppDynamics and Delphix work together to get to the root cause of an issue.

appdynamics delphix integration

AppDynamics can register an incident and open a ticket into the ticketing platform of choice. The SRE team can then see the tickets and decide when to generate one or many RCA environments.

Delphix then leverages the AppDynamics application topology information and orchestration to simplify the creation of RCA environments and in many cases can leverage an existing workflow and tools in place. Certain events can be defined to have RCA environments created automatically (without the need for approval) and others might need human control. Delphix optimizes storage and provisioning of data and transforms the time and resource-consuming task into a fast, parallelized and automated process.

Troubleshooting and Resolving Application Performance Issues

It’s never an easy task for teams to pinpoint exactly the performance issue. SREs end up manually creating production clones to execute specific types of testing.

However, our integration solution with AppDynamics enables SRE teams to:

  1. Provision production data environments on-demand to isolate different types of testing

  2. Automatically scale data resources to read-only environments to maintain a level of service for users while potential issues are being investigated

Accelerate CI/CD Speed and Precision

Many companies already use tools like AppDynamics in their continuous integration and continuous delivery (CI/CD) toolchains to measure the before and after state when a new release is deployed. This helps operations and SRE teams understand the overall performance implication of new features.

But in cases where a new feature is flagged as a potential source of service degradation, they often spend an insurmountable time to resolve issues. Development teams spend days, weeks, and sometimes even months waiting for DBAs and compliance personnel to prepare fresh, secure data for testing.

While technology failure and system downtime are inevitable, companies need a better way to track back to the cause of issues and shorten the time to resolve them. Downtime is costly and leads to lost revenue, missed customer acquisition opportunities, and stalled production lines. Data errors, data loss, and data corruption can also generate the same negative outcomes.

With Delphix and AppDynamics, enterprise teams can:

  • Use production-like, masked data and provision it instantly to developers for investigation

  • Ensure the testing environment is accurate and ready by the time the developer is ready to test the fix code

  • Deliver all the data sources in the CI/CD process, even for complex integrated applications

Final Thoughts

A 12-hour store outage cost Apple $25 million, Delta Airlines lost an estimated $150 million during a five-hour power outage that caused 2,000 cancelled flights, and Facebook suffered an estimated $90 million due to a 14-hour outage. Those are industry leaders, who can weather a one-day financial storm, but most enterprises do not have the capabilities to manage and recover from the consequences associated with downtime.

With Delphix’s open programmable data infrastructure, IT teams can deliver data at every point of the application lifecycle, from development to testing, analytics, and production reliability engineering. Our integration solution with AppDynamics allows enterprises to reproduce issues, perform root cause analysis, develop and test fixes, drastically shorten the time to restore services, virtually driving application downtime to zero.

Watch this demo here to learn more.