Tech/Engineering, Innovation Series

Druva’s Data Anomalies platform: from coupled to de-coupled architecture

December 17, 2020 Neeraj Thakur, Principal Engineer

Druva’s backup solutions were architected from the beginning to be cloud-native and built on AWS. Druva has built many applications on top of these cloud-native backup solutions which deliver additional value to customers; among these are solutions to help ensure customer cyber resiliency and resistance to ransomware. One of Druva’s new and innovative features to predict and prevent malicious attacks on customer data is Data Anomalies, formerly "Unusual Data Activity (UDA) detection."

In a typical attack, a malicious user or software modifies data in a suspicious manner on a device. This modification is considered a data anomaly, and Druva’s SaaS-based platform for protection and management across endpoints and cloud applications leverages anomaly detection to provide reports which help identify suspicious activity, such as:

Large number of files deleted or added
Unwarranted modification of files
Suspicious encryption of files

The Data Anomalies feature was primarily available to customers of Druva's endpoint and SaaS apps protection as it had extensive dependency and coupling with the endpoint backup framework. In short, this means the following:

The coupling was in the form of REST APIs. The workflow of backup finish, detection of anomalies and submitting the result back to Master service was done in a synchronous way. There was a two-way REST API communication between the Master service and the anomaly detection service.
The endpoint backup framework is made up of multiple smaller services which facilitate the execution of backup and recovery functions, these include: syncer service, master config. server, backup manager service, API service, node service, storage service, and user portal service.

Current coupling and design

Druva's Data Anomalies modules build an Artificial Intelligence (AI) model based on the number of files added, updated or deleted in a backup. The AI model learns the backup activity happening on the device. The model is built for each endpoint device separately. Backup metadata is generated in an active backup session and is synchronously pushed to the master module for adding into the AI model along with the data encryption key. The backup information is then processed from the AI model within the master module.

If any device is found to have anomalies, the master provides a synchronous REST API call to inform the administrator and add the anomaly in an alert.

Design challenges

The REST API call is delivered via the active backup session. In this design, the inSync product was dependent on the Data Anomalies module to process and respond back for provided backup metadata. But, given the backup is the core functionality and anomaly detection is an add-on feature of the backup, this created challenges in the coupling.

Developing a common layer: Data Anomalies

As Druva has multiple product lines, from cloud workloads, end-user protection, and hybrid workloads, engineers wanted to build a common service layer where all value added features would be placed, and could cover all products.

So, Druva developed and added a new module on top of its end-user and hybrid workload protection. This was not a simple task as the flow of information needed to be handled correctly without complicating data management.

Dependency challenges

Data Anomalies previously worked on a push model, where products would push backup metadata from REST API calls along with the data encryption key. The model would push anomaly results back to products via REST API creating a dependency on the product. In this model there is a cyclic dependency between the inSync product and the module as seen in the first graphic above.

With a direct dependency on upgrading the module, as they are coupled, engineers needed to push the new version of software simultaneously for both modules.

The deployment of the production environment is automated. The Data Anomalies build pipeline is configured under a Continuous Integration (CI) tool. As soon as the engineering team pushes the branch into the internal gitlab server, the integrated code passes through the automated Lint – Build – Test pipeline, at which point the new build is ready to be qualified by the QA team. After the QA team certifies the artifacts, they hand over the certified artifact index to CloudOps team, who submits the artifact index into the terraform suite. Finally, this suite updates the production service to the version given in the artifact index.

In addition, in an issue called ‘module slowness propagation,’ the Data Anomalies modules need to work at the speed of product and provision enough compute resources to avoid any slowness or lag in execution.

New architecture with Amazon SNS SQS

The first task towards decoupling was to remove API calls which were fired from Druva to the Data Anomalies module and make the push framework a pull framework. With a new pull framework, the REST API call direction was sending data from Data Anomalies to the Druva product.

Pull framework design

With the new changes in place, products began maintaining backup metadata along the tracking of backup jobs. This solved the issue of dependency direction, leaving the issue of which device’s information needed to be pulled for processing and at which times. Engineers addressed this problem by adding an asynchronous message passing protocol – the Simple Notification Service (SNS) and Simple Queue Service (SQS).

After adding SNS/SQS to the inSync product and the Data Anomalies module respectively, the backup complete notification can now push synchronously and the module can process them asynchronously. Through this approach, Druva may abstract which product or module is pulling backup metadata, solving the problem of direct coupling between the product and module.

Decoupling solved the issues of dependency, and the decoupled Data Anomalies module enabled engineers to independently develop, test, and release new versions of the modules in the production environment.

Scaled architecture

After adding SNS/SQS to Data Anomalies, backup complete notifications are processed quickly, efficiently, and scalably. The architecture functions as follows, the module begins by provisioning a minimum compute resource (AWS ECS), and scales up as the number of SNS events increases at peak times, scaling down to save compute resources in downtime.

Currently, the Data Anomalies service handles up to 5,000 backup events per minute, and can scale automatically, reaching up to 15,000 events per minute in peak load. Druva estimates these load capabilities will continue to increase in 2021.

Key takeaways

How end-users interact with products has not changed, however, by delivering the enhanced capabilities of anomaly detection, Druva has improved scalability, ease of data management and simplicity of maintenance.

Explore the many ways Druva’s innovative solutions enable a range of next-generation cloud-based applications, including how they simplify the management of data pipelines, in the Tech/Engineering section of the blog archive.

Druva’s Data Anomalies platform: from coupled to de-coupled architecture

Current coupling and design

Design challenges

Developing a common layer: Data Anomalies

Dependency challenges

New architecture with Amazon SNS SQS

Pull framework design

Scaled architecture

Key takeaways

Druva Blog: Cloud Technology & Data Protection Articles

Druva Data Security Cloud

The Druva Platform

Data Protection

Cyber Response & Recovery

eDiscovery & Compliance

Modernize Data Protection

Accelerate Data Security

Key Technologies

Customers

Resources

Partners

Company