Tech/Engineering

Efficient Concurrent Reads with Read-Through LRU Cache

Aditya Supugade, Sr. Staff Software Engineer

Introduction

Druva Daintree storage SDK (Photon) provides a cached read API for pointed data reads for a file of the particular snapshot. This cached read API uses LRU (Least Recently Used) cache optimized for efficient data retrieval.

As we set out to build cached read APIs, there were a few essential requirements to address:

  1. Low response times

  2. Minimize network requests

To achieve this, the LRU (Least Recently Used) cache is used. Cache helps reduce latency, optimize performance for read-intensive tasks, and save costs.

Additionally, cached read APIs can be called by any number of threads parallelly.

If you have used LRU cache where parallel requests are frequent, you've likely run into the issue of multiple requests resulting in cache misses, leading to all threads making network calls to retrieve the data.

In a typical scenario using Least Recently Used (LRU) cache, the process of data retrieval is as follows:

LRU cache 1

Cache Workflow

When a cache get request is initiated, if the requested data is present in the cache, it is swiftly returned, enhancing response times significantly. If data is not present in the cache, a function call will be made to get the data from the server. Let's look at how it is efficiently managed using inflight requests and careful orchestration of data flow.

Managing Inflight Requests

The inflight requests dictionary serves as a tracking mechanism. When a cache miss happens, this dictionary is consulted. If another worker is already fetching the requested data, the new request is not immediately routed to the server. Instead, the requesting worker is placed in a waiting state, anticipating the data to arrive shortly.

In this setup, a clever utilization of Rust synchronization primitives comes into play. Here, once data is available the fetching worker communicates it back to all waiting workers simultaneously.

Efficient Orchestration

Once an entry is made in the inflight requests dictionary, the fetching process begins. The first worker initiates a call to the server, aiming to retrieve the requested data. Simultaneously, any other workers waiting for the same data are blocked, waiting for the first worker to make the information available.

Upon successful retrieval of data by the first worker, it will

  1. Populate the cache, ensuring that subsequent requests for the same data can be handled from the cache.

  2. Remove the entry from the inflight requests dictionary and send the data so all waiting workers will receive it.

  3. Send data to the main request.

Handling Failures

In the event of a failure during the data-fetching process, a well-designed fail-safe mechanism comes into action. If the function call to the server encounters an error, all waiting workers are promptly notified with the error information. This ensures that errors are handled gracefully.

LRU cache 2


Cached Read Workflow

File Stats Cache

File Stats are required to determine the storage block size for that file. Since many requests come for a single file, file stats are cached.

File Stats Cache Design

The key for the file stats cache is the file path and the value is stats for that file which contains block size.

Meta Cache

Metadata can be cached to take advantage of the principle of locality. The request for metadata can be extended (balancing the API response time with the cache benefit) to fetch blocks of metadata that immediately follow. This will avoid some of the API calls and improve API latency.

Meta Cache Design

Fetching metadata from the server is handled in batches.

The meta cache key is a combination of a path and an offset, where the offset is a multiple of the meta batch size. The value for meta cache is the metadata of chunks from the offset to meta batch size.

When a read request is initiated for a particular offset and size, and its metadata is not present in the cache, the system dynamically adjusts the storage request offset to the nearest smaller multiple of the meta batch size. The request size is set to match the meta batch size, ensuring an efficient and streamlined metadata retrieval from the server.

Block Cache

Block caching is needed to handle the discrepancy in block sizes. Linux operates with 4 KB blocks, and IOs will be issued with this block size. Druva storage operates with a higher block size (for example, 1 MB). It will benefit to cache these large blocks as they get downloaded. The subsequent IO read requests (for 4 KB blocks) potentially will find a cache hit. Another use case for block cache is read-ahead. When an application intends to consume multiple blocks serially, it can request for read-ahead. These pre-fetched blocks stay in the cache and requests for them get quickly served as the cache gets populated up-front. 

Block Cache Design

The block cache key is block id which is unique for each block and the value is block data.

LRU cache 3

Use Cases

Cached read API is used in the following workloads in Druva:

  1. File level restore for Hybrid workloads (AHV, VMware)

  2. Data restore for stream workloads (SAP HANA)

Conclusion

The Druva Daintree storage SDK (Photon) leverages advanced caching techniques, including the Least Recently Used (LRU) cache, to optimize data retrieval processes effectively. By integrating robust mechanisms such as inflight request management and synchronized data handling, the system ensures high performance, reduced latency, and minimized network requests, even in high-concurrency scenarios.

The cached read API's architecture embodies Druva's commitment to innovative solutions that prioritize reliability and performance, making it a critical component for delivering seamless and cost-effective data management.