Broadly, my research interests lie in the area of computer and networked systems. I take a holistic approach towards solving real-world problems considering both theoretical and systems perspectives. I am interested in designing solutions rooted in fundamental theory as well as in building systems that employ these solutions and insights to advance the state-of-the-art.
In the recent past, my research has focussed on fault tolerance, scalability, load balancing, reducing latency, and resource efficiency in large-scale distributed data storage and caching systems. We designed coding theory based solutions that we showed are provably optimal. We also built systems and evaluated them on Facebook's data-analytics cluster and on Amazon EC2 showing significant benefits over the state-of-the-art. The solutions that we proposed are now a part of Apache Hadoop 3.0 and are also being considered by several companies such as NetApp and Cisco for use in their storage and analytics products. More details on these projects are below.
Here is a brief description of some of my previous projects. For a more complete list please see my publications list.
Today's large scale distributed storage systems comprise of thousands of nodes, storing hundreds of petabytes of data. In these systems, failures are common, and this makes it essential to store the data in a redundant fashion to ensure reliability and availability. The most common way of adding redundancy is replication. However, replication is highly inefficient in utilizing storage capacity. With the rapid increase in the volume of data needed to be stored, replication is quickly becoming unsustainable, and many distributed storage systems are now turning to erasure coding
which offers a storage-efficient alternative. While classical erasure codes are optimal in terms of storage capacity utilization, they come with many drawbacks when applied to the distributed storage setting. For instance, they result in siginificant increase in the usage of system resources such as network bandwidth and device I/O. Furthermore, in big-data systems, the usage of codes has largely been limited to achieving space-efficient fault tolerance in disk-based storage systems, that is, for storing "cold" (less-frequently accessed) data.
We have constructed new erasure codes (i.e., designing new encoding and decoding algorithms) that provably overcome the limitations of classical erasure codes for application into large-scale distributed storage systems, and designing and building systems that employ these new generation of storage codes in novel ways. We have also explored new avenues for employing erasure coding in big-data systems, in particular for "hot" (frequently accessed) data, beyond the realm of disk-based storage systems and for goals beyond fault tolerance.
In-memory object caching in data-intensive clusters routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.
An erasure-coded storage system that reduces both network traffic and device I/O by around 25-45% during recovery with no additional storage, the same fault tolerance, and arbitrary flexibility in the choice of parameters, as compared to Reed-Solomon based systems. We have implemented Hitchhiker on top of Facebook's Hadoop Distributed File System (HDFS) and evaluated various metrics on the data-warehouse cluster in production at Facebook with real-time traffic and workloads. The underlying erasure code employed in Hitchhiker is desgined by making use of our Piggybacking framework (see below). Hitchhiker has beeb incorporated into Apache Hadoop.
Piggybacking Code Design Framework:
Piggybacking is a framework for designing practical distributed storage codes that are efficient in terms of device I/O and network-bandwidth while not compromising on storage efficiency. Piggybacking framework enables construction of best known codes for multiple settings. The basic idea behind this framework is to take multiple instances of existing codes and add carefully designed functions of the data of one instance to the other. You can find more details about Piggybacking in our IEEE ISIT 2013 paper and the upcoming IEEE Transactions on Information Theory paper.
Regenerating Code Constuctions:
'Regenerating Codes' are a class of erasure codes for distributed storage systems which are optimal with respect to both storage space and network-bandwidth utilization. Regenerating codes come in two flavors: (i) Minimum Storage (MSR): which minimize bandwidth usage while storing an optimal amount of data, (ii) Minimum Bandwidth (MBR): which further minimize bandwidth usage by allowing for a slightly higher storage space.
Here is a description of some of our regenerating code constructions.
Product-Matrix codes are a general class of regenerating code constructions spanning both MBR and MSR. These codes allow the number of nodes in the system to scale irrespective of other system parameters. This attribute allows one to vary the number of nodes in the system on the fly, which is very useful, for example, in provisioning resources based on the demand. Further, these codes can be implemented using existing Reed-Solomon encoders and decoders as building blocks.
We have also optimized these codes with respect to other aspects of distributed storage systems such as security, error correction, scaling via ratelessness, etc.
These are MBR codes which perform repair-by-transfer: the data stored in any node can be recovered by mere transfer of data from other nodes, without requiring any computation. This minimizes the disk IO since each node reads only the data that it transfers, and also permits the use of "dumb" nodes. The animated video above presents the working of these codes.
- "Explicit Construction of Optimal Exact Regenerating Codes for Distributed Storage", K. V. Rashmi, N. B. Shah, P. V. Kumar and K. Ramchandran, Proc. Allerton Conf., Sep. 2009.
- "Distributed Storage Codes with Repair-by-Transfer and Non-achievability of Interior Points on the Storage-Bandwidth Tradeoff", N. B. Shah, K. V. Rashmi, P. V. Kumar and K. Ramchandran, IEEE Transactions on Information Theory, March 2012.