Distributed storage systems

Introduction: Single Parity
An erasure code is a way to introduce redundancy. The simplest form of erasure coding is the single parity: Consider that a data object to be stored has size $B$ bits. Split the object into $k$ blocks (each of size $B/k$ bits), store them into $k$ nodes. No redundancy so far. In addition, store the bit-wise XOR of the blocks (also of size $B/k$ bits) into the $k+1$th node. This can tolerate any single node failure since the parity can be used to recover any loss. Also, if the parity is lost, it can be easily recovered.

In general, when there are $n$ storage nodes and any $k$ suffice to recover (i.e. the system can tolerate any $n-k$ node failures) This is called the $(n,k)$ property or $n$-out-of-$k$ reliability. The single parity is an $(n=k+1,k)$ such system.

A very simple example of a single parity code is shown here:



It was easy to construct single parity codes by just binary XORing of blocks. If we want to be able to tolerate more than one failure however, this requires more complicated constructions in larger finite fields. Since any $k$ nodes must suffice to recover the data, each node must store at least $B/k$ bits. This minimal storage is achieved through (maximum distance separable) MDS codes. Reed-Solomon codes form one of the most widely used constructions of MDS codes and for any value of $n$ and $k$.

Repairing a node failure
As discussed, the single parity code can tolerate any single node failure. This means that even after one node fails, a data collector (shown as a laptop) can communicate the information from the surviving nodes and reconstruct the file. This is shown here:



In the picture it is assumed that each block has size 1GB and the arrows show how much information is communicated. This is simply reconstruction of the whole data object. Clearly reconstruction costs 2GB of communication in this example.

The repair problem (also known as rebuild problem) is different from object reconstruction. For repair, a new node is no longer interested in reconstructing all the data, but rather only a single block. This is shown here:



Two different notions of repair are shown. Exact repair corresponds to rebuilding exactly the lost block. Functional repair corresponds to simply reconstructing a new block that combined with the existing ones still forms an (n,k) MDS code.

In linear algebra terms this means that the new block is in general position (maximally linearly independent) with respect to the existing blocks. Exact repair is a special case of functional repair and is strictly harder.

While dealing with repair, apart from the paramters $n,k$, a third parameter $d$ is used to specify the number of nodes a replacement node connects to during the repair process. In the following examples, the parameter $d$ is assumed to be $n-1$, i.e., the node replacing a failed node connects to all the remaining nodes for repair. For the case of a single parity code, repairing also requires communication of 2GB-- both blocks have to be recovered, to recover one.

Two or more parities
In single parity codes the whole data object must be reconstructed to repair one node failure. Contrary to what was widely believed until recently, this is not the case when the erasure code has more parities.

Consider for example the Evenodd code ((by Blaum and Bruck), an MDS array code with parameters $(4,2)$ MDS code that can tolerate two failures:



One observation is that each node is now storing two blocks. This sub-packetization is necessary to create binary codes with more than one parity. Assuming each block has size 1/2 GB so that each node is storing 1GB, reconstructing the whole file would, of course require 4 blocks of total size 2GB of communication. However, repairing a node failure can be done by communicating 1.5GB. See [|[9]].



Further it can be shown that 1.5GB is the minimum amount of communication required to repair any $(4,2)$ MDS code that stores a 2GB data object. This is shown through a cut-set bound on the repair communication.

A second interesting case of repair is shown here:



Observe that in this case the two packets at the second node are XORed and $c+d$ is communicated.

There are several metrics that can be optimized during repair: the total information read from existing disks during repair, the total information communicated in the network (called repair bandwidth), or the total number of disks required for each repair. Currently, the most well-understood metric is that of repair bandwidth but there is on-going work on designing erasure codes that are optimized for different metrics that have importance in distributed storage systems. A list of recent code constructions and preprints can be found in the Coding for storage Wiki