Polar codes

Motivation
Let $W$ be a (binary-input memoryless output-symmetric) BMS channel. Let $I(W)$ denote the capacity of this channel. Since the channel has binary input we have $0 \leq I(W) \leq 1$. Our aim is to construct a low-complexity and reliable coding scheme of rate $R$ as close as possible to $I(W)$.

There are two special cases where this task is simple, namely when $I(W)=0$ and when $I(W)=1$. When $I(W)=0$ then the channel is completely useless. No information can be conveyed reliably over such a channel regardless of how sophisticated the coding scheme is. Hence, no use even trying. If, on the other hand, $I(W)=1$, then the channel is completely noiseless, i.e, if $X$ denotes the input to the channel, $X \in \{ \pm 1\}$, and $Y$ denotes the output of the channel, then we must have $Y=X$. In this case too, no coding is required. We can use the channel "as is." Let us think of these two scenarios as the extreme cases. The idea of polar coding is to start with a general channel and to "transform" it into these two extreme cases.

More precisely, we start with a set of lets say $2^n$ channels and, via a simple transform, we create a new set of $2^n$ channels, which are "polarized." I.e., in the new set of channels some of the channels are essentially noiseless channels, whereas the remaining channels look like extremely noisy channels. The surprising fact of polar codes is that this can be done in such a way that the sum capacity is preserved. If we can manage to sufficiently polarize all our channels, then the final step is easy: use the nearly noiseless channels "as is" without further coding transmit one bit of information per channel use each. And simply ignore the very noisy channels since they have essentially zero capacity anyway. Such a scheme achieves the capacity.

Channel combining and channel splitting
Let us now describe the two basic ingredients of polar codes, these two steps are called channel combining and channel splitting.

We start with channel combining. Consider two instances of the channel $W$. Let us use these two channel instance as shown below.

More precisely. Let $(U_1, U_2)$ be a pair of iid random Bernoulli$(\frac12)$ rvs. We think of $(U_1, U_2)$ as the information which we want to transmit over the channels. We first transform these according to $$ (X_1, X_2) = (U_1, U_2) \underbrace{\left( \begin{array}{cc} 1 & 0 \\ 1 & 1 \end{array} \right)}_{G_2} $$ Note that under this transform also $(X_1, X_2)$ is a pair of iid random Bernoulli$(\frac12)$ rvs. We then let $(X_1, X_2)$ be the input to two independent channels. At the receiver we observe the output $(Y_1, Y_2)$.

By the data processing inequality we have \begin{equation*} I(U_1, U_2; Y_1, Y_2) \leq I(X_1, X_2; Y_1, Y_2), \end{equation*} but since the map $G_2$ is invertible it is clear that no information is lost. I.e., we have equality, \begin{equation} I(U_1, U_2; Y_1, Y_2) = I(X_1, X_2; Y_1, Y_2) = 2 I(W). \label{equ:sumcapacityispreserved} \end{equation} The last equality on the right follows since we assume the channel $W$ to be memoryless.

The second basic operation is channel splitting. Recall that we have \begin{align} X_1 & =U_1+U_2, \label{equ:x1} \\ X_2 & =U_2. \label{equ:x2} \end{align} Let us rewrite \eqref{equ:x1} as $U_1=X_1+U_2$, which using the second equality can be written as $U_1=X_1+X_2$. Now recall that $X_1$ and $X_2$ are each sent over an independent realization of the channel $W$ and that the corresponding observations are $Y_1$ and $Y_2$.

Let us define the channel $W^-$. It is the channel with input $U_1$ and output $(Y_1, Y_2)$. This channel has capacity $I^-:=I(W^-)=I(U_1; Y_1, Y_2)$. Assume that we use this channel with a reliable code. We can then assume that $U_1$ is known with high probability at the receiver.

Next, let us try to decode $U_2$. From \eqref{equ:x2} we have $U_2=X_2$. Rewriting \eqref{equ:x1}, we also have $U_2=X_1+U_1$, and $U_1$ is known at the receiver. The channel $W^+$ is defined as the channel with input $U_2$ and with output $(Y_1, Y_2)$, assuming that $U_1$ is available at the receiver. It has capacity $I^+:=I(W^+)=I(U_2; Y_1, Y_2 | U_1)$. It is easy to see that this is equal to $I(U_2; Y_1, Y_2, U_1)$ since $$ I(U_2; Y_1, Y_2, U_1) = I(U_2; U_1) + I(U_2; Y_1, Y_2 | U_1), $$ but $U_1$ and $U_2$ are independent so that $I(U_2; U_1)=0$. The channel $W^+$ is therefore equivalent to the channel with input $U_2$ and output $(Y_1, Y_2, U_1)$.

Now note that $$ I(W^-)+I(W^-) = I(U_1; Y_1, Y_2) + I(U_2; Y_1, Y_2 \mid U_1) = I(U_1, U_2; Y_1, Y_2)=2 I(W), $$ where the last step follows from \eqref{equ:sumcapacityispreserved}.

Let us summarize. We have started with two independent realizations of the channel $W$. We have converted this into two channels, namely $W^-$ and $W^+$. We have just seen that the sum capacity is preserved, i.e., that $I(W^-)+I(W^+)=2 I(W)$. This is the basic transform underlying polar codes. We will now see that one channel is "better" and that one channel is "worse" than the original channel. Applying this transform repeatedly, we will be able to polarize all channels to either essentially perfect or essentially useless channels.

Basic polarization effect
We claim that \begin{align} I(W^-) \leq I(W) \leq I(W^+). \label{equ:polarization} \end{align} In words, the channel $W^-$ is worse than $W$, whereas $W^+$ is better than $W$.

In fact, inequality \eqref{equ:polarization} is strict unless $I(W)=0$ or $I(W)=1$. Before we formally prove this claim note that it is not very surprising. Clearly, $W^+$ is better than $W$ since we can think of $W^+$ as consisting of two conditionally independent observations. But since the sum capacity is preserved, this implies that $W^-$ must be worse than $W$.

Proof of basic polarization effect for the BEC
Consider the BEC with parameter $\epsilon$. Consider the relationship $U_1=X_1+X_2$. If either $Y_1=?$ or $Y_2=?$ or both then $U_1 \sim \text{Bern}(\frac12)$. If both $Y_1$ and $Y_2$ are received unerased then $U_1$ is known perfectly. We conclude that the channel $W^-$ is a BEC with parameter $1-(1-\epsilon)^2 = \epsilon (2-\epsilon) \geq \epsilon$. So, indeed, $I(W^-)=(1-\epsilon)^2 \leq 1-\epsilon = I(W)$.

Tree channel and the polarization process
For $N=2^n$, the construction of the channels can be visualized in the following way. Consider an infinite binary tree. To each vertex of the tree we assign a channel in a way that the collection of all the channels that correspond to the vertices at depth $n$ equals $\{W_{2^n} ^{(i)}\}_{1 \leq i \leq 2^n}$. We do this by a recursive procedure. Assign to the root node the channel $W$ itself. To the left offspring of the root node assign $W^-$ and to the right one assign $W^+$. In general, if $Q$ is the channel that is assigned to vertex $v$, to the left offspring of $v$ assign  $Q ^-$ and to the right one assign $Q ^+$.

In this setting, the channel assigned to a vertex at level $n$, is obtained by starting from the original channel $W$ and applying a sequence of $+$ and $-$ on it. More precisely, label the vertices at level $n$ from left to right by $1$ to $2^n$. The channel which is assigned to the $i$-th vertex is $W_{2^n} ^{(i)}$. Let the binary representation of $i-1$ be $b_1 b_2 \cdots b_n$, where $b_1$ is the most significant bit. By the mapping $0 \to -$ and $1 \to +$, every binary sequence $b_1 b_2 \cdots b_n$ is converted to a sequence of $+$ and $-$, denoted by $c_1 c_2\cdots c_n$. Then we have \begin{equation*} W_{2^n} ^{(i)}=(((W^{c_1})^{c_2})^{\cdots})^{c_n}. \end{equation*} E.g., assuming $i=7$ we have $W_{8} ^{(7)} =((W^{+})^{+})^{-}$.

Let $B_1,B_2,\dotsc$ be a $\{-,+\}$-valued i.i.d.\ process with $\Pr[B_1=-]=\Pr[B_1=+]=1/2$. Also define the polarization process of the channel $W$ as $W_0=W$ and for $n \in \NN$, \begin{equation} W_{n}= \left\{ \begin{array}{lr} W_{n-1} ^{+} & \quad \text{if } B_n=+\\ W_{n-1} ^{-} & \quad \text{if } B_n=-. \end{array} \right. \end{equation} In words, the process starts from the root node of the infinite binary tree and in each step moves either to the left or the right offspring of the current node with probability $\frac 12$. So at time $n$, the process $W_n$ outputs one of the $2^n$ channels at level $n$ of the tree uniformly at random.

Since the process $W_n$ takes its outputs in the the space of BMS channels, it may be difficult to work with it. However, one can consider various functionals applied to $W_n$ and hence obtain real valued processes which are considerably simpler to analyze. Important examples of such functionals include the capacity functional or the Bhattacharyya functional. Hence, the process $I_n=I(W_n)$ is a real valued process that outputs the capacity of the channel obtained from the process $W_n$.

A simple proof for BEC
So far we have seen that if the channel $W$ is a BEC channel with erasure probability $\epsilon$, then the channels $W^-$ and $W^+$ are also BEC channels with erasure probabilities $1-(1-\epsilon)^2$ and $\epsilon^2$, respectively. As a result, it is easy to see that the process $I_n=I(W_n)$ has the following recursive definition: Assuming the underlying channel is BEC($\epsilon$), the process $I_n$ starts with $I_0=1-\epsilon$, and for $n \in \NN$, \begin{equation} \label{BEC} I_{n}= \left\{ \begin{array}{lr} (I_{n-1})^2 & ; \text{with probability $\frac 12$},\\ 1-(1-I_{n-1})^2 & ; \text{with probability $\frac 12$}. \end{array} \right. \end{equation} We now aim to show that as $n$ grows, the random variables $I_n$ converge almost surely to a random variable $I_\infty$ that has the following simple rule: \begin{equation} I_{\infty}= \left\{ \begin{array}{lr} 1 &  ; \text{with probability $1-\epsilon$},\\ 0 & ; \text{with probability $\epsilon$}. \end{array} \right. \end{equation} The proof is done in two steps; Firstly note that \begin{align*} \mathbb{E}[I_n] & = \frac 12 (\mathbb{E}[(I_{n-1})^2]+ \mathbb{E}[1-(1-I_{n-1})^2]) \\ &= \frac 12 \mathbb{E}[(I_{n-1})^2+ 1-(1-I_{n-1})^2] \\ & = \frac 12 \mathbb{E}[2I_{n-1}] \\ & = \mathbb{E}[I_{n-1}]. \end{align*} As a result, we have \begin{equation} \label{E} \mathbb{E}[I_n]=\mathbb{E}[I_{n-1}]=\cdots=\mathbb{E}[I_0]=\mathbb{E}[I_\infty]=1-\epsilon. \end{equation} Secondly, define the random variable $Q_n=\sqrt{I_n(1-I_n)}$. Using \eqref{BEC} we have \begin{equation*} Q_{n}= Q_{n-1}. \left\{ \begin{array}{lr} \sqrt{I_{n-1}(1+I_{n-1})} & ; \text{with probability $\frac 12$},\\ \sqrt{(2-I_{n-1})(1-I_{n-1})}	& ; \text{with probability $\frac 12$}. \end{array} \right. \end{equation*} As a result, \begin{align*} & \mathbb{E}[Q_{n} \mid Q_{n-1}] \\ & \leq \frac {Q_{n-1}}{2} \max_{z\in [0,1]} \{ \sqrt{(2-z)(1-z)}+ \sqrt{z(1+z)} \}\\ & \leq Q_{n-1} \frac{\sqrt{3}}{2}. \end{align*} Thus, by noting that $\mathbb{E}(Q_0) \leq 1$ we get \begin{equation} \nonumber \mathbb{E}({Q_n}) \leq \bigl (\frac {\sqrt{3}}{2} \bigr )^n. \end{equation} Let $\rho_n=\bigl (\frac {\sqrt{3}}{2} \bigr )^n$. By the Markov inequality, it is easy to see that, \begin{equation} \label{bound_E} \mathbb{P}(I_n \in [\rho_n, 1-\rho_n]) \leq \frac{ \mathbb{E}[Q_n]}{\sqrt{\rho_n(1-\rho_n)}} \leq 2\sqrt{\rho_n}. \end{equation} Now, since $\mathbb{E}[I_n]=1-\epsilon$, we obtain \begin{align*} \mathbb{P}(I_n \geq 1- \rho_n) \geq 1-\epsilon-3\sqrt{\rho_n}. \end{align*} and consequently with the help of \eqref{bound_E}, we get \begin{align*} \mathbb{E}[\vert I_n - I_\infty \vert ] \leq 6 \sqrt{\rho_n} \to 0. \end{align*}

Density evolution and quantization
Under successive decoding, there is a BMS channel associated to each bit $U_i$ given the observation vector $Y_0^{N-1}$ as well as the values of the previous bits $U_0^{i-1}$. This channel has a fairly simple description in terms of the underlying BMS channel $W$.\footnote {We note that in order to arrive at this description we crucially use the fact that $W$ is symmetric. This allows us to assume that $U_0^{i-1}$ is the all-zero vector.}

Tree channels of height $n$: Consider the following $N=2^n$ tree channels of height $n$. Let $\sigma_{1}\dots \sigma_n$ be the $n$-bit binary expansion of $i$. E.g., we have for $n=3$, $0=000$, $1=001$, \dots, $7=111$. Let $\sigma = \sigma_1\sigma_2\dots\sigma_{n}$. Note that for our purpose it is slightly more convenient to denote the least (most) significant bit as $\sigma_n$ ($\sigma_1$). Each tree channel consists of $n+1$ levels, namely $0,\dots,n$. It is a complete binary tree. The root is at level $n$. At level $j$ we have $2^{n-j}$ nodes. For $1 \leq j \leq n$, if $\sigma_{j} = 0$ then all nodes on level $j$ are check nodes; if $\sigma_{j} = 1$ then all nodes on level $j$ are variable nodes. All nodes at level $0$ correspond to independent observations of the output of the channel $W$, assuming that the input is $0$. \end{definition}

An example for $W^{011}$ (that is $n=3$ and $\sigma=011$) is shown in Figure~\ref{fig:tree}. \begin{figure}[!h] \begin{center} \input{ps/treenew1.tex} \end{center} \caption{ Tree representation of the channel $W^{011}$. The $3$-bit binary expansion of $3$ is $\sigma_1\sigma_2\sigma_3 = 011$.}\label{fig:tree} \end{figure}

Let us call $\sigma=\sigma_{1}\dots\sigma_n$ the {\em type} of the tree. We have $\sigma \in \{0, 1\}^n$. Let $W^{\sigma}$ be the channel associated to the tree of type $\sigma$.

Consider the channels $W_N^{(i)}$. The channel $W_N^{(i)}$ has input $U_i$ and output $(Y_0^{N-1}, U_0^{i-1})$. Without proof we note that $W_N^{(i)}$ is equivalent to the channel $W^{\sigma}$ introduced above if we let $\sigma$ be the $n$-bit binary expansion of $i$.

Given the description of $W^{\sigma}$ in terms of a tree channel, it is clear that we can use density evolution to compute the channel law of $W^{\sigma}$. When using density evolution it is convenient to represent the channel in the log-likelihood domain. The BMS $W$ is represented as a probability distribution over $\mathbb{R}\cup\{\pm\infty\}$. The probability distribution is the distribution of the variable $\log(\frac{W(Y\mid 0)}{W(Y\mid 1)})$, where $Y\sim W(y\mid 0)$.

Density evolution starts at the leaf nodes which are the channel observations and proceeds up the tree. We have two types of convolutions, namely the variable convolution (denoted by $\circledast$) and the check convolution (denoted by $\boxast$). All the densities corresponding to nodes which are at the same level are identical. Each node in the $j$-th level is connected to two nodes in the $(j-1)$-th level. Hence the convolution (depending on $\sigma_j$) of two identical densities in the $(j-1)$-th level yields the density in the $j$-th level. If $\sigma_j=0$, then we use a check convolution ($\boxast$), and if $\sigma_j=1$, then we use a variable convolution ($\circledast$).

As an example, consider the channel shown in Figure~\ref{fig:tree}. By some abuse of notation, let $W$ also denote the initial density corresponding to the channel $W$. Recall that $\sigma=011$. Then the density corresponding to $W^{011}$ (the root node) is given by \begin{align*} \Bigl((W^{\boxast 2})^{\circledast 2}\Bigr)^{\circledast 2} = (W^{\boxast 2})^{\circledast 4}. \end{align*}

Assuming that infinite-precision density evolution has unit cost, it can be shown that the total cost of computing all channel laws is linear in $N$. However, a crucial point to note is that since for almost every channel at level $n$, the number of possible outputs is of order $2^N$, infinite-precision density evolution needs exponential complexity in terms of the block-length $N$. One way to overcome this high complexity issue is to quantize the channels; i.e., efficiently approximate the channels at each level of the tree by channels that their number of possible outputs is less than a fixed number $K$. In this way, the construction of the new quantized channels can be done in linear complexity.

Efficient construction

 * density evolution and quantization

Performance

 * complexity

Error probability behavior
In order to analyze the error probability behavior of polar codes, we resort to another reliability parameter, $Z(W)$, defined as \begin{equation} Z(W)=\sum_y \sqrt{W(y\mid 0)W(y\mid 1)}, \end{equation} also known as the Bhattacharyya parameter. It is known that the error probability of a channel is upper bounded by its Bhattacharyya parameter. It can also be shown that the error probability of polar codes under successive cancellation decoding is upper bounded as \begin{equation} P_e\le \sum_{i\in\mathcal{A}} Z(W_N^{(i)}), \end{equation} where $\mathcal{A}$ is the set of information bits. This simple inequality leads to useful upper bounds on the block error probability. To obtain such bounds, it suffices to estimate the values of the Bhattacharyya parameters appearing on the right-hand side of the above inequality. This computation can be facilitated by defining a random process $Z(W_0),Z(W_1),\dotsc$ of Bhatacharyya parameters, where the realizations of $Z(W_n)$ correspond to the Bhattacharyya parameters of the bit channels at level $n$.

The relation between $Z(W)$, $Z(W^-)$, and $Z(W^+)$ is given as \begin{align} Z(W)\le Z(W^-)&\le 2Z(W)-Z(W)^2\\ Z(W^+)&=Z(W)^2. \end{align} which immediately imply \begin{align} \label{eqn:z-minus} Z_n\le Z_{n+1}&\le 2Z_n-Z_n^2\qquad \text{if } B_n=-\\ \label{eqn:z-plus} Z_{n+1}&=Z_n^2\qquad \text{if } B_n=+, \end{align} (Here $Z_n$ is shorthand for $Z(W_n)$.

The following characterization of the asymptotic behavior of $Z_n$ yields sufficiently tight bounds on the error probability: \begin{align} \lim_{n\to\infty}\Pr[Z_n\le 2^{-N^\beta}]=I(W) \end{align} for all $\beta<1/2$. This asymptotic behavior can be explained roughly by considering the behavior of $-\log Z_n$. In particular, at time $n+1$, $-\log Z_{n}$ is either doubled (since $Z_n$ is squared) or is decresed by at most $1$ (due to \ref{eqn:z-minus}). Also observe that once $-\log Z_n$ becomes sufficiently large, subtracting $1$ from it has negligible effect compared with the doubling operation. Therefore, when $-\log Z_n\to\infty$ (which happens with probability $I(W)$), it does so roughly by doubling half of the time (by law of large numbers), and remaining roughly the same otherwise. Thus, $-\log Z_n\approx -2^{n/2}\log Z_0$, i.e., $Z_n\approx Z_0^{\sqrt{N}}$.


 * scaling behavior
 * improved decoding algorithms

Universality
Consider two BMS channels $P$ and $Q$. We are interested in constructing a common polar code of rate $R$ (of arbitrarily large block length) which allows reliable transmission over both channels. Denote by $C_{\text{P, SC}}(P, Q)$ as the maximum achievable rate. Trivially, \begin{align} \label{equ:Ibound} C_{\text{P, SC}}(P, Q) & \leq \min\{I(P), I(Q)\}. \end{align} We will see shortly that, properly applied, this simple fact can be used to give tight bounds.

For the lower bound we claim that \begin{align} C_{\text{P, SC}}(P, Q) & \geq C_{\text{P, SC}}(\text{BEC}(Z(P)), \text{BEC}(Z(Q))) \nonumber \\ & = 1-\max\{Z(P), Z(Q)\}. \label{equ:Zbound} \end{align} To see this claim, we proceed as follows. Consider a particular computation tree of height $n$ with observations at its leaf nodes from a BMS channel with Battacharyya constant $Z$. What is the largest value that the Bhattacharyya constant of the root node can take on? From the extremes of information combining framework we can deduce that we get the largest value if we take the BMS channel to be the BEC$(Z)$. This is true, since at variable nodes the Bhattacharyya constant acts multiplicatively for any channel, and at check nodes the worst input distribution is known to be the one from the family of BEC channels. Further, BEC densities stay preserved within the computation graph.

The above considerations give rise to the following transmission scheme. We signal on those channels $W^{\sigma}$ which are reliable for the BEC$(\max\{Z(P), Z(Q)\})$. A fortiori these channels are also reliable for the actual input distribution. In this way we can achieve a reliable transmission at rate $1-\max\{Z(P), Z(Q)\}$.

As an example, let us apply the above mentioned bounds to $C_{\text{P, SC}}(P, Q)$, where $P=\text{BEC}(0.5)$ and $Q=\text{BSC}(0.11002)$. We \begin{align*} I(P)=I(Q)& =0.5, \\ Z(\text{BEC}(0.5)) & =0.5, \\ Z(\text{BSC}(0.11002))& = 2 \sqrt{0.1102 (1-0.11002)} \approx 0.6258. %E(\text{BEC}(0.5)) & =0.25, \\ % E(\text{BSC}(0.11002)) & = 0.11002. \end{align*} The upper bound (\ref{equ:Ibound}) and the lower bound (\ref{equ:Zbound}) then translate to \begin{gather*} C_{\text{P, SC}}(P, Q)) \leq \min\{0.5,0.5\} = 0.5, \\ %C_{\text{P, SC}}(P, Q))  \leq 1-2\max\{0.25,0.11002\} = 0.5, \\ C_{\text{P, SC}}(P, Q)) \geq 1 -\max\{0.6258, 0.5\} = 0.3742. \end{gather*} Note that the upper bound is trivial, but the lower bound is not.

So far we have looked at seemingly trivial upper and lower bounds on the compound capacity of two channels. However, it is quite simple to considerably tighten the result by considering individual branches of the channel trees separately. In this regard, for any $ n \in \NN$ we have \begin{align*} C_{\text{P, SC}}(P, Q) \leq & \frac{1}{2^n} \sum_{i=1}^N  \min\{ I(P_N^{(i)}), I(Q_N^{(i)})   \},  \\ %C_{\text{P, SC}}(P, Q) \leq & %         1-\frac{1}{2^{n-1}} \sum_{\sigma \in \{0, 1\}^n}  \max\{E(P^{\sigma}), E(Q^{\sigma})\}, \\ %C_{\text{P, SC}}(P, Q) \leq %&        (I(P)+I(Q)-1) + \\ %& \frac{1}{2^n} \sum_{\sigma \in \{0, 1\}^n} \min\{ H(P_{N}^{\sigma}), H(Q_{N}^{\sigma}\}, \\ C_{\text{P, SC}}(P, Q)  \geq  &         1- \frac{1}{2^n} \sum_{i=1}^N  \max\{ Z(P_N^{(i)}), Z(Q_N^{(i)})\}. \end{align*}

Further, the upper as well as the lower bounds converge to the compound capacity as $n$ tends to infinity and the bounds are monotone with respect to $n$.

Generalizations

 * l x l matrices
 * non-binary
 * concatenations

Applications

 * lossless source coding (with and without memory)
 * lossy source coding
 * Wyner-Ziv problem
 * multiple-access channel
 * Wire-tap channel
 * non-symmetric channels
 * Gaussian channels
 * parallel channels
 * Slepian-Wolf

Open problems

 * finite-length improvements
 * exact scaling behavior

Historical account

 * convolutional codes and cut-off rate
 * Dumer, Reed-Muller, ...

References and other resources