Backpropagation Algorithm

In the gradient decent learning process, we adjust the weights and biases to decrease the cost function. Backpropagation algorithm is used to find out the gradient of the cost function.


1. Background

The basics of backpropagation were introduced in 1970s. Until 1986, this algorithm was showed be useful for neural networks with hidden layers. And this method became popular in 2010s benefiting from cheap, powerful GPU-based computing systems.


2. Architecture

For a L layers full-connected forward neural network, we can denote w_{jk}^{l} as the weight between neuron k in layer l-1 and neuron j in layer l. Also we denote b_j^l to be the bias of neuron j in layer l. If we choose the activation function to be \sigma(z), the output of neuron j in layer l is thus

(1)   \begin{equation*}  a_j^l = \sigma\left(\sum_{k}w_{jk}^{l}a_k^{l-1}+b_j^l\right) \end{equation*}

We can also use z_j^l to denote the weighted input

(2)   \begin{equation*} z^l_j = \sum_{k}w_{jk}^{l}a_k^{l-1}+b_j^l \end{equation*}

In a compact vectorized form, 1 can be written as

(3)   \begin{equation*} a^l = \sigma(w^la^{l-1}+b^l) \end{equation*}

\caption{Weight w_{jk}^{l} and bias b_j^l}

The cost function is assumed to be

(4)   \begin{equation*}  C=\frac{1}{M}\sum_{i=1}^{M}C_i \end{equation*}

where M is the number of training samples, C_i is the contribution from ith sample (x_i,y_i), which is, in same cases,

(5)   \begin{equation*}  C_i = \frac{1}{2}|y_i-a^L(x_i)|^2 \end{equation*}

The learning process is the way trying to reduce C by adjusting the weights and biases

(6)   \begin{equation*} w_{jk}^{l} \to w_{jk}^{l}-\eta\frac{\partial C}{\partial w_{jk}^{l}} \end{equation*}

(7)   \begin{equation*} b_j^l \to b_j^l-\eta\frac{\partial C}{\partial b_j^l} \end{equation*}

Because of Eq.(4), the above equations are

(8)   \begin{equation*} w_{jk}^{l} \to w_{jk}^{l}-\frac{\eta}{M}\sum_i\frac{\partial C_i}{\partial w_{jk}^{l}} \end{equation*}

(9)   \begin{equation*} b_j^l \to b_j^l-\frac{\eta}{M}\sum_i\frac{\partial C_i}{\partial b_j^l} \end{equation*}

3. Gradients

The backpropagation algorithm is basically just an application of chain rule for the gradient. Let’s focus on a single training sample (x_i, y_i)

(10)   \begin{equation*} \frac{\partial C_i}{\partial w_{jk}^{l}} = \frac{\partial C_i}{\partial a_j^l} \frac{\partial a_j^l}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^{l}} \end{equation*}

(11)   \begin{equation*} \frac{\partial C_i}{\partial b_j^{l}} = \frac{\partial C_i}{\partial a_j^l} \frac{\partial a_j^l}{\partial z_j^l} \frac{\partial z_j^l}{\partial b_j^l} \end{equation*}

However, Eq(1) indicates \frac{\partial a_j^l}{\partial z_j^l} =\sigma'(z_j^l), Eq(??) indicates \frac{\partial z_j^l}{\partial w_{jk}^{l}} = a_k^{l-1} and \frac{\partial z_j^l}{\partial b_j^l} = 1. Thus these gradients are

(12)   \begin{equation*} \frac{\partial C_i}{\partial w_{jk}^{l}} = \frac{\partial C_i}{\partial a_j^l} \sigma'(z_j^l) a_k^{l-1} \end{equation*}

(13)   \begin{equation*} \frac{\partial C_i}{\partial b_j^{l}} = \frac{\partial C_i}{\partial a_j^l} \sigma'(z_j^l) \end{equation*}

use the notation

(14)   \begin{equation*} \delta^l_j \equiv \frac{\partial C_i}{\partial z^l_j} =\frac{\partial C_i}{\partial a_j^l} \sigma'(z_j^l) \end{equation*}

the gradients can be simplified as

(15)   \begin{equation*} \frac{\partial C_i}{\partial w_{jk}^{l}} = \delta^l_j a_k^{l-1} \end{equation*}

(16)   \begin{equation*} \frac{\partial C_i}{\partial b_j^{l}} = \delta^l_j \end{equation*}

We can easily calculate the output a_k^{l-1} and the activation z^l_j, but as for the \delta^l_j, we need to divide it into two situations.


l is the output layer L
C_i is an explicit function of a_j^L as it is shown in Eq.(5), \frac{\partial C_i}{\partial a_j^l} can be calculated directly.

(17)   \begin{equation*} \delta^L_j =\frac{\partial C_i}{\partial a_j^L} \sigma'(z_j^L) \end{equation*}

l is the intermediate layer
We can use chain rule again to get \frac{\partial C_i}{\partial a_j^l} for intermediate layers

(18)   \begin{align*} \frac{\partial C_i}{\partial a_j^l} \nonumber &=\sum_{k}\frac{\partial C_i}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial a_j^l} \\ &=\sum_{k}\delta^{l+1}_k w_{kj}^{l+1} \end{align*}


(19)   \begin{equation*} \delta^l_j = \sum_{k}\delta^{l+1}_k w_{kj}^{l+1} \sigma'(z_j^l) \end{equation*}

We can get \delta^L_j at the first, then we can get \delta^{L-1}_j from the above equation, and then \delta^{L-2}_j..., that’s why it’s called backpropagation.


4. Summary

The four essential equation to get the gradients are

(20)   \begin{equation*} \delta^L_j =\frac{\partial C_i}{\partial a_j^L} \sigma'(z_j^L) \end{equation*}

(21)   \begin{equation*} \delta^l_j = \sum_{k} \delta^{l+1}_k w_{kj}^{l+1} \sigma'(z_j^l) \end{equation*}

(22)   \begin{equation*} \frac{\partial C_i}{\partial w_{jk}^{l}} = \delta^l_j a_k^{l-1} \end{equation*}

(23)   \begin{equation*} \frac{\partial C_i}{\partial b_j^{l}} = \delta^l_j \end{equation*}

In vectorized form, they are

(24)   \begin{equation*} \delta^L =\frac{\partial C_i}{\partial a^L}\odot \sigma'(z^L) \end{equation*}

(25)   \begin{equation*} \delta^l = \left[\delta^{l+1} w^{l+1}\right] \odot \sigma'(z_j^l) \end{equation*}

(26)   \begin{equation*} \frac{\partial C_i}{\partial w_{jk}^{l}} = \delta^l_j a_k^{l-1} \end{equation*}

(27)   \begin{equation*} \frac{\partial C_i}{\partial b_j^{l}} = \delta^l_j \end{equation*}

where \odot denotes the elementwise product of two vectors, which is named Hadamard product or Schur product. Notice that the \delta^l depends on the specific training sample (x_i, y_i).



[1] Haykin, Simon S., et al. Neural networks and learning machines. Vol. 3. Upper Saddle River, NJ, USA:: Pearson, 2009.
[2] Michael Nielsen. Neural Networks and Deep Learning.
[3] Wikipedia. Backpropagation.

Leave a Reply