In the gradient decent learning process, we adjust the weights and biases to decrease the cost function. Backpropagation algorithm is used to find out the gradient of the cost function.

#### 1. Background

The basics of backpropagation were introduced in 1970s. Until 1986, this algorithm was showed be useful for neural networks with hidden layers. And this method became popular in 2010s benefiting from cheap, powerful GPU-based computing systems.

#### 2. Architecture

For a layers full-connected forward neural network, we can denote as the weight between neuron in layer and neuron in layer . Also we denote to be the bias of neuron in layer . If we choose the activation function to be , the output of neuron in layer is thus

(1)

\label{weightedin}

We can also use to denote the weighted input

(2)

In a compact vectorized form, 1 can be written as

(3)

\begin{figure}[h!]

\centering

\includegraphics[width=0.8\textwidth]{weights.png}

\caption{Weight and bias }

\end{figure}

The cost function is assumed to be

(4)

where is the number of training samples, is the contribution from th sample , which is, in same cases,

(5)

The learning process is the way trying to reduce by adjusting the weights and biases

(6)

(7)

Because of Eq.(4), the above equations are

(8)

(9)

#### 3. Gradients

The backpropagation algorithm is basically just an application of chain rule for the gradient. Let’s focus on a single training sample

(10)

(11)

However, Eq(1) indicates , Eq(??) indicates and . Thus these gradients are

(12)

(13)

use the notation

(14)

the gradients can be simplified as

(15)

(16)

We can easily calculate the output and the activation , but as for the , we need to divide it into two situations.

**l is the output layer L**

is an explicit function of as it is shown in Eq.(5), can be calculated directly.

(17)

**l is the intermediate layer**

We can use chain rule again to get for intermediate layers

(18)

thus

(19)

We can get at the first, then we can get from the above equation, and then , that’s why it’s called backpropagation.

#### 4. Summary

The four essential equation to get the gradients are

(20)

(21)

(22)

(23)

In vectorized form, they are

(24)

(25)

(26)

(27)

where denotes the elementwise product of two vectors, which is named Hadamard product or Schur product. Notice that the depends on the specific training sample .

#### References

[1] Haykin, Simon S., et al. Neural networks and learning machines. Vol. 3. Upper Saddle River, NJ, USA:: Pearson, 2009.

[2] Michael Nielsen. Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com/index.html

[3] Wikipedia. Backpropagation. https://www.wikiwand.com/en/Backpropagation