Roberto Santana and Unai Garciarena
Department of Computer Science and Artificial Intelligence
University of the Basque Country
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press. 2005.
R. Rojas. Neural networks: a systematic introduction. Springer Science & Business Media. Chapter 7. 2013.
where \( d_j \) is the desired output
H. Wang, B. Raj, and E. P. Xing. On the Origin of Deep Learning. arXiv preprint arXiv:1702.07800. 2017.
K. Kawaguchi. A multithreaded software model for backpropagation neural network applications. Ph. D. Thesis. 2000.
\[ \begin{align} h({\bf{x}}) =& g \left ( w_1 h_1({\bf{x}}) + w_2 h_2({\bf{x}}) + c \right ) \\ =& g \left ( w_1 g(\theta_1 x_1 + \theta_2 x_2 + b_1) + w_2 g(\theta_3 x_1 + \theta_4 x_2 + b_2) + c \right ) \end{align} \]
Q. V. Le. A Tutorial on Deep Learning. Part 1: Nonlinear Classifiers and The Backpropagation Algorithm. 2015.
A. K. Jain, J. Mao, and K. M. Mohiuddin. Figure. Artificial neural networks: A tutorial. Computer. Vol. 29 No. 3. Pp. 31-44. 1996.
N. J. Guliyev and V. E. Ismailov A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function. 2016.
Input | True Model | Output | Abs. Error | Square Error |
---|---|---|---|---|
x | M(x)=2x | g(x) | |g(x)-M(x)| | (g(x)-M(x))^2 |
0 | 0 | 0 | 0 | 0 |
1 | 2 | 3 | 1 | 1 |
2 | 4 | 6 | 2 | 4 |
3 | 6 | 9 | 3 | 9 |
4 | 8 | 12 | 4 | 16 |
5 | 10 | 15 | 5 | 25 |
All | 15 | 55 |
Input(x) | True Model | W=3 | SE (W=3) | SE (W=3.02) | SE (W=2.98) |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
1 | 2 | 3 | 1 | 1.04 | 0.96 |
2 | 4 | 6 | 4 | 4.16 | 3.84 |
3 | 6 | 9 | 9 | 9.36 | 8.64 |
4 | 8 | 12 | 16 | 16.64 | 15.36 |
5 | 10 | 15 | 25 | 26.01 | 24.01 |
All | 55 | 57.22 | 52.82 |
The gradient of a function \( J(\theta_1,\dots,\theta_d) \) is a vector-value function defined as:
\[
\nabla J(\theta_1,\dots,\theta_d) = < \frac{\partial J}{\partial \theta_1}(\theta_1,\dots,\theta_d), \dots,\frac{\partial J}{\partial \theta_d}(\theta_1,\dots,\theta_d)>
\]
Gradient descent: A local minimization method based on updating the parameters of a function \( J(\theta_1,\dots,\theta_d) \) in the opposite direction to its gradient.
A parameter \( \mu \) is used to indicate the learning rate of the algorithm (the size of the step taken to reach to local optimum)
S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. 2016.
\[ \begin{align} h({\bf{x}}) =& g \left ( w_1 h_1({\bf{x}}) + w_2 h_2({\bf{x}}) + c \right ) \\ =& g(w_1 g(\theta_1 x_1 + \theta_2 x_2 + b_1) \\ +& w_2 g(\theta_3 x_1 + \theta_4 x_2 + b_2) + c) \end{align} \]
Q. V. Le. A Tutorial on Deep Learning. Part 1. Nonlinear classifiers and the backpropagation algorithm. 2015.
A. K. Jain, J. Mao, and K. M. Mohiuddin. Figure. Artificial neural networks: A tutorial. Computer. Vol. 29 No. 3. Pp. 31-44. 1996.
\( h(x) \): decision function
\( g \): activation function
\( \theta^l_{ij} \): weight at layer \(l\)-th between input \(j\)-th and neuron \(i\)-th in layer \((l+1)\)-th
\( b_i \): bias of neuron \( i \)
\( s_l \): number of neurons in the layer
Optimization is involved in several aspects of machine learning algorithms.
Of all the optimization problems involved in deep learning, the most difficult is neural network training.
Optimization is also relevant to the efficiency of the DNN learning algorithm.
We focus on the optimization problem of finding the parameters \( \Theta \) of a neural network that significantly reduce a (possibly regularized) loss function \( J(\Theta) \).
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning. Chapter 8. Optimization for Training Deep Models. MIT Press. 2016.
Gradient descent algorithms can be grouped in three classes according to the way the gradient is used for the updates:
S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. 2016.
To perform one parameter update, computes the gradient of \(J\) using all the points the dataset as:
\[
\theta = \theta - \epsilon \nabla_{\theta} J(\theta)
\]
A parameter update is performed for each point \(x^i\) and label \(y^i\) as:
\[
\theta = \theta - \epsilon \nabla_{\theta} J(\theta;x^i,y^i)
\]
A parameter update is performed for each mini-batch of \(n\) points \( (x^i,\dots,x^{i+n})\) and labels \((y^i,\dots,y^{i+n})\) as:
\[
\theta = \theta - \epsilon \nabla_{\theta} J(\theta; (x^i,\dots,x^{i+n}),(y^i,\dots,y^{i+n}))
\]
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning. Chapter 8. Optimization for Training Deep Models. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning. Chapter 8. Optimization for Training Deep Models. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning. Chapter 8. Optimization for Training Deep Models. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning. Chapter 8. Optimization for Training Deep Models. MIT Press. 2016.
Images credit: Alec Radford.
Images credit: Alec Radford.