Roberto Santana and Unai Garciarena
Department of Computer Science and Artificial Intelligence
University of the Basque Country
H. Wang, B. Raj, and E. P. Xing. On the Origin of Deep Learning. arXiv preprint arXiv:1702.07800. 2017.
\[ y = \begin{cases} 1, & \mbox{if } \sum_i w_i x_i \geq \theta \wedge z_j=0, \, \forall j \\ 0, & \mbox{otherwise} \end{cases} \]
\( y \): output.
\(x_i\): input signals
\( \theta \): threshold
\(w_i\): weights
\(z_i\): inhibitory input
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, Vol. 5, Pp. 115-133. 1943.
A. K. Jain, J. Mao, and K. M. Mohiuddin. Figure. Artificial neural networks: A tutorial. Computer. Vol. 29 No. 3. Pp. 31-44. 1996.
Figure credit. Neural network zoo.
J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558. 1982.
S. Haykin Neural Networks. A Comprehensive Foundation. Eastern Economy Edition. Second Edition. 2007.
The primary function of a content-addressable memory is to retrieve a pattern (item) stored in memory given an incomplete or noisy version of that pattern.
Given sufficient partial information the content addressable memory will be able to retrieve the pattern.
A content-addressable memory is therefore error correcting.
The Hopfield network works, by defining an energy function whose fixed points correspond to the patterns to be stored.
S. Haykin Neural Networks. A Comprehensive Foundation. Eastern Economy Edition. Second Edition. 2007.
\[ E = - \frac{1}{2} \sum_{i,j} s_is_jw_{i,j} - \sum_i s_i b_i \]
\(s_i\): is the state of unit \(i\) (initially, the input \(x_i\))
\(b_i\): denotes the bias applied externaly to neuron \(i\)
\(i,j\): indices of the units
\(w_{i,j}\): bidirectional weight between neurons \(i\) and \(j\)
\( s^1 = (-1,-1,-1,-1) \)
\( s^2 = (-1,-1,-1,1) \)
\( \dots \)
\( s^{16} = (1,1,1,1) \)
\[ W = \begin{pmatrix} 0& 2& 2& 0\\ 2& 0& -2& -2\\ 2& -2& 0& 0\\ 0& -2& 0& 0 \end{pmatrix} \]
\( s^1 = (-1,1,1,1) \)
\( s^2 = (-1,-1,1,-1) \)
\[ \begin{align} E(s^1) &=& -1* (-1*1*2+ -1*1*2 \\ &+& -1*1*0 + 1*1*-2 \\ &+& 1*1*-2 + 1*1*0 ) \\ &=& -1*-8= 8 \end{align} \]
\( E(s^2) = ? \)
\[ E = - \frac{1}{2} \sum_{i,j} s_i s_j w_{i,j} \]
\[ W = \begin{pmatrix} 0& 2& 2& 0\\ 2& 0& -2& -2\\ 2& -2& 0& 0\\ 0& -2& 0& 0 \end{pmatrix} \]
\( s_i(t+1) = sign(\sum_j w_{ij} s_j(t) +b_i) \)
\[ sign(x) = \begin{cases} 1, & \mbox{if } x \geq 0 \\ -1, & \mbox{otherwise} \end{cases} \]
S. Haykin Neural Networks. A Comprehensive Foundation. Eastern Economy Edition. Second Edition. 2007.
\( s_i(t+1) = sign(\sum_j w_{ij} s_j(t)) \)
\( s^1(t) = (-1,1,1,1). E(s^1(t))=8. \; i=3 \)
\(s^1_3(t) = sign( w_{3,1}*s^1_1 + w_{3,2}*s^1_2 + w_{3,4}*s^1_4)=-1\)
\( s^2(t) = (-1,1,-1,1). E(s^2(t))=0. \; i=1 \)
\(s^2_1(t) = sign( w_{1,2}*s^2_2 + w_{1,3}*s^2_3 + w_{1,4}*s^2_4)=1\)
\( s^3(t) = (1,1,-1,1). E(s^3(t))=0. \; i=4 \)
\[ W = \begin{pmatrix} 0& 2& 2& 0\\ 2& 0& -2& -2\\ 2& -2& 0& 0\\ 0& -2& 0& 0 \end{pmatrix} \]
\( w_{i,j} = \frac{1}{N} \sum_{k=1}^{N} (s^k_i)(s^k_j) \)
S. Haykin Neural Networks. A Comprehensive Foundation. Eastern Economy Edition. Second Edition. 2007.
\( w_{i,j} = \frac{1}{N} \sum_{k=1}^{N} (s^k_i)(s^k_j) \)
\[ W = \begin{pmatrix} 0& 1& 1& 1& -1& -1& 1& 1& 1\\ 1& 0& 1& 1& -1& -1& 1& 1& 1\\ 1& 1& 0& 1& -1& -1& 1& 1& 1\\ 1& 1& 1& 0& -1& -1& 1& 1& 1\\ -1& -1& -1& -1& 0& 1& -1& -1& -1\\ -1& -1& -1& -1& 1& 0& -1& -1& -1\\ 1& 1& 1& 1& -1& -1& 0& 1& 1\\ 1& 1& 1& 1& -1& -1& 1& 0& 1\\ 1& 1& 1& 1& -1& -1& 1& 1& 0 \end{pmatrix} \]
\( s^1 = (1,1,1,1,-1,-1,1,1,1) \)
\[ W = \begin{pmatrix} 0& 1& 1& 1& -1& -1& 1& 1& 1\\ 1& 0& 1& 1& -1& -1& 1& 1& 1\\ 1& 1& 0& 1& -1& -1& 1& 1& 1\\ 1& 1& 1& 0& -1& -1& 1& 1& 1\\ -1& -1& -1& -1& 0& 1& -1& -1& -1\\ -1& -1& -1& -1& 1& 0& -1& -1& -1\\ 1& 1& 1& 1& -1& -1& 0& 1& 1\\ 1& 1& 1& 1& -1& -1& 1& 0& 1\\ 1& 1& 1& 1& -1& -1& 1& 1& 0 \end{pmatrix} \]
\( w_{i,j} = \frac{1}{N} \sum_{k=1}^{N} (s^k_i)(s^k_j) \)
\[ W = \begin{pmatrix} 0& 1& 1& 0& 0& -1& 0& 1& 0\\ 1& 0& 1& 0& 0& -1& 0& 1& 0\\ 1& 1& 0& 0& 0& -1& 0& 1& 0\\ 0& 0& 0& 0& -1& 0& 1& 0& 1\\ 0& 0& 0& -1& 0& 0& -1& 0& -1\\ -1& -1& -1& 0& 0& 0& 0& -1& 0\\ 0& 0& 0& 1& -1& 0& 0& 0& 1\\ 1& 1& 1& 0& 0& -1& 0& 0& 0\\ 0& 0& 0& 1& -1& 0& 1& 0& 0 \end{pmatrix} \]
\( s^1 = (1,1,1,1,-1,-1,1,1,1) \)
\( s^2 = (1,1,1,-1,1,-1,-1,1,-1) \)
\[ W = \begin{pmatrix} 0& 1& 1& 0& 0& -1& 0& 1& 0\\ 1& 0& 1& 0& 0& -1& 0& 1& 0\\ 1& 1& 0& 0& 0& -1& 0& 1& 0\\ 0& 0& 0& 0& -1& 0& 1& 0& 1\\ 0& 0& 0& -1& 0& 0& -1& 0& -1\\ -1& -1& -1& 0& 0& 0& 0& -1& 0\\ 0& 0& 0& 1& -1& 0& 0& 0& 1\\ 1& 1& 1& 0& 0& -1& 0& 0& 0\\ 0& 0& 0& 1& -1& 0& 1& 0& 0 \end{pmatrix} \]
\[ E = -\frac{1}{2} \sum_{i,j} s_i s_j w_{i,j} \]
\[ W = \begin{pmatrix} 0& 1& 1& 0& 0& -1& 0& 1& 0\\ 1& 0& 1& 0& 0& -1& 0& 1& 0\\ 1& 1& 0& 0& 0& -1& 0& 1& 0\\ 0& 0& 0& 0& -1& 0& 1& 0& 1\\ 0& 0& 0& -1& 0& 0& -1& 0& -1\\ -1& -1& -1& 0& 0& 0& 0& -1& 0\\ 0& 0& 0& 1& -1& 0& 0& 0& 1\\ 1& 1& 1& 0& 0& -1& 0& 0& 0\\ 0& 0& 0& 1& -1& 0& 1& 0& 0 \end{pmatrix} \]
\( s^1 = (1,1,1,1,-1,-1,1,1,1). \; E(s^1) = -16 \) .
\( r^1 = ( \) \(-1\) \(,1,1,1,-1,-1,1,1,1). \; E(r^1) = -8 \)
\( r^2 = (1, \)\(-1\)\(,1,1,-1,-1,1,1,1). \; E(r^2) = -8 \)
\( r^3 = (1,1,-1,1,-1,-1,1,1,1). \; E(r^3) = -8 \)
\( r^4 = (1,1,1,-1,-1,-1,1,1,1). \; E(r^4) = -10 \)
\( r^5 = (1,1,1,1,1,-1,1,1,1). \; E(r^5) = -10 \)
\( r^6 = (1,1,1,1,-1,1,1,1,1). \; E(r^6) = -8\)
\( r^7 = (1,1,1,1,-1,-1,-1,1,1). \; E(r^7) = -10 \)
\( r^8 = (1,1,1,1,-1,-1,1,-1,1). \; E(r^8) = -8 \)
\( r^{9} = (1,1,1,1,-1,-1,1,1,-1). \; E(r^9) = -10 \)
Figure credit. Neural network zoo.
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
E. H. L. Aarst and J. H. M. Korst. Boltzmann machines and their applications. International Conference on Parallel Architectures and Languages Europe. Springer, Berlin, Heidelberg, 1987.
\( s^1 = (-1,-1,-1,-1) \)
\( s^2 = (-1,-1,-1,1) \)
\( \dots \)
\( s^{16} = (1,1,1,1) \)
\[ W = \begin{pmatrix} 0& 2& 2& 0 & 1 & 1 \\ 2& 0& -2& -2 & -1& 2 \\ 2& -2& 0& 0 & 2 & 0 \\ 0& -2& 0& 0 & -1& 0 \\ 1& -1& 2& -1 & 0 & 1 \\ 1& 2& 0& 0 & 1 & 0 \end{pmatrix} \]
\[ E(v,h) = -\sum_i v_ib_i -\sum_k h_k d_k -\sum_{i,j} v_iv_jw_{i,j} -\sum_{i,k} v_ih_k w_{i,k} -\sum_{k,l} h_kh_lw_{k,l} \]
\( v \): visible units.
\( h \): hidden units.
\( w \): weights.
\( i,j,k,l \): indices of the units.
\( z_i(t) = b_i + \sum_j w_{ij} s_j(t) \)
\( p(s_i(t+1)=1) = \frac{1}{1+e^{-z_i}} \)
\( p(s_i(t+1)=-1) = 1-\frac{1}{1+e^{-z_i}} \)
\( s_i(t+1) = sign(\sum_j w_{ij} s_j(t) + b_i) \)
\[ sign(x) = \begin{cases} 1, & \mbox{if } x \geq 0 \\ -1, & \mbox{otherwise} \end{cases} \]
The probability distribution of any global state is computed as: \[ p({\bf{v}},{\bf{h}}) = \frac{e^{-E({\bf{v}},{\bf{h}})}}{Z}, \]
where \(Z\) is the normalizing constant: \[ Z = \sum_{{\bf{v}},{\bf{h}}} e^{-E({\bf{v}},{\bf{h}})} \]
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines.Cognitive science, 9(1):147–169. 1985.
The probabilities of visible units are the sum of the probabilities of all the joint configurations that contain them: \[ p({\bf{v}}) = \sum_{{\bf{h}}} p({\bf{v}},{\bf{h}}) = \frac{\sum_{{\bf{h}}} e^{-E({\bf{v}},{\bf{h}})}}{Z}. \]
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines.Cognitive science, 9(1):147–169. 1985.
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines.Cognitive science, 9(1):147–169. 1985.
Figure credit. Neural network zoo.
P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
\( s^1 = (-1,-1,-1,-1) \)
\( s^2 = (-1,-1,-1,1) \)
\( \dots \)
\( s^{16} = (1,1,1,1) \)
\[ W = \begin{pmatrix} 0& 0& 2& 0 & 1 & 1 \\ 0& 0& -2& -2 & -1& 2 \\ 2& -2& 0& 0 & 0 & 0 \\ 0& -2& 0& 0 & 0 & 0 \\ 1& -1& 0& 0 & 0 & 0 \\ 1& 2& 0& 0 & 0 & 0 \end{pmatrix} \]
\[ E(v,h) = -\sum_i v_ib_i -\sum_k h_k d_k -\sum_{i,k} v_ih_k w_{i,k} \]
\( v \): visible units.
\( h \): hidden units.
\( w \): bidirectional weights.
\( i,k \): indices of the units.
The visible nodes are conditionally dependent among them given the hidden nodes.
The hidden nodes are conditionally dependent among them given the visible nodes.
The probability distribution of any global state is computed as: \[ p({\bf{v}},{\bf{h}}) = \frac{e^{-E({\bf{v}},{\bf{h}})}}{Z}, \]
where \(Z\) is the normalizing constant: \[ Z = \sum_{{\bf{x}},{\bf{y}}} e^{-E({\bf{v}},{\bf{h}})} \]
The probabilities of visible units are the sum of the probabilities of all the joint configurations that contain them: \[ p({\bf{v}}) = \sum_{{\bf{h}}} p({\bf{v}},{\bf{h}}) = \frac{\sum_{{\bf{h}}} e^{-E({\bf{v}},{\bf{h}})}}{Z}. \]
Given the hidden states, the visible states are reconstructed using \( \phi \) according to: \[ p(v_i|{\bf{h}}) = \phi \left( \sum_j w_{i,j} h_j -b_i \right), \]
Weights and biases are updated as: \[ \begin{align} w_{i,j}^{\prime} &=& w_{i,j} + \eta \left( \langle v_i h_i \rangle_O - \langle v_i h_i\rangle_S \right) \\ b_{i}^{\prime} &=& b_{i} + \eta \left( \langle v_i \rangle_O - \langle v_i \rangle_S \right) \\ d_{i}^{\prime} &=& d_{i} + \eta \left( \langle v_i \rangle_O - \langle v_i \rangle_S \right) \end{align} \]
where \(\eta\) is the learning rate, \(\langle \rangle_O\) comprises the original states of the neurons and \( \langle \rangle_S\) is the expected states of the neurons after \( S \) step reconstruction.
The inputs are rendered into the visible units. The hidden states are determined using Gibbs sampling according to: \[ p(h_i|{\bf{v}}) = \phi \left( \sum_i w_{i,j} v_i - d_j \right), \]
where \(\phi({\bf{x}}) \) is the logistic function.