Undercomplete representation forces the autoencoder to capture the most salient features of the training data.
The learning process tries to minimize a loss function \( L({\bf{x}},g(f({\bf{x}})))\) where \(L\) is a loss function penalizing \(g(f({\bf{x}}))\) from being dissimilar from \({\bf{x}}\).
One example of the loss functions used is the mean squared error.
The capacity of the encoding and decoding functions, whether linear or nonlinear will influence the quality of the reconstruction.
Sparse autoencoder (SAE): An autoencoder whose training criterion involves a sparsity penalty.
Denoising autoencoder (DAE):Noise is added to the input \( {\bf{x}} \) to create a noisy or corrupted version \( \tilde{{\bf{x}}} \). The goal of the denoising autoencoder is to learn to recover \( {\bf{x}} \) from \( \tilde{{\bf{x}}} \).
Contractive autoencoder (CAE): An explicit regularizer is introduced to encourage the derivatives of the encoding function \(f({\bf{x}})\) to be as small as possible.
Characteristics
Try to adjust the capacity of the encoder and decoder based on the complexity of the data to be modeled.
Use a loss function that encourages the model to have other properties (e.g., sparsity) besides the ability to copy its input to its output.
A regularized autoencoder can be nonlinear and overcomplete but still learn something useful about the data distribution.
In principle, the goal of the denoising autoencoder is to reconstruct an input that has been corrupted by some sort of noise.
Othe previous works have proposed to use multi-layer perceptrons for denoising data.
However, the denoising autoencoder is intended not merely to learn to denoise its input but to learn a good internal representation as a side effect of learning to denoise.
Sample a training example \( {\bf{x}} \) from the training data.
Sample a corrupted version \( {\bf{\tilde{x}}} \) from \( C({\bf{\tilde{x}}}|x= {\bf{x}}) \), where \(C() \) represents a given corruption process.
Use \(({\bf{x}},{\bf{\tilde{x}}}) \) as a training example for estimating the autoencoder distribution \( p_{reconstruct}({\bf{x}}|{\bf{\tilde{x}}}) = p_{decoder}({\bf{x}}|x= {\bf{h}}) \) with \( {\bf{h}} \) the output of encoder \( f({\bf{\tilde{x}}}) \) and \(p_{decoder}\) typically defined by a decoder \( g({\bf{h}}) \).
As in sparse autoencoders a regularization penalty is used:
\[
\begin{equation}
L({\bf{x}},g(f({\bf{x}}))) + \Omega({\bf{h}},{\bf{x}})
\end{equation}
\]
But with a different choice of \(\Omega\):
\begin{equation}
\Omega({\bf{h}},{\bf{x}}) = \lambda \sum_{i} ||\nabla_{{\bf{x}}} h_i||^2
\end{equation}
Characteristics
The model is forced to learn a function that does not change much when \({\bf{x}}\) changes slightly.
It has theoretical connections to denoising autoencoders.
The CAE is contractive only locally, i.e., all perturbations of a training point \({\bf{x}}\) are mapped near to \(f({\bf{x}}) \).
Generative Modeling with DNNs
Network generated faces
Generative modeling
Generative modeling deals with models of distributions \( p({\bf{x}}) \), defined over datapoints \({\bf{x}} \) in some high-dimensional space \( \mathcal{x}\).
Since, learning the exact distribution is usually impossible, the goal is then to learn an approximate distribution as accurate as posssible according some metric.
For an image, the \( {\bf{x}} \) values which look like real images should be given a high probability, whereas images that look like random noise should get low probability.
Instead of computing the probabilities, usually the goal is to produce more examples that are like those already in a database, but not exactly the same.
More formally, let us suppose we have a dataset of examples \( {\bf{x}} \) distributed according to some unknown distribution \( p_{gt}({\bf{x}}) \).
The goal is to learn a model \( p({\bf{x}}) \) which we can sample from, such that \( p({\bf{x}}) \) is as similar as possible to \( p_{gt}({\bf{x}}) \).
Training this type of model has been a long-standing problem in the machine learning community.
Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) \( p_{\bf{x}} \) from those of the generative distribution \( p_g(G) \).
One of the goals is to learn the generator's distribution \( p_g \) over data \( {\bf{x}}\).
To do this a prior on input noise variables \(p_z ({\bf{z}})\) is defined
A mapping to data space \({\bf{\tilde{x}}} = G({\bf{z}}, \theta_g)\) is also defined where \( G \) is a differentiable function represented by a multilayer perceptron with parameters \( \theta_g \).
One of the goals is to learn the generator's distribution \( p_g \) over data \( {\bf{x}}\).
To do this a prior on input noise variables \(p_z ({\bf{z}})\) is defined
A mapping to data space \({\bf{\tilde{x}}} = G({\bf{z}}, \theta_g)\) is also defined where \( G \) is a differentiable function represented by a multilayer perceptron with parameters \( \theta_g \).
A second multi-layer perceptron \(D({\bf{x}}, \theta_d)\) is also defined.
\(D({\bf{x}}, \theta_d)\) outputs a scalar and represents the probability that \( {\bf{x}} \) from the data rather than \( p_g \) .
\(D \) and \(G \) play the following two-player minimax game with value function \(V(D,G) \):
As traditional autoencoders they can be split into two components: an encoder and a decoder.
They are probabilistic autoencoders since their outputs are partly determined by chance (as oppossed to DAEs which use randomness only during training).
They can generate new instances that look like if they were sampled from the training set (generative autoencoders).
Similar to RBMs, but they are easier to train and the sampling process is much faster.
Learning a distribution of images that belong to the same class can be difficult.
In the example above, the second image (b) is a corrupted version of the first image (a). The third image (c) is identical but it has been shifted two pixels.
Thinking of a way to detect this type of similarity given a wide range of transformations is difficult.
In some highly complex input domains (e.g., image analysis), assuming a latent representation can help to model the problem.
Latent space: We assume that the distribution over the observed variables \({\bf{z}} \;\) is the consequence of a distribution over some set of hidden variables \({\bf{z}} \sim p({\bf{z}}) \).
Inference is the process of disentangling these rich real-world dependencies into simplified latent dependencies, by predicting \(p({\bf{z}}|{\bf{x}})\).
In the decoder phase, a sampled \( {\bf{z}} \) is passed to the decoder/generative network.
The decoder uses the learned conditional distribution over input space to reconstruct an input according to \(\tilde{\bf{x}} \sim p_{\theta}({\bf{x}}|{\bf{z}})\).
The actual coding is sampled from the Gaussian distribution with the learned parameters.
The key idea behind the variational autoencoder is to attempt to sample values of \({\bf{z}}\) that are likely to have produced \({\bf{x}}\), and compute \( p({\bf{x}})\) just from those.
The encoder maps observed inputs to (approximate) posterior distributions over latent space.
The decoder works as a generative network. It maps arbitrary latent coordinates back to distributions over the original data space.
The latent variable model can be seen as a probability distribution \(p(x|z)\) describing the generative process (of how \( x \) is generated from \(z\) ).
In machine learning, a manifold can serve to capture in a low dimensional space characteristics of the data in the high dimensional space.
In contrast to PCA and other methods, the manifold learning can provide more powerful non-linear dimensionality reduction by preserving the local structure of the input data.
The usability of manifolds can be measured in terms of their ability to capture variability accross the data and in terms of their generative capability.
Manifold learning has mostly focused on unsupervised learning procedures that attempt to capture these manifolds.