Roberto Santana and Unai Garciarena
Department of Computer Science and Artificial Intelligence
University of the Basque Country
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. 2012.
Representation |
Evaluation |
Optimization |
Instances |
Accuracy/Error rate |
Combinatorial optimization |
--K-nearest neighbor |
Precision and recall |
--Greedy search |
-- Support vector machines |
Squared error |
--Beam search |
Hyperplanes |
Likelihood |
--Branch-and-bound |
--Naive Bayes |
Posterior probability |
Continuous optimization |
--Logistic regression |
Information gain |
-Unconstrained |
Decision trees |
KL divergence |
--Gradient descent |
Set of rules |
Cost/Utility |
--Conjugate gradient |
--Propositional rules |
Margin |
--Quasi-Newton methods |
--Logic programs |
-Constrained |
|
Neural networks |
--Linear programming |
|
Graphical models |
--Quadratic programming |
|
--Bayesian networks |
Let \( p({\bf{x}}) \) be a probability distribution defined on a discrete feature \( {\bf{x}} \). \( p({\bf{x}}) \) satisfies the following:
\[ p[{\bf{X}} = {\bf{x}}] = p({\bf{x}}) \]
\[ p({\bf{x}}) \geq 0 \; \; \forall {\bf{x}} \]
\[ \sum_{{\bf{x}}} p({\bf{x}}) = 1 \]
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press. 2005.
P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. 2012.
A set of \(N\) tuples \( (x^1,y^1), \dots (x^N,y^N) \) is given, where \(y\) is the target or dependent variable and \(x\) is the covariate, independent variable or predictor. The task is to predict \(y\) given \(x\).
General regression model: \[ y = f(x) + \epsilon \]
where \( \epsilon \) is the irreducible error that does not depend on x.
Linear regression model: \[ f(x) = \beta_1 x + \beta_0 \]
Linear regression estimate: \[ \hat{y} = \hat{\beta_1} x + \hat{\beta_0} \]
The residual error is the difference between the prediction and the true value. \[ e^i = y^i - \hat{y}^i \]
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.
The mean squared error is usually used: \[ MSE = \frac{1}{N} \sum_{i=1}^N (y^i - \hat{y}^i)^2 \]
The parameters of the model that minimize this error are learned: \[ \arg \min_{\beta_0,\beta_1} \frac{1}{N} \sum_{i=1}^N (y^i - (\beta_1 x^i + \beta_0))^2 \]
After differentiating with respect to \( \beta_0,\beta_1 \) and equalling to \(0\), we get: \[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}; \; \; \hat{\beta_1} = \frac{\sum_{i=1}^N x^i y^i - N \bar{x} \bar{y}} {\sum_{i=1}^N (x^i)^2 - N \bar{x}^2} \] where \(\bar{x}\) and \(\bar{y}\) are the mean of \(x\) and \(y\) as computed from the data.
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.
In multiple linear regression we have multiple covariates, represented as a vector \({\bf{x}} \). The model is linear on the covariates.
Multiple linear regression model: \[ f(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon \]
Let \( {\bf{w}} \) be the model weight vector containing \(\beta\) values, , \( {\bf{w}}^T {\bf{x}} \) represents the inner or scalar product between the input vector \( {\bf{x}} \) and the weight vector.
Then, the multiple linear regression model in matrix form is expressed as: \[ y({\bf{x}}) = {\bf{w}}^T {\bf{x}} + \epsilon = \sum_{j=1}^{n} w_jx_j + \epsilon \]
Estimates of \(w\) are found by minimizing the MSE in a way similar to the case of a single covariate. That way the parameters of the model are learned.
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.