Roberto Santana and Unai Garciarena
Department of Computer Science and Artificial Intelligence
University of the Basque Country
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
I. Goodfellow and Y. Bengio and A. Courville. Deep Learning.. MIT Press. 2016.
P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10), 7887. 2012.
Representation 
Evaluation 
Optimization 
Instances 
Accuracy/Error rate 
Combinatorial optimization 
Knearest neighbor 
Precision and recall 
Greedy search 
 Support vector machines 
Squared error 
Beam search 
Hyperplanes 
Likelihood 
Branchandbound 
Naive Bayes 
Posterior probability 
Continuous optimization 
Logistic regression 
Information gain 
Unconstrained 
Decision trees 
KL divergence 
Gradient descent 
Set of rules 
Cost/Utility 
Conjugate gradient 
Propositional rules 
Margin 
QuasiNewton methods 
Logic programs 
Constrained 

Neural networks 
Linear programming 

Graphical models 
Quadratic programming 

Bayesian networks 
Let \( p({\bf{x}}) \) be a probability distribution defined on a discrete feature \( {\bf{x}} \). \( p({\bf{x}}) \) satisfies the following:
\[ p[{\bf{X}} = {\bf{x}}] = p({\bf{x}}) \]
\[ p({\bf{x}}) \geq 0 \; \; \forall {\bf{x}} \]
\[ \sum_{{\bf{x}}} p({\bf{x}}) = 1 \]
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press. 2005.
P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10), 7887. 2012.
A set of \(N\) tuples \( (x^1,y^1), \dots (x^N,y^N) \) is given, where \(y\) is the target or dependent variable and \(x\) is the covariate, independent variable or predictor. The task is to predict \(y\) given \(x\).
General regression model: \[ y = f(x) + \epsilon \]
where \( \epsilon \) is the irreducible error that does not depend on x.
Linear regression model: \[ f(x) = \beta_1 x + \beta_0 \]
Linear regression estimate: \[ \hat{y} = \hat{\beta_1} x + \hat{\beta_0} \]
The residual error is the difference between the prediction and the true value. \[ e^i = y^i  \hat{y}^i \]
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.
The mean squared error is usually used: \[ MSE = \frac{1}{N} \sum_{i=1}^N (y^i  \hat{y}^i)^2 \]
The parameters of the model that minimize this error are learned: \[ \arg \min_{\beta_0,\beta_1} \frac{1}{N} \sum_{i=1}^N (y^i  (\beta_1 x^i + \beta_0))^2 \]
After differentiating with respect to \( \beta_0,\beta_1 \) and equalling to \(0\), we get: \[ \hat{\beta_0} = \bar{y}  \hat{\beta_1}\bar{x}; \; \; \hat{\beta_1} = \frac{\sum_{i=1}^N x^i y^i  N \bar{x} \bar{y}} {\sum_{i=1}^N (x^i)^2  N \bar{x}^2} \] where \(\bar{x}\) and \(\bar{y}\) are the mean of \(x\) and \(y\) as computed from the data.
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.
In multiple linear regression we have multiple covariates, represented as a vector \({\bf{x}} \). The model is linear on the covariates.
Multiple linear regression model: \[ f(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon \]
Let \( {\bf{w}} \) be the model weight vector containing \(\beta\) values, , \( {\bf{w}}^T {\bf{x}} \) represents the inner or scalar product between the input vector \( {\bf{x}} \) and the weight vector.
Then, the multiple linear regression model in matrix form is expressed as: \[ y({\bf{x}}) = {\bf{w}}^T {\bf{x}} + \epsilon = \sum_{j=1}^{n} w_jx_j + \epsilon \]
Estimates of \(w\) are found by minimizing the MSE in a way similar to the case of a single covariate. That way the parameters of the model are learned.
K. P. Murphy. Machine learning. A probabilistic perspective. MIT Press. 2012.