Machine Learning and Neural Networks

Roberto Santana and Unai Garciarena

Department of Computer Science and Artificial Intelligence

University of the Basque Country

Imputation

Criteria/Flat F1F2F3F4 F5F6F7F8C1C2C3
Price high lowmed.highlowmed.med.highmed.highlow
Distance to University far farclose close closeclosecloseclosefarfarclose
Parking no nono no noyesnonononoyes
Cool Roommates? cool coolcool no nocoolcoolcoolcoolcoolno
Flat owner nice nicenot nice nice not nicenot nicenot nice?nice??
Heating for winterno nono yes yesnoyesyesnonoyes
Distance to Bus close closeclose far closeclosefarfarfarcloseclose
Room space med. largesmall small smallmed.smallsmallmed.smallsmall
Noisy area no yesyes no noyesyesnononono
Mother advice yes ?no ? noyesyesnoyesnono
Cat no yesno no yesyesnoyesyesnono
Kitchen small smalllarge med. med.smallsmallmed.largesmallsmall
Distance to beach far farclose close farfarfarfarfarfarfar
Floor 2 71 1 0312403
Elevator no yesno no nononononoyesyes
Bars around yes yesyes yes noyesnononoyesno
Did (Will) I like it? no yesno no noyesyesno???

Imputation methods

  1. Missing data problem: The values of some features may be absent from some observations.
  2. Solution 1: Remove all observations with missing data.
  3. Solution 2: Fill the missing data using some logic.
  4. Solution 1 can significantly reduce our training set and information is sacrificed.
  5. Solution 2 is called Imputation and the methods that implement it save information but add some noise.
In [1]:
# Imputers
import numpy as np
from sklearn.preprocessing import Imputer
In [2]:
my_data = np.array([[ 'NaN',   7,     6],
                    [  5   ,  89,    13],
                    [ 23   ,  12,   213],
                    [  2   ,  87, 'NaN'],
                    [  8   , 101,    71],
                    [ 13   ,'NaN',    20]])
In [3]:
mean_imputer = Imputer(missing_values='NaN',strategy="mean",axis=0)
mean_imputer.fit(my_data)
imputed_data = mean_imputer.transform(my_data)

[[  10.2    7.     6. ]
 [   5.    89.    13. ]
 [  23.    12.   213. ]
 [   2.    87.    64.6]
 [   8.   101.    71. ]
 [  13.    59.2   20. ]]

Normalization and discretization

  1. Normalizing the data: Some ML algorithms require that all features have the same range of values.
  2. Other algorithms assume that variance of the features is of the same order.
  3. Standardization: Works by removing the mean and scaling to unit variance .
  4. Discretization: Consists of transforming continuous data to discrete values.
  5. Binarization: Continuous values are transformed to binary values.
  6. Discretization can be a requirement for ML methods that only accept discrete values (e.g., some methods used for learning Bayesian networks).
In [4]:
# Normalization or scalers
from sklearn.preprocessing import StandardScaler
In [5]:
scaler = StandardScaler()
scaler.fit(imputed_data)
scaled_data = scaler.transform(imputed_data)

[[ -2.64412211e-16  -1.39837036e+00  -8.26676029e-01]
 [ -7.74024378e-01   7.98303387e-01  -7.27926333e-01]
 [  1.90529078e+00  -1.26442684e+00   2.09349356e+00]
 [ -1.22057690e+00   7.44725979e-01  -2.00473941e-16]
 [ -3.27471852e-01   1.11976784e+00   9.02854367e-02]
 [  4.16782357e-01   1.90345192e-16  -6.29176637e-01]]
In [7]:
binarized_data = binarize(scaled_data)

[[ 0.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  1.]
 [ 1.  1.  0.]]

Feature selection and feature extraction

  1. Feature selection: From the original set of features select a subset of informative features.
  2. Feature extraction or feature engineering: From the original set of features create a new set of informative features.
  3. Feature selection strategies usually include Filter methods, wrapper methods , and embedded methods.
  4. Feature extraction approaches include Dimensionality reduction methods, neural networks and clustering based algorithms .

Y. Saeys, I. Inza, and P. LarraƱaga. A review of feature selection techniques in bioinformatics.. Bioinformatics. Vol. 23. No. 19. Pp. 2507-2517. 2007.

In [8]:
# Feature selection (Filtering using ANOVA F_Test (f_classif))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
In [9]:
feature_selection = SelectKBest(f_classif,k=2)
my_class = np.array([1,1,1,0,0,0])
new_features = feature_selection.fit_transform(scaled_data, my_class)

[[ -2.64412211e-16  -1.39837036e+00]
 [ -7.74024378e-01   7.98303387e-01]
 [  1.90529078e+00  -1.26442684e+00]
 [ -1.22057690e+00   7.44725979e-01]
 [ -3.27471852e-01   1.11976784e+00]
 [  4.16782357e-01   1.90345192e-16]]
In [10]:
# Feature extraction (dimensionality reduction)
from sklearn.decomposition import PCA
In [11]:
pca = PCA(n_components=2)
pca.fit(scaled_data)
reduced_data = pca.transform(scaled_data)

[[ 0.31130824         1.57076484] 
 [-1.32366387         -0.04615306]
 [ 3.04183574         -0.59333748]
 [-1.19300064         -0.52826786]
 [-0.76851544         -0.85238305]
 [-0.06796403          0.44937662]]

Pipelines

  1. ML Pipeline: The sequence of ML procedures (e.g., imputers, feature selection, classifiers) that are applied to the initial data to the final classification.
  2. Benefits:
    • Allow the encapsulation of several required pre-processing steps in a unique work-flow.
    • Make easier the joint selection of the parameters for the procedures in the pipeline.
    • Convenient for the automatic application of ML to complex real-world problems.
  3. Limitation: The selection of the pipeline steps requires human intervention.
In [16]:
lr = LogisticRegression()
my_pipeline = Pipeline([('imputer', mean_imputer),('standarize', scaler), ('reduce_dim', pca), ('clf',lr)])
In [17]:
# Making predictions with our pipeline
from sklearn.model_selection import cross_val_predict
predicted_class = cross_val_predict(my_pipeline,my_data,my_class)

[1 0 0 0 0 1]
In [19]:
# Evaluating the accuracy of the pipeline
from sklearn import metrics
pipeline_accuracy = metrics.accuracy_score(my_class,predicted_class) 

The accuracy of the pipeline is:  0.5

Pipelines

  1. Automatic pipeline generation: The decision of which are the elements of the pipelines is made by another ML learning algorithm.
  2. A search procedure evaluates the quality of several pipelines and choose the best.
  3. The space of pipelines is defined by all possible legal combinations of transformers and learning algorithms.
  4. One of the ML methods used for this task is TPOT, a genetic programming implementation complex real-world problems.
In [24]:
# TPOT: Automaticly finding pipelines
from tpot import TPOTClassifier

my_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
my_tpot.fit(features=my_data, target=my_class)
print("The pipeline learned by tpot is:")
my_tpot.fitted_pipeline_.steps
The pipeline learned by tpot is:
In [ ]:
[('robustscaler',
  RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
         with_scaling=True)),
 ('logisticregression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]