Machine Learning and Neural Networks

Roberto Santana and Unai Garciarena

Department of Computer Science and Artificial Intelligence

University of the Basque Country

Machine learning preprocessing steps

Imputation
Normalization and discretization
Feature selection and feature extraction
Pipelines

Imputation

Criteria/Flat	F1	F2	F3	F4	F5	F6	F7	F8	C1	C2	C3
Price	high	low	med.	high	low	med.	med.	high	med.	high	low
Distance to University	far	far	close	close	close	close	close	close	far	far	close
Parking	no	no	no	no	no	yes	no	no	no	no	yes
Cool Roommates?	cool	cool	cool	no	no	cool	cool	cool	cool	cool	no
Flat owner	nice	nice	not nice	nice	not nice	not nice	not nice	?	nice	?	?
Heating for winter	no	no	no	yes	yes	no	yes	yes	no	no	yes
Distance to Bus	close	close	close	far	close	close	far	far	far	close	close
Room space	med.	large	small	small	small	med.	small	small	med.	small	small
Noisy area	no	yes	yes	no	no	yes	yes	no	no	no	no
Mother advice	yes	?	no	?	no	yes	yes	no	yes	no	no
Cat	no	yes	no	no	yes	yes	no	yes	yes	no	no
Kitchen	small	small	large	med.	med.	small	small	med.	large	small	small
Distance to beach	far	far	close	close	far	far	far	far	far	far	far
Floor	2	7	1	1	0	3	1	2	4	0	3
Elevator	no	yes	no	no	no	no	no	no	no	yes	yes
Bars around	yes	yes	yes	yes	no	yes	no	no	no	yes	no
Did (Will) I like it?	no	yes	no	no	no	yes	yes	no	?	?	?

Imputation methods

Missing data problem: The values of some features may be absent from some observations.

Solution 1: Remove all observations with missing data.

Solution 2: Fill the missing data using some logic.

Solution 1 can significantly reduce our training set and information is sacrificed.

Solution 2 is called Imputation and the methods that implement it save information but add some noise.

In [1]:

# Imputers
import numpy as np
from sklearn.preprocessing import Imputer

In [2]:

my_data = np.array([[ 'NaN',   7,     6],
                    [  5   ,  89,    13],
                    [ 23   ,  12,   213],
                    [  2   ,  87, 'NaN'],
                    [  8   , 101,    71],
                    [ 13   ,'NaN',    20]])

In [3]:

mean_imputer = Imputer(missing_values='NaN',strategy="mean",axis=0)
mean_imputer.fit(my_data)
imputed_data = mean_imputer.transform(my_data)

[[  10.2    7.     6. ]
 [   5.    89.    13. ]
 [  23.    12.   213. ]
 [   2.    87.    64.6]
 [   8.   101.    71. ]
 [  13.    59.2   20. ]]

Normalization and discretization

Normalizing the data: Some ML algorithms require that all features have the same range of values.
Other algorithms assume that variance of the features is of the same order.
Standardization: Works by removing the mean and scaling to unit variance .

Discretization: Consists of transforming continuous data to discrete values.

Binarization: Continuous values are transformed to binary values.

Discretization can be a requirement for ML methods that only accept discrete values (e.g., some methods used for learning Bayesian networks).

In [4]:

# Normalization or scalers
from sklearn.preprocessing import StandardScaler

In [5]:

scaler = StandardScaler()
scaler.fit(imputed_data)
scaled_data = scaler.transform(imputed_data)

[[ -2.64412211e-16  -1.39837036e+00  -8.26676029e-01]
 [ -7.74024378e-01   7.98303387e-01  -7.27926333e-01]
 [  1.90529078e+00  -1.26442684e+00   2.09349356e+00]
 [ -1.22057690e+00   7.44725979e-01  -2.00473941e-16]
 [ -3.27471852e-01   1.11976784e+00   9.02854367e-02]
 [  4.16782357e-01   1.90345192e-16  -6.29176637e-01]]

In [7]:

binarized_data = binarize(scaled_data)

[[ 0.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  1.]
 [ 1.  1.  0.]]

Feature selection and feature extraction

Feature selection: From the original set of features select a subset of informative features.
Feature extraction or feature engineering: From the original set of features create a new set of informative features.

Feature selection strategies usually include Filter methods, wrapper methods , and embedded methods.

Feature extraction approaches include Dimensionality reduction methods, neural networks and clustering based algorithms .

Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics.. Bioinformatics. Vol. 23. No. 19. Pp. 2507-2517. 2007.

In [8]:

# Feature selection (Filtering using ANOVA F_Test (f_classif))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [9]:

feature_selection = SelectKBest(f_classif,k=2)
my_class = np.array([1,1,1,0,0,0])
new_features = feature_selection.fit_transform(scaled_data, my_class)

[[ -2.64412211e-16  -1.39837036e+00]
 [ -7.74024378e-01   7.98303387e-01]
 [  1.90529078e+00  -1.26442684e+00]
 [ -1.22057690e+00   7.44725979e-01]
 [ -3.27471852e-01   1.11976784e+00]
 [  4.16782357e-01   1.90345192e-16]]

In [10]:

# Feature extraction (dimensionality reduction)
from sklearn.decomposition import PCA

In [11]:

pca = PCA(n_components=2)
pca.fit(scaled_data)
reduced_data = pca.transform(scaled_data)

[[ 0.31130824         1.57076484] 
 [-1.32366387         -0.04615306]
 [ 3.04183574         -0.59333748]
 [-1.19300064         -0.52826786]
 [-0.76851544         -0.85238305]
 [-0.06796403          0.44937662]]

Pipelines

ML Pipeline: The sequence of ML procedures (e.g., imputers, feature selection, classifiers) that are applied to the initial data to the final classification.

Benefits:

Allow the encapsulation of several required pre-processing steps in a unique work-flow.
Make easier the joint selection of the parameters for the procedures in the pipeline.
Convenient for the automatic application of ML to complex real-world problems.

Limitation: The selection of the pipeline steps requires human intervention.

In [16]:

lr = LogisticRegression()
my_pipeline = Pipeline([('imputer', mean_imputer),('standarize', scaler), ('reduce_dim', pca), ('clf',lr)])

In [17]:

# Making predictions with our pipeline
from sklearn.model_selection import cross_val_predict
predicted_class = cross_val_predict(my_pipeline,my_data,my_class)

[1 0 0 0 0 1]

In [19]:

# Evaluating the accuracy of the pipeline
from sklearn import metrics
pipeline_accuracy = metrics.accuracy_score(my_class,predicted_class)

The accuracy of the pipeline is:  0.5

Pipelines

Automatic pipeline generation: The decision of which are the elements of the pipelines is made by another ML learning algorithm.
A search procedure evaluates the quality of several pipelines and choose the best.
The space of pipelines is defined by all possible legal combinations of transformers and learning algorithms.
One of the ML methods used for this task is TPOT, a genetic programming implementation complex real-world problems.

In [24]:

# TPOT: Automaticly finding pipelines
from tpot import TPOTClassifier

my_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
my_tpot.fit(features=my_data, target=my_class)
print("The pipeline learned by tpot is:")
my_tpot.fitted_pipeline_.steps

The pipeline learned by tpot is:

In [ ]:

[('robustscaler',
  RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
         with_scaling=True)),
 ('logisticregression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]