Roberto Santana and Unai Garciarena
Department of Computer Science and Artificial Intelligence
University of the Basque Country
Criteria/Flat | F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | C1 | C2 | C3 |
---|---|---|---|---|---|---|---|---|---|---|---|
Price | high | low | med. | high | low | med. | med. | high | med. | high | low |
Distance to University | far | far | close | close | close | close | close | close | far | far | close |
Parking | no | no | no | no | no | yes | no | no | no | no | yes |
Cool Roommates? | cool | cool | cool | no | no | cool | cool | cool | cool | cool | no |
Flat owner | nice | nice | not nice | nice | not nice | not nice | not nice | ? | nice | ? | ? |
Heating for winter | no | no | no | yes | yes | no | yes | yes | no | no | yes |
Distance to Bus | close | close | close | far | close | close | far | far | far | close | close |
Room space | med. | large | small | small | small | med. | small | small | med. | small | small |
Noisy area | no | yes | yes | no | no | yes | yes | no | no | no | no |
Mother advice | yes | ? | no | ? | no | yes | yes | no | yes | no | no |
Cat | no | yes | no | no | yes | yes | no | yes | yes | no | no |
Kitchen | small | small | large | med. | med. | small | small | med. | large | small | small |
Distance to beach | far | far | close | close | far | far | far | far | far | far | far |
Floor | 2 | 7 | 1 | 1 | 0 | 3 | 1 | 2 | 4 | 0 | 3 |
Elevator | no | yes | no | no | no | no | no | no | no | yes | yes |
Bars around | yes | yes | yes | yes | no | yes | no | no | no | yes | no |
Did (Will) I like it? | no | yes | no | no | no | yes | yes | no | ? | ? | ? |
# Imputers
import numpy as np
from sklearn.preprocessing import Imputer
my_data = np.array([[ 'NaN', 7, 6],
[ 5 , 89, 13],
[ 23 , 12, 213],
[ 2 , 87, 'NaN'],
[ 8 , 101, 71],
[ 13 ,'NaN', 20]])
mean_imputer = Imputer(missing_values='NaN',strategy="mean",axis=0)
mean_imputer.fit(my_data)
imputed_data = mean_imputer.transform(my_data)
[[ 10.2 7. 6. ] [ 5. 89. 13. ] [ 23. 12. 213. ] [ 2. 87. 64.6] [ 8. 101. 71. ] [ 13. 59.2 20. ]]
# Normalization or scalers
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(imputed_data)
scaled_data = scaler.transform(imputed_data)
[[ -2.64412211e-16 -1.39837036e+00 -8.26676029e-01] [ -7.74024378e-01 7.98303387e-01 -7.27926333e-01] [ 1.90529078e+00 -1.26442684e+00 2.09349356e+00] [ -1.22057690e+00 7.44725979e-01 -2.00473941e-16] [ -3.27471852e-01 1.11976784e+00 9.02854367e-02] [ 4.16782357e-01 1.90345192e-16 -6.29176637e-01]]
binarized_data = binarize(scaled_data)
[[ 0. 0. 0.] [ 0. 1. 0.] [ 1. 0. 1.] [ 0. 1. 0.] [ 0. 1. 1.] [ 1. 1. 0.]]
Y. Saeys, I. Inza, and P. LarraƱaga. A review of feature selection techniques in bioinformatics.. Bioinformatics. Vol. 23. No. 19. Pp. 2507-2517. 2007.
# Feature selection (Filtering using ANOVA F_Test (f_classif))
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
feature_selection = SelectKBest(f_classif,k=2)
my_class = np.array([1,1,1,0,0,0])
new_features = feature_selection.fit_transform(scaled_data, my_class)
[[ -2.64412211e-16 -1.39837036e+00] [ -7.74024378e-01 7.98303387e-01] [ 1.90529078e+00 -1.26442684e+00] [ -1.22057690e+00 7.44725979e-01] [ -3.27471852e-01 1.11976784e+00] [ 4.16782357e-01 1.90345192e-16]]
# Feature extraction (dimensionality reduction)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
reduced_data = pca.transform(scaled_data)
[[ 0.31130824 1.57076484] [-1.32366387 -0.04615306] [ 3.04183574 -0.59333748] [-1.19300064 -0.52826786] [-0.76851544 -0.85238305] [-0.06796403 0.44937662]]
lr = LogisticRegression()
my_pipeline = Pipeline([('imputer', mean_imputer),('standarize', scaler), ('reduce_dim', pca), ('clf',lr)])
# Making predictions with our pipeline
from sklearn.model_selection import cross_val_predict
predicted_class = cross_val_predict(my_pipeline,my_data,my_class)
[1 0 0 0 0 1]
# Evaluating the accuracy of the pipeline
from sklearn import metrics
pipeline_accuracy = metrics.accuracy_score(my_class,predicted_class)
The accuracy of the pipeline is: 0.5
# TPOT: Automaticly finding pipelines
from tpot import TPOTClassifier
my_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
my_tpot.fit(features=my_data, target=my_class)
print("The pipeline learned by tpot is:")
my_tpot.fitted_pipeline_.steps
The pipeline learned by tpot is:
[('robustscaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
with_scaling=True)),
('logisticregression',
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]