How use GridSearchCV and pipelines to crossvalidate best categorical encoding

@EanX wrote:

I would like to use GridSearchCV and pipelines in sklearn not only to select best hyper-paramters for the choosen classifier but to select best categorical encoding strategy. Considering Titanic dataset (http://Kaggle Titanic)][1]) and using Sklearn-pandas I could define some DataFrameMappers to select and encode some features, then cross-validate a RandomForestClassifier() to search for it’s best hyper-parameters.

Consider the following code:

from __future__ import division
import csv as csv
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, StandardScaler
from category_encoders import BinaryEncoder, LeaveOneOutEncoder

from sklearn_pandas import DataFrameMapper

df_train = pd.read_csv('train.csv', header = 0, index_col = 'PassengerId')
df_test = pd.read_csv('test.csv', header = 0, index_col = 'PassengerId')
df = pd.concat([df_train, df_test], keys=["train", "test"])

df['Title'] = df['Name'].apply(lambda c: c[c.index(',') + 2 : c.index('.')])

df['LastName'] = df['Name'].apply(lambda n: n[0:n.index(',')])

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].mode()[0]

df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].mode()[0]

df['FamilyID'] = df['LastName'] + ':' + df['FamilySize'].apply(str)

df.loc[df['FamilySize'] <= 2, 'FamilyID'] = 'Small_Family'

df['AgeOriginallyNaN'] = df['Age'].isnull().astype(int)

medians_by_title = pd.DataFrame(df.groupby('Title')['Age'].median()).rename(columns = {'Age': 'AgeFilledMedianByTitle'})

df = df.merge(medians_by_title, left_on = 'Title', right_index = True).sort_index(level = 0).sort_index(level = 1)

df_train = df.loc['train']
df_test  = df.loc['test']

y_train = df_train['Survived']
X_train = df_train[df_train.columns.drop('Survived')]

mapper1 = DataFrameMapper([
     ('Embarked',BinaryEncoder()),
     (['AgeFilledMedianByTitle'], StandardScaler()),
     ('Pclass', LeaveOneOutEncoder())
 ])

mapper2=DataFrameMapper([
     ('Embarked',LeaveOneOutEncoder()),
     (['AgeFilledMedianByTitle'], StandardScaler()),
     ('Pclass', LeaveOneOutEncoder())
 ])



pipe = Pipeline([('featurize', mapper1),
                 ('forest', RandomForestClassifier(n_estimators=10))])

param_grid = dict(forest__n_estimators = [2, 16, 32,64],
                  forest__criterion = ['gini', 'entropy'])

grid_search = GridSearchCV(pipe, param_grid=param_grid, scoring='accuracy')

best_pipeline = grid_search.fit(X_train, y_train).best_estimator_
best_pipeline.get_params()['forest']
grid_search.best_score_

s it possible to use Pipeline in GridSearchCV to select best possible mapper (mapper1 and mapper2)? How?

Posts: 1

Participants: 1

Read full topic