@EanX wrote:
I would like to use GridSearchCV and pipelines in sklearn not only to select best hyper-paramters for the choosen classifier but to select best categorical encoding strategy. Considering Titanic dataset (http://Kaggle Titanic)][1]) and using Sklearn-pandas I could define some DataFrameMappers to select and encode some features, then cross-validate a RandomForestClassifier() to search for it’s best hyper-parameters.
Consider the following code:
from __future__ import division import csv as csv import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, GridSearchCV from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder, LabelBinarizer, StandardScaler from category_encoders import BinaryEncoder, LeaveOneOutEncoder from sklearn_pandas import DataFrameMapper df_train = pd.read_csv('train.csv', header = 0, index_col = 'PassengerId') df_test = pd.read_csv('test.csv', header = 0, index_col = 'PassengerId') df = pd.concat([df_train, df_test], keys=["train", "test"]) df['Title'] = df['Name'].apply(lambda c: c[c.index(',') + 2 : c.index('.')]) df['LastName'] = df['Name'].apply(lambda n: n[0:n.index(',')]) df['FamilySize'] = df['SibSp'] + df['Parch'] + 1 df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].mode()[0] df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].mode()[0] df['FamilyID'] = df['LastName'] + ':' + df['FamilySize'].apply(str) df.loc[df['FamilySize'] <= 2, 'FamilyID'] = 'Small_Family' df['AgeOriginallyNaN'] = df['Age'].isnull().astype(int) medians_by_title = pd.DataFrame(df.groupby('Title')['Age'].median()).rename(columns = {'Age': 'AgeFilledMedianByTitle'}) df = df.merge(medians_by_title, left_on = 'Title', right_index = True).sort_index(level = 0).sort_index(level = 1) df_train = df.loc['train'] df_test = df.loc['test'] y_train = df_train['Survived'] X_train = df_train[df_train.columns.drop('Survived')] mapper1 = DataFrameMapper([ ('Embarked',BinaryEncoder()), (['AgeFilledMedianByTitle'], StandardScaler()), ('Pclass', LeaveOneOutEncoder()) ]) mapper2=DataFrameMapper([ ('Embarked',LeaveOneOutEncoder()), (['AgeFilledMedianByTitle'], StandardScaler()), ('Pclass', LeaveOneOutEncoder()) ]) pipe = Pipeline([('featurize', mapper1), ('forest', RandomForestClassifier(n_estimators=10))]) param_grid = dict(forest__n_estimators = [2, 16, 32,64], forest__criterion = ['gini', 'entropy']) grid_search = GridSearchCV(pipe, param_grid=param_grid, scoring='accuracy') best_pipeline = grid_search.fit(X_train, y_train).best_estimator_ best_pipeline.get_params()['forest'] grid_search.best_score_
s it possible to use Pipeline in GridSearchCV to select best possible mapper (mapper1 and mapper2)? How?
Posts: 1
Participants: 1