Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Improper output of class Probabilities in Random Forest Classifier

$
0
0

@psnh wrote:

Hi! My dataset has 140k rows with 5 attributes and 1 Attrition as target variable (value can either be 0 (Customer Does not churn) or 1 (Customer churn)). I divided my dataset in 80% training and 20% testing. My dataset is heavily imbalanced. 84% of my dataset has 0 as target variable and only 16% has 1 as target variable.

The feature importance of my training dataset is as follows:

ColumnA = 28%, ColumnB = 27%, AnnualFee- 17%, ColumnD - 17% an ColumnE - 11%

I initially wanted to do a very simple check of my model. After creating a Random Forest Classifier I tested the model on a dataset with just 5 rows. I kept all variables constant except Column AnnualFee. Below is a snapshot of my test data:

 Column A	Column B	             AnnualFee	          ColumnD	         ColumnE
 4500                3.9                  5%                2.1               7
 4500                3.9                  10%               2.1               7
 4500                3.9                  15%               2.1               7
 4500                3.9                  20%               2.1               7
 4500                3.9                  25%               2.1               7

I expected that as annual fee increases the probability of customer churn also increases. But my rf.predict_proba(X_test) seems to be all over the place. I am not sure why this is happening:

I tried two different codes but the anomaly seems to be happening on both codes:

Code 1:

rf = RandomForestClassifier(n_estimators = 400,random_state = 0, 
min_samples_split=2,min_samples_leaf=5,
                      class_weight = {0:.0001,1:.9999})
rf.fit(X_train, Y_train )

Code 2: Not My Code - Got it Online

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
clf_4 = RandomForestClassifier(class_weight = {0:1,1:5})
estimators_range = np.array([2,3,4,5,6,7,8,9,10,15,20,25])
depth_range = np.array([11,21,35,51,75,101,151,201,251,301,401,451,501])
kfold = 5
skf = StratifiedKFold(n_splits = kfold,random_state = 42)

model_grid = [{'max_depth': depth_range, 'n_estimators': estimators_range}]
grid = GridSearchCV(clf_4, model_grid, cv = StratifiedKFold(n_splits = 5, 
random_state = 42),n_jobs = 8, scoring = 'roc_auc')
grid.fit(X_train,Y_train)

I would really appreciate any help on this!

Posts: 2

Participants: 2

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles