@psnh wrote:
Hi! My dataset has 140k rows with 5 attributes and 1 Attrition as target variable (value can either be 0 (Customer Does not churn) or 1 (Customer churn)). I divided my dataset in 80% training and 20% testing. My dataset is heavily imbalanced. 84% of my dataset has 0 as target variable and only 16% has 1 as target variable.
The feature importance of my training dataset is as follows:
ColumnA = 28%, ColumnB = 27%, AnnualFee- 17%, ColumnD - 17% an ColumnE - 11%
I initially wanted to do a very simple check of my model. After creating a Random Forest Classifier I tested the model on a dataset with just 5 rows. I kept all variables constant except Column AnnualFee. Below is a snapshot of my test data:
Column A Column B AnnualFee ColumnD ColumnE 4500 3.9 5% 2.1 7 4500 3.9 10% 2.1 7 4500 3.9 15% 2.1 7 4500 3.9 20% 2.1 7 4500 3.9 25% 2.1 7
I expected that as annual fee increases the probability of customer churn also increases. But my rf.predict_proba(X_test) seems to be all over the place. I am not sure why this is happening:
I tried two different codes but the anomaly seems to be happening on both codes:
Code 1:
rf = RandomForestClassifier(n_estimators = 400,random_state = 0, min_samples_split=2,min_samples_leaf=5, class_weight = {0:.0001,1:.9999}) rf.fit(X_train, Y_train )
Code 2: Not My Code - Got it Online
from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import GridSearchCV clf_4 = RandomForestClassifier(class_weight = {0:1,1:5}) estimators_range = np.array([2,3,4,5,6,7,8,9,10,15,20,25]) depth_range = np.array([11,21,35,51,75,101,151,201,251,301,401,451,501]) kfold = 5 skf = StratifiedKFold(n_splits = kfold,random_state = 42) model_grid = [{'max_depth': depth_range, 'n_estimators': estimators_range}] grid = GridSearchCV(clf_4, model_grid, cv = StratifiedKFold(n_splits = 5, random_state = 42),n_jobs = 8, scoring = 'roc_auc') grid.fit(X_train,Y_train)
I would really appreciate any help on this!
Posts: 2
Participants: 2