Difference in model performance measures of train and test data sets

@manuvats1990 wrote:

I am using CART classification technique by dividing a dataset into train and test sets. I have been using Mis-classification error, KS by rank ordering, AUC and Gini as MPMs(model performance measures). The problem I am facing is that the MPM values are quite far apart.

Dataset - https://drive.google.com/open?id=1UwqmM_R3SAHGAn7b1sytdT6KlcWylxUW Metadata - https://drive.google.com/open?id=1PkhvSA4fsuFtZsnaxegI4gBKFEc2Q839

I have tried with minsplit equal to anywhere from 20 to 1400 and minbucket from 5 to 100 but couldn’t get expected results. I have also tried oversampling/undersampling through ROSE package but without any improvement. Moreover, the mis-classification error increased a lot. Following code is through which I could get the best values, but they were not enough.
#Reading Data
pdata = read.csv("PL_XSELL.csv", header = TRUE)

#Converting ACC_OP_DATE from type factor to date
pdata$ACC_OP_DATE<-as.Date(pdata$ACC_OP_DATE, format = "%d-%m-%Y")

#Paritioning the data into training and test dataset
set.seed(2000)
n=nrow(pdata)
split= sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.70, 0.30))
ptrain = pdata[split, ]
ptest = pdata[!split,]

#CART Model
#Taking the minsplit, minbucket values as low as possible, so that pruning 
#can be done later. Higher values didn't allow any scope for pruning
r.ctrl = rpart.control(minsplit=20, minbucket = 5,  cp = 0, xval = 10)

#Calling the rpart function to build the tree
cartModel <- rpart(formula = TARGET ~ ., 
        data = ptrain[,-1], method = "class", 
        control = r.ctrl)
#Pruning Tree Code
cartModel<- prune(cartModel, cp= 0.00225  ,"CP")

#Predicting class and scores
ptrain$predict.class <- predict(cartModel, ptrain, type="class")
ptrain$predict.score <- predict(cartModel, ptrain, type="prob")
Results that I got-: Train data Mis-classification error-.103 AUC - 0.679 KS - 0.259 Gini - 0.313

Test data Mis-classification error-.113 AUC - 0.664 KS - 0.226 Gini - 0.307

Is it due to the dataset or am I doing something wrong. I am new to Data Analytics. It is a part of my academic project, so I need to use CART technique only. I will put separate questions for Random Forest and Neural Networks. Kindly help.

Posts: 1

Participants: 1

Read full topic

Difference in model performance measures of train and test data sets

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...