Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Kaggle Titanic: Logistic Regression - higher cross validation scores results in lower accuracy on submission

$
0
0

@Aarshay wrote:

Hi,

I'm working on the Titanic problem at Kaggle. I'm focusing on getting a reasonably good solution using Logistic Regression. I've created some features and the training and test set I'm using are:
test_modified.csv (50.4 KB)
train_modified.csv (109.0 KB)

I'm facing a peculiar issue. When I train the model using only "Sex" as the variable I get accuracy 78.67% and 10-fold cross validation mean scores as 78.67% with 0.04 standard deviation. This gives 0.76555 on submission.

Now I've run RFECV (recursive feature elimination with cross validation) on my dataset and I get the following graph:

Here, the top 10 variables and their coefficients are:

These give accuracy of 82.72%. The 10-fold cross validation mean accuracy is 81.59% with standard deviation of 0.0255. When I submit this solution on Kaggle, I get a score of 0.75598 which is 0.01 lower than the model with only Sex as predictor.

This is really strange. If the model if overfitting, why am I getting high cross-validation score? Is there something I'm missing here? What are other possible metrics which I can use to diagnose the model?

Please help.

Thanks,
Aarshay

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles