Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Mismatch in levels of categorical variable in train and test data

$
0
0

@syed.danish wrote:

Hi,
I have a data set divided into two parts train and test , I want to know how one should handle extra levels present in test or /and train data. Three cases are possible :

Case 1 :
In train there is a variable Education with 4 levels : U.G.,P.G.,12, PHD .
In test data “Education” variable has 3 levels : U.G., P.G., PHD.
Level 12 is present in train but not in test.

Case 2 :
In train there is a variable Education with 3 levels : U.G.,P.G., PHD .
In test data “Education” variable has 4 levels : U.G.,P.G.,12, PHD.
Level 12 is present in test but not in train.

Case 3 :
In train there is a variable Education with 4 levels : U.G.,P.G.,12, PHD .
In test data “Education” variable has 4 levels : U.G., P.G., 12, Diploma.
Level PHD is present in train but not in test and Diploma is present in test but not in train.

What will be the suitable operation in each case :
1. Applying one hot encoding on the levels that are present in both test and train.
2. We can just drop the particular observations in test/train having the levels which are absent in corresponding train/test.
3. We could just create a prediction model without doing anything about the extra levels.

Please suggest if there is any other way to handle this problem.

Thanks in advance.

Posts: 3

Participants: 3

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles