@syed.danish wrote:
Hi,
I have a data set divided into two parts train and test , I want to know how one should handle extra levels present in test or /and train data. Three cases are possible :Case 1 :
In train there is a variable Education with 4 levels :U.G.,P.G.,12, PHD
.
In test data “Education” variable has 3 levels :U.G., P.G., PHD
.
Level12
is present in train but not in test.Case 2 :
In train there is a variable Education with 3 levels :U.G.,P.G., PHD
.
In test data “Education” variable has 4 levels :U.G.,P.G.,12, PHD
.
Level12
is present in test but not in train.Case 3 :
In train there is a variable Education with 4 levels :U.G.,P.G.,12, PHD
.
In test data “Education” variable has 4 levels :U.G., P.G., 12, Diploma
.
LevelPHD
is present in train but not in test andDiploma
is present in test but not in train.What will be the suitable operation in each case :
1. Applying one hot encoding on the levels that are present in both test and train.
2. We can just drop the particular observations in test/train having the levels which are absent in corresponding train/test.
3. We could just create a prediction model without doing anything about the extra levels.Please suggest if there is any other way to handle this problem.
Thanks in advance.
Posts: 3
Participants: 3