Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Imputing missing values in the Smart recruits hackathon

$
0
0

@B.Rabbit wrote:

Hello People,

In the Smart recruits hackathon there are about 1600 rows where all the values of Manager Variables are missing.
My idea is to create a unique manager ID for each manager(each manager has unique values for all the manager variables) and create a model with target variable as ID and independent variables as all applicant variables.
To deal with them the following is what I did:
1) Split the data set to 'missing'(even if one value in any row is missing) and 'non missing' (none of the values are missing) datasets.
2) Extract all the values of Manager variable from the non missing data set.
3) Assign IDs for each unique Manager.
4) Merge and Assign the corresponding ID to each manager in the non missing data set.
5) Create a model where the target variable is ID and independent variables as all applicant variables.
6) Predict the corresponding IDs for each row in the 'missing' dataset.
The code:
test$Business_Sourced = 2
combo = rbind(train, test)
filled = na.omit(combo)
notfilled = subset(combo, !(ID %in% filled$ID))

    y = filled %>% select(Office_PIN, Manager_DOJ:Manager_Num_Products2)
    table(y)

    manager = unique(y)
    manager$newvar <- seq(1,6641,1)

    z = merge(manager, filled, by = c("Office_PIN", "Manager_DOJ","Manager_Joining_Designation", "Manager_Current_Designation", "Manager_Grade", "Manager_Status", "Manager_Gender", "Manager_DoB", "Manager_Num_Application", "Manager_Num_Coded","Manager_Business", "Manager_Num_Products", "Manager_Business2", "Manager_Num_Products2"))
    Applicant_with_managerID = z %>% select(Office_PIN, newvar:Applicant_Qualification, Same_Locality:Applicant_Age)
    Applicant_with_managerID$newvar = as.factor(Applicant_with_managerID$newvar)
    str(Applicant_with_managerID)
    Applicant_with_managerID$Applicant_City_PIN = as.numeric(Applicant_with_managerID$Applicant_City_PIN)
    Applicant_with_managerID$Office_PIN = as.numeric(Applicant_with_managerID$Office_PIN)

    library(randomForest)
    model1 = randomForest(newvar ~.-(ID), data = Applicant_with_managerID, ntree = 50)

Since there are 6000 classes to predict I'm unable to create a model.
What should I do? Is my approach right? Should I change my model?

Regards

Posts: 3

Participants: 2

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles