Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Variable Importance in Inputs that caused a shift in the output variable

$
0
0

@sharathdhamodaran wrote:

I am working on an interesting problem of finding some causes in the measurement shift of the output variable by correlating some of the input variables in the dataset. I am primarily using R for this task.
(Here is my dataset as a csv https://drive.google.com/open?id=0B7UROHet3IQwTURnSHlCcEt2Ykk)

df_sample <- read.csv("Sample.csv")

Here is the code and the plot of the output variable

ggplot(aes(x=DATETIME, y=OutVar), data = df_sample) + 
  geom_point(size = 2) + geom_line() +
  theme(axis.text=element_text(size=12)
        , axis.title=element_text(size=14,face="bold")) + 
  theme(plot.title=element_text(face="bold", size=20)) +
  xlab("Timestamp") + ylab("Output_Var") +
  scale_x_datetime(breaks=date_breaks("5 days"), labels=date_format("%m/%d"))

As you can clearly see, there is a shift in the output variable. I am trying to find the cause of this shift to check whether one or more of the input variables that are used in the above dataset have caused this to happen.
I have been trying the basic techniques with few ML algorithms to start off with. Some of the things that I have tried
1) Pair wise plots, correlation using "library(qgraph)"
2) Random Forest for variable importance - The error that I am getting is high here.

I am also thinking if I should deal this as a classification problem by assigning a "Good" for optimal points and "Bad" for abnormal points and then use classification algorithms for prediction. The only problem is that it is a very small data set with just 173 rows and the "Bad" points are minimal too. Can we get a good classification out of this?
Kindly provide some other techniques or directions on how to solve this problem

Posts: 3

Participants: 2

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles