Simpson's Paradox in Regression

@manoj09990 wrote:

Hi Everyone,

I am working on a Linear Regression model where my output variable is ‘Salary’ of individuals based on 2 input variables, 1). Department 2). Job_Level.

When I fit a Simple Linear Regression model to predict the “Salary” variable using “Department” it gives the coefficients which make sense with the data, however when I add “Level” also to the model, it produces incorrect coefficients, since I can not share the data set here I have used the Big Mart data (Train data after removing all rows with missing values) for simulation purpose.

Below is the R code of the model which I built:

#First model Simple Linear:

model1 <- lm(Item_Outlet_Sales ~ Outlet_Size, data = Big_mart) ## Predicting the Sales based on Outlet size only

coef(model1)

(Intercept)
2298.99526

Outlet_SizeMedium
-126.87866

Outlet_SizeSmall
59.34781

So I would interpret the coefficients, if the Outlet_Size is Medium then the average sales figures will be -126 comparing to the reference category Outlet_Size_High, similarly if the Outlet_Size is Small then the average sales figures will be positive 59 comparing to Outlet_Size High, this makes sense as the Mean figures of Sales by Outlet_Size matches with the coefficients logic(i.e. for small they are highest and for medium they are lowest)

Now I added one more variable, which is Outlet_Location_Type and re-built the regression equation:

model2 <- lm(Item_Outlet_Sales ~ Outlet_Size + Outlet_Location_Type, data = Big_mart)

coef(model2)

(Intercept)
2651.8512

Outlet_SizeMedium
-303.4965

Outlet_SizeSmall
-374.0069

Outlet_Location_TypeTier 2
160.9976

Outlet_Location_TypeTier 3
-352.8559

Now the problem is, in the simple model where Outlet_Size was used to predict the Sales, coefficients for Outlet_SizeSmall were positive however after adding Outlet_Location_Type they have flipped the signs from Positive to Negative, which doesn’t make sense if we manually compare the coefficients with raw data.

Same is happening when I include the Level variable along with Department variable to predict Salary.

After doing some research on google, I came to know that this phenomena is known as Simpson’s Paradox. Now I know the cause of this problem but my question is how can I resolve this problem to fit a regression model which gives me coefficients which have signs(+ or -) which match with the data used to train the model. I am also interested to share the results with the business owners so I would need to report the coefficients to them.

If you have any solution to this, please share your valuable inputs?

Thanks.

Posts: 1

Participants: 1

Read full topic

Simpson's Paradox in Regression - Solution?

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...