@shashankvarshney wrote:
Hi Folks,
I am new to Machine learning and was trying to fit logistic regression on Boston data set available in MASS library of R.I tried with forward selection and started with one predictor and keep on adding predictor variables one by one.
I am using first half of data set for training purpose and second half as test set.
I have introduced a new variable "crim_class" which is response variable and will be one if "crim" >= median(crim) and zero if "crim" < median(crim).--> with the following model i got leat value of prediction error on test set but some of the predictors are statistically insignificant considering the * coding of p-value.
glm.fit = glm(crim_class~zn+indus+chas+nox+rm+age, data = Boston, family = binomial, subset = train)
summary(glm.fit)
Call:
glm(formula = crim_class ~ zn + indus + chas + nox + rm + age,
family = binomial, data = Boston, subset = train)Deviance Residuals:
Min 1Q Median 3Q Max
-2.1280 -0.4903 -0.0315 0.4073 3.7807Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -30.344933 5.569194 -5.449 5.07e-08 ***
zn -0.108906 0.066974 -1.626 0.10393
indus -0.186455 0.061034 -3.055 0.00225 **
chas 1.563326 0.725541 2.155 0.03118 *
nox 53.472313 10.465525 5.109 3.23e-07 ***
rm 0.620567 0.306502 2.025 0.04290 *
age -0.004282 0.011314 -0.378 0.70511
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 329.37 on 252 degrees of freedomResidual deviance: 162.08 on 246 degrees of freedom
AIC: 176.08Number of Fisher Scoring iterations: 8
glm.pred = rep(0, length(y.test))
glm.prob = predict(glm.fit, Boston.test, type = 'response')
glm.pred[glm.prob > 0.5] = 1
mean(glm.pred != y.test)
[1] 0.1106719--> With the following model error increased little bit but all the predictors are statistically significant.
glm.fit = glm(crim_class~.-crim-chas-tax, data = Boston, family = binomial, subset = train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)Call:
glm(formula = crim_class ~ . - crim - chas - tax, family = binomial,
data = Boston, subset = train)Deviance Residuals:
Min 1Q Median 3Q Max
-3.07309 -0.06280 0.00000 0.04518 2.52250Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -95.004862 20.048738 -4.739 2.15e-06 ***
zn -0.850179 0.184521 -4.607 4.08e-06 ***
indus 0.413284 0.157156 2.630 0.008544 **
nox 89.142048 19.552114 4.559 5.13e-06 ***
rm -4.631311 1.678561 -2.759 0.005796 **
age 0.050660 0.023105 2.193 0.028337 *
dis 4.513311 0.954809 4.727 2.28e-06 ***
rad 2.968052 0.677653 4.380 1.19e-05 ***
ptratio 1.483118 0.369531 4.014 5.98e-05 ***
black -0.016615 0.006504 -2.554 0.010636 *
lstat 0.209340 0.086666 2.415 0.015714 *
medv 0.631766 0.183235 3.448 0.000565 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 329.367 on 252 degrees of freedomResidual deviance: 70.529 on 241 degrees of freedom
AIC: 94.529Number of Fisher Scoring iterations: 10
glm.pred = rep(0, length(y.test))
glm.prob = predict(glm.fit, Boston.test, type = 'response')
glm.pred[glm.prob > 0.5] = 1
mean(glm.pred != y.test)
[1] 0.1857708My query is in this scenario which model to choose:
--> Model with higher accuracy but statistically insignificant predictors?
or
--> Model with lower accuracy but with all the predictors which are statistically significant.
Posts: 1
Participants: 1