ONE STEP Execution of Backward Elimination in R using p-value

In [2]:
library(dplyr)
library(dummies)
In [3]:
# Author: Endale B Altaye
# Dec 2014
# BGSU, For Regression Class

backwardElimination=function(data,responceVariable,alphaToRemove) 
{
dataname=data.frame(data) 
response=responceVariable
varname=names(dataname)
predVarname=varname[varname!=response]
# computing MLR containing all variables    
mymodel=lm(as.formula(paste(response,paste(predVarname,collapse="+"),sep="~")),data=dataname)
# Extracting the p-values of each predictor variables excluding the pvalue of the intercept
pvalue=summary(mymodel)$coeff[-1,4] 
    i=1
message(sprintf("Variable removed at each step with alpha to stay value = %s:",alphaToStay))
newpredvarname=predVarname
repeat
{
# comparing the largest p-value (probably the most insignificant one) with the alphaToRemove value
if(max(pvalue)>alphaToRemove)
{
mostinSig=which(pvalue==max(pvalue))
removedvar=newpredvarname[mostinSig]
newpredvarname=newpredvarname[-mostinSig]
print(sprintf("step %s removed variable=%s, the p-value was %s",i,removedvar,max(pvalue)))
if(length(newpredvarname)==0)
{
# if we end up droping every thing 
Finalresult=print("No Significant variable Remains")
break
}
i=i+1
mymodel=lm(as.formula(paste(response,paste(newpredvarname,collapse="+"),sep="~")),data=dataname)
pvalue=summary(mymodel)$coeff[-1,4]
}
else {
Finalresult= summary(mymodel)
break }
}
if(length(predVarname)==length(newpredvarname)) {message("No Variable Removed")}
message("Summary of Finally Selected Model is:")
Finalresult
}

Example

In [4]:
housingprice <- read.delim(".../housingprice.txt")
In [6]:
str(housingprice)
'data.frame':	108 obs. of  11 variables:
 $ SQFT    : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS    : Factor w/ 4 levels "3","4","5","6": 1 1 2 1 1 1 1 2 2 1 ...
 $ BATHS   : Factor w/ 3 levels "2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
 $ HEAT    : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
 $ STYLE   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ GARAGE  : Factor w/ 3 levels "1","2","3": 1 2 2 2 1 2 2 2 2 2 ...
 $ BASEMENT: Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
 $ FIRE    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 1 2 ...
 $ AGE     : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE   : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL  : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
In [7]:
housingpriceWithDummy=dummy.data.frame(housingprice,verbose=TRUE)
  BEDS : 4 dummy varibles created
  BATHS : 3 dummy variables created
  HEAT : 2 dummy variables created
  STYLE : 3 dummy variables created
  GARAGE : 3 dummy variables created
  BASEMENT : 2 dummy variables created
  FIRE : 2 dummy variables created
  SCHOOL : 2 dummy variables created
In [8]:
str(housingpriceWithDummy)
'data.frame':	108 obs. of  24 variables:
 $ SQFT     : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS3    : int  1 1 0 1 1 1 1 0 0 1 ...
 $ BEDS4    : int  0 0 1 0 0 0 0 1 1 0 ...
 $ BEDS5    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BEDS6    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BATHS2   : int  1 1 1 1 1 1 1 0 1 1 ...
 $ BATHS3   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ BATHS4   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HEAT0    : int  1 0 1 1 1 1 1 1 1 1 ...
 $ HEAT1    : int  0 1 0 0 0 0 0 0 0 0 ...
 $ STYLE0   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ STYLE1   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ STYLE2   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GARAGE1  : int  1 0 0 0 1 0 0 0 0 0 ...
 $ GARAGE2  : int  0 1 1 1 0 1 1 1 1 1 ...
 $ GARAGE3  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BASEMENT0: int  0 1 0 0 1 0 0 0 0 1 ...
 $ BASEMENT1: int  1 0 1 1 0 1 1 1 1 0 ...
 $ FIRE0    : int  0 0 1 0 0 0 1 1 1 0 ...
 $ FIRE1    : int  1 1 0 1 1 1 0 0 0 1 ...
 $ AGE      : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE    : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL0  : int  0 1 1 1 1 1 1 1 1 1 ...
 $ SCHOOL1  : int  1 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dummies")=List of 8
  ..$ BEDS    : int  2 3 4 5
  ..$ BATHS   : int  6 7 8
  ..$ HEAT    : int  9 10
  ..$ STYLE   : int  11 12 13
  ..$ GARAGE  : int  14 15 16
  ..$ BASEMENT: int  17 18
  ..$ FIRE    : int  19 20
  ..$ SCHOOL  : int  23 24

Dropping reference column, one for each categorical variable

In [10]:
housingpriceWithDummy=housingpriceWithDummy %>% select(-c(BEDS3,BATHS2,HEAT0,STYLE0,GARAGE1,BASEMENT0,FIRE0,SCHOOL0))

Running the algorithm

In [11]:
backwardElimination(housingpriceWithDummy,"PRICE",0.2) # keeping varibles in the model which have a pvalue less than 20%
Variable removed at each step with alpha to stay value = 0.2:
[1] "step 1 removed variable=SCHOOL1, the p-value was 0.982195817776971"
[1] "step 2 removed variable=BATHS4, the p-value was 0.875456044721044"
[1] "step 3 removed variable=BATHS3, the p-value was 0.887656543262546"
[1] "step 4 removed variable=GARAGE3, the p-value was 0.649062642381856"
[1] "step 5 removed variable=BEDS4, the p-value was 0.649565011594501"
[1] "step 6 removed variable=HEAT1, the p-value was 0.599513115307359"
[1] "step 7 removed variable=FIRE1, the p-value was 0.609049549046296"
[1] "step 8 removed variable=STYLE2, the p-value was 0.649761078035323"
[1] "step 9 removed variable=AGE, the p-value was 0.565047744637062"
[1] "step 10 removed variable=BEDS6, the p-value was 0.407515878300097"
[1] "step 11 removed variable=BASEMENT1, the p-value was 0.236681455326157"
[1] "step 12 removed variable=STYLE1, the p-value was 0.301005424473968"
Summary of Finally Selected Model is:
Call:
lm(formula = as.formula(paste(response, paste(newpredvarname, 
    collapse = "+"), sep = "~")), data = dataname)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.372 -13.487  -1.253  13.096  51.677 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  55.46303    7.77601   7.133 1.36e-10 ***
SQFT         -0.40162    0.07735  -5.192 1.04e-06 ***
BEDS5       -16.42947    7.66845  -2.142   0.0345 *  
GARAGE2      10.98170    6.80355   1.614   0.1095    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.99 on 104 degrees of freedom
Multiple R-squared:  0.2704,	Adjusted R-squared:  0.2494 
F-statistic: 12.85 on 3 and 104 DF,  p-value: 3.31e-07

Finally, of course, residual diagnostics for the assumptions should follow