ONE STEP Execution of Stepwise Variable Selection in R using p-value

In [2]:
library(dplyr)
library(dummies)

Stepwise Variable Selection Algorithm

In [3]:
# Author: Endale B Altaye
# Dec 2014
# BGSU, For regression class

StepwiseAlgorithm=function(data,responceVariable,alphaToEnter,alphaToRemove)
{
dataname=data.frame(data)
response=responceVariable
varname=names(dataname)
predVarname=varname[varname!=response]
# Simple linear regression list for forward case
SLRList=lapply(predVarname, function(x){fml = as.formula(sprintf("%s ~ %s", response, x))
                                summary(lm(fml, data = dataname))$coeff})
# Extracting the p-values    
pvalueVector=sapply(1:length(predVarname), function(x) pvalue=c(SLRList[[x]][2,4]))
     # 2 in above line represents the second row which is the variable since the first row is intercept
    # 4 for the p-value column

#Picking the most probable significant variable, min p-value
m=which(pvalueVector==min(pvalueVector))
    
#Comparing the p-value of the most probable sig. with the alphaToEnter value
if(pvalueVector[m]<=alphaToEnter)
{
sigvariable=c(predVarname[m])
Xreduc=predVarname[-m]
Xnew=sigvariable
message (sprintf("Variable entered/removed per each step with alpha to enter = %s and alpha to stay =%s :",alphaToEnter,alphaToRemove))
print(sprintf("step 1 Selected variable=%s, the p-value was %s",Xnew,min(pvalueVector)))
}
else
{stop("No Significant Variable to include.") # stops the algorithm returning no sig variable to include
}
k=1
repeat
    {
i=0
k=k+1
# checking if there exist variable to include into the model for forward alg. ifnot stop algorithm
stopifnot(length(Xreduc)>0)
# checking if forward alg results more than one pred variable or not 
sigvarcomb=ifelse(length(Xnew)>1, paste(Xnew, collapse=" + ") , Xnew[1])
# running forward step
SLRList=lapply(Xreduc, function(x){fml = as.formula(sprintf("%s ~ %s+%s",response,x,sigvarcomb))
summary(lm(fml, data = dataname))$coeff})
pvalueVector=sapply(1:length(Xreduc), function(x) pvalue=c(SLRList[[x]][2,4]))
    
m=which(pvalueVector==min(pvalueVector))
if(pvalueVector[m]<=alphaToEnter)
{ Xnew=c(Xnew,Xreduc[m])
print(sprintf("step %s Selected variable=%s, the p-value was %s",k,Xreduc[m],min(pvalueVector)))
Xreduc=Xreduc[-m]
lengthXnew=length(Xnew)
j=1
while(j<=lengthXnew)
{
i=i+1
# running backward elimination
mymodel=lm(as.formula(paste(response,paste(Xnew,collapse="+"),sep="~")), data = dataname)
pvalue=summary(mymodel)$coeff[-1,4]
if(max(pvalue)>alphaToRemove)
{
maxpvalue=which(pvalue==max(pvalue))
print(sprintf("step %s.%s Removed variable=%s, the p-value was %s",k,i,Xnew[maxpvalue],max(pvalue)))
Xnew=Xnew[-maxpvalue]
j=j+1
}
else {
j=lengthXnew+1
Xreduc=setdiff(predVarname,Xnew)
}
}
 }
else { break }
}
                    
finalmodel=lm(as.formula(paste(response,paste(Xnew,collapse="+"),sep="~")), data = dataname)
message("Summary of Finally Selected Model is:")
summary(finalmodel)
}
In [5]:
housingprice <- read.delim(".../housingprice.txt")
#colnames(housingprice)
In [7]:
str(housingprice)
'data.frame':	108 obs. of  11 variables:
 $ SQFT    : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS    : Factor w/ 4 levels "3","4","5","6": 1 1 2 1 1 1 1 2 2 1 ...
 $ BATHS   : Factor w/ 3 levels "2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
 $ HEAT    : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
 $ STYLE   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ GARAGE  : Factor w/ 3 levels "1","2","3": 1 2 2 2 1 2 2 2 2 2 ...
 $ BASEMENT: Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
 $ FIRE    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 1 2 ...
 $ AGE     : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE   : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL  : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
In [8]:
housingpriceWithDummy=dummy.data.frame(housingprice,verbose=TRUE)
  BEDS : 4 dummy variables created
  BATHS : 3 dummy variables created
  HEAT : 2 dummy variables created
  STYLE : 3 dummy variables created
  GARAGE : 3 dummy variables created
  BASEMENT : 2 dummy variables created
  FIRE : 2 dummy variables created
  SCHOOL : 2 dummy variables created
In [9]:
str(housingpriceWithDummy)
'data.frame':	108 obs. of  24 variables:
 $ SQFT     : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS3    : int  1 1 0 1 1 1 1 0 0 1 ...
 $ BEDS4    : int  0 0 1 0 0 0 0 1 1 0 ...
 $ BEDS5    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BEDS6    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BATHS2   : int  1 1 1 1 1 1 1 0 1 1 ...
 $ BATHS3   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ BATHS4   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HEAT0    : int  1 0 1 1 1 1 1 1 1 1 ...
 $ HEAT1    : int  0 1 0 0 0 0 0 0 0 0 ...
 $ STYLE0   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ STYLE1   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ STYLE2   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GARAGE1  : int  1 0 0 0 1 0 0 0 0 0 ...
 $ GARAGE2  : int  0 1 1 1 0 1 1 1 1 1 ...
 $ GARAGE3  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BASEMENT0: int  0 1 0 0 1 0 0 0 0 1 ...
 $ BASEMENT1: int  1 0 1 1 0 1 1 1 1 0 ...
 $ FIRE0    : int  0 0 1 0 0 0 1 1 1 0 ...
 $ FIRE1    : int  1 1 0 1 1 1 0 0 0 1 ...
 $ AGE      : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE    : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL0  : int  0 1 1 1 1 1 1 1 1 1 ...
 $ SCHOOL1  : int  1 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dummies")=List of 8
  ..$ BEDS    : int  2 3 4 5
  ..$ BATHS   : int  6 7 8
  ..$ HEAT    : int  9 10
  ..$ STYLE   : int  11 12 13
  ..$ GARAGE  : int  14 15 16
  ..$ BASEMENT: int  17 18
  ..$ FIRE    : int  19 20
  ..$ SCHOOL  : int  23 24
In [10]:
housingpriceWithDummy=housingpriceWithDummy %>% select(-c(BEDS3,BATHS2,HEAT0,STYLE0,GARAGE1,BASEMENT0,FIRE0,SCHOOL0))
In [11]:
StepwiseAlgorithm(housingpriceWithDummy,"PRICE",0.15,0.25) # adding variable into the model with pvalue less than 15%
                                                          # Droping variable outof the model if its pvalue is more than 25 %
Variable entered/removed per each step with alpha to enter = 0.15 and alpha to stay =0.25 :
[1] "step 1 Selected variable=SQFT, the p-value was 2.98341451779726e-07"
[1] "step 2 Selected variable=BEDS5, the p-value was 0.036913372026464"
[1] "step 3 Selected variable=GARAGE2, the p-value was 0.109532891700503"
Summary of Finally Selected Model is:
Call:
lm(formula = as.formula(paste(response, paste(Xnew, collapse = "+"), 
    sep = "~")), data = dataname)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.372 -13.487  -1.253  13.096  51.677 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  55.46303    7.77601   7.133 1.36e-10 ***
SQFT         -0.40162    0.07735  -5.192 1.04e-06 ***
BEDS5       -16.42947    7.66845  -2.142   0.0345 *  
GARAGE2      10.98170    6.80355   1.614   0.1095    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.99 on 104 degrees of freedom
Multiple R-squared:  0.2704,	Adjusted R-squared:  0.2494 
F-statistic: 12.85 on 3 and 104 DF,  p-value: 3.31e-07

Finally, of course, residual diagnostics for the assumptions should follow