ONE STEP Execution of Forward Variable Selection in R using p-value

In [2]:
library(dplyr)
library(dummies) # to change categorical into dummies

Forward Variable Selection Algorithm

In [3]:
# Author: Endale B Altaye
# Dec 2014
# BGSU, for regression class
forwardSelection=function(data,responceVariable,alphaToEnter)
{
dataname=data.frame(data)
response=responceVariable
varname=names(dataname)
predVarname=varname[varname!=response] 
# Simple linear regression list 
SLRList=lapply(predVarname, function(x){SLR = as.formula(sprintf("%s ~ %s", response, x))
                                        summary(lm(SLR, data = dataname))$coeff})
# Extracting the p-values    
pvalueVector=sapply(1:length(predVarname), function(x) pvalue=c(SLRList[[x]][2,4])) 
    # 2 in above line represents the second row which is the variable since the first row is intercept
    # 4 for the p-value column

#Picking the most probable significant variable, min p-value
m=which(pvalueVector==min(pvalueVector))

#Comparing p-value of the most probable sig. with the alphaToEnter value
if(pvalueVector[m]<=alphaToEnter)
{
sigvariable=c(predVarname[m])
predVarname=predVarname[-m]
selectedvar=sigvariable
message (sprintf("Variable entered per each step with alpha to enter = %s:",alphaToEnter))
print(sprintf("step 1 Selected variable=%s, the p-value was %s",selectedvar,min(pvalueVector)))
k=1
repeat
{
sigvarcomb=ifelse(length(selectedvar)>1, paste(selectedvar, collapse=" + ") , selectedvar[1])
SLRList=lapply(predVarname, function(x){SLR = as.formula(sprintf("%s ~ %s+%s",response,x,sigvarcomb))
                                         summary(lm(SLR, data = dataname))$coeff})
pvalueVector=sapply(1:length(predVarname), function(x) pvalue=c(SLRList[[x]][2,4]))
m=which(pvalueVector==min(pvalueVector))
if(pvalueVector[m]<=alphaToEnter)
{ k=k+1
selectedvar=c(selectedvar,predVarname[m])
addedvar=predVarname[m]
predVarname=predVarname[-m]
print(sprintf("step %s Selected variable=%s, the p-value was %s",k,addedvar,min(pvalueVector)))
if(length(predVarname)==0) {
message("All Variables Selected")
# stop if all the variables are significant and included
break}
}
else {
sigvarcomb=ifelse(length(selectedvar)>1, paste(selectedvar, collapse=" + ") , selectedvar[1])
SLRList=summary(lm(as.formula(sprintf("%s ~ %s",response,sigvarcomb)),data = dataname))$coeff
# stop if all the significant variables are included no more to add. Remaining are insignificant once. 
    break
}
}
message("Summary of Finally Selected Model is:")
SLRList
} else {print("No significant variable at the given alpha")}
}

Example

In [4]:
housingprice <- read.delim(".../housingprice.txt")
#colnames(housingprice)
In [6]:
str(housingprice)
'data.frame':	108 obs. of  11 variables:
 $ SQFT    : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS    : Factor w/ 4 levels "3","4","5","6": 1 1 2 1 1 1 1 2 2 1 ...
 $ BATHS   : Factor w/ 3 levels "2","3","4": 1 1 1 1 1 1 1 2 1 1 ...
 $ HEAT    : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
 $ STYLE   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ GARAGE  : Factor w/ 3 levels "1","2","3": 1 2 2 2 1 2 2 2 2 2 ...
 $ BASEMENT: Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
 $ FIRE    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 1 2 ...
 $ AGE     : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE   : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL  : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
In [8]:
housingpriceWithDummy=dummy.data.frame(housingprice,verbose=TRUE)
  BEDS : 4 dummy varibles created
  BATHS : 3 dummy variables created
  HEAT : 2 dummy variables created
  STYLE : 3 dummy variables created
  GARAGE : 3 dummy variables created
  BASEMENT : 2 dummy variables created
  FIRE : 2 dummy variables created
  SCHOOL : 2 dummy variables created
In [9]:
str(housingpriceWithDummy)
'data.frame':	108 obs. of  24 variables:
 $ SQFT     : num  6 37 11 13 5 11 42 38 44 10 ...
 $ BEDS3    : int  1 1 0 1 1 1 1 0 0 1 ...
 $ BEDS4    : int  0 0 1 0 0 0 0 1 1 0 ...
 $ BEDS5    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BEDS6    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BATHS2   : int  1 1 1 1 1 1 1 0 1 1 ...
 $ BATHS3   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ BATHS4   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HEAT0    : int  1 0 1 1 1 1 1 1 1 1 ...
 $ HEAT1    : int  0 1 0 0 0 0 0 0 0 0 ...
 $ STYLE0   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ STYLE1   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ STYLE2   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GARAGE1  : int  1 0 0 0 1 0 0 0 0 0 ...
 $ GARAGE2  : int  0 1 1 1 0 1 1 1 1 1 ...
 $ GARAGE3  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BASEMENT0: int  0 1 0 0 1 0 0 0 0 1 ...
 $ BASEMENT1: int  1 0 1 1 0 1 1 1 1 0 ...
 $ FIRE0    : int  0 0 1 0 0 0 1 1 1 0 ...
 $ FIRE1    : int  1 1 0 1 1 1 0 0 0 1 ...
 $ AGE      : num  4 5 9 3 19 9 13 4 10 5 ...
 $ PRICE    : num  35 38 39 39 40 42 43 45 47 48 ...
 $ SCHOOL0  : int  0 1 1 1 1 1 1 1 1 1 ...
 $ SCHOOL1  : int  1 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dummies")=List of 8
  ..$ BEDS    : int  2 3 4 5
  ..$ BATHS   : int  6 7 8
  ..$ HEAT    : int  9 10
  ..$ STYLE   : int  11 12 13
  ..$ GARAGE  : int  14 15 16
  ..$ BASEMENT: int  17 18
  ..$ FIRE    : int  19 20
  ..$ SCHOOL  : int  23 24

Dropping reference column, one for each categorical variable

In [10]:
housingpriceWithDummy=housingpriceWithDummy %>% select(-c(BEDS3,BATHS2,HEAT0,STYLE0,GARAGE1,BASEMENT0,FIRE0,SCHOOL0))

Running the algorithm

In [12]:
forwardSelection(housingpriceWithDummy,"PRICE",0.2) # keeping varibles inthe model which have a pvalue less than 20%
Variable entered per each step with alpha to enter = 0.2:
[1] "step 1 Selected variable=SQFT, the p-value was 2.98341451779726e-07"
[1] "step 2 Selected variable=BEDS5, the p-value was 0.036913372026464"
[1] "step 3 Selected variable=GARAGE2, the p-value was 0.109532891700503"
Summary of Finally Selected Model is:
EstimateStd. Errort valuePr(>|t|)
(Intercept)5.546303e+017.776012e+007.132581e+001.358321e-10
SQFT-4.016215e-01 7.735378e-02-5.192008e+00 1.037673e-06
BEDS5-16.42947398 7.66845054 -2.14247636 0.03448839
GARAGE210.9816968 6.8035458 1.6141138 0.1095329

Finally, of course, residual diagnostics for the assumptions should follow