My developed Algorithms for Forward, Backward and Stepwise variable selection during Multiple Linear regression using P-value only.
I was doing multiple linear regression project on R with larger continuous and categorical variable size, when I found myself wasting time to do stepwise variable selection. Searching a significant variable to add into the model and then look for dropping a variable and so on in each step. It was very tiring. I could have used the preexisting step() from stats package or regsubsets from leaps package. However, these methods use AIC to do variable selection. AIC is one among the acceptable methods used to do model comparison or variable selection. However, AIC is way complicated to interpret and to even explain it for non-technical person. In addition, we pick a model with smaller AIC but how much smaller is small AIC? That is why I decided to write these algorithms which only evaluates the p-value to add or drop a variable in each case of forward, backward or stepwise variable selection procedures.
Advantage of using p-values to do variable selection
- P-values are straight forward to interpret and explain for non-technical person.
- Different research areas apply different alpha, level of significance, values. The fact that each p-value should be compared with the alpha value, we can do variable selection for the alpha value the we,the researcher, chose.
- The alpha value to add a variable (alpha-to-enter) often is not equal with alpha value to drop (alpha-to-stay) a variable. These should be decided by the researcher based on the application area and do the variable selection using the p-value result.
- By watching the p-value's magnitude, we can play around with the complexity of the model.
Thus, it's a great benefit to use the algorithm I prepared which does variable selection using P-value. Each method of variable selection algorithms, (backward, forward and stepwise), are being written separately. Would have been best if it is just one package for the three, but not yet. I am working on it.
Instruction to use the algorithms
- Each categorical variables or factors should be changed into dummy variables, can use the dummy.data.frame() function from the dummies package to change factors into dummies, one column for each value. The reference value column for each categorical variable should be dropped from the data frame or data set for obvious reason, reference value.
- Alpha-to-enter for forward selection and alpha-to-stay for backward elimination and both for stepwise should be decided and used as input.
- After saving the algorithm use forwardSelection(data,"responceVariable",alphaToEnter) or backwardElimination(data,"responceVariable",alphaToRemove) or StepwiseAlgorithm(data,"responceVariable",alphaToEnter,alphaToRemove) according to the researchers preference.
Algorithms
Backward Elimination algorithm using backwardElimination(data,"responceVariable",alphaToRemove)
Stepwise Variable Selection algorithm using StepwiseAlgorithm(data,"responceVariable",alphaToEnter,alphaToRemove)
More to come...