Bowling Green State University
(419) 378 - 9131
ebaltay@bgsu.edu
Multivariate multiple linear regression: several y’s and several x’s. Multivariate refers to the dependent variables and multiple pertains to the independent variables.
Steps:
* Overall significance test. Testing the existence of at least one predictor variable to explain at least one of the
response variable.
* Variable selection using Backward elimination
* Model improvement
+ Interaction included
+ R-square and S used for assessment
* Multivariate Normality assessed
* Residual assessment
libs <- c("car","xlsx","psych","dplyr","ggplot2","knitr","gridExtra")
lapply(libs,require, character.only= TRUE)
library(repr)
options(repr.plot.width=7,repr.plot.height=4)
multdata=read.xlsx("C:/Users/selamina/...data.xlsx",1)
#message("H0: B = O vs H1: B != O, (We want to know whether at least one Bjk != 0,
#to explain at least one of the response variable.")
multdataModel="cbind(y1,y2,y3)~x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+x16"
multdataFullModel<- lm(multdataModel, data=multdata)
c=diag(15)[2:15,] # first column is for the const term
(OverallFit <- linearHypothesis(model = multdataFullModel, hypothesis.matrix = c)) #Spits H, E and Sign for over all
#message("From all the four multivariate test statistics, we can conclude as there exist at least one potential predictor variable
#to explain at least one of the response variable.")
Sum of squares and products for the hypothesis:
y1 y2 y3
y1 138.56339 43.42221 56.09689
y2 43.42221 14.13711 17.47444
y3 56.09689 17.47444 32.55243
Sum of squares and products for error:
y1 y2 y3
y1 49.322325 10.376339 8.714629
y2 10.376339 2.649746 2.302865
y3 8.714629 2.302865 14.258286
Multivariate Tests:
Df test stat approx F num Df den Df Pr(>F)
Pillai 14 1.648181 1.741760 42 60.00000 0.0240225 *
Wilks 14 0.046435 2.339980 42 54.16198 0.0016965 **
Hotelling-Lawley 14 8.133309 3.227504 42 50.00000 4.7430e-05 ***
Roy 14 6.725995 9.608564 14 20.00000 4.9976e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see in all four multivariate test statistics, we can conclude that there exist at least one potential predictor variable to explain at least one of the response variable at 5 percent level of significance.
I chose to implement the Backward elimination, but could be tried Stepwise or forward selection as well. The concept for backward elimination is to compute $X_j$ given $X_{-j}$ , using Sum of Square of Type III; then drop the one with most insignificant one. Note that $X_{-j}$ means all the predictor variables except $j^{th}$.
#message("The concept is to compute Xj-given-all_X, using Sum of Square of Type III;
#then drop the one with most insignificant one or the least contributer.")
Manova(multdataFullModel,test.statistic="Wilks")
#message("The P-value per variable has been assessed and the one with largest p-value is dropped, which is X10 with p-value=0.89213")
Type II MANOVA Tests: Wilks test statistic
Df test stat approx F num Df den Df Pr(>F)
x3 1 0.94416 0.3549 3 18 0.7861932
x4 1 0.92825 0.4638 3 18 0.7110744
x5 1 0.73803 2.1297 3 18 0.1320678
x6 1 0.90414 0.6362 3 18 0.6013723
x7 1 0.95622 0.2747 3 18 0.8428341
x8 1 0.95665 0.2719 3 18 0.8448361
x9 1 0.87286 0.8740 3 18 0.4728694
x10 1 0.96709 0.2042 3 18 0.8921308
x11 1 0.91977 0.5234 3 18 0.6717012
x12 1 0.82645 1.2599 3 18 0.3178384
x13 1 0.86040 0.9735 3 18 0.4269728
x14 1 0.36264 10.5453 3 18 0.0003136 ***
x15 1 0.92527 0.4846 3 18 0.6971627
x16 1 0.75528 1.9441 3 18 0.1586741
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for each variable has been assessed and the one with largest p-value is dropped, which is X10 with p-value=0.89213. Then X10 is dropped to move to next step.
reduced1=update(multdataFullModel,.~.-x10)
Manova(reduced1,test.statistic="Wilks")
Type II MANOVA Tests: Wilks test statistic
Df test stat approx F num Df den Df Pr(>F)
x3 1 0.94549 0.3651 3 19 0.7789325
x4 1 0.91062 0.6217 3 19 0.6096140
x5 1 0.74060 2.2183 3 19 0.1192034
x6 1 0.90616 0.6559 3 19 0.5891723
x7 1 0.96521 0.2283 3 19 0.8755587
x8 1 0.95985 0.2649 3 19 0.8498111
x9 1 0.87434 0.9102 3 19 0.4546124
x11 1 0.90829 0.6395 3 19 0.5989107
x12 1 0.76392 1.9573 3 19 0.1546839
x13 1 0.78192 1.7664 3 19 0.1876760
x14 1 0.36310 11.1092 3 19 0.0001957 ***
x15 1 0.92316 0.5272 3 19 0.6689373
x16 1 0.73759 2.2532 3 19 0.1151654
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Steps deleted here
reduced6=update(reduced5,.~.-x4)
Manova(reduced6,test.statistic="Wilks")
Type II MANOVA Tests: Wilks test statistic
Df test stat approx F num Df den Df Pr(>F)
x5 1 0.74266 2.7721 3 24 0.063404 .
x6 1 0.81382 1.8302 3 24 0.168603
x9 1 0.85135 1.3969 3 24 0.267960
x11 1 0.87631 1.1292 3 24 0.357064
x12 1 0.79453 2.0689 3 24 0.131006
x13 1 0.84582 1.4583 3 24 0.250879
x14 1 0.32212 16.8351 3 24 4.216e-06 ***
x16 1 0.56075 6.2665 3 24 0.002703 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
reduced7=update(reduced6,.~.-x11)
Manova(reduced7,test.statistic="Wilks")
#summary(Manova(reduced7))
Type II MANOVA Tests: Wilks test statistic
Df test stat approx F num Df den Df Pr(>F)
x5 1 0.73487 3.0066 3 25 0.04923 *
x6 1 0.79237 2.1837 3 25 0.11504
x9 1 0.82799 1.7313 3 25 0.18622
x12 1 0.78596 2.2694 3 25 0.10512
x13 1 0.85703 1.3902 3 25 0.26889
x14 1 0.32114 17.6156 3 25 2.348e-06 ***
x16 1 0.35388 15.2150 3 25 7.728e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the biggest p-value is 0.26889, its acceptable not to drop any variable afterwards, just to be on the safe side.
#1. R-Square For the final model selected using BackWard elimination
message("#1.R-Square")
Intercept=c(rep(1,35))
matX=multdata %>% select(x5 , x6, x9 , x12, x13, x14 , x16)
matX$x16=ifelse(matX$x16=="Sec",1,0) #Sec=0, else 1
matX=as.matrix(cbind(Intercept,matX))
yY=multdata %>% select(y1,y2,y3)
meanY=as.matrix(apply(yY,2,mean))
yY=as.matrix(yY)
bhat=solve(t(matX)%*%matX)%*%t(matX)%*%yY # Pf_H. behaves strange
rSquare=solve(t(yY)%*%yY-35*meanY%*%t(meanY))%*%(t(bhat)%*%t(matX)%*%yY-35*meanY%*%t(meanY))
tr(rSquare)/length(meanY)
#rSquare
#2.S
message("S-Pooled")
Sp=t(yY)%*%yY-t(bhat)%*%t(matX)%*%yY
Sp
message("trace of S-Pooled")
(tr(Sp))
#1.R-Square
S-Pooled
| y1 | y2 | y3 | |
|---|---|---|---|
| y1 | 55.57168 | 11.48428 | 10.59718 |
| y2 | 11.484276 | 2.880592 | 2.846709 |
| y3 | 10.597183 | 2.846709 | 17.801518 |
trace of S-Pooled
The backward elimination algorithm returns a model of seven predictor variables. According to this model, 47.43% of the variability in the response variables could be explained by the variables in the model.
Picking the most significant once in the selected model above and try to improve model by including interaction among these variables which is between $x_5$,$x_{14}$ and $x_{16}$.
modelInteraction=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16,data=multdata)
#Manova(modelInteraction,statistics="Wilks")
#message("Picked the most significant once and try to improve model by including interaction among these x5,x14 ,x16")
modelInteraction1=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16+x5*x14+x5*x16+x14*x16,data=multdata)
Manova(modelInteraction1,test.statistics="Wilks")
Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
x5 1 0.31804 3.4200 3 22 0.03506 *
x6 1 0.22687 2.1519 3 22 0.12259
x9 1 0.19014 1.7218 3 22 0.19175
x12 1 0.20915 1.9394 3 22 0.15273
x13 1 0.18063 1.6166 3 22 0.21419
x14 1 0.67970 15.5619 3 22 1.183e-05 ***
x16 1 0.69125 16.4182 3 22 7.960e-06 ***
x5:x14 1 0.07747 0.6158 3 22 0.61206
x5:x16 1 0.16556 1.4550 3 22 0.25408
x14:x16 1 0.14759 1.2697 3 22 0.30923
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modelInteraction, modelInteraction1, test = "Wilks") #
| Res.Df | Df | Gen.var. | Wilks | approx F | num Df | den Df | Pr(>F) | |
|---|---|---|---|---|---|---|---|---|
| 1 | 27 | NA | 0.277564 | NA | NA | NA | NA | NA |
| 2 | 24 | -3 | 0.2624401 | 0.593668 | 1.42546 | 9 | 53.69282 | 0.2007421 |
At 20% level of significance, it looks significant to include interaction to the model. However, the model could be improved by droping the most insignificant interactions.
modelInteraction2=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16+x5*x16+x14*x16,data=multdata)
#Manova(modelInteraction2,statistics="Wilks")
message("## R-Square for updated model")
Intercept=c(rep(1,35))
matX=multdata %>% select(x5 , x6, x9 ,x12, x13, x14 , x16) %>% mutate(x16=ifelse(x16=="Sec",1,0),x5x16=x5*x16,x14x16=x14*x16)
matX=as.matrix(cbind(Intercept,matX))
yY=multdata %>% select(y1,y2,y3)
meanY=as.matrix(apply(yY,2,mean))
yY=as.matrix(yY)
bhat=solve(t(matX)%*%matX)%*%t(matX)%*%yY # Pf_H. behaves strange
rSquare=solve(t(yY)%*%yY-35*meanY%*%t(meanY))%*% (t(bhat)%*%t(matX)%*%yY-35*meanY%*%t(meanY))
tr(rSquare)/length(meanY)
#2. S
message("## 2. S")
Sp=t(yY)%*%yY-t(bhat)%*%t(matX)%*%yY
message("S-Pooled")
Sp
message("Trace of S-Pooled")
(tr(Sp))
## R-Square for updated model
## 2. S S-Pooled
| y1 | y2 | y3 | |
|---|---|---|---|
| y1 | 42.042047 | 8.741411 | 8.894307 |
| y2 | 8.741411 | 2.260064 | 2.286782 |
| y3 | 8.894307 | 2.286782 | 16.872124 |
Trace of S-Pooled
After including interaction, we manage R-square to go from 47.43% to 54.62% and S to go from 76.2537 to 61.17 which is good improvement to our preselected model.
Manova(modelInteraction2)
message("Final model Coefficients:")
kable(modelInteraction2$coefficients, digits=4)
Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
x5 1 0.30652 3.3887 3 23 0.03522 *
x6 1 0.19379 1.8428 3 23 0.16758
x9 1 0.17842 1.6650 3 23 0.20225
x12 1 0.20414 1.9665 3 23 0.14716
x13 1 0.16708 1.5379 3 23 0.23150
x14 1 0.67663 16.0423 3 23 7.609e-06 ***
x16 1 0.68519 16.6864 3 23 5.622e-06 ***
x5:x16 1 0.18551 1.7462 3 23 0.18557
x14:x16 1 0.23005 2.2907 3 23 0.10507
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Final model Coefficients:
| | y1| y2| y3| |:-----------|-------:|-------:|-------:| |(Intercept) | -2.1764| -0.3662| 0.2662| |x5 | 0.0002| 0.0001| -0.0001| |x6 | -0.4936| -0.1484| -0.0320| |x9 | 0.0194| 0.0068| 0.0116| |x12 | -0.7872| 0.1041| -0.3873| |x13 | 1.8790| 0.2089| 1.0588| |x14 | -5.4780| -1.4244| -0.1523| |x16Sec | -1.1260| -0.6571| -2.1130| |x5:x16Sec | 0.0009| 0.0002| 0.0003| |x14:x16Sec | 3.9416| 0.6106| -0.1317|
The predictor variables $x_5$, $x_9$,$x_{13}$ and interaction $x_5:x_{16}$ and $x_{14}:x_{16}$ have a positive linear relationship with $Y_1$ which means that increasing one of these variables will increase the value of $y_1$ and vies versa. Whereas, the other variables have a negative linear relationship with $y_1$. Note also that $x_5$ and $x_5:x_{16}$ have a relationship close to zero towards $y_1$. It could be interpreted in a similar fashion for the other response variables.
normmultdata=data.frame(multdata %>% select( y1,y2,y3,x5 , x6, x9 ,x12, x13, x14 , x16))
normmultdata$x16=ifelse(normmultdata$x16=="Sec",1,0) #Sec=0, else 1
normmultdata=as.matrix(normmultdata)
Di_square=c(rep(0,35))
meanNmultdata=apply(normmultdata,2,mean)
CovNmultdata=cov(normmultdata)
for(i in 1:35)
{
Di_square[i]=t(normmultdata[i,]-meanNmultdata)%*%solve(CovNmultdata)%*%(normmultdata[i,]-meanNmultdata)
}
Di_square_sort=sort(Di_square)
p=ncol(normmultdata)
n=nrow(normmultdata)
alpha1=(p-2)/(2*p)
beta1=(n-p-3)/(2*(n-p-1))
Vi=(c(1:35)-alpha1)/(35-alpha1-beta1+1)
Normdata=cbind.data.frame(Ui=Di_square_sort,Vi=Vi)
ggplot(Normdata, aes(x=Normdata[,1],y=Normdata[,2]))+geom_point(stat="identity")+labs(x="U(i)",y="Vi (Quantiles of Beta)")
maxDi=round(max(Di_square_sort),2)
tabvalue=paste("at Alpha=0.05 and p=10,n=35 we get D-value=15.75") #Table A.6.
multdataResid=data.frame(modelInteraction2$residuals)
p1=ggplot(multdataResid, aes(x=c(1:35),y=y1)) + geom_point()+labs(x=" ",y="y1 Residual")
p2=ggplot(multdataResid, aes(x=c(1:35),y=y2)) + geom_point()+labs(x=" ",y="y2 Residual")
p3=ggplot(multdataResid, aes(x=c(1:35),y=y3)) + geom_point()+labs(x=" ",y="y3 Residual")
grid.arrange(p1,p2,p3,ncol=2)
According to the Normality plot and the Residuals plot, we can conclude that the selected model is a very legit one.
[Alvin_C._Rencher] Methods of Multivariate Analysis, 2nd edition.