Multivariate Multiple Regression Analysis Using MANOVA

Multivariate multiple linear regression: several y’s and several x’s. Multivariate refers to the dependent variables and multiple pertains to the independent variables.

In [ ]:
Steps:
    * Overall significance test. Testing the existence of at least one predictor variable to explain at least one of the
       response variable. 
    * Variable selection using Backward elimination
    * Model improvement
        + Interaction included
        + R-square and S used for assessment
    * Multivariate Normality assessed
    * Residual assessment
In [1]:
libs <- c("car","xlsx","psych","dplyr","ggplot2","knitr","gridExtra")
lapply(libs,require, character.only= TRUE)
library(repr)
options(repr.plot.width=7,repr.plot.height=4)
In [3]:
multdata=read.xlsx("C:/Users/selamina/...data.xlsx",1)

Step 1. Overall Significance Test

$$ H_0: \beta=0 \hspace{12pt} vs \hspace{12pt} H_a:\beta \neq 0 \hspace{12pt} \text{for at least one}\hspace{8pt} \beta_{ij} $$
In [4]:
#message("H0: B = O vs H1: B != O, (We want to know whether at least one Bjk != 0, 
#to explain at least one of the response variable.")
multdataModel="cbind(y1,y2,y3)~x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+x16"
multdataFullModel<- lm(multdataModel, data=multdata) 
c=diag(15)[2:15,] # first column is for the const term
(OverallFit <- linearHypothesis(model = multdataFullModel, hypothesis.matrix = c)) #Spits H, E and Sign for over all
#message("From all the four multivariate test statistics, we can conclude as there exist at least one potential predictor variable
#to explain at least one of the response variable.")
Sum of squares and products for the hypothesis:
          y1       y2       y3
y1 138.56339 43.42221 56.09689
y2  43.42221 14.13711 17.47444
y3  56.09689 17.47444 32.55243

Sum of squares and products for error:
          y1        y2        y3
y1 49.322325 10.376339  8.714629
y2 10.376339  2.649746  2.302865
y3  8.714629  2.302865 14.258286

Multivariate Tests: 
                 Df test stat approx F num Df   den Df     Pr(>F)    
Pillai           14  1.648181 1.741760     42 60.00000  0.0240225 *  
Wilks            14  0.046435 2.339980     42 54.16198  0.0016965 ** 
Hotelling-Lawley 14  8.133309 3.227504     42 50.00000 4.7430e-05 ***
Roy              14  6.725995 9.608564     14 20.00000 4.9976e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see in all four multivariate test statistics, we can conclude that there exist at least one potential predictor variable to explain at least one of the response variable at 5 percent level of significance.

Step 2. Variable Selection

Step 2.1 Backward Elimination

I chose to implement the Backward elimination, but could be tried Stepwise or forward selection as well. The concept for backward elimination is to compute $X_j$ given $X_{-j}$ , using Sum of Square of Type III; then drop the one with most insignificant one. Note that $X_{-j}$ means all the predictor variables except $j^{th}$.

In [5]:
#message("The concept is to compute Xj-given-all_X, using Sum of Square of Type III; 
#then drop the one with most insignificant one or the least contributer.")
Manova(multdataFullModel,test.statistic="Wilks")
#message("The P-value per variable has been assessed and the one with largest p-value is dropped, which is X10 with p-value=0.89213")
Type II MANOVA Tests: Wilks test statistic
    Df test stat approx F num Df den Df    Pr(>F)    
x3   1   0.94416   0.3549      3     18 0.7861932    
x4   1   0.92825   0.4638      3     18 0.7110744    
x5   1   0.73803   2.1297      3     18 0.1320678    
x6   1   0.90414   0.6362      3     18 0.6013723    
x7   1   0.95622   0.2747      3     18 0.8428341    
x8   1   0.95665   0.2719      3     18 0.8448361    
x9   1   0.87286   0.8740      3     18 0.4728694    
x10  1   0.96709   0.2042      3     18 0.8921308    
x11  1   0.91977   0.5234      3     18 0.6717012    
x12  1   0.82645   1.2599      3     18 0.3178384    
x13  1   0.86040   0.9735      3     18 0.4269728    
x14  1   0.36264  10.5453      3     18 0.0003136 ***
x15  1   0.92527   0.4846      3     18 0.6971627    
x16  1   0.75528   1.9441      3     18 0.1586741    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for each variable has been assessed and the one with largest p-value is dropped, which is X10 with p-value=0.89213. Then X10 is dropped to move to next step.

2.2. X10 dropped

In [6]:
reduced1=update(multdataFullModel,.~.-x10)
Manova(reduced1,test.statistic="Wilks")
Type II MANOVA Tests: Wilks test statistic
    Df test stat approx F num Df den Df    Pr(>F)    
x3   1   0.94549   0.3651      3     19 0.7789325    
x4   1   0.91062   0.6217      3     19 0.6096140    
x5   1   0.74060   2.2183      3     19 0.1192034    
x6   1   0.90616   0.6559      3     19 0.5891723    
x7   1   0.96521   0.2283      3     19 0.8755587    
x8   1   0.95985   0.2649      3     19 0.8498111    
x9   1   0.87434   0.9102      3     19 0.4546124    
x11  1   0.90829   0.6395      3     19 0.5989107    
x12  1   0.76392   1.9573      3     19 0.1546839    
x13  1   0.78192   1.7664      3     19 0.1876760    
x14  1   0.36310  11.1092      3     19 0.0001957 ***
x15  1   0.92316   0.5272      3     19 0.6689373    
x16  1   0.73759   2.2532      3     19 0.1151654    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Steps deleted here

2.7. x4 dropped

In [11]:
reduced6=update(reduced5,.~.-x4)
Manova(reduced6,test.statistic="Wilks")
Type II MANOVA Tests: Wilks test statistic
    Df test stat approx F num Df den Df    Pr(>F)    
x5   1   0.74266   2.7721      3     24  0.063404 .  
x6   1   0.81382   1.8302      3     24  0.168603    
x9   1   0.85135   1.3969      3     24  0.267960    
x11  1   0.87631   1.1292      3     24  0.357064    
x12  1   0.79453   2.0689      3     24  0.131006    
x13  1   0.84582   1.4583      3     24  0.250879    
x14  1   0.32212  16.8351      3     24 4.216e-06 ***
x16  1   0.56075   6.2665      3     24  0.002703 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.8. x11 dropped

In [12]:
reduced7=update(reduced6,.~.-x11)
Manova(reduced7,test.statistic="Wilks")
#summary(Manova(reduced7))
Type II MANOVA Tests: Wilks test statistic
    Df test stat approx F num Df den Df    Pr(>F)    
x5   1   0.73487   3.0066      3     25   0.04923 *  
x6   1   0.79237   2.1837      3     25   0.11504    
x9   1   0.82799   1.7313      3     25   0.18622    
x12  1   0.78596   2.2694      3     25   0.10512    
x13  1   0.85703   1.3902      3     25   0.26889    
x14  1   0.32114  17.6156      3     25 2.348e-06 ***
x16  1   0.35388  15.2150      3     25 7.728e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the biggest p-value is 0.26889, its acceptable not to drop any variable afterwards, just to be on the safe side.

R-Square and S Computation for model improvement assessment

In [30]:
#1. R-Square For the final model selected using BackWard elimination
message("#1.R-Square")
Intercept=c(rep(1,35))
matX=multdata %>% select(x5 , x6, x9 , x12, x13, x14 , x16)
matX$x16=ifelse(matX$x16=="Sec",1,0) #Sec=0, else 1
matX=as.matrix(cbind(Intercept,matX))
yY=multdata %>% select(y1,y2,y3)
meanY=as.matrix(apply(yY,2,mean))
yY=as.matrix(yY)
bhat=solve(t(matX)%*%matX)%*%t(matX)%*%yY # Pf_H. behaves strange 

rSquare=solve(t(yY)%*%yY-35*meanY%*%t(meanY))%*%(t(bhat)%*%t(matX)%*%yY-35*meanY%*%t(meanY))
tr(rSquare)/length(meanY)
#rSquare

#2.S
message("S-Pooled")
Sp=t(yY)%*%yY-t(bhat)%*%t(matX)%*%yY
Sp
message("trace of S-Pooled")
(tr(Sp))
#1.R-Square
0.474309629592164
S-Pooled
y1y2y3
y155.5716811.4842810.59718
y211.484276 2.880592 2.846709
y310.597183 2.84670917.801518
trace of S-Pooled
76.2537864017878

The backward elimination algorithm returns a model of seven predictor variables. According to this model, 47.43% of the variability in the response variables could be explained by the variables in the model.

Interaction Assessment

1. Final model from Main Effect (reduced7)

Picking the most significant once in the selected model above and try to improve model by including interaction among these variables which is between $x_5$,$x_{14}$ and $x_{16}$.

In [27]:
modelInteraction=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16,data=multdata)
#Manova(modelInteraction,statistics="Wilks")
#message("Picked the most significant once and try to improve model by including interaction among these x5,x14 ,x16")

2. Second Order Interaction included

In [28]:
modelInteraction1=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16+x5*x14+x5*x16+x14*x16,data=multdata)
Manova(modelInteraction1,test.statistics="Wilks")
Type II MANOVA Tests: Pillai test statistic
        Df test stat approx F num Df den Df    Pr(>F)    
x5       1   0.31804   3.4200      3     22   0.03506 *  
x6       1   0.22687   2.1519      3     22   0.12259    
x9       1   0.19014   1.7218      3     22   0.19175    
x12      1   0.20915   1.9394      3     22   0.15273    
x13      1   0.18063   1.6166      3     22   0.21419    
x14      1   0.67970  15.5619      3     22 1.183e-05 ***
x16      1   0.69125  16.4182      3     22 7.960e-06 ***
x5:x14   1   0.07747   0.6158      3     22   0.61206    
x5:x16   1   0.16556   1.4550      3     22   0.25408    
x14:x16  1   0.14759   1.2697      3     22   0.30923    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Model comparison between main effect model and second order interaction included model

In [20]:
anova(modelInteraction, modelInteraction1, test = "Wilks") #
Res.DfDfGen.var.Wilksapprox Fnum Dfden DfPr(>F)
127NA0.277564NANANANANA
224-30.26244010.5936681.42546953.692820.2007421

At 20% level of significance, it looks significant to include interaction to the model. However, the model could be improved by droping the most insignificant interactions.

4. Droping x5:x14 interaction

In [29]:
modelInteraction2=lm(cbind(y1,y2,y3)~x5+x6+x9+x12+x13+x14+x16+x5*x16+x14*x16,data=multdata)
#Manova(modelInteraction2,statistics="Wilks")
In [33]:
message("## R-Square for updated model")
Intercept=c(rep(1,35))
matX=multdata %>% select(x5 , x6, x9 ,x12, x13, x14 , x16) %>% mutate(x16=ifelse(x16=="Sec",1,0),x5x16=x5*x16,x14x16=x14*x16)
matX=as.matrix(cbind(Intercept,matX))
yY=multdata %>% select(y1,y2,y3)
meanY=as.matrix(apply(yY,2,mean))
yY=as.matrix(yY)
bhat=solve(t(matX)%*%matX)%*%t(matX)%*%yY # Pf_H. behaves strange 

rSquare=solve(t(yY)%*%yY-35*meanY%*%t(meanY))%*% (t(bhat)%*%t(matX)%*%yY-35*meanY%*%t(meanY))
tr(rSquare)/length(meanY)

#2. S
message("## 2. S")
Sp=t(yY)%*%yY-t(bhat)%*%t(matX)%*%yY
message("S-Pooled")
Sp
message("Trace of S-Pooled")
(tr(Sp))
## R-Square for updated model
0.54624709324098
## 2. S
S-Pooled
y1y2y3
y142.042047 8.741411 8.894307
y28.7414112.2600642.286782
y3 8.894307 2.28678216.872124
Trace of S-Pooled
61.1742353991862

After including interaction, we manage R-square to go from 47.43% to 54.62% and S to go from 76.2537 to 61.17 which is good improvement to our preselected model.

Final Model

In [34]:
Manova(modelInteraction2)
message("Final model Coefficients:")
kable(modelInteraction2$coefficients, digits=4)
Type II MANOVA Tests: Pillai test statistic
        Df test stat approx F num Df den Df    Pr(>F)    
x5       1   0.30652   3.3887      3     23   0.03522 *  
x6       1   0.19379   1.8428      3     23   0.16758    
x9       1   0.17842   1.6650      3     23   0.20225    
x12      1   0.20414   1.9665      3     23   0.14716    
x13      1   0.16708   1.5379      3     23   0.23150    
x14      1   0.67663  16.0423      3     23 7.609e-06 ***
x16      1   0.68519  16.6864      3     23 5.622e-06 ***
x5:x16   1   0.18551   1.7462      3     23   0.18557    
x14:x16  1   0.23005   2.2907      3     23   0.10507    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Final model Coefficients:

|            |      y1|      y2|      y3|
|:-----------|-------:|-------:|-------:|
|(Intercept) | -2.1764| -0.3662|  0.2662|
|x5          |  0.0002|  0.0001| -0.0001|
|x6          | -0.4936| -0.1484| -0.0320|
|x9          |  0.0194|  0.0068|  0.0116|
|x12         | -0.7872|  0.1041| -0.3873|
|x13         |  1.8790|  0.2089|  1.0588|
|x14         | -5.4780| -1.4244| -0.1523|
|x16Sec      | -1.1260| -0.6571| -2.1130|
|x5:x16Sec   |  0.0009|  0.0002|  0.0003|
|x14:x16Sec  |  3.9416|  0.6106| -0.1317|

The predictor variables $x_5$, $x_9$,$x_{13}$ and interaction $x_5:x_{16}$ and $x_{14}:x_{16}$ have a positive linear relationship with $Y_1$ which means that increasing one of these variables will increase the value of $y_1$ and vies versa. Whereas, the other variables have a negative linear relationship with $y_1$. Note also that $x_5$ and $x_5:x_{16}$ have a relationship close to zero towards $y_1$. It could be interpreted in a similar fashion for the other response variables.

Multivariate Normality Assumption Check for the Data

In [35]:
normmultdata=data.frame(multdata %>% select( y1,y2,y3,x5 , x6, x9 ,x12, x13, x14 , x16))
normmultdata$x16=ifelse(normmultdata$x16=="Sec",1,0) #Sec=0, else 1
normmultdata=as.matrix(normmultdata)

Di_square=c(rep(0,35))
meanNmultdata=apply(normmultdata,2,mean)
CovNmultdata=cov(normmultdata)

for(i in 1:35)
{
Di_square[i]=t(normmultdata[i,]-meanNmultdata)%*%solve(CovNmultdata)%*%(normmultdata[i,]-meanNmultdata)
}

Di_square_sort=sort(Di_square)
p=ncol(normmultdata)
n=nrow(normmultdata)
alpha1=(p-2)/(2*p)
beta1=(n-p-3)/(2*(n-p-1))
Vi=(c(1:35)-alpha1)/(35-alpha1-beta1+1)

Normdata=cbind.data.frame(Ui=Di_square_sort,Vi=Vi)

ggplot(Normdata, aes(x=Normdata[,1],y=Normdata[,2]))+geom_point(stat="identity")+labs(x="U(i)",y="Vi (Quantiles of Beta)")

maxDi=round(max(Di_square_sort),2)
tabvalue=paste("at Alpha=0.05 and p=10,n=35 we get D-value=15.75") #Table A.6. 

"From the graph we can conclude that the Multivariate Normality assumption is not violated. However,the numerical result indicates the Multivariate Normality assumption violated. The calculated value (22.12) is greater than the tabulated value (at Alpha=0.05 and p=10,n=35 we get D-value=15.75). However, if we extrapolate a bit, might be OK to accept Normality since we couldn't get the exact value from the table for our p size."

Residual plot

In [56]:
multdataResid=data.frame(modelInteraction2$residuals)
p1=ggplot(multdataResid, aes(x=c(1:35),y=y1)) + geom_point()+labs(x=" ",y="y1 Residual") 
p2=ggplot(multdataResid, aes(x=c(1:35),y=y2)) + geom_point()+labs(x=" ",y="y2 Residual") 
p3=ggplot(multdataResid, aes(x=c(1:35),y=y3)) + geom_point()+labs(x=" ",y="y3 Residual") 
  
grid.arrange(p1,p2,p3,ncol=2)

According to the Normality plot and the Residuals plot, we can conclude that the selected model is a very legit one.

Reference

[Alvin_C._Rencher] Methods of Multivariate Analysis, 2nd edition.

© 2016 Altaye