In this page, you will find my projects.
Hierarchical Cluster Analysis of USA STATES based on
their Revenue and Expenditure
This project is based on the State Government Finances: 2013 data provided on
data
link. The finance data contains around 52 fields or attributes. However, for this study I chose to
focus on Expenditure Amounts and Revenue Amounts to do Hierarchical Clustering (HC) of States. The amounts
are in thousands. At the beginning, I have done Hierarchical Clustering among states by considering both the expenditure
and revenue amounts. This analysis is followed by doing HC of states by considering only the expenditure amounts and again only considering the revenue amounts.
The three analysis results are summarised below.
By Expenditure Amount By Revenue Amount By Revenue Expenditure Amount
Cluster_1 California State California State California State
Cluster_2 New York State New York State New York State
Cluster_3 Texas State Texas State and Florida Texas State
Cluster_4 Illinois, Massachusetts,Florida, Pennsylvania, Michigan and Ohio States
Virginia, Georgia, North Carolina, Massachusetts, New Jersey,Illinois,Ohio, Michigan and Pennsylvania states
Pennsylvania, Illinois, Ohio, Michigan and Florida
Cluster_5 Other States Other States Other States
As we can clearly see from the three studies, State of California and State of New York are the two
states with way different from each other and way unequal from other states in both revenue and
expenditure amounts. When we look at the case for State of Texas , only focusing on expenditure
or a combination of expenditure and revenue amounts, it stood alone. However, when we look at the cluster analysis considering only revenue amounts,
State of Texas's revenue is significantly equal with State of Florida. From the fourth cluster we can see that
the state of Pennsylvania, Illinois, Ohio and Michigan are significantly equal in revenue and expenditure amounts. Interestingly,
significant number of states are on cluster 5 which implies that these state's Expenditure and Revenue amount is not as
significantly different as initially listed once. Of course, we can slice cluster 5 down but it might not be that significant. Note
that, according to USA population Census 2013, the states found in either of the first four clusters have a population size ranking among the top
10. This shows the consistency of the analysis with the fact that population size have a significant impact on the states revenue and expenditure.
Detail analysis including the R-code is give hereHierarchical Cluster Analysis detail result
Time Series Project
I have used monthly Apple Stock price (Opening price) from the time period of
2005-03-01 to 2014-06-02. Applied Box-Jenkins, ARIMA(p,d,q), modelling and done forecasting using the
model selected. The analysis suggest that at logarithmic scale level, the open price for any selected
month is on average equals with the previous month open price. Analysis provided here
Time Series Project
Text (Sentence) Classification
President Obama's 2009 inaugural speech and 2012 Presidential candidate, then Governor Mitt Romney's
campaign speech, transcript version, have been used to do sentence classification. After I created the corpus and did different text
cleaning procedures on it, I passed it through three algorithms.
Support Vector Machine (SVM) ,
Random Forest (RF) and
Decision Tree(Tree)
(Tree) were the one I used to classify a random sentence either to President Obama's or Gov. Mitt Romney's used sentence. I used Precision,
recall and Accuracy to select the model which does a better job in classifying the sentence correctly. Finally, RF became the winner
model with 75 percent accuracy and 85% precision.
According to the result I made the following interpretations. Among the total sentences predicted to be in Pres.Obma’s
speech, 85% of them are factually correct. On the other hand, 73 % of Gov Romney’s predicted sentences are factually correct.
95 % of Gov Romney’s sentence are predicted correctly. Whereas, 45% of Pres. Obama’s speech sentences are managed to be predicted correctly. This makes
it hard to predict President Obama’s sentence than Gov. Romeny's. Need to be remembered that the theme of the speeches were not the same.
Report given here: Sentence Classification
Multivariate Multiple Regression Analysis
Multivariate multiple linear regression is an analysis when we have several y’s (response variables) and several
x’s (predictor variables) in a regression setting. Multivariate refers to the dependent variables and multiple pertains
to the independent variables. . Detail analysis and important assumption checks is provided here.Multivariate Multiple
Regression Analysis
Sentiment Analysis of the Four Retail Stores (Amazon, Best Buy, Walmart and Ebay) using Twitter data
Hierarchical Cluster Analysis of USA STATES based on their Revenue and Expenditure
This project is based on the State Government Finances: 2013 data provided on data link. The finance data contains around 52 fields or attributes. However, for this study I chose to focus on Expenditure Amounts and Revenue Amounts to do Hierarchical Clustering (HC) of States. The amounts are in thousands. At the beginning, I have done Hierarchical Clustering among states by considering both the expenditure and revenue amounts. This analysis is followed by doing HC of states by considering only the expenditure amounts and again only considering the revenue amounts. The three analysis results are summarised below.
| By Expenditure Amount | By Revenue Amount | By Revenue Expenditure Amount | |
|---|---|---|---|
| Cluster_1 | California State | California State | California State |
| Cluster_2 | New York State | New York State | New York State |
| Cluster_3 | Texas State | Texas State and Florida | Texas State |
| Cluster_4 | Illinois, Massachusetts,Florida, Pennsylvania, Michigan and Ohio States | Virginia, Georgia, North Carolina, Massachusetts, New Jersey,Illinois,Ohio, Michigan and Pennsylvania states | Pennsylvania, Illinois, Ohio, Michigan and Florida |
| Cluster_5 | Other States | Other States | Other States |
As we can clearly see from the three studies, State of California and State of New York are the two states with way different from each other and way unequal from other states in both revenue and expenditure amounts. When we look at the case for State of Texas , only focusing on expenditure or a combination of expenditure and revenue amounts, it stood alone. However, when we look at the cluster analysis considering only revenue amounts, State of Texas's revenue is significantly equal with State of Florida. From the fourth cluster we can see that the state of Pennsylvania, Illinois, Ohio and Michigan are significantly equal in revenue and expenditure amounts. Interestingly, significant number of states are on cluster 5 which implies that these state's Expenditure and Revenue amount is not as significantly different as initially listed once. Of course, we can slice cluster 5 down but it might not be that significant. Note that, according to USA population Census 2013, the states found in either of the first four clusters have a population size ranking among the top 10. This shows the consistency of the analysis with the fact that population size have a significant impact on the states revenue and expenditure.
Detail analysis including the R-code is give hereHierarchical Cluster Analysis detail result
Time Series Project
I have used monthly Apple Stock price (Opening price) from the time period of 2005-03-01 to 2014-06-02. Applied Box-Jenkins, ARIMA(p,d,q), modelling and done forecasting using the model selected. The analysis suggest that at logarithmic scale level, the open price for any selected month is on average equals with the previous month open price. Analysis provided here Time Series Project
Text (Sentence) Classification
President Obama's 2009 inaugural speech and 2012 Presidential candidate, then Governor Mitt Romney's campaign speech, transcript version, have been used to do sentence classification. After I created the corpus and did different text cleaning procedures on it, I passed it through three algorithms. Support Vector Machine (SVM) , Random Forest (RF) and Decision Tree(Tree) (Tree) were the one I used to classify a random sentence either to President Obama's or Gov. Mitt Romney's used sentence. I used Precision, recall and Accuracy to select the model which does a better job in classifying the sentence correctly. Finally, RF became the winner model with 75 percent accuracy and 85% precision.
According to the result I made the following interpretations. Among the total sentences predicted to be in Pres.Obma’s speech, 85% of them are factually correct. On the other hand, 73 % of Gov Romney’s predicted sentences are factually correct. 95 % of Gov Romney’s sentence are predicted correctly. Whereas, 45% of Pres. Obama’s speech sentences are managed to be predicted correctly. This makes it hard to predict President Obama’s sentence than Gov. Romeny's. Need to be remembered that the theme of the speeches were not the same.
Report given here: Sentence ClassificationMultivariate Multiple Regression Analysis
Multivariate multiple linear regression is an analysis when we have several y’s (response variables) and several x’s (predictor variables) in a regression setting. Multivariate refers to the dependent variables and multiple pertains to the independent variables. . Detail analysis and important assumption checks is provided here.Multivariate Multiple Regression Analysis