In this page, you will find my projects.

Hierarchical Cluster Analysis of USA STATES based on their Revenue and Expenditure

This project is based on the State Government Finances: 2013 data provided on data link. The finance data contains around 52 fields or attributes. However, for this study I chose to focus on Expenditure Amounts and Revenue Amounts to do Hierarchical Clustering (HC) of States. The amounts are in thousands. At the beginning, I have done Hierarchical Clustering among states by considering both the expenditure and revenue amounts. This analysis is followed by doing HC of states by considering only the expenditure amounts and again only considering the revenue amounts. The three analysis results are summarised below.


By Expenditure AmountBy Revenue AmountBy Revenue Expenditure Amount
Cluster_1California StateCalifornia StateCalifornia State
Cluster_2New York StateNew York StateNew York State
Cluster_3Texas StateTexas State and FloridaTexas State
Cluster_4Illinois, Massachusetts,Florida, Pennsylvania, Michigan and Ohio States Virginia, Georgia, North Carolina, Massachusetts, New Jersey,Illinois,Ohio, Michigan and Pennsylvania states Pennsylvania, Illinois, Ohio, Michigan and Florida
Cluster_5Other StatesOther StatesOther States

As we can clearly see from the three studies, State of California and State of New York are the two states with way different from each other and way unequal from other states in both revenue and expenditure amounts. When we look at the case for State of Texas , only focusing on expenditure or a combination of expenditure and revenue amounts, it stood alone. However, when we look at the cluster analysis considering only revenue amounts, State of Texas's revenue is significantly equal with State of Florida. From the fourth cluster we can see that the state of Pennsylvania, Illinois, Ohio and Michigan are significantly equal in revenue and expenditure amounts. Interestingly, significant number of states are on cluster 5 which implies that these state's Expenditure and Revenue amount is not as significantly different as initially listed once. Of course, we can slice cluster 5 down but it might not be that significant. Note that, according to USA population Census 2013, the states found in either of the first four clusters have a population size ranking among the top 10. This shows the consistency of the analysis with the fact that population size have a significant impact on the states revenue and expenditure.

Detail analysis including the R-code is give hereHierarchical Cluster Analysis detail result


Time Series Project

I have used monthly Apple Stock price (Opening price) from the time period of 2005-03-01 to 2014-06-02. Applied Box-Jenkins, ARIMA(p,d,q), modelling and done forecasting using the model selected. The analysis suggest that at logarithmic scale level, the open price for any selected month is on average equals with the previous month open price. Analysis provided here Time Series Project


Text (Sentence) Classification

President Obama's 2009 inaugural speech and 2012 Presidential candidate, then Governor Mitt Romney's campaign speech, transcript version, have been used to do sentence classification. After I created the corpus and did different text cleaning procedures on it, I passed it through three algorithms. Support Vector Machine (SVM) , Random Forest (RF) and Decision Tree(Tree) (Tree) were the one I used to classify a random sentence either to President Obama's or Gov. Mitt Romney's used sentence. I used Precision, recall and Accuracy to select the model which does a better job in classifying the sentence correctly. Finally, RF became the winner model with 75 percent accuracy and 85% precision.

According to the result I made the following interpretations. Among the total sentences predicted to be in Pres.Obma’s speech, 85% of them are factually correct. On the other hand, 73 % of Gov Romney’s predicted sentences are factually correct. 95 % of Gov Romney’s sentence are predicted correctly. Whereas, 45% of Pres. Obama’s speech sentences are managed to be predicted correctly. This makes it hard to predict President Obama’s sentence than Gov. Romeny's. Need to be remembered that the theme of the speeches were not the same.

Report given here: Sentence Classification


Multivariate Multiple Regression Analysis

Multivariate multiple linear regression is an analysis when we have several y’s (response variables) and several x’s (predictor variables) in a regression setting. Multivariate refers to the dependent variables and multiple pertains to the independent variables. . Detail analysis and important assumption checks is provided here.Multivariate Multiple Regression Analysis

Sentiment Analysis of the Four Retail Stores (Amazon, Best Buy, Walmart and Ebay) using Twitter data


More to come...