Sentiment Analysis of the Four Retail Stores (Amazon, Best Buy, Walmart and Ebay) using Twitter data
im a 6 but in walmart im a 10
The sentiment analysis was done by extracting recent tweets related to each retail stores. 1000 tweets were extracted related to each and then the extracted tweets been filtered to make sure each tweet contains the name of the searched retail store.
libs <- c("tm","ROAuth",
"twitteR","httr" ,"RCurl","stringr","plyr","dplyr","wordcloud", "lattice","scales","ggplot2")
lapply(libs,require, character.only= TRUE)
consumer_key="deleted"
consumer_secret="deleted"
access_token="deleted"
access_secret= "deleted"
setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret )
pos=scan('positive-words.txt',what='character', comment.char=';')
neg=scan('negative-words.txt',what='character', comment.char=';')
Sentiment Analysis Function
The Sentiment analysis algorithm that I used here is based on Score values. The sentiment of each words in each tweet were categorized as either
Positive, Negative or Unknown in comparison with the positive and negative words list provided
on: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.
The score value for each tweet is calculated as: Score=TotalPostiveWords-TotalNegativeWords
in each tweet. Then, tweets with positive value, (usually greater than or equal to 2 (need sitting)), would be assigned as
tweets with a strong positive sentiment. On the other hand, tweets with score values less than or equal to -2, were categorized
as tweets with a strong negative sentiment. FYI, Some people are using the naive Bayes trained classifier rather than Scoring method
to compute the sentiment analysis. Further reading could be found
on http://www.slideshare.net/DevSahu2/sentiment-analysis-using-naive-bayes-classifier-39784368.
Sentiment Algorithm
score_sentiment = function(sentences, pos_words, neg_words,.progress='none' ) # for laply function:
{
scores = laply(sentences,
function(sentence, pos_words, neg_words)
{
sentence = gsub("[[:punct:]]", " ", sentence) #- remove punctuation - using global substitute (gsub())
sentence = gsub("[[:cntrl:]]", " ", sentence) #- remove control characters (???)
sentence = gsub("[^[:graph:]]", " ", sentence) #- remove graphic characters
sentence = gsub("http\\w+", " ", sentence) # Remove links, just only http, (????)
sentence = gsub('\\d+', '', sentence) #- remove digits
sentence = gsub("@\\w+", " ", sentence) # remove @followed by word with no space
sentence = tolower(sentence)
word_list = str_split(sentence, "\\s+") #- split sentence into words with str_split (at stringr package)
words = unlist(word_list)
#- compare words to the dictionaries of positive & negative terms
pos_matches = match(words, pos_words)
neg_matches = match(words, neg_words)
pos_matches = !is.na(pos_matches)
neg_matches = !is.na(neg_matches)
#- final score
score = sum(pos_matches) - sum(neg_matches)
return(score)
}, pos_words, neg_words,.progress=.progress ) # for laply function .progress=
#- data frame with scores for each tweets
scores_df = data.frame(text=sentences, score=scores)
return(scores_df)
}
Algorithm sanity check
#Let's quickly test our score.sentiment() function and word lists with some sample sentences:
sample = c("Sentiment Analysis is Awesome, I Love it",
"Job search sucks, but plenty of time to do what I love doing, its fun.",
"BG is getting windy, its annoying and uncomfortable. #disaster #bad very bad ")
result = score_sentiment(sample, pos, neg)
result
Check, the value of result is (2 1 -5)
Extracting Tweets
The information that I extracted from each tweets are:
- Tweet text using getText()
- Whether its a retweet or not using isRetweet()
- Number of retweets using retweetCount()
- Number of favourites using favoriteCount()
Amazontweets = searchTwitter("Amazon", n=1000,resultType="recent", lang="en")
walmarttweets = searchTwitter("Walmart", n=1000,resultType="recent", lang="en")
ebaytweets = searchTwitter("Ebay", n=1000, resultType="recent",lang="en")
bestbuytweets = searchTwitter("Best Buy", n=1000, resultType="recent",lang="en")
######## Upcoming codes are written in an explicit way############
############################Extracting the text from each tweet ##########
Amazon_txt = sapply(Amazontweets, function(x) x$getText())
walmart_txt = sapply(walmarttweets, function(x) x$getText())
ebay_txt = sapply(ebaytweets, function(x) x$getText())
bestbuy_txt = sapply(bestbuytweets, function(x) x$getText())
########################### Extracting whether its Retweet or not ##########
Amazon_isRetweet = sapply(Amazontweets, function(x) x$isRetweet)
walmart_isRetweet = sapply(walmarttweets, function(x) x$isRetweet)
ebay_isRetweet= sapply(ebaytweets, function(x) x$isRetweet)
bestbuy_isRetweet = sapply(bestbuytweets, function(x) x$isRetweet)
##########################
########################### Extracting retweetCount ########################
Amazon_retweetCount = sapply(Amazontweets, function(x) x$retweetCount)
walmart_retweetCount = sapply(walmarttweets, function(x) x$retweetCount)
ebay_retweetCount= sapply(ebaytweets, function(x) x$retweetCount)
bestbuy_retweetCount = sapply(bestbuytweets, function(x) x$retweetCount)
##########################
########################### Extracting favoriteCount #####################
Amazon_favoriteCount = sapply(Amazontweets, function(x) x$favoriteCount)
walmart_favoriteCount = sapply(walmarttweets, function(x) x$favoriteCount)
ebay_favoriteCount= sapply(ebaytweets, function(x) x$favoriteCount)
bestbuy_favoriteCount = sapply(bestbuytweets, function(x) x$favoriteCount)
##########################
nd = c(length(Amazon_txt), length(walmart_txt), length(ebay_txt),length(bestbuy_txt))
#- join texts
nd
#inspect(Amazon_txt[1:5])
retailStore = c(Amazon_txt, walmart_txt, ebay_txt, bestbuy_txt)
isRetweet=c(Amazon_isRetweet,walmart_isRetweet,ebay_isRetweet,bestbuy_isRetweet)
retweetCount=c(Amazon_retweetCount,walmart_retweetCount,ebay_retweetCount,bestbuy_retweetCount)
favoriteCount=c(Amazon_favoriteCount,walmart_favoriteCount,ebay_favoriteCount,bestbuy_favoriteCount)
scores = score_sentiment(retailStore, pos, neg, .progress='text')
#- add variables to data frame
scores$retailStore = as.factor(rep(c("Amazon", "Walmart", "Ebay", "Bestbuy"), nd))
scores$isRetweet=isRetweet
scores$retweetCount=retweetCount
scores$favoriteCount=favoriteCount
scores$very_pos = as.numeric(scores$score >= 2)
scores$very_neg = as.numeric(scores$score <=-2)
scores$neutral = as.numeric(scores$score >-2 & scores$score <2)
#View(scores)
Filtering tweets
When we extract the tweets, there is a possibility of getting tweets which do not contain our searched word, Walmart, Ebay, Best Buy or Amazon. This might make our result a bit biased, thus its necessary to check and drop those which do not contain our searched word. scoresfiltered is the filtered version of scores data file. Note that this is done after computing the score, could be done earlier, which is the better way.
scoresfiltered=filter(scores, grepl('Amazon |Walmart|Ebay|Best buy', text,ignore.case = TRUE))
#write.csv(scores,file="scores.csv")
#write.csv(scoresfiltered,file="scoresfiltered.csv")
# Extracted tweet are saved and retrieved just for convenience
scores=read.csv("C:/Users/selamina/Desktop/break/R/TextMiningTraining/TextTraining/scores.csv", comment.char="#")
scoresfiltered=read.csv("C:/Users/selamina/Desktop/break/R/TextMiningTraining/TextTraining/scoresfiltered.csv", comment.char="#")
ggplot(scoresfiltered, aes(factor(retailStore), score))+ geom_boxplot(outlier.colour = "green")+ labs(x="Retails Stores")
ggplot(scoresfiltered, aes(score)) +
geom_bar(aes(y = 100*(..count..)/tapply(..count..,..PANEL..,sum)[..PANEL..])) +
facet_grid(~retailStore)+labs(y="Percent",x="Sentiment Score", title="Sentiment Analysis of 4 retail Stores")
As we can see from the four box plots and bar charts, Best Buy turn out to be the retail store with a higher positive sentiment than others. Amazon and Ebay have a pretty much similar, Neutral sentiment with a bit skewed to the positive sentiment side, in both cases. Walmart turn out to be the retail store with a very symmetric and Neutral sentiment.
#- how many very positives and very negatives
numpos = sum(scoresfiltered$very_pos)
numneg = sum(scoresfiltered$very_neg)
global_score = round( 100 * numpos / (numpos + numneg) )
global_score
[1] 77
ggplot(scoresfiltered, aes(score)) +
geom_bar(aes(y = 100*(..count..)/tapply(..count..,..PANEL..,sum)[..PANEL..])) +
labs(y="Percent",x="Sentiment Score",title="Sentiment Analysis of the 4 retail Stores, combined")
As we can see from the percentage of global score, excluding the Neutral sentiment, [-1<= score <= 1], from all retail stores, the strong positive sentiment accounts 77 % of the tweets related to these stores leaving the remaing 13 % for strong negative sentiment towards the retail stores. In addition, from the bar chart, we can see that the overall sentiment is Neutral slightly skewed to the positive sentiment.
The most retweeted tweet related to each retail store
The most retweeted tweet related to Walmart was RT @vinnycrack: im a 6 but in walmart im a 10 of 14243 times.
The most retweeted tweet related to Best buy was RT @MyNintendoNews: Link Takes Over Best Buy's Homepage https://t.co/4lAkNS8f0a https://t.co/kBhnLTHdD0 of 90 times.
The most retweeted tweet related to Ebay was RT @TonyJohnson1278: Check out LifeProof Fre Case (Water Proof) for Galaxy S5 Teal Color -Brand New! Unopened! https://t.co/vyHZbkBJXu @eBay of 157 times. Just one for Amazon
fevcount=scoresfiltered %>% dplyr::group_by(retailStore) %>% dplyr::arrange(-favoriteCount) %>% dplyr::select(favoriteCount,score,retweetCount)
slice(fevcount,1:3)
Source: local data frame [12 x 4]
Groups: retailStore [4]
retailStore favoriteCount score retweetCount
(fctr) (int) (int) (int)
1 Amazon 2 0 1
2 Amazon 2 2 0
3 Amazon 1 2 0
4 Bestbuy 7 -1 2
5 Bestbuy 4 2 4
6 Bestbuy 4 0 0
7 Ebay 2 0 1
8 Ebay 0 0 157
9 Ebay 0 -1 0
10 Walmart 26 -2 4
11 Walmart 12 0 1
12 Walmart 10 0 3
retweetcount=scoresfiltered %>% dplyr::group_by(retailStore) %>% dplyr::arrange(-retweetCount) %>% dplyr::select(favoriteCount,score,retweetCount)
slice(retweetcount,1:3)
Source: local data frame [12 x 4]
Groups: retailStore [4]
retailStore favoriteCount score retweetCount
(fctr) (int) (int) (int)
1 Amazon 0 2 6738
2 Amazon 0 1 6619
3 Amazon 0 1 6538
4 Bestbuy 0 1 90
5 Bestbuy 0 2 37
6 Bestbuy 0 2 37
7 Ebay 0 0 157
8 Ebay 0 0 75
9 Ebay 0 0 51
10 Walmart 0 0 14243
11 Walmart 0 0 12359
12 Walmart 0 0 12359
#Retweeted Percentage
scoresfiltered$isRetweet=as.factor(scoresfiltered$isRetweet)
PerRetweet=round(prop.table(table(scoresfiltered$retailStore,scoresfiltered$isRetweet),1)*100,2)
colnames(PerRetweet)=c("Not Retweeted (%)","Is Retweeted (%)")
PerRetweet
Not Retweeted (%) Is Retweeted (%)
Amazon 70.26 29.74
Bestbuy 82.50 17.50
Ebay 92.52 7.48
Walmart 56.28 43.72
As we can see in the above table, around 44% of Walmart tweets are retweeted. Whereas, we can say its almost 7 % that Ebay related tweets are retweeted.
Summary
I have extracted 1000 recent tweets related to each of Amazon, Ebay, Walmart and Best Buy for the sentiment analysis. According to the study,Best Buy attained a better positive sentiment than other retail stores. Amazon and Ebay have pretty much similar sentiment, slightly skewed to the positive side. Walmart retail store achieved a very Neutral sentiment. Interestingly, the most frequent retweet related to walmart is im a 6 but in walmart im a 10 . In addition, Walmart related tweets have a higher percentage of being retweets.