Transcript version of:
Both data matrix concatenated to create the corpus in data-frame format (Columns: Sentence and Speaker)
Text cleaning: removePunctuation(), removeNumbers(), removeSparseTerms() etc
Reshuffle the order of observation randomly and spilt into Training set and Test set data and pass it through different algorithms.
Do prediction for the test data
Precision: Among the total sentences predicted to be Pres.Obma’s, 85% of them are factually correct. On the other hand, 73 % of Gov Romney’s predicted sentences are factually correct.
Recall: 95 % of Gov Romney’s sentence are predicted correctly. Whereas, 45% of Pres. Obama’s speech sentences are managed to be predicted correctly which is below a random guess.
Accuracy: the selected model Random Forest had a 75 percent accuracy. This means that any randomly selected sentence has a 75 % chance of being labelled to the correct speaker.