Machine Learning Projects
A list of machine learning projects that I have carried out in the course of my education.
To Be or Not To Be: Using Machine Learning to Recognize Shakespeare
Using a dataset of the complete texts of Shakespeare's 33 plays in his first folio, as well as 22 plays from his contemporaries, we performed lexical analysis to train a Naive Bayes Classifier. All texts were acquired from Project Gutenberg.
Process
We first attempted an analysis using TF-IDF with an 80-20 split. This caused some issues with misclassifications. Sometimes our accuracy would reach up to 90%, however, upon further inspection of the confusion matrix, we found that all texts would be classified as Shakespeare, as seen in figure 1. Attempting to do a 75-25 split gave us the same results. We next preprocessed the data to perform a Term Frequency (TF) and Relative Term Frequency (RTF) analysis. The RTF proved to sometimes correctly classify a few of the non-Shakespeare plays, but still largely classified most to all plays as Shakespeare in both the 80-20 and 75-25 splits. The TF analysis finally broke this cycle. While the 75-25 split generally had better accuracy, precision, and recall than either the TF-IDF and RTF analyses, the TF 80-20 split gave us the best results. It would still misclassify some texts, as figure 2 shows, but, upon further analysis, these were only 11 of the 55 texts that would sometimes be misclassified. Given more time, we would have liked to break the plays down into scenes to see if certain scenes, especially in the 11 misclassified plays, skewed the training in any way. We also would have liked to perform a semantic analysis of the plays, especially looking at hendiadys as Shakespeare had a unique way of using them.
Figure 1: TF-IDF 80-20 split Confusion Matrix
Figure 2: TF 80-20 split Confusion Matrix
Comparing KNN and SVC
Using a dataset of 70,000 handwritten digits from 0-9, I compared runtimes, accuracies, precisions, and recalls of the K-Nearest Neighbor Classfier and Support Vector Classifier in recognizing handwritten digits. 60,000 of the digits were used for training, and 10,000 were used for testing. All images were acquired from the MNIST subset of the NIST database.
Process
After reading in the 28x28 pixel images into a 2D array, I had to convert them into a 1D array. These arrays represented the numbers by giving values of how "bright" an area is. This representation can be seen in figure 3. After preprocessing them into this format, I did several iterations on the KNN with incremental increases to the number of neighbors used until I found an optimal range. Once this range was found, I looped through to find the overall optimal number of neighbors, based on accuracy, as seen in figure 4. I used this value to train an optimal KNN. I then trained an SVC with no optimizations. I compared the confusion matrices, figures 5 and 6, and the overall runtimes, accuracies, precisions, and accuracies, shown in figure 7. The result was that the SVC has better accuracy, precision, and recall, but takes much more time than KNN.
Figure 3: Representation of a handwritten 5
Figure 4: Graph of accuracies of N-neighbors with N being between 5 and 15.
Figure 5: Optimal KNN Confusion Matrix
Figure 7: SVC Confusion Matrix
KNN time: 163.99149537086487
SVC time: 408.89756441116333
KNN accuracy: 0.9694
SVC accuracy: 0.9792
7
KNN Precision: 0.9662698412698413
SVC Precision: 0.9798590130916415
KNN Recall 0.9938775510204082
SVC Recall 0.9928571428571429
2
KNN Precision: 0.9513014273719563
SVC Precision: 0.9885864793678666
KNN Recall 0.9982378854625551
SVC Recall 0.9920704845814978
1
KNN Precision: 0.9830845771144279
SVC Precision: 0.9757516973811833
KNN Recall 0.9573643410852714
SVC Recall 0.9748062015503876
0
KNN Precision: 0.9730807577268196
SVC Precision: 0.9745347698334965
KNN Recall 0.9663366336633663
SVC Recall 0.9851485148514851
4
KNN Precision: 0.9762396694214877
SVC Precision: 0.9826175869120655
KNN Recall 0.9623217922606925
SVC Recall 0.9786150712830958
9
KNN Precision: 0.9654403567447045
SVC Precision: 0.986409966024915
KNN Recall 0.9708520179372198
SVC Recall 0.976457399103139
5
KNN Precision: 0.9833333333333333
SVC Precision: 0.9853862212943633
KNN Recall 0.9853862212943633
SVC Recall 0.9853862212943633
6
KNN Precision: 0.9583333333333334
SVC Precision: 0.9755142017629774
KNN Recall 0.9620622568093385
SVC Recall 0.9688715953307393
3
KNN Precision: 0.9860064585575888
SVC Precision: 0.9713701431492843
KNN Recall 0.9404517453798767
SVC Recall 0.9753593429158111
8
KNN Precision: 0.9563058589870904
SVC Precision: 0.9719438877755511
KNN Recall 0.9544103072348861
SVC Recall 0.9613478691774033
Figure 7: Runtime and accuracy of the models, as well as precision and recall of each digit for KNN and SVC