Machine Learning Projects

A list of machine learning projects that I have carried out in the course of my education.

To Be or Not To Be: Using Machine Learning to Recognize Shakespeare

Using a dataset of the complete texts of Shakespeare's 33 plays in his first folio, as well as 22 plays from his contemporaries, we performed lexical analysis to train a Naive Bayes Classifier. All texts were acquired from Project Gutenberg.

Process

We first attempted an analysis using TF-IDF with an 80-20 split. This caused some issues with misclassifications. Sometimes our accuracy would reach up to 90%, however, upon further inspection of the confusion matrix, we found that all texts would be classified as Shakespeare, as seen in figure 1. Attempting to do a 75-25 split gave us the same results. We next preprocessed the data to perform a Term Frequency (TF) and Relative Term Frequency (RTF) analysis. The RTF proved to sometimes correctly classify a few of the non-Shakespeare plays, but still largely classified most to all plays as Shakespeare in both the 80-20 and 75-25 splits. The TF analysis finally broke this cycle. While the 75-25 split generally had better accuracy, precision, and recall than either the TF-IDF and RTF analyses, the TF 80-20 split gave us the best results. It would still misclassify some texts, as figure 2 shows, but, upon further analysis, these were only 11 of the 55 texts that would sometimes be misclassified. Given more time, we would have liked to break the plays down into scenes to see if certain scenes, especially in the 11 misclassified plays, skewed the training in any way. We also would have liked to perform a semantic analysis of the plays, especially looking at hendiadys as Shakespeare had a unique way of using them.

Figure 1: TF-IDF 80-20 split Confusion Matrix

Figure 2: TF 80-20 split Confusion Matrix

Comparing KNN and SVC

Using a dataset of 70,000 handwritten digits from 0-9, I compared runtimes, accuracies, precisions, and recalls of the K-Nearest Neighbor Classfier and Support Vector Classifier in recognizing handwritten digits. 60,000 of the digits were used for training, and 10,000 were used for testing. All images were acquired from the MNIST subset of the NIST database.

Process

After reading in the 28x28 pixel images into a 2D array, I had to convert them into a 1D array. These arrays represented the numbers by giving values of how "bright" an area is. This representation can be seen in figure 3. After preprocessing them into this format, I did several iterations on the KNN with incremental increases to the number of neighbors used until I found an optimal range. Once this range was found, I looped through to find the overall optimal number of neighbors, based on accuracy, as seen in figure 4. I used this value to train an optimal KNN. I then trained an SVC with no optimizations. I compared the confusion matrices, figures 5 and 6, and the overall runtimes, accuracies, precisions, and accuracies, shown in figure 7. The result was that the SVC has better accuracy, precision, and recall, but takes much more time than KNN.

Figure 3: Representation of a handwritten 5

Figure 4: Graph of accuracies of N-neighbors with N being between 5 and 15.

Figure 5: Optimal KNN Confusion Matrix

Figure 7: SVC Confusion Matrix

KNN time: 163.99149537086487

SVC time: 408.89756441116333

KNN accuracy: 0.9694

SVC accuracy: 0.9792

KNN Precision: 0.9662698412698413

SVC Precision: 0.9798590130916415

KNN Recall 0.9938775510204082

SVC Recall 0.9928571428571429

KNN Precision: 0.9513014273719563

SVC Precision: 0.9885864793678666

KNN Recall 0.9982378854625551

SVC Recall 0.9920704845814978

KNN Precision: 0.9830845771144279

SVC Precision: 0.9757516973811833

KNN Recall 0.9573643410852714

SVC Recall 0.9748062015503876

KNN Precision: 0.9730807577268196

SVC Precision: 0.9745347698334965

KNN Recall 0.9663366336633663

SVC Recall 0.9851485148514851

KNN Precision: 0.9762396694214877

SVC Precision: 0.9826175869120655

KNN Recall 0.9623217922606925

SVC Recall 0.9786150712830958

KNN Precision: 0.9654403567447045

SVC Precision: 0.986409966024915

KNN Recall 0.9708520179372198

SVC Recall 0.976457399103139

KNN Precision: 0.9833333333333333

SVC Precision: 0.9853862212943633

KNN Recall 0.9853862212943633

SVC Recall 0.9853862212943633

KNN Precision: 0.9583333333333334

SVC Precision: 0.9755142017629774

KNN Recall 0.9620622568093385

SVC Recall 0.9688715953307393

KNN Precision: 0.9860064585575888

SVC Precision: 0.9713701431492843

KNN Recall 0.9404517453798767

SVC Recall 0.9753593429158111

KNN Precision: 0.9563058589870904

SVC Precision: 0.9719438877755511

KNN Recall 0.9544103072348861

SVC Recall 0.9613478691774033

Figure 7: Runtime and accuracy of the models, as well as precision and recall of each digit for KNN and SVC