Clement Ow
Online Hate Speech Prediction
Disclaimer: Dataset may contain offensive content. This is only discussed and used for research and academic purposes and do not condone such behaviour. Let's try our best to eradicate hate speech! #techforgood
Abstract
In this era of the digital age, online hate speech residing in social media networks can influence hate violence or even crimes towards a certain group of people. Hate related attacks targeted at specific groups of people are at a 16-year high in the United States of America, statistics released by the FBI reported. [1] With many social media comments posted every second, there is an immense need to eradicate hate speech as much as possible through automatic detection and prediction.
Multiple modelling approaches will be explored, such as machine learning models and even state-of-the-art deep learning models. F1 score and recall will be the metrics to be prioritised in model comparison. In the event where both are the same, actual False Negatives and False Positive numbers will be looked at.
Summary of Findings
I developed machine learning and deep learning models to help predict and flag out offensive comments. We present current approaches to this classification task and also explored different techniques including deep learning models and state-of-the-art NLP models such as BERT.
Machine Learning models used:
- Logistic Regression
- Naive-Bayes Classifier
- Ensemble models
- Random Forest Classifier
- Extra Trees Classifier
- Adaboost Classifier
- Gradient Boosting Classifier
Deep Learning models used:
- LSTM
- CNN & LSTM
- BERT model [2]
The best model in terms of test F1 score will be the Logistic Regression model with an F1 score of 87.91% and an ROC AUC score of 0.91. State-of-the-art NLP BERT model came close in second with an F1 score of 87%.
We do not want to sacrifice precision and since F1 score is the harmonic mean between precision and recall the Logistic Regression model is preferred even though BERT's recall was higher. Furthermore, Logistic Regression is far more interpretable model than BERT.
For more details on this project, you can visit my GitHub repo. I'm always trying to learn as much as I can in this NLP space, so feel free to connect with me on LinkedIn and I'd love to discuss more!