Measuring hate speech

Outcome phenomena are typically measured at the binary level: a comment is toxic or not, an image has sexual content or it doesn’t, a patient is healthy or deceased. But the real world isn’t that simple: most target variables are inherently continuous in nature. Physical quantities such as temperature and weight can be measured as interval variables where magnitudes are meaningful. How can we achieve that same interval measurement for arbitrary outcomes - creating a continuous scale with magnitudes? For example, what is the cardiovascular health of a patient on a scale of -3 to +3?

We propose a method for measuring phenomena as continuous, interval variables by unifying deep learning with the Constructing Measures approach to Rasch item response theory (IRT). The crux of our method is decomposing the target construct into multiple constituent components measured as ordinal survey items, which are then transformed via an IRT non-linear activation into a continuous measure of unprecedented quality. In particular, we estimate labeler bias and eliminate its influence on the final construct when creating a training dataset, which renders obsolete the notion of inter-rater reliability as a quality metric. To our knowledge this IRT bias adjustment has never before been implemented in machine learning but is critical for algorithmic fairness. We further estimate the response quality of each individual labeler, allowing responses from low-quality labelers to be removed.

Our IRT scaling procedure fits naturally into multi-output, weight-sharing deep learning architectures in which our theorized components of the target outcome are used as supervised, ordinal latent variables for the neural networks’ internal representation learning, improving sample efficiency and promoting generalizability. Built-in explainability is an inherent advantage of our method, because the final numeric prediction can be directly explained by the predictions on the constituent components.

We demonstrate our method on a new dataset of 50,000 online comments labeled to measure a spectrum from hate speech to counterspeech, and sourced from YouTube, Twitter, and Reddit. We evaluate Universal Sentence Encoders, RoBERTa, XLNet, and ULMFiT as contextual representation models for the comment text, and benchmark our predictive accuracy against Google Jigsaw’s Perspective API models.

Preprint to the posted in October!

Chris Kennedy
Consulting data scientist, biostatistics PhD student


Applied machine learning workshop, talk on machine learning for human rights, and talk on hate speech measurement

Machine learning introduction interwoven with preliminary results from our hate speech project.