Recent Publications

E-cigarette use is increasing dramatically among adolescents as social media marketing portrays “vaping” products as healthier …

Background: Although epidemiologic studies suggest that early immune stimulation is protective against childhood leukemia, evidence for …

Background: It is estimated that vaccinating 50-70% of school-aged children for influenza can produce population-wide indirect effects. …

Chapter in Targeted Learning in Data Science (2018) covering the varimpact variable importance algorithm.


Measuring hate speech

Integrate item response theory with deep NLP to enable major new innovations in the measurement of hate speech.

Targeted Exposure Mixtures

Analysis of exposure mixtures as data-adaptive target parameters based on cross-validated targeted learning (CV-TMLE).

Chestpain Risk Score

Development of a risk score for chest pain at Kaiser Permanente using machine learning, generalized low rank models, variable importance, and accumulated local effect plots.

Varimpact: causally motivated variable importance

Ranking the importance of variables based on their estimated treatment effect on an outcome.

Traumatic Brain Injury Prediction

Variable importance and prediction of traumatic brain injury in urgent care settings.

Childhood Leukemia

Discovering the causes of childhood leukemia so that we can prevent it from occurring

Caselaw Historical Semantics

Estimating historical trends in legal semantics using aligned word embeddings.

Instagram Vaping

Application of deep learning to measure vaping marketing on Instagram.

Recent & Upcoming Talks

Applied machine learning workshop, talk on machine learning for human rights, and talk on hate speech measurement

Machine learning introduction interwoven with preliminary results from our hate speech project.


Short courses

Supervised learning in R (6-8 hours): Preprocessing, cross-validation, lasso, decision trees, random forest, xgboost, and superlearner ensembles.

Deep learning in R (6-8 hours): Deep learning with Keras - building & training deep networks, image classification, transfer learning, text analysis, and visualization

Unsupervised learning in R (6-8 hours): Clustering (Hdbscan, LCA, Hopach), dimensionality reduction (GLRM, UMAP), and anomaly detection (isolation forests)

Guide to SuperLearner (4-6 hours): Basic ensembles, hyperparameter tuning, nested cross-validation, parallelization, diagnostics, feature selection, and loss customization.

Recent Posts

Date: April 2015 The following is a simple guide to installing Stata on Windows using Amazon EC2, which I created to help out fellow UC …

Date: April 2015 As the next round of admitted political science PhD students starts thinking about what they should be doing to …