Recent Publications

E-cigarette use is increasing dramatically among adolescents as social media marketing portrays “vaping” products as healthier …

Background: Although epidemiologic studies suggest that early immune stimulation is protective against childhood leukemia, evidence for …

Background: It is estimated that vaccinating 50-70% of school-aged children for influenza can produce population-wide indirect effects. …

Chapter in Targeted Learning in Data Science (2018) covering the varimpact variable importance algorithm.


Measuring hate speech

Integrate item response theory with deep NLP to enable major new innovations in the measurement of hate speech.

Targeted Exposure Mixtures

Analysis of exposure mixtures as data-adaptive target parameters based on cross-validated targeted learning (CV-TMLE).

Chestpain Risk Score

Development of a risk score for chest pain at Kaiser Permanente using machine learning, generalized low rank models, variable importance, and accumulated local effect plots.

Varimpact: causally motivated variable importance

Ranking the importance of variables based on their estimated treatment effect on an outcome.

Traumatic Brain Injury Prediction

Variable importance and prediction of traumatic brain injury in urgent care settings.

Childhood Leukemia

Discovering the causes of childhood leukemia so that we can prevent it from occurring

Caselaw Historical Semantics

Estimating historical trends in legal semantics using aligned word embeddings.

Instagram Vaping

Application of deep learning to measure vaping marketing on Instagram.


Short courses

Supervised learning in R (6-8 hours): Preprocessing, cross-validation, lasso, decision trees, random forest, xgboost, and superlearner ensembles.

Deep learning in R (6-8 hours): Deep learning with Keras - building & training deep networks, image classification, transfer learning, text analysis, and visualization

Unsupervised learning in R (6-8 hours): Clustering (Hdbscan, LCA, Hopach), dimensionality reduction (GLRM, UMAP), and anomaly detection (isolation forests)

Guide to SuperLearner (4-6 hours): Basic ensembles, hyperparameter tuning, nested cross-validation, parallelization, diagnostics, feature selection, and loss customization.

Causal inference with targeted learning (6-8 hours): causal diagrams, regression with SuperLearner, inverse probability of treatment weighting, targeted maximum likelihood estimation, effect modification, causal variable importance, exposure mixture modeling

Feature selection in R (6-8 hours): permutation importance, adaptive elastic net, relief family (relief-f, STIR, multisurf), joint mutual information, and knockoffs

Please feel free to contact me to discuss training for your institution.

Recent Posts

Date: April 2015 The following is a simple guide to installing Stata on Windows using Amazon EC2, which I created to help out fellow UC …

Date: April 2015 As the next round of admitted political science PhD students starts thinking about what they should be doing to …