Selected Publications

We develop a method to create debiased, continuous, interval-valued latent variables from human-labeled data by combining faceted Rasch …

An evaluation of Oakland’s Shoo the Flu program, published in PLOS Medicine.

Causal variable importance to create a new risk score for complications during pregnancy.

Examining patient characteristics associated with telephone or video telemedicine visits, rather than in-person care.

Computer vision and natural language processing to track how influencers promote vaping on Instagram.

Latent class analysis clustering, causal variable importance, and logistic regression to understand childhood leukemia.

Chapter in Targeted Learning in Data Science (2018) covering the varimpact variable importance algorithm.


Measuring hate speech

Integrate item response theory with deep NLP to enable major new innovations in the measurement of hate speech.

Targeted Exposure Mixtures

Analysis of exposure mixtures as data-adaptive target parameters based on cross-validated targeted learning (CV-TMLE).

Chestpain Risk Score

Development of a risk score for chest pain at Kaiser Permanente using machine learning, generalized low rank models, variable importance, and accumulated local effect plots.

Varimpact: causally motivated variable importance

Ranking the importance of variables based on their estimated treatment effect on an outcome.

Instagram Vaping

Application of deep learning to measure vaping marketing on Instagram.


Short courses

Supervised learning in R (6-8 hours): Preprocessing, cross-validation, lasso, decision trees, random forest, xgboost, and superlearner ensembles.

Deep learning in R (6-8 hours): Deep learning with Keras - building & training deep networks, image classification, transfer learning, text analysis, and visualization

Unsupervised learning in R (6-8 hours): Clustering (Hdbscan, LCA, Hopach), dimensionality reduction (GLRM, UMAP), and anomaly detection (isolation forests)

Guide to SuperLearner (4-6 hours): Basic ensembles, hyperparameter tuning, nested cross-validation, parallelization, diagnostics, feature selection, and loss customization.

Causal inference with targeted learning (6-8 hours): causal diagrams, regression with SuperLearner, inverse probability of treatment weighting, targeted maximum likelihood estimation, effect modification, causal variable importance, exposure mixture modeling

Feature selection in R (6-8 hours): permutation importance, adaptive elastic net, relief family (relief-f, STIR, multisurf), joint mutual information, and knockoffs

Please feel free to contact me to discuss training for your institution.

Recent Posts

Date: April 2015 The following is a simple guide to installing Stata on Windows using Amazon EC2, which I created to help out fellow UC …

Date: April 2015 As the next round of admitted political science PhD students starts thinking about what they should be doing to …