Working Title: “Discovering toxic exposure mixtures by combining cross-validated causal inference with machine learning”

Work with Catherine Metayer, Todd Whitehead, and Alan Hubbard

**Abstract** (work in progress)

Scientists typically seek to understand the causal effect of toxic chemicals by examining changes in one chemical exposure at a time, such as via the estimated beta coefficients of a logistic regression. The independent effect approach may poorly estimate the toxicity of chemicals that act in combination to influence disease, whether through synergistic or antagonist relationships. Joint effects can be estimated through multidimensional grids, but that becomes infeasible as the number of exposure variables grows due to the curse of dimensionality. As a result, recent years have seen a shift to examining joint relationships of chemical exposures as summarized through mixture modeling methods such a weighted quantile sum regression and Bayesian kernel machine regression.

We develop a distinctive method of combining multiple exposures into an aggregate mixture as a data-adaptive target parameter within the targeted learning causal inference framework. Our method aggregates groups of multiple correlated exposures into exposure indices that best capture the impact of exposure mixtures on an outcome, such as mortality or case-control status. As a proof of concept we use partial least squares to data-adaptively estimate a convex weighting of the available exposures, although our method supports any mixture estimation algorithm.

Integrated with the exposure mixture estimation, we use cross-validated targeted learning (CV-TMLE) to predict the exposure-specific counterfactual means for quantiles of the mixture using out-of-sample data - validation data held out from estimating the mixture function (exposure weights). This ensures that we obtain an unbiased effect estimate that is not overfit to the training data, and cross-validation allows us to rotate over the full dataset for this procedure, maximizing our power. The integration into targeted learning reduces statistical assumptions, efficiently uses small data sizes, minimizes estimator bias, and leverages machine learning to capture complex relationships.

We apply our method to chemical exposures extracted from residential dust to estimate the impact of chemical mixtures on risk of childhood leukemia. We also evaluate it on simulated data. Our method is provided as an open source software package **tlmixture**.