Working Title: “**Discovering toxic exposure mixtures via cross-validated causal inference and ensemble machine learning**“

Work with Catherine Metayer, Todd Whitehead, Alan Hubbard, and Mark van der Laan

**Abstract** (work in progress)

**Background**: There is increasing interest in evaluating the combined impact of groups of exposures, termed “mixtures” or “co-exposures”, in observational data. Compared to standard methods such as main-effects logistic regression, mixture-based analyses can better detect joint or interactive impacts between exposures, namely antagonistic or synergistic effects. Such effect modification is known to occur in biological pathways but will be missed with single-exposure analyses.

**Objectives**: We sought to create a method for mixture estimation and risk evaluation with benefits over existing approaches. Limitations of current methods can include dependence on exploratory data analyses that are difficult to automate or replicate, loss of power due to a single train/test split, biologically unrealistic statistical assumptions such as restrictive linear functional forms, inability to capture interactions between exposures, or risky reliance on a single prediction algorithm. We also aimed to support mixture estimation on subsets of exposures, which would allow ranking exposure sets, with statistical inference, by the impact of their corresponding mixture.

**Methods**: We framed mixture estimation as a data-adaptive target parameter in which an aggregate exposure mixture of interest is estimated on the training data. Our method estimates mixtures using backfit orthogonalized ensemble regression, which posits a nonparametric functional form for the mixture and adjusts for confounding nonparametrically. The mixture and the confounding adjustment functions are estimated using ensemble machine learning, and a bias-correction term is added to provide double robustness. The adjusted mean of the outcome is then estimated at quantiles of the mixture exposure using cross-validation (CV-TMLE). When exposures are grouped into subsets (such as chemical classes), we support estimating separate mixtures for each subset and ranking the importance of exposure groups by the statistical significance of a counterfactual intervention setting all observations to the high vs. low mixture quantile.

**Data analysis**: We applied our method to an existing dataset of chemical levels measured in household dust to estimate their influence on childhood leukemia risk, and to rank the importance of chemical classes in terms of toxicity. We also evaluated the method’s performance on pre-existing NIEHS simulated datasets.

**Conclusions**: We created and validated a novel targeted learning methodology for exposure mixture estimation, which generalizes variable importance measures to sets of variables. Our method is provided as an open source R software package *tlmixture*.