Advertisement

MACHINE LEARNING ALGORITHMS FOR EXPERIMENT DESIGN IN HIGH DIMENSIONAL LONGITUDINAL COHORT STUDIES: IMPLICATIONS FOR CLINICAL TRIALS

      Background

      We revisit a classical problem of selecting a subset of participants from a larger cohort such that a statistical model estimated using only the smaller subset yields similar parameter estimates as a model where all participants are used. This setup is important in AD studies: selecting a subset of “representative” participants (using baseline imaging, clinical, cognitive data) from a larger cohort for longitudinal measurements may be necessary due to budget/logistic constraints. We present the first known algorithm for the regime where the baseline predictors are high-dimensional data (e.g., imaging or genetic data) and we must select a specific subset of individuals that will maximize power for estimating parameters of a sparse linear model with a pre-specified sample size restriction.

      Methods

      The selection problem is only given access to data available at baseline. We must select a statistically diverse set of participants that best represent the underlying distribution of the cohort from which recruitment is performed. But enumerating all subsets is combinatorially large. Based on interesting geometric observations related to D-optimality in statistics, we give an algorithm for subject selection. We perform evaluations using ADNI2 data where the dependent variables are longitudinal cognitive outcomes and the predictors are image-based ROIs available at baseline.

      Results

      Using cognitive scores and risk factors for decline, we demonstrate that the subset selected by indeed approximates the full cohort. Figure 1 shows the error in the covariates picked by the full cohort model versus the selected subset model. Errors decrease gradually as the budget or number of allowed predictors (in the sparse model) increases. Figures 2-3 show the goodness of the selected subset in data-fitting (e.g., using a linear model). Figures 4-5 show the ratios of 1st to 4th moments of the ADAS and CDR samples generated from the full cohort and the subset, showing a good approximation for the full cohort.

      Conclusions

      We proposed machine learning algorithms for conducting AD focused longitudinal neuroimaging studies on a budget. Our experiments show that an optimized selection can maintain power while saving costs in longitudinal studies, including clinical trials.
      Figure thumbnail fx1
      Figure thumbnail fx2
      Figure thumbnail fx3
      Figure thumbnail fx4
      Figure thumbnail fx5