Core D: Computational Biology

Summary

The support provided under this Core reflects a growing trend in studies of environmental exposure from more traditional epidemiological studies and simple experimental designs to high-dimensional biology, with its emphasis on ‘omic’ technologies and complicated questions addressing the possible interaction of environmental exposures and high-dimensional measures of the genome, proteome, etc. These high-dimensional data sets are characterized by many (thousands) measurements made on only a few independent units (e.g., people). Thus, the Core reflects a parallel evolution in the field of biostatistics towards developing methodologies that can both find patterns in high dimensional data sets as well as providing proper statistical inference for these patterns. Besides offering consulting on traditional epidemiological experimental design and analysis questions, the Core focuses its efforts on providing the most relevant and rigorous statistical techniques to the Program. With new ‘omic’ technologies, biology has entered a new more empirical phase where the goals of the research are ambitious (e.g., discovery of regulatory gene networks affected by particular environmental toxicants), but the sample sizes relatively small (biological replicates numbering in the tens). With these technologies, have come also a proliferation of proposed methods to find biologically meaningful patterns and typically little theory is provided to guide their relative worth. The goal of this Core is to provide the project researchers with the best techniques available, software to help implement them, a computational environment that can handle computer-intensive methods on large data sets and, most importantly, rigorous statistical inference for the parameters estimated by these procedures. A subset of the developments related to the proliferation of high-dimensional biological/epidemiological data particularly relevant to this core are 1) multiple testing, 2) machine-learning and loss-based estimation, 3) grouping algorithms methods, 4) causal inference and 5) biological metadata and systems biology. In addition, the Core provides access to a computational environment that lends itself to the computationally intensive methods developed for data mining and re-sampling based inference.

News

No news at this time.

» News Archive

Events

No events at this time.

Core Update

The methodological work of Drs. Mark van der Laan and Alan Hubbard has focused on biomarkers of exposure via ‘omic technologies and their recent work (Birkner, et al. 2005) has been to augment their original proposal for a new Gene Set Enrichment Analysis (GSEA) using machine learning algorithms to include the so-called SuperLearner (van der Laan, Polley and Hubbard, 2007).  The SuperLearner creates a predictor that combines several different learners (e.g., neural nets, stepwise regression, POLYCLASS, Kooperberg, et al., 1999, etc.), which means that the space of models it considers is huge.  After the fit is completed, the researchers construct a test statistic, such as a likelihood ratio statistic.  This is compared to the permutation distribution (the entire algorithm is run on a large set of permutations of the original data).  A recent set of simulations suggests this procedure is very powerful for finding sets of genes that have complicated relationships with some outcome (e.g., disease status).  The theoretical power of the test is that it is data-adaptive so that, as sample size increases, the algorithm will automatically search more aggressively for patterns all the while still providing proper type I error control. This procedure has wide applicability to many of the data/questions procedure by other projects in the Program as a whole.

The researchers have continued their use of variable importance procedures to augment their prediction-based algorithms to find a list of biomarkers that reflect the importance of each variable (say gene expression) to the variability of exposure (say arsenic).   These parameters are an alternative to other proposals for measuring variable importance (the relative importance of a variable in predicting an outcome) and these parameter estimates have the additional virtue that, if one makes certain assumptions on the data-generating distribution, they have a causal interpretation.   They have recently augmented these procedures by using the relationship of one variable (one gene expression) to the association of another variable on exposure – the technique produces a interesting distance matrix that is a function not just of the relationship of variables (gene expressions) to one another, but to their relationships as biomarkers (predictors) of exposure.  The result has been to produce new potential networks of biomarkers in their relationship to exposure.

» Core Update Archive

Publications

  • Hubbard, Alan E. and Mark J. van der Laan. 2008. Population Intervention models In causal inference. Biometrika. 95(1):35-47. doi:10.1093/biomet/asm097 (http://dx.doi.org/10.1093/biomet/asm097) Exit NIEHS Website
  • Johnson, David R., E.L. Brodie, Alan E. Hubbard, Gary L. Andersen, Steven H. Zinder, and Lisa Alvarez-Cohen. 2008. Temporal transcriptomic microarray analysis of “Dehalococcoides ethenogenes” strain 195 during the transition into stationary phase. Applied and Environmental Microbiology. (http://aem.asm.org/) Exit NIEHS Website 74(9):2864-72.
  • McHale, Cliona M., Luoping Zhang, Qing Lan, Guilan Li, Alan E. Hubbard, Matthew S. Forrest, Roel Vermeulen, Jinsong Chen, Min Shen, Stephen M. Rappaport, Songnian Yin, Martyn T. Smith, and Nathaniel Rothman. 2008. Changes in the peripheral blood transcriptome associated with occupational benzene exposure identified by cross-comparison on two microarray platforms. Genomics. 93(4):343-349. doi:10.1016/j.ygeno.2008.12.006 (http://dx.doi.org/10.1016/j.ygeno.2008.12.006) Exit NIEHS Website
  • Skibola, Chistine F., Alexandra Nieters, P.M. Bracci, John D. Curry, Luz Agana, Danica R. Skibola, Alan E. Hubbard, Nikolaus Becker, Martyn T. Smith, and E.A. Holly. 2008. A functional TNFRSF5 gene variant Is associated with risk of lymphoma. Blood. (http://www.bloodjournal.org/) Exit NIEHS Website 111(8):4348-4354. doi:10.1182/blood-2007-09-112144 (http://dx.doi.org/10.1182/blood-2007-09-11214 4) Exit NIEHS Website

» Publications Archive