Core D: Computational Biology
The support provided under this Core reflects a growing trend in studies of environmental exposure from more traditional epidemiological studies and simple experimental designs to high-dimensional biology, with its emphasis on ‘omic’ technologies and complicated questions addressing the possible interaction of environmental exposures and high-dimensional measures of the genome, proteome, etc. These high-dimensional data sets are characterized by many (thousands) measurements made on only a few independent units (e.g., people). Thus, the Core reflects a parallel evolution in the field of biostatistics towards developing methodologies that can both find patterns in high dimensional data sets as well as providing proper statistical inference for these patterns. Besides offering consulting on traditional epidemiological experimental design and analysis questions, the Core focuses its efforts on providing the most relevant and rigorous statistical techniques to the Program. With new ‘omic’ technologies, biology has entered a new more empirical phase where the goals of the research are ambitious (e.g., discovery of regulatory gene networks affected by particular environmental toxicants), but the sample sizes relatively small (biological replicates numbering in the tens). With these technologies, have come also a proliferation of proposed methods to find biologically meaningful patterns and typically little theory is provided to guide their relative worth. The goal of this Core is to provide the project researchers with the best techniques available, software to help implement them, a computational environment that can handle computer-intensive methods on large data sets and, most importantly, rigorous statistical inference for the parameters estimated by these procedures. A subset of the developments related to the proliferation of high-dimensional biological/epidemiological data particularly relevant to this core are 1) multiple testing, 2) machine-learning and loss-based estimation, 3) grouping algorithms methods, 4) causal inference and 5) biological metadata and systems biology. In addition, the Core provides access to a computational environment that lends itself to the computationally intensive methods developed for data mining and re-sampling based inference.
No news at this time.
No events at this time.
Core D’s leaders are developing innovative methods in a variety of contexts to improve the use, analysis, and interpretation of data. They have made important contributions to the development of methods essential to the use of the great volumes of data produced by new methods in genomics and other “omics” areas. Newer methods are contributing to important new results in Superfund Research at Berkeley.
The overall goal of this Core is to develop better statistical methods suitable for the kinds of data emerging from the more sophisticated “omics” research methods now integral to the Superfund Research Program. These methods are already producing new insights not achievable with older approaches.
- Developing more powerful statistical methods that are adapted to the data set under review.
- Devising means to integrate higher order knowledge with datasets resulting from high throughput methods.
Accomplishments for the last year
Our collaborative work this year concentrated on bioinformatic methods that could incorporate higher order information into the analysis of the data that is produced from high throughput “omics” assays. This means developing methods to add consideration of information that we already have about biological pathways related to disease, exposure, biological processes, etc. into our analyses of high throughput omic data and exposures.
This has been facilitated by our newest member of Core D, Reuben Thomas, a post-doctoral researcher from NIEHS. He has helped developed statistical methods for examining the correspondence of gene expression data versus exposure, and existing hypothesized pathways related to relevant biological pathways.
Investigators continue to refine their “semiparametric” approach. This means methods that adapt to the data that is actually produced, rather than being based solely on a model of how they think the data should be. Investigators are using this to look for associations between sets of characteristics such as gene expression, proteomics, methylation, etc. and their independent association with exposures to environmental contaminants, in the presence of other confounding variables.
Having reached a relatively satisfactory set of methodologies along with corresponding code for quick implementation, focus has shifted towards refining the methods to superimpose these results onto the accumulating knowledge base regarding biological pathways. Specifically, in the context of our study of occupational benzene exposure and genomics, we used a method known as “structurally enhanced pathway enrichment analysis” (Thomas et al. 2009). This incorporates information about the genome and biological pathways. It uses manually drawn pathway maps representing current knowledge on the molecular interaction and reaction networks involved in cellular processes such as metabolism, and cell cycle.
This procedure revealed highly significant (p < 0.001) impacts of relatively high benzene occupational exposure to several pathways. (These are the transcriptome of genes related to the toll-like receptor signaling pathway, oxidative phosphorylation, B cell receptor signaling pathway, apoptosis, acute myeloid leukemia, and T cell receptor signaling.)
Perhaps even more important, the combination of statistical methodology for selecting differentially expressed genes and the use of these statistical tests for highlighting “significant” pathways found the same pathways.
We also examined dose specific pathways and found that some are uniquely impacted only among very highly exposed workers. (These include, for instance, expression among nucleosome assembly and the ABC transporter pathways.)
What we plan to do next
The bottom line for the computational core is now we have in place an analysis stream that both finds individual culprits via rigorous statistical estimation and inference, but also can find higher-level patterns via methodology designed for finding significantly affected biological pathways, including disease pathways.
- Rappaport SM, Kim S, Thomas R, Johnson BA, Bois FY, Kupper LL (2013) Low-dose metabolism of benzene in humans: science and obfuscation. Carcinogenesis. Jan;34(1):2-9. PMID: 23222815. (PMC Journal – In Process). [PDF]
- Zhang L, Lan Q, Ji Z, Li G, Shen M, Vermeulen R, Guo W, Hubbard A, McHale CM, Rappaport SM, Hayes RB, Linet B, Yin S, Fraumeni JF, Rothman N, Smith MT (2012). Leukemia-related chromosomal loss detected in hematopoietic progenitor cells of benzene-exposed workers. Leukemia. Dec;26(12):2494-2498. PMCID: PMC3472034. [PDF]
- Phuong J, Kim S, Thomas R, Zhang L (2012) Predicted toxicity of the biofuel candidate 2,5-dimethylfuran in environmental and biological systems. Environ Mol Mutagen. Jul;53(6):478-87. PMID: 22730190. (PMC Journal – In Process). [PDF]
- Thomas R, Phuong J, McHale CM, Zhang L (2012) Using bioinformatic approaches to identify pathways targeted by human leukemogens. Int J Environ Res Public Health. Jul;9(7):2479-503. PMCID: PMC3407916. [PDF]
- Godderis L, Thomas R, Hubbard AE, Tabish AM, Hoet P, Zhang L, Smith MT, Veulemans H, McHale CM (2012) Effect of Chemical Mutagens and Carcinogens on Gene Expression Profiles in Human TK6 Cells. PLoS One. 7(6):e39205. PMCID: PMC3377624. [PDF]
- Zhang L, Lan Q, Guo W, Hubbard AE, Li G, Rappaport SM, McHale CM, Shen M, Ji Z, Vermeulen R, Yin S, Rothman N, Smith MT (2011) Chromosome-Wide Aneuploidy Study (CWAS) in Workers Exposed to an Established Leukemogen, Benzene. Carcinogenesis. Apr;32(4):605-12. PMCID: PMC3066415. [PDF]
- Zhang L, Tang X, Rothman N, Vermeulen R, Ji Z, Shen M, Qiu C, Guo W, Liu S, Reiss B, Freeman LB, Ge Y, Hubbard AE, Hua M, Blair A, Galvan N, Ruan X, Alter BP, Xin KX, Li S, Moore LE, Kim S, Xie Y, Hayes RB, Azuma M, Hauptmann M, Xiong J, Stewart P, Li L, Rappaport SM, Huang H, Fraumeni JF Jr, Smith MT, Lan Q (2010) Occupational Exposure to Formaldehyde, Hematotoxicity, and Leukemia-Specific Chromosome Changes in Cultured Myeloid Progenitor Cells. Cancer Epidemiol Biomarkers Prev. Jan;19(1):80-88. PMID: 20056626. PMC Journal – In Process. [PDF]
- Zhang L, McHale CM, Rothman N, Li G, Ji Z, Vermeulen R, Hubbard AE, Ren X, Shen M, Rappaport SM, North M, Skibola CF, Yin S, Vulpe C, Chanock SJ, Smith MT, Lan Q (2010) Systems biology of human benzene exposure. Chem Biol Interact. Mar 19;184(1-2):86-93. PMID: 20026094. PMCID: PMC2846187. [PDF]
- McHale CM, Zhang L, Lan Q, Li G, Hubbard AE, Forrest MS, Vermeulen R, Chen J, Shen M, Rappaport SM, Yin S, Smith MT, Rothman N (2009) Changes in the peripheral blood transcriptome associated with occupational benzene exposure identified by cross-comparison on two microarray platforms. Genomics. Apr; 93(4):343-9. PMID: 19162166. PMCID: PMC2693268. [PDF]
- Johnson DR, Brodie EL, Hubbard AE, Andersen GL, Zinder SH, Alvarez-Cohen L (2008) Temporal transcriptomic microarray analysis of “Dehalococcoides ethenogenes” strain 195 during the transition into stationary phase. Appl Environ Microbiol. May; 74(9):2864-72. PMID: 18310438. PMCID: PMC2394897. [PDF]
- Hubbard AE, Laan MJ (2008) Population intervention models in causal inference. Biometrika. 95(1):35-47. PMID: 18629347. PMCID: PMC2464276. [PDF]
- McHale CM, Zhang L, Hubbard AE, Zhao X, Baccarelli A, Pesatori AC, Smith MT, Landi MT (2007) Microarray analysis of gene expression in peripheral blood mononuclear cells from dioxin-exposed human subjects. Toxicology. Jan 5;229(1-2):101-13. PMID: 17101203. [PDF]
- Escobar PA, Smith MT, Vasishta A, Hubbard AE, Zhang L (2007) Leukaemia-specific chromosome damage detected by comet with fluorescence in situ hybridization (comet-FISH). Mutagenesis. Sep; 22(5):321-7. PMID: 17575318. [PDF]
- Zhang L, Rothman N, Li G, Guo W, Yang W, Hubbard AE, Hayes RB, Yin S, Lu W, Smith MT (2007) Aberrations in chromosomes associated with lymphoma and therapy-related leukemia in benzene-exposed workers. Environ Mol Mutagen. Jul; 48(6):467-74. PMID: 17584886. [PDF]
- Chen J, van der Laan MJ, Smith MT, Hubbard AE (2007) A comparison of methods to control type I errors in microarray studies. Stat Appl Genet Mol Biol. 6:Article28. PMID: 18052911. [PDF]
- Birkner MD, Hubbard AE, van der Laan MJ, Skibola CF, Hegedus CM, Smith MT (2006) Issues of processing and multiple testing of SELDI-TOF MS proteomic data. Stat Appl Genet Mol Biol. 5:Article11. PMID: 16646865. [PDF]
- van der Laan MJ, Hubbard AE (2006) Quantile-function based null distribution in resampling based multiple testing. Stat Appl Genet Mol Biol. 5:Article14. PMID: 17049025. [PDF]
- Zhang L, Lan Q, Guo W, Li G, Yang W, Hubbard AE, Vermeulen R, Rappaport SM, Yin S, Rothman N, Smith MT (2005) Use of OctoChrome fluorescence in situ hybridization to detect specific aneuploidy among all 24 chromosomes in benzene-exposed workers. Chem Biol Interact. May 30; 153-154(117-22. PMID: 15935807. [PDF]
- Zhang L, Yang W, Hubbard AE, Smith MT (2005) Nonrandom aneuploidy of chromosomes 1, 5, 6, 7, 8, 9, 11, 12, and 21 induced by the benzene metabolites hydroquinone and benzenetriol. Environ Mol Mutagen. May; 45(4):388-96. PMID: 15662717. [PDF]
- Forrest MS, Lan Q, Hubbard AE, Zhang L, Vermeulen R, Zhao X, Li G, Wu YY, Shen M, Yin S, Chanock SJ, Rothman N, Smith MT (2005) Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers. Environ Health Perspect. Jun; 113(6):801-7. PMID: 15929907. [PDF]