Core D: Quantitative Biology: Biostatistics, Bioinformatics and Computation
The purpose of the Quantitative Biology Core is to provide investigators with consultative support in biostatistics/computational biology and bioinformatics, and to support web-based dissemination of bioinformatic solutions and database access. Most specific aims with the projects produce high-dimensional biological and exposure data, and often involve complicated questions addressing the possible interaction of environmental exposures and high-dimensional measures of the genome, proteome, and other high throughput technologies. These high-dimensional data sets are characterized by many thousands of measurements made on each unit (e.g. person, yeast culture, soil community). Core D reflects an evolution in the field of biostatistics and bioinformatics towards developing methodologies that can both find patterns in high dimensional data sets as well as providing proper statistical inference for these patterns. A consensus among our project researches and the methodological experts has formed around a set of core principles regarding optimal estimation and inference in the context of complicated questions and high-dimensional data. Specifically, the consensus favors using (when possible): semi-parametric locally efficient estimation with robust inference and the development of optimal methods used to integrate the statistical results into existing metadata to suggest relevant biological pathways and networks. Applying this approach will enable analyses to incorporate diverse data to query similar patterns/pathways in both related toxins and possible related diseases thus substantially leveraging data generated by the Program. To implement this methodology, the Quantitative Biology Core will provide access to a computational environment that lends itself to the computationally intensive methods developed for data mining and re-sampling based inference. Because of the scale of the data collection as well as the desirability of converging to a general methodology, our Program requires a more centralized system that can both archive data for, provide sharing to this Core, guidance on the access of metadata/annotation and routines for leveraging such data to find overprinting of our results on existing hypothesized regulatory networks. The Core will also develop tools to find and compares pathway, and create and maintain a web-based system that will allow for both efficient sharing of our methodological expertise with the project researchers and ultimately serve as a tool for outreach among the general scientific community.
This is relevant because despite improvements in technology, the lack of statistical rigor among the proliferating methods used to discover disease etiology and develop effective interventions are producing large numbers of false positive claims. However, by using methods that optimally balance the complexity of models with the need to provide inferences consistent with the amount of data available, one can avoid wild goose chases engendered by false discoveries.
Mark van der Laan, PhD
Professor of Biostatistics and Statistics
Biostatistics, School of Public Health
University of California, Berkeley
Alan Hubbard, PhD
Associate Professor, Biostatistics
Biostatistics, School of Public Health
University of California, Berkeley
As part of the goal of creating procedures that can find robust associations in the context of limited observations and many competing causes, we (Hubbard and trainees Nima Hejazi and Wilson Cai) proposed and implemented a novel data-adaptive target parameter that combines the power of data-mining algorithms with valid statistical inference. (Hubbard et al., 2016)(Hubbard and Van Der Laan, 2016). We developed an R package, based on the general data-adaptive parameter framework, that adaptively reduces the number of tests to those that appear “interesting” by some mining procedure. Then a cross-validation technique is used to derive trustworthy inference for this reduced set: the data.adapt.multi.test package (Cai et al., 2017). We then applied the method to a differential expression analysis of benzene effect on micro-RNA. We succeeded in narrowing down a handful of key miRNA targets that were culled from the high-dimensional dataset (~5000 probes, 85 subjects). Existing methods all failed to find any signal in this data.
Members of Core D have had a productive period, engaging in: 1) applied analysis of complex data for finding the impacts of exposure to toxicants on biomarkers of disease, 2) methods-development for getting more robust statistical results (fewer false positive and negative associations of exposure and such biomarkers), and 3) software development for dissemination of the methods both to other Program members, but also general scientific community. Our program, like similar research efforts, have analytical challenges due to very complex data, with thousands (sometimes millions) of variables measured on relatively few subjects. Standard methods, even ones still commonly used for such data, are theoretically known to fail in such circumstances. We have taken two complimentary approaches. First, we have derived theoretically results that given some indication of how bad things can get (how erroneous interpretations of the data can be based on use of standard methods). Second, we have developed novel adaptive ways of narrowing done the list of variables, so that the number of variables does not overwhelm the data. Both of these tracks are complicated and to implement, computationally difficult. Thus, we have used some of our funded time to develop software packages, so that expertise in computing is not required to implement these methods. This general approach is creating models of reproducible research, trying to gain as much information from the data as possible, but still providing accurate measures of uncertainty in the results so that there is some context to interpret the results.
Core Leader Alan Hubbard presented in the following two workshops:
- 2016 Workshop on Targeted Learning. 2-day workshop sponsored by Instituto Mexicano del Seguro Social, Mexico City, Mexico.
- 2016 Workshop on Targeted Learning. 3-day workshop for development economists sponsored by Inter-American Development Bank. Berkeley, CA.
Hubbard was also an invited speaker at the 2016 Statistical Inference for Data Adaptive Target Parameters at Statistical Causal Inference and its Applications to Genetics, Montreal, Canada.
Hubbard AE, Kherad-Pajouh S, van der Laan MJ (2016) Statistical Inference for Data Adaptive Target Parameters. Int J Biostat. May 1;12(1):3-19. doi:10.1515/ijb-2015-0013. PMID: 27227715. [PDF]
Perttula K, Edmands WM, Grigoryan H, Cai X, Iavarone AT, Gunter MJ, Naccarati A, Polidoro, S, Hubbard A, Vineis P, Rappaport SM (2016) Evaluating Ultra-long Chain Fatty Acids as Biomarkers of Colorectal Cancer Risk. Cancer Epidemiol Biomarkers Prev. Aug;25(8):1216-23. 10.1158/1055-9965.EPI-16-0204. PMID: 27257090. [PDF]
Lan Q, Smith MT, Tang X, Guo W, Vermeulen R, Ji Z, Hu W, Hubbard AE, …et.al. (2015) Chromosome-wide aneuploidy study of cultured circulating myeloid progenitor cells from workers occupationally exposed to formaldehyde. Carcinogenesis. Jan;36(1):160-7. PMCID: PMC4291049. [PDF]
Thomas R, Hubbard AE, McHale CM, Zhang L, Rappaport SM, Lan Q, et al. (2014) Characterization of changes in gene expression and biochemical pathways at low levels of benzene exposure. PLoS One. 9(5):e91828. PMCID: PMC4006721. [PDF]
McHale CM, Zhang L, Lan Q, Vermeulen R, Li G, Hubbard AE, et al. (2011) Global gene expression profiling of a population exposed to a range of benzene levels. Environ Health Perspect. 119(5):628-34. PMCID: PMC3074412. [PDF]
Thomas R, McHale CM, Lan Q, Hubbard AE, Zhang L, Vermeulen R, et al. Global gene expression response of a population exposed to benzene: a pilot study exploring the use of RNA-sequencing technology. Environ Mol Mutagen. 2013;54(7):566-73. PMCID: PMC4353497. [PDF]
Rappaport SM, Kim S, Thomas R*, Johnson BA, Bois FY, Kupper LL (2013) Low-dose Metabolism of Benzene in Humans: Science and Obfuscation. Carcinogenesis. Jan;34(1):2-9. PMCID: PMC3584950.
Rappaport SM, Johnson BA, Bois FY, Kupper LL, Kim S, Thomas R (2013) Ignoring and adding errors do not improve the science. Carcinogenesis. Jul;34(7):1689-91. PMCID: PMC3697890. [PDF]
Zhang L, Lan Q, Ji Z, Li G, Shen M, Vermeulen R, Guo W, Hubbard A, McHale CM, Rappaport SM, Hayes RB, Linet B, Yin S, Fraumeni JF, Rothman N, Smith MT (2012). Leukemia-related chromosomal loss detected in hematopoietic progenitor cells of benzene-exposed workers.Leukemia. Dec;26(12):2494-2498. PMCID: PMC3472034. [PDF]
Phuong J, Kim S, Thomas R, Zhang L (2012) Predicted toxicity of the biofuel candidate 2,5-dimethylfuran in environmental and biological systems. Environ Mol Mutagen. Jul;53(6):478-87. PMID: 22730190. (PMC Journal – In Process). [PDF]
Thomas R, Phuong J, McHale CM, Zhang L (2012) Using bioinformatic approaches to identify pathways targeted by human leukemogens. Int J Environ Res Public Health. Jul;9(7):2479-503. PMCID: PMC3407916. [PDF]
Godderis L, Thomas R, Hubbard AE, Tabish AM, Hoet P, Zhang L, Smith MT, Veulemans H, McHale CM (2012) Effect of Chemical Mutagens and Carcinogens on Gene Expression Profiles in Human TK6 Cells. PLoS One. 7(6):e39205. PMCID: PMC3377624. [PDF]
Zhang L, Lan Q, Guo W, Hubbard AE, Li G, Rappaport SM, McHale CM, Shen M, Ji Z, Vermeulen R, Yin S, Rothman N, Smith MT (2011) Chromosome-Wide Aneuploidy Study (CWAS) in Workers Exposed to an Established Leukemogen, Benzene. Carcinogenesis. Apr;32(4):605-12. PMCID: PMC3066415. [PDF]
Zhang L, Tang X, Rothman N, Vermeulen R, Ji Z, Shen M, Qiu C, Guo W, Liu S, Reiss B, Freeman LB, Ge Y, Hubbard AE, Hua M, Blair A, Galvan N, Ruan X, Alter BP, Xin KX, Li S, Moore LE, Kim S, Xie Y, Hayes RB, Azuma M, Hauptmann M, Xiong J, Stewart P, Li L, Rappaport SM, Huang H, Fraumeni JF Jr, Smith MT, Lan Q (2010) Occupational Exposure to Formaldehyde, Hematotoxicity, and Leukemia-Specific Chromosome Changes in Cultured Myeloid Progenitor Cells. Cancer Epidemiol Biomarkers Prev. Jan;19(1):80-88. PMID: 20056626. PMC Journal – In Process. [PDF]
Zhang L, McHale CM, Rothman N, Li G, Ji Z, Vermeulen R, Hubbard AE, Ren X, Shen M, Rappaport SM, North M, Skibola CF, Yin S, Vulpe C, Chanock SJ, Smith MT, Lan Q (2010) Systems biology of human benzene exposure. Chem Biol Interact. Mar 19;184(1-2):86-93. PMID: 20026094. PMCID: PMC2846187. [PDF]
McHale CM, Zhang L, Lan Q, Li G, Hubbard AE, Forrest MS, Vermeulen R, Chen J, Shen M, Rappaport SM, Yin S, Smith MT, Rothman N (2009) Changes in the peripheral blood transcriptome associated with occupational benzene exposure identified by cross-comparison on two microarray platforms. Genomics. Apr; 93(4):343-9. PMID: 19162166. PMCID: PMC2693268. [PDF]
Johnson DR, Brodie EL, Hubbard AE, Andersen GL, Zinder SH, Alvarez-Cohen L (2008) Temporal transcriptomic microarray analysis of “Dehalococcoides ethenogenes” strain 195 during the transition into stationary phase. Appl Environ Microbiol. May; 74(9):2864-72. PMID: 18310438. PMCID: PMC2394897. [PDF]
Hubbard AE, Laan MJ (2008) Population intervention models in causal inference. Biometrika. 95(1):35-47. PMID: 18629347. PMCID: PMC2464276. [PDF]
McHale CM, Zhang L, Hubbard AE, Zhao X, Baccarelli A, Pesatori AC, Smith MT, Landi MT (2007) Microarray analysis of gene expression in peripheral blood mononuclear cells from dioxin-exposed human subjects. Toxicology. Jan 5;229(1-2):101-13. PMID: 17101203. [PDF]
Escobar PA, Smith MT, Vasishta A, Hubbard AE, Zhang L (2007) Leukaemia-specific chromosome damage detected by comet with fluorescence in situ hybridization (comet-FISH). Mutagenesis. Sep; 22(5):321-7. PMID: 17575318. [PDF]
Zhang L, Rothman N, Li G, Guo W, Yang W, Hubbard AE, Hayes RB, Yin S, Lu W, Smith MT (2007) Aberrations in chromosomes associated with lymphoma and therapy-related leukemia in benzene-exposed workers. Environ Mol Mutagen. Jul; 48(6):467-74. PMID: 17584886. [PDF]
Chen J, van der Laan MJ, Smith MT, Hubbard AE (2007) A comparison of methods to control type I errors in microarray studies. Stat Appl Genet Mol Biol. 6:Article28. PMID: 18052911. [PDF]
Birkner MD, Hubbard AE, van der Laan MJ, Skibola CF, Hegedus CM, Smith MT (2006) Issues of processing and multiple testing of SELDI-TOF MS proteomic data. Stat Appl Genet Mol Biol. 5:Article11. PMID: 16646865. [PDF]
van der Laan MJ, Hubbard AE (2006) Quantile-function based null distribution in resampling based multiple testing. Stat Appl Genet Mol Biol. 5:Article14. PMID: 17049025. [PDF]
Zhang L, Lan Q, Guo W, Li G, Yang W, Hubbard AE, Vermeulen R, Rappaport SM, Yin S, Rothman N, Smith MT (2005) Use of OctoChrome fluorescence in situ hybridization to detect specific aneuploidy among all 24 chromosomes in benzene-exposed workers. Chem Biol Interact. May 30; 153-154(117-22. PMID: 15935807. [PDF]
Zhang L, Yang W, Hubbard AE, Smith MT (2005) Nonrandom aneuploidy of chromosomes 1, 5, 6, 7, 8, 9, 11, 12, and 21 induced by the benzene metabolites hydroquinone and benzenetriol. Environ Mol Mutagen. May; 45(4):388-96. PMID: 15662717. [PDF]
Forrest MS, Lan Q, Hubbard AE, Zhang L, Vermeulen R, Zhao X, Li G, Wu YY, Shen M, Yin S, Chanock SJ, Rothman N, Smith MT (2005) Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers.Environ Health Perspect. Jun; 113(6):801-7. PMID: 15929907. [PDF]