Core D Update Archive


We both support the computational needs of the projects and well as developing and refining methods for anticipated research.  We develop both the software tools as well as augment are computational infrastructure to handle increasingly complicated, large data using a powerful multiprocessor machine that can handle very large data sets.  We implement methods that can utilize parallel processing to minimize the time for processing and analyzing data to increase our computing capacity.  We develop statistical and bioinformatics methods for addressing how environmental exposure affect certain biological pathways, and which of these pathways possibly to disease.  We develop and refine data mining tools that seek to highlight “robust” relationships in our analyses to robustly estimate trustworthy inference.  We applied a combination of targeted study design and use of hierarchical models to estimate the contribution of particular sources of variance (nuisance and otherwise) in miR expression.  We used toxicokinetic models, adjusting for gender, smoking, and country to study gender and smoking effects of two-path versus one-path chemical kinetic models of benzene metabolism.  We developed a general theory for data adaptive target parameters using aggressive methods for pattern findings, while still deriving inference even when the parameter is not pre-specified.



Because we have the advantage of being at one of the epicenters of new research in causal inference and machine learning, we have been able to translate advances in semi parametric causal inference and machine learning to in order to solve practical problems high dimensional biology among the projects, and for the computational biology community.  We have been well-suited to take these theorems developed for optimal, data adaptive estimation and guide their implementation, where we can make a real difference in promoting methodologies that can distinguish real signal from over-fitting noise.

In characterizing changes in gene expression and biochemical pathways at low levels of benzene exposure we were able to demonstrate dose-dependent effects of benzene exposure on gene expression and biochemical pathways in 83 workers exposed across four airborne concentration ranges (from < 1 ppm to > 10 ppm) compared with 42 subjects with non-workplace ambient exposure levels.  We further characterized these dose-dependent effects with continuous benzene exposure in all 125 study subjects.  We estimated air benzene exposure levels in the 42 environmentally-exposed subjects from their unmetabolized urinary benzene levels.  We used a non-parametric, data-adaptive model selection method to estimate the change with dose in the expression of each gene.  We describe novel semi-parametric, causal inference approaches to model pathway responses and used these to estimate the dose responses of the AML pathway and 4 other pathways of interest.  The response patterns of majority of genes as captured by mean estimates of the first and second principal components of the dose-response for the five pathways and the profiles of 6 AML pathway response-representative genes (identified by clustering) exhibited similar supra-linear responses. This data shows that benzene alters disease-relevant pathways and genes in a dose-dependent manner, with effects apparent at doses as low as 100 ppb in air.  Further studies with extensive exposure assessment of subjects exposed in the low-dose range between 10 ppb and 1 ppm are needed to confirm these findings.

We applied bioinformatic approaches to identify pathways common to chemical leukemogens and to determine whether leukemogens could be distinguished from leukemia-negative carcinogens.  From all known and probable carcinogens classified by IARC and NTP, we identified 35 carcinogens that were associated with leukemia in human studies and 16 leukemia-negative carcinogens.  Using data on gene/protein targets available in the Comparative Toxicogenomics Database (CTD) for 29 of the leukemogens and 11 of the leukemia-negative carcinogens, we analyzed for enrichment of all 250 human biochemical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database.  The top pathways targeted by the leukemogens included metabolism of xenobiotics by cytochrome P450, glutathione metabolism, neurotrophin_signaling_pathway, aApoptosis, MAPK signaling, Toll-like receptor signaling and various cancer pathways.  The 29 leukemogens formed 18 distinct clusters comprising 1 to 3 chemicals that did not correlate with structural similarity as determined by 2D Tanimoto coefficients in the PubChem database.  Unsupervised clustering and one-class support vector machines, based on the pathway data, were unable to distinguish the 29 leukemogens from 11 leukemia-negative known and probable IARC carcinogens.  However, using more aggressive machine learning (two-class random forests), we were able to construct an accurate predictor of leukemogen vs non-leukemogen.

Using mRNA-sequencing (RNA-seq) technology, we examined the effect of benzene exposure on gene expression in peripheral blood mononuclear cells obtained from 10 workers occupationally exposed to high levels of benzene (≥5ppm) in the air and 10 matched unexposed control workers, chosen from a prior larger study (involving 125 subjects) in which gene expression was measured using microarrays.  RNA-seq technology is known to be more sensitive and to have a wider dynamic range for the quantification of gene expression.  Further it has the ability to detect novel transcripts and additional modes of regulation of gene expression.  The main conclusions from analyses of the RNA-seq data derived from the 20 workers are as follows: 1) The Pearson correlation between the two technical replicates for the RNA-seq experiments is 0.98. 2) There is a correlation of around 0.6 between the signals for the RNA-seq experiments and corresponding ones from the microarray experiments.  60% of the transcripts with detected reads from the RNA-seq experiments did not have corresponding probes on the microarrays.  53% of the transcripts detected from the RNA-seq data were protein-coding while 99% of transcripts with probes on the microarray used were protein-coding.  There is a significant overlap (P <0.05) of transcripts declared differentially expressed due to benzene exposure using the two technologies.  About 20% of the transcripts declared differentially expressed using the RNA-seq data were non-coding transcripts.  Six transcripts were determined (FDR<0.05) to be alternatively spliced as a result of benzene exposure.  Overall, we demonstrated that RNA-seq can complement the information obtained by microarray in the analysis of changes in gene expression from chemical exposures.

Gender and smoking effects of two-path versus one-path chemical kinetic models of benzene metabolism.  We had previously observed that the rate of production of benzene metabolites per parts-per-million of air benzene level at levels below 1ppm was higher than what were predicted.  These predictions were based on extrapolation of toxicokinetic models fit to benzene exposure data in the higher dose range.  In order to explain this increased rate of production of metabolites we hypothesized the presence of a second enzymatic pathway (in addition to CYP2E1 based one) that metabolizes benzene below 1ppm.  We demonstrated evidence for this hypothesis not just for the total benzene metabolites but also for individual metabolites like phenol and muconic acid.  This was done by fitting data from non-smoking women in the China study to Michaelis-Menten type kinetic models. In this analysis, the exposure levels of the subjects at the lower range were extrapolated using a calibration model fit to urinary unmetabolized benzene levels and measured air benzene levels in subjects exposed in the higher dose range.  We have now done an analysis where we evaluated our two-pathway hypothesis using levels of muconic acid data from male and female, smoking and non-smoking subjects.  Data from a set of Italian subjects who were exposed at the lower range were also used and data from subjects whose exposures were previously predicted using the calibration mode were excluded.  Gender (male vs female), smoking (non-smoking vs smoking) and study (China vs Italy) were included as fixed effects in the toxicokinetic model.  We demonstrated that the two-pathway model is a better fit than the one-pathway model for the production of the benzene metabolite muconic acid in non-smokers and reasonable evidence that the one-pathway model better fits the data for the smokers.

Our collaboration with Imperial College, London seeks to measure a diverse set of molecular measurements in the blood of the study subjects identified primarily by their cardiovascular disease risk and their ethnicities.  In this project, a broad framework for the design of experiments for the study was proposed. This involved an appropriate choice of the number of technical replicates (designed to capture ‘nuisance’ experimental variation) at the different steps in the making the final molecular measurements.  These concepts are being currently used for the design of miRNA experiments and the measurement of estrogen receptor activity level in the study subjects.

We developed automated statistical methodology that defines clusters of genes/proteins that are not just enriched with targets of a chemical but with possibly any subset of chemicals. We have experimented with the yeast protein-protein interaction network from the STRING database.  The methodology we developed defines a cluster of genes as being significantly targeted by a chemical if there is at least one target gene for which there is a “very small” probability that it is as close (in the network) to each of the other target genes in the cluster.  The measure of distance between two genes is the number of interactions separating them in the protein-protein interaction network.  A group of gene clusters defined by a subset of chemicals is termed statistically significant by similar measures.

Generalized LIMMA for semiparametric causal inference in Large Scale Omic studies
Typical biological hypotheses evaluated by high throughput data involved determining the relative “causal” impact among many competing causes, and though (semiparametric) causal inference methods have not been generally applied in the setting,  such targeted questions are a perfect match for this general methodology.  For instance, we have been working on generalizing empirical Bayes approaches in providing inference for adaptive estimation of semiparametric variable importance measures.  A recently submitted paper was done in collaboration with post-doctoral research Sara Kherad, presented an empirical Bayesian (LIMMA) methodology for deriving more robust inference for asymptotically linear estimators beyond simple means and regression coefficients.  This means that our group can develop variable importance measures based on machine-learning tools that can aggressively fit the data, and still we can provide trustworthy joint inference across potentially thousands of comparisons. This kind of work becomes particularly important when the goals are even more ambitious cross-platform studies that are planned in next year.


We were able to not only find a biomarker of lymphoma in human serum, but also a potential metabolic driver of disease, which could lead to better risk diagnosis and treatment.  Our work has provided key insights into the mechanisms behind arsenic exposure and development of cancer and a more complete understanding of toxicities associated with environmental chemicals.

Future Plans

Next steps in our Core will be to make the techniques developed more widely available and accessible via creation of software packages and web-based implementations.  Environmental informatics promises an explosion in the dimension of data being collect at all scales (large scale environment to individual molecules), our goal is develop techniques that produce robust findings from such data.



Currently under review.



Currently under review.



Core D’s leaders are developing innovative methods in a variety of contexts to improve the use, analysis, and interpretation of data.  They have made important contributions to the development of methods essential to the use of the great volumes of data produced by new methods in genomics and other “omics” areas.   Newer methods are contributing to important new results in Superfund Research at Berkeley.

Overall goals
The overall goal of this Core is to develop better statistical methods suitable for the kinds of data emerging from the more sophisticated “omics” research methods now integral to the Superfund Research Program.  These methods are already producing new insights not achievable with older approaches.

Important advances

  1. Developing more powerful statistical methods that are adapted to the data set under review.
  2. Devising means to integrate higher order knowledge with datasets resulting from high throughput methods.

Accomplishments for the last year
Our collaborative work this year concentrated on bioinformatic methods that could incorporate higher order information into the analysis of the data that is produced from high throughput “omics” assays.  This means developing methods to add consideration of information that we already have about biological pathways related to disease, exposure, biological processes, etc. into our analyses of high throughput omic data and exposures.

This has been facilitated by our newest member of Core D, Reuben Thomas, a post-doctoral researcher from NIEHS.  He has helped developed statistical methods for examining the correspondence of gene expression data versus exposure, and existing hypothesized pathways related to relevant biological pathways.

Investigators continue to refine their “semiparametric” approach.  This means methods that adapt to the data that is actually produced, rather than being based solely on a model of how they think the data should be.  Investigators are using this to look for associations between sets of characteristics such as gene expression, proteomics, methylation, etc. and their independent association with exposures to environmental contaminants, in the presence of other confounding variables.

Having reached a relatively satisfactory set of methodologies along with corresponding code for quick implementation, focus has shifted towards refining the methods to superimpose these results onto the accumulating knowledge base regarding biological pathways.  Specifically, in the context of our study of occupational benzene exposure and genomics, we used a method known as “structurally enhanced pathway enrichment analysis” (Thomas et al. 2009).  This incorporates information about the genome and biological pathways. It uses manually drawn pathway maps representing current knowledge on the molecular interaction and reaction networks involved in cellular processes such as metabolism, and cell cycle.

This procedure revealed highly significant (p < 0.001) impacts of relatively high benzene occupational exposure to several pathways.  (These are the transcriptome of genes related to the toll-like receptor signaling pathway, oxidative phosphorylation, B cell receptor signaling pathway, apoptosis, acute myeloid leukemia, and T cell receptor signaling.)

Perhaps even more important, the combination of statistical methodology for selecting differentially expressed genes and the use of these statistical tests for highlighting “significant” pathways found the same pathways.

We also examined dose specific pathways and found that some are uniquely impacted only among very highly exposed workers.  (These include, for instance, expression among nucleosome assembly and the ABC transporter pathways.)

What we plan to do next
The bottom line for the computational core is now we have in place an analysis stream that both finds individual culprits via rigorous statistical estimation and inference, but also can find higher-level patterns via methodology designed for finding significantly affected biological pathways, including disease pathways.



Our methodological work has continued focused on biomarkers of exposure via ‘omic technologies and we have introduced several new techniques this year to improve our search of these biomarkers.   In this process, we have taken the same basic estimation philosophy as well as the methods to derive inference:  arbitrary assumptions lead to misleading results and erroneous findings.   For instance, last year, we discussed our proposed Pathway Test, which uses a machine-learning approach to derive the overall association of sets of genes (for instance) and exposure.  This we consider a large improvement over the typical approach (so called Gene Set Enrichment Analysis techniques), which typically rely on arbitrarily chosen simple models that are undoubtedly wrong, which lead to inefficient tests, erroneous inference (e.g., false positives) poor estimates of the association of things we care about (environmental exposure) and biomarkers for health outcomes (proteins).  Incredible progress has been made in the field of data-adaptive/machine learning techniques, which are nearly infinitely flexible, yet these have been rarely leveraged to estimate the associations that are often goals of bioinformatic techniques.    Our larger goal is to change this, because we want to contribute to changing the (hard to debate) assertion of Ioannidis (2005): “Why Most Published Research Findings Are False”.  A large contribution of this problem is theoretically unjustifiable modeling assumptions and non-robust estimates of uncertainty.

In this light, we have been applying new semi-parametric methods for defining biomarkers of exposure.  The methods rely on estimating models of predicted exposure versus (for instance) gene expression.  In one case, we have occupational benzene exposure as well as 20,000+ gene expression measures on around 144 samples from workers in China.  In addition, we also have other demographic and behavioral factors.  Our goal is to 1) find the best possible predictor of benzene exposure from the list of expression values and 2) find which of these gene expressions contribute the most to accurate prediction of benzene exposure.  As opposed to assuming an arbitrary, our team has combined both machine learning algorithms (specifically, the Superlearner; van der Laan, Polley and Hubbard, 2008) and the robust estimation methods of targeted maximum likelihood (van der Laan and Rubin, 2007) to accomplish both goals.  The Superlearner is a combination of many other machine learning algorithms and is thus, extremely flexible, though, capable of producing a very simple model if the data suggest that is the best fit.   Our results suggest that exposure can be characterized with surprising accuracy by expression biomarkers, and the particular biomarkers involved have suggestions for important biological consequences from benzene exposure.  In addition, these results do not arise when conventional procedures (e.g., simple linear regression) techniques are used.   These types of estimation methods will have important consequences for how we examine the data from the various projects in the Program.



The methodological work of Drs. Mark van der Laan and Alan Hubbard has focused on biomarkers of exposure via ‘omic technologies and their recent work (Birkner, et al. 2005) has been to augment their original proposal for a new Gene Set Enrichment Analysis (GSEA) using machine learning algorithms to include the so-called SuperLearner (van der Laan, Polley and Hubbard, 2007).  The SuperLearner creates a predictor that combines several different learners (e.g., neural nets, stepwise regression, POLYCLASS, Kooperberg, et al., 1999, etc.), which means that the space of models it considers is huge.  After the fit is completed, the researchers construct a test statistic, such as a likelihood ratio statistic.  This is compared to the permutation distribution (the entire algorithm is run on a large set of permutations of the original data).  A recent set of simulations suggests this procedure is very powerful for finding sets of genes that have complicated relationships with some outcome (e.g., disease status).  The theoretical power of the test is that it is data-adaptive so that, as sample size increases, the algorithm will automatically search more aggressively for patterns all the while still providing proper type I error control. This procedure has wide applicability to many of the data/questions procedure by other projects in the Program as a whole.

The researchers have continued their use of variable importance procedures to augment their prediction-based algorithms to find a list of biomarkers that reflect the importance of each variable (say gene expression) to the variability of exposure (say arsenic).   These parameters are an alternative to other proposals for measuring variable importance (the relative importance of a variable in predicting an outcome) and these parameter estimates have the additional virtue that, if one makes certain assumptions on the data-generating distribution, they have a causal interpretation.   They have recently augmented these procedures by using the relationship of one variable (one gene expression) to the association of another variable on exposure – the technique produces a interesting distance matrix that is a function not just of the relationship of variables (gene expressions) to one another, but to their relationships as biomarkers (predictors) of exposure.  The result has been to produce new potential networks of biomarkers in their relationship to exposure.



For Dr. Mark van der Laan and Dr. Alan Hubbard, interaction with colleagues performing research within the overall center has motivated the creation of a new multiple testing procedure, the quantile-function based multiple testing procedure (van der Laan and Hubbard, 2006), that remedies some of the efficiency issues in the original re-sampling procedures (Dudoit, et al., 2004). A recent methods paper motivated by the projects applied work on biomarkers of benzene exposure (authored by a post-doctoral student with the Computational Biology Core) compared different methods for controlling family wise error rate (FWER) and found that the researchers recently proposed technique performed well to other techniques, such as the permutation test, in a variety of data-generating distribution settings (Chen, et al., 2007).

Van der Laan and Hubbard have also concentrated on methods that test, in a data-adaptive fashion, not just single genes or SNP’s at a time, but the simultaneous association of groups of variables, such as all genes belonging to a particular gene ontology, a so-called Gene Set Enrichment (GSE) test and many GSE procedures have been proposed. The investigators have applied a new version of this test to the SNP/non-Hodgkin’s lymphoma data collected by Dr. Skibola. The project leaders have a general algorithm that applies machine learning algorithms (such as POLYCLASS, Kooperberg, et al., 1999) to find the best predictor of phenotypic outcome from a set of genetic variables. Then, they use the final fit to construct a test statistic relevant for testing the global independence of the set (such as a likelihood ratio statistic). The power of the test is that it is data-adaptive so that, as sample size increases, the algorithm will automatically search more aggressively for patterns all the while still providing proper type I error control. This procedure has wide applicability to many of the data/questions procedure by other projects within the larger center grant.

Van der Laan (2005) proposed a parameter that measures the independent impact of one variable (e.g., exposure) on an outcome (specific gene expression) in the presence of other factors (other genes) and he has begun applying these methods for biomarker discovery in our microarray experiments. These parameters are an alternative to other proposals for measuring variable importance (the relative importance of a variable in predicting an outcome) and these parameter estimates have the additional virtue that, if one makes certain assumptions on the data-generating distribution, they have a causal interpretation. The use of these methods with traditional techniques based on marginal associations (gene by gene) of exposure and expression paint a more complete picture of how exposure to potential toxins is related to changes the expression of related genes. Dr. Hubbard presented this work at the recent TIES conference (Raleigh, NC) and a GSR within The Computational Biology Core (Kristin Porter) presented a poster of this work at the SBRP conference (Durham, NC) in December.



In the Core’s study of occupational benzene exposure on gene expression, Core researchers have encountered a statistical issue involving highlighting genes that might change expression due to toxic exposures. Existing methods, which attempt to control the number of false findings, rely on marginal p-values and are typically very conservative, and those that rely on permutation methods are only valid under strong assumptions. This data has motivated creating a new multiple testing procedure, the quantile-function based multiple testing procedure (van der Laan and Hubbard, 2006), that remedies some of the efficiency issues in the original re-sampling procedures (Dudoit, et al., 2004). Core researchers have begun to make it the standard for controlling the rate of false positives and it has been used with great success. This is an important development because it can be applied to almost any data where many variables are collected on the same sample/subject and many comparisons are made based on these variables.

The Computational Biology Core has developed a new technique for analyzing single nucleotide polymorphisms (SNPs) and cancer. The Core researchers have been using an E-M algorithm approach (Zhoa, et al., 2002;2003) to estimate haplotype frequencies in the population and the probability of having a particular haplotype for each individual. But, the researchers have refined how their models are parameterized based on these imputed haplotypes so that they can distinguish between the association of haplotype and genotype with cancer.

The Core has refined techniques used for analyzing mass spectrometry proteomic by using a combination of quantile regression for baseline drift correction in conjunction smoothing (using a rectangular kernel) and bandwidth selection using cross-validation with the purpose of processing the signal versus mass to remove noise and drift.