Core Update Archive
Core D: Computational Biology
Our methodological work has continued focused on biomarkers of exposure via ‘omic technologies and we have introduced several new techniques this year to improve our search of these biomarkers. In this process, we have taken the same basic estimation philosophy as well as the methods to derive inference: arbitrary assumptions lead to misleading results and erroneous findings. For instance, last year, we discussed our proposed Pathway Test, which uses a machine-learning approach to derive the overall association of sets of genes (for instance) and exposure. This we consider a large improvement over the typical approach (so called Gene Set Enrichment Analysis techniques), which typically rely on arbitrarily chosen simple models that are undoubtedly wrong, which lead to inefficient tests, erroneous inference (e.g., false positives) poor estimates of the association of things we care about (environmental exposure) and biomarkers for health outcomes (proteins). Incredible progress has been made in the field of data-adaptive/machine learning techniques, which are nearly infinitely flexible, yet these have been rarely leveraged to estimate the associations that are often goals of bioinformatic techniques. Our larger goal is to change this, because we want to contribute to changing the (hard to debate) assertion of Ioannidis (2005): “Why Most Published Research Findings Are False”. A large contribution of this problem is theoretically unjustifiable modeling assumptions and non-robust estimates of uncertainty.
In this light, we have been applying new semi-parametric methods for defining biomarkers of exposure. The methods rely on estimating models of predicted exposure versus (for instance) gene expression. In one case, we have occupational benzene exposure as well as 20,000+ gene expression measures on around 144 samples from workers in China. In addition, we also have other demographic and behavioral factors. Our goal is to 1) find the best possible predictor of benzene exposure from the list of expression values and 2) find which of these gene expressions contribute the most to accurate prediction of benzene exposure. As opposed to assuming an arbitrary, our team has combined both machine learning algorithms (specifically, the Superlearner; van der Laan, Polley and Hubbard, 2008) and the robust estimation methods of targeted maximum likelihood (van der Laan and Rubin, 2007) to accomplish both goals. The Superlearner is a combination of many other machine learning algorithms and is thus, extremely flexible, though, capable of producing a very simple model if the data suggest that is the best fit. Our results suggest that exposure can be characterized with surprising accuracy by expression biomarkers, and the particular biomarkers involved have suggestions for important biological consequences from benzene exposure. In addition, these results do not arise when conventional procedures (e.g., simple linear regression) techniques are used. These types of estimation methods will have important consequences for how we examine the data from the various projects in the Program.
The methodological work of Drs. Mark van der Laan and Alan Hubbard has focused on biomarkers of exposure via ‘omic technologies and their recent work (Birkner, et al. 2005) has been to augment their original proposal for a new Gene Set Enrichment Analysis (GSEA) using machine learning algorithms to include the so-called SuperLearner (van der Laan, Polley and Hubbard, 2007). The SuperLearner creates a predictor that combines several different learners (e.g., neural nets, stepwise regression, POLYCLASS, Kooperberg, et al., 1999, etc.), which means that the space of models it considers is huge. After the fit is completed, the researchers construct a test statistic, such as a likelihood ratio statistic. This is compared to the permutation distribution (the entire algorithm is run on a large set of permutations of the original data). A recent set of simulations suggests this procedure is very powerful for finding sets of genes that have complicated relationships with some outcome (e.g., disease status). The theoretical power of the test is that it is data-adaptive so that, as sample size increases, the algorithm will automatically search more aggressively for patterns all the while still providing proper type I error control. This procedure has wide applicability to many of the data/questions procedure by other projects in the Program as a whole.
The researchers have continued their use of variable importance procedures to augment their prediction-based algorithms to find a list of biomarkers that reflect the importance of each variable (say gene expression) to the variability of exposure (say arsenic). These parameters are an alternative to other proposals for measuring variable importance (the relative importance of a variable in predicting an outcome) and these parameter estimates have the additional virtue that, if one makes certain assumptions on the data-generating distribution, they have a causal interpretation. They have recently augmented these procedures by using the relationship of one variable (one gene expression) to the association of another variable on exposure – the technique produces a interesting distance matrix that is a function not just of the relationship of variables (gene expressions) to one another, but to their relationships as biomarkers (predictors) of exposure. The result has been to produce new potential networks of biomarkers in their relationship to exposure.
For Dr. Mark van der Laan and Dr. Alan Hubbard, interaction with colleagues performing research within the overall center has motivated the creation of a new multiple testing procedure, the quantile-function based multiple testing procedure (van der Laan and Hubbard, 2006), that remedies some of the efficiency issues in the original re-sampling procedures (Dudoit, et al., 2004). A recent methods paper motivated by the projects applied work on biomarkers of benzene exposure (authored by a post-doctoral student with the Computational Biology Core) compared different methods for controlling family wise error rate (FWER) and found that the researchers recently proposed technique performed well to other techniques, such as the permutation test, in a variety of data-generating distribution settings (Chen, et al., 2007).
Van der Laan and Hubbard have also concentrated on methods that test, in a data-adaptive fashion, not just single genes or SNP’s at a time, but the simultaneous association of groups of variables, such as all genes belonging to a particular gene ontology, a so-called Gene Set Enrichment (GSE) test and many GSE procedures have been proposed. The investigators have applied a new version of this test to the SNP/non-Hodgkin’s lymphoma data collected by Dr. Skibola. The project leaders have a general algorithm that applies machine learning algorithms (such as POLYCLASS, Kooperberg, et al., 1999) to find the best predictor of phenotypic outcome from a set of genetic variables. Then, they use the final fit to construct a test statistic relevant for testing the global independence of the set (such as a likelihood ratio statistic). The power of the test is that it is data-adaptive so that, as sample size increases, the algorithm will automatically search more aggressively for patterns all the while still providing proper type I error control. This procedure has wide applicability to many of the data/questions procedure by other projects within the larger center grant.
Van der Laan (2005) proposed a parameter that measures the independent impact of one variable (e.g., exposure) on an outcome (specific gene expression) in the presence of other factors (other genes) and he has begun applying these methods for biomarker discovery in our microarray experiments. These parameters are an alternative to other proposals for measuring variable importance (the relative importance of a variable in predicting an outcome) and these parameter estimates have the additional virtue that, if one makes certain assumptions on the data-generating distribution, they have a causal interpretation. The use of these methods with traditional techniques based on marginal associations (gene by gene) of exposure and expression paint a more complete picture of how exposure to potential toxins is related to changes the expression of related genes. Dr. Hubbard presented this work at the recent TIES conference (Raleigh, NC) and a GSR within The Computational Biology Core (Kristin Porter) presented a poster of this work at the SBRP conference (Durham, NC) in December.
In the Core’s study of occupational benzene exposure on gene expression, Core researchers have encountered a statistical issue involving highlighting genes that might change expression due to toxic exposures. Existing methods, which attempt to control the number of false findings, rely on marginal p-values and are typically very conservative, and those that rely on permutation methods are only valid under strong assumptions. This data has motivated creating a new multiple testing procedure, the quantile-function based multiple testing procedure (van der Laan and Hubbard, 2006), that remedies some of the efficiency issues in the original re-sampling procedures (Dudoit, et al., 2004). Core researchers have begun to make it the standard for controlling the rate of false positives and it has been used with great success. This is an important development because it can be applied to almost any data where many variables are collected on the same sample/subject and many comparisons are made based on these variables.
The Computational Biology Core has developed a new technique for analyzing single nucleotide polymorphisms (SNPs) and cancer. The Core researchers have been using an E-M algorithm approach (Zhoa, et al., 2002;2003) to estimate haplotype frequencies in the population and the probability of having a particular haplotype for each individual. But, the researchers have refined how their models are parameterized based on these imputed haplotypes so that they can distinguish between the association of haplotype and genotype with cancer.
The Core has refined techniques used for analyzing mass spectrometry proteomic by using a combination of quantile regression for baseline drift correction in conjunction smoothing (using a rectangular kernel) and bandwidth selection using cross-validation with the purpose of processing the signal versus mass to remove noise and drift.