Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

K. Vuk, N. Ihlo, M. Behr

Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are often required, rather than global scores summarizing overall feature importance. Random Forests (RFs) are widely used in Read more…

Decorrelated feature importance from local sample weighting

B. Fröhlich, A. Durst, M. Behr

Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI often tends to be distributed among all features which are in correlation Read more…

Epistasis regulates genetic control of cardiac hypertrophy

Qianru Wang, Tiffany M. Tang, Nathan Youlton, Chad S. Weldy, Ana M. Kenney, Omer Ronen, J. Weston Hughes, Elizabeth T. Chin, Shirley C. Sutton, Abhineet Agarwal, Xiao Li, Merle Behr, Karl Kumbier, Christine S. Moravec, W. H. Wilson Tang, Kenneth B. Margulies, Thomas P. Cappola, Atul J. Butte, Rima Arnaout, James B. Brown, James R. Priest, Victoria N. Parikh, Bin Yu, Euan A. Ashley

The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the Read more…

Learning epistatic polygenic phenotypes with Boolean interactions

Merle Behr, Karl Kumbier, Aldo Cordova-Palomera, Matthew Aguirre, Euan Ashley, Atul J. Butte, Rima Arnaout, Ben Brown, James Priest, Bin Yu

Detecting epistatic drivers of human phenotypes remains a challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving single pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Read more…

Opportunities and Challenges for AI-Based Analysis of RWD in Pharmaceutical R&D: A Practical Perspective

M. Behr, R. Burghaus, C. Diedrich, J. Lippert

Real world data (RWD) has become an important tool in pharmaceutical research and development. Generated every time patients interact with the healthcare system when diagnoses are developed and medical interventions are selected, RWD are massive and in many regards typical big data. The use of artificial intelligence (AI) to analyze RWD seems an obvious choice. Read more…

Minimax estimation in linear models with unknown design over finite alphabets

Behr, M., Munk, A.

We provide a minimax optimal estimation procedure for F and W in matrix valued linear models Y = F W + Z where the parameter matrix W and the design matrix F are unknown but the latter takes values in a known finite set. The proposed finite alphabet linear model is justified in a variety Read more…

Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests

Merle Behr, Yu Wang, Xiao Li, and Bin Yu

Random Forests (RF) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative Random Forests (iRF) use a tree ensemble from iteratively modified RF to obtain predictive and stable non-linear high-order Boolean interactions of features. They have shown great promise for high-order biological interaction discovery that is Read more…

Multiple Haplotype Reconstruction from Allele Frequency Data

Marta Pelizzola, Merle Behr, Housen Li, Axel Munk, Andreas Futschik

We propose a new method that is able to accurately infer major haplotypes and their frequencies just from multiple samples of allele frequency data. Our approach seems to be the first that is able to estimate more than one haplotype given such data. Even the accuracy of experimentally obtained allele frequencies can be improved by Read more…

Multiscale quantile segmentation

Vanegas, L.J., Behr, M., Munk, A.

We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) prob- ability for selecting the correct number Read more…

Testing for dependence on tree structures

Behr, M., Ansari, M. A., Munk, A., Holmes, C.

Tree structures, showing hierarchical relationships and the latent structures between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable measured on each sample and the latent group structure represented by some given tree. Currently, this is addressed on an ad Read more…