Random Forests (RF) are at the cutting edge of supervised machine learning in
terms of prediction performance, especially in genomics. Iterative Random
Forests (iRF) use a tree ensemble from iteratively modified RF to obtain
predictive and stable non-linear high-order Boolean interactions of features.
They have shown great promise for high-order biological interaction discovery
that is central to advancing functional genomics and precision medicine.
However, theoretical studies into how tree-based methods discover high-order
feature interactions are missing. In this paper, to enable such theoretical
studies, we first introduce a novel discontinuous nonlinear regression model,
called Locally Spiky Sparse (LSS) model, which is inspired by the thresholding
behavior in many biological processes. Specifically, LSS model assumes that the
regression function is a linear combination of piece-wise constant Boolean
interaction terms. We define a quantity called depth-weighted prevalence (DWP)
for a set of signed features S and a given RF tree ensemble. We prove that,
with high probability under the LSS model, DWP of S attains a universal upper
bound that does not involve any model coefficients, if and only if S
corresponds to a union of Boolean interactions in the LSS model. As a
consequence, we show that RF yields consistent interaction discovery under the
LSS model. Simulation results show that DWP can recover the interactions under
the LSS model even when some assumptions such as the uniformity assumption are
violated.
To appear in
Proceedings of the National Academy of Science (PNAS).
arXiv:2102.11800