This vignette illustrates the
basic usage of the knockoff
package with Model-X knockoffs.
In this scenario we assume that the distribution of the predictors is
known (or that it can be well approximated), but we make no assumptions
on the conditional distribution of the response. For simplicity, we will
use synthetic data constructed from a linear model such that the
response only depends on a small fraction of the variables.
# Problem parameters
n = 200 # number of observations
p = 200 # number of variables
k = 60 # number of variables with nonzero coefficients
amplitude = 4.5 # signal amplitude (for noise level = 1)
# Generate the variables from a multivariate normal distribution
mu = rep(0,p)
rho = 0.25
Sigma = toeplitz(rho^(0:(p-1)))
X = matrix(rnorm(n*p),n) %*% chol(Sigma)
# Generate the response from a linear model
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero) / sqrt(n)
y.sample = function(X) X %*% beta + rnorm(n)
y = y.sample(X)
To begin, we call knockoff.filter
with all the default
settings.
We can display the results with
## Call:
## knockoff.filter(X = X, y = y)
##
## Selected variables:
## [1] 8 11 15 27 45 50 51 60 66 68 71 81 87 88 99 101 111 112 114
## [20] 134 135 146 150 152 153 158 160 161 162 164 166 172 177 179 181
The default value for the target false discovery rate is 0.1. In this experiment the false discovery proportion is
## [1] 0.02857143
By default, the knockoff filter creates model-X second-order Gaussian knockoffs. This construction estimates from the data the mean μ and the covariance Σ of the rows of X, instead of using the true parameters (μ, Σ) from which the variables were sampled.
The knockoff package also includes other knockoff construction
methods, all of which have names prefixed
withknockoff.create
. In the next snippet, we generate
knockoffs using the true model parameters.
gaussian_knockoffs = function(X) create.gaussian(X, mu, Sigma)
result = knockoff.filter(X, y, knockoffs=gaussian_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
##
## Selected variables:
## [1] 11 15 27 50 60 66 82 83 94 99 114 134 135 141 146 150 153 160 161
## [20] 162 165 166 172 179 181
Now the false discovery proportion is
## [1] 0
By default, the knockoff filter uses a test statistic based on the
lasso. Specifically, it uses the statistic
stat.glmnet_coefdiff
, which computes Wj = |Zj| − |Z̃j|
where Zj
and Z̃j are
the lasso coefficient estimates for the jth variable and its knockoff,
respectively. The value of the regularization parameter λ is selected by cross-validation
and computed with glmnet
.
Several other built-in statistics are available, all of which have
names prefixed with stat
. For example, we can use
statistics based on random forests. In addition to choosing different
statistics, we can also vary the target FDR level (e.g. we now increase
it to 0.2).
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = stat.random_forest, fdr=0.2)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = stat.random_forest, fdr = 0.2)
##
## Selected variables:
## [1] 68 87 114 152 158 161
## [1] 0
In addition to using the predefined test statistics, it is also possible to use your own custom test statistics. To illustrate this functionality, we implement one of the simplest test statistics from the original knockoff filter paper, namely Wj = |Xj⊤ ⋅ y| − |X̃j⊤ ⋅ y|.
my_knockoff_stat = function(X, X_k, y) {
abs(t(X) %*% y) - abs(t(X_k) %*% y)
}
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_knockoff_stat)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = my_knockoff_stat)
##
## Selected variables:
## [1] 11 12 50 54 60 66 68 83 85 87 88 94 99 114 134 146 158 160 161
## [20] 166 177 179 181
## [1] 0.08695652
As another example, we show how to customize the grid of λ’s used to compute the lasso path in the default test statistic.
my_lasso_stat = function(...) stat.glmnet_coefdiff(..., nlambda=100)
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_lasso_stat)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = my_lasso_stat)
##
## Selected variables:
## [1] 15 27 60 66 67 71 83 99 111 114 134 135 141 153 158 160 161 165 172
## [20] 177 179 186 193
## [1] 0
The nlambda
parameter is passed by
stat.glmnet_coefdiff
to the glmnet
, which is
used to compute the lasso path. For more information about this and
other parameters, see the documentation for
stat.glmnet_coefdiff
or glmnet.glmnet
.
In addition to using the predefined procedures for construction knockoff variables, it is also possible to create your own knockoffs. To illustrate this functionality, we implement a simple wrapper for the construction of second-order Model-X knockoffs.
create_knockoffs = function(X) {
create.second_order(X, shrink=T)
}
result = knockoff.filter(X, y, knockoffs=create_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = create_knockoffs)
##
## Selected variables:
## [1] 11 12 27 41 50 51 56 60 66 68 71 80 83 87 88 94 99 101 111
## [20] 112 114 132 134 135 140 141 142 146 150 152 153 156 158 160 161 164 165 166
## [39] 167 168 172 177 179 181 182 187 193
## [1] 0.06382979
The knockoff package supports two main styles of knockoff variables, semidefinite programming (SDP) knockoffs (the default) and equi-correlated knockoffs. Though more computationally expensive, the SDP knockoffs are statistically superior by having higher power. To create SDP knockoffs, this package relies on the R library [Rdsdp][Rdsdp] to efficiently solve the semidefinite program. In high-dimensional settings, this program becomes computationally intractable. A solution is then offered by approximate SDP (ASDP) knockoffs, which address this issue by solving a simpler relaxed problem based on a block-diagonal approximation of the covariance matrix. By default, the knockoff filter uses SDP knockoffs if p < 500 and ASDP knockoffs otherwise.
In this example we generate second-order Gaussian knockoffs using the estimated model parameters and the full SDP construction. Then, we run the knockoff filter as usual.
gaussian_knockoffs = function(X) create.second_order(X, method='sdp', shrink=T)
result = knockoff.filter(X, y, knockoffs = gaussian_knockoffs)
print(result)
## Call:
## knockoff.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
##
## Selected variables:
## [1] 8 11 12 15 27 48 50 51 60 66 68 80 83 87 88 94 99 101 111
## [20] 112 114 132 134 135 140 141 146 150 152 153 158 160 161 164 165 166 167 168
## [39] 172 177 179 181 193
## [1] 0.04651163
If you want to look inside the knockoff filter, see the advanced vignette. If you want to see how to use knockoffs for Fixed-X variables, see the Fixed-X vignette.