--- title: "Advanced Usage of the Knockoff Filter for R" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Usage of the Knockoff Filter for R} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- The function `knockoff.filter` is a wrapper around several simpler functions that 1. Construct knockoff variables (various functions with prefix `create`) 2. Compute the test statistic $W$ (various functions with prefix `stat`) 3. Compute the threshold for variable selection (`knockoff.threshold`) These functions may be called directly if desired. The purpose of this vignette is to illustrate the flexibility of this package with some examples. ```{r, results='hide', message=FALSE, warning=FALSE} set.seed(1234) library(knockoff) ``` Creating an artificial problem ------------------------------ Let us begin by creating some synthetic data. For simplicity, we will use synthetic data constructed from a generalized linear model such that the response only depends on a small fraction of the variables. ```{r} # Problem parameters n = 200 # number of observations p = 200 # number of variables k = 60 # number of variables with nonzero coefficients amplitude = 7.5 # signal amplitude (for noise level = 1) # Generate the variables from a multivariate normal distribution mu = rep(0,p) rho = 0.10 Sigma = toeplitz(rho^(0:(p-1))) X = matrix(rnorm(n*p),n) %*% chol(Sigma) # Generate the response from a logistic model and encode it as a factor. nonzero = sample(p, k) beta = amplitude * (1:p %in% nonzero) / sqrt(n) invlogit = function(x) exp(x) / (1+exp(x)) y.sample = function(x) rbinom(n, prob=invlogit(x %*% beta), size=1) y = factor(y.sample(X), levels=c(0,1), labels=c("A","B")) ``` Looking inside the knockoff filter ---------------------------------- Instead of using `knockoff.filter` directly, we can run the filter manually by calling its main components one by one. The first step is to generate the knockoff variables for the true Gaussian distribution of the variables. ```{r} X_k = create.gaussian(X, mu, Sigma) ``` Then, we compute the knockoff statistics using 10-fold cross-validated lasso ```{r, results='hide', message=FALSE, warning=FALSE} W = stat.glmnet_coefdiff(X, X_k, y, nfolds=10, family="binomial") ``` Now we can compute the rejection threshold ```{r} thres = knockoff.threshold(W, fdr=0.2, offset=1) ``` The final step is to select the variables ```{r} selected = which(W >= thres) print(selected) ``` The false discovery proportion is ```{r} fdp = function(selected) sum(beta[selected] == 0) / max(1, length(selected)) fdp(selected) ``` Performing numerical simulations -------------------------------- We show how to manually run the knockoff filter multiple times and compute average quantities. This is particularly useful to estimate the FDR (or the power) for a particular configuration of the knockoff filter on artificial problems. ```{r} # Optimize the parameters needed for generating Gaussian knockoffs, # by solving an SDP to minimize correlations with the original variables. # This calculation requires only the model parameters mu and Sigma, # not the observed variables X. Therefore, there is no reason to perform it # more than once for our simulation. diag_s = create.solve_asdp(Sigma) # Compute the fdp over 20 iterations nIterations = 20 fdp_list = sapply(1:nIterations, function(it) { # Run the knockoff filter manually, using the pre-computed value of diag_s X_k = create.gaussian(X, mu, Sigma, diag_s=diag_s) W = stat.glmnet_lambdasmax(X, X_k, y, family="binomial") t = knockoff.threshold(W, fdr=0.2, offset=1) selected = which(W >= t) # Compute and store the fdp fdp(selected) }) # Estimate the FDR mean(fdp_list) ``` See also -------- If you want to see some basic usage of the knockoff filter, see the [introductory vignette](knockoff.html). If you want to see how to use knockoffs for Fixed-X variables, see the [Fixed-X vignette](fixed.html).