Title: | Knockoffs for Hidden Markov Models and Genetic Data |
---|---|
Description: | Generate knockoffs for genetic data and hidden Markov models. For more information, see the website below and the accompanying papers: "Gene hunting with hidden Markov model knockoffs", Sesia et al., Biometrika, 2019, (<doi:10.1093/biomet/asy033>). "Multi-resolution localization of causal variants across the genome", Sesia et al., bioRxiv, 2019, (<doi:10.1101/631390>). |
Authors: | Matteo Sesia [aut, cre] |
Maintainer: | Matteo Sesia <[email protected]> |
License: | GPL-3 |
Version: | 0.8.4 |
Built: | 2024-11-01 06:10:59 UTC |
Source: | https://github.com/msesia/snpknock |
This function samples the latent Markov chain of a hidden Markov model given the observed values, given an HMM for unphased genotypes.
fwGenotypes(X, r, alpha, theta, seed = 123, cluster = NULL, display_progress = FALSE)
fwGenotypes(X, r, alpha, theta, seed = 123, cluster = NULL, display_progress = FALSE)
X |
a 0,1,2 matrix of size n-by-p containing the original variables. |
r |
a vector of length p containing the "r" parameters estimated by fastPHASE. |
alpha |
a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE. |
theta |
a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE. |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
A 0,1,2 matrix of size n-by-p containing the knockoff variables.
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other hmm: fwHMM
, fwHaplotypes
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=FALSE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Forward-backward sampling H = fwGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=FALSE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Forward-backward sampling H = fwGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
This function samples the latent Markov chain of a hidden Markov model given the observed values, given an HMM for phased haplotypes.
fwHaplotypes(X, r, alpha, theta, seed = 123, cluster = NULL, display_progress = FALSE)
fwHaplotypes(X, r, alpha, theta, seed = 123, cluster = NULL, display_progress = FALSE)
X |
a binary matrix of size n-by-p containing the original variables. |
r |
a vector of length p containing the "r" parameters estimated by fastPHASE. |
alpha |
a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE. |
theta |
a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE. |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
A binary matrix of size n-by-p containing the knockoff variables.
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other hmm: fwGenotypes
, fwHMM
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=TRUE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Forward-backward sampling H = fwHaplotypes(X, hmm$r, hmm$alpha, hmm$theta)
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=TRUE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Forward-backward sampling H = fwHaplotypes(X, hmm$r, hmm$alpha, hmm$theta)
This function samples the latent Markov chain of a hidden Markov model given the observed values.
fwHMM(X, pInit, Q, pEmit, seed = 123, cluster = NULL, display_progress = FALSE)
fwHMM(X, pInit, Q, pEmit, seed = 123, cluster = NULL, display_progress = FALSE)
X |
an integer matrix of size n-by-p containing the original variables. |
pInit |
an array of length K, containing the marginal distribution of the states for the first variable. |
Q |
an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain. |
pEmit |
an array of size (p,M,K), containing the emission probabilities for each of the M possible emission states, from each of the K hidden states and the p variables. |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
Each element of the matrix X should be an integer value between 0 and M-1.
The transition matrices contained in Q are defined with the same convention as in knockoffDMC.
The emission propability matrices contained in pEmit are defined such that ,
where
is the latent variable associated to
.
An integer matrix of size n-by-p containing the knockoff variables.
Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi:10.1093/biomet/asy033.
Other hmm: fwGenotypes
,
fwHaplotypes
# Generate data p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20) # Forward-backward sampling H = fwHMM(X, pInit, Q, pEmit)
# Generate data p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20) # Forward-backward sampling H = fwHMM(X, pInit, Q, pEmit)
This function constructs knockoffs of variables distributed as a discrete Markov chain.
knockoffDMC(X, pInit, Q, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
knockoffDMC(X, pInit, Q, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
X |
an integer matrix of size n-by-p containing the original variables. |
pInit |
an array of length K, containing the marginal distribution of the states for the first variable. |
Q |
an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain. |
groups |
an array of length p, describing the group membership of each variable (default: NULL). |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
Each element of the matrix X should be an integer value between 0 and K-1.
The transition matrices contained in Q are defined such that .
An integer matrix of size n-by-p containing the knockoff variables.
Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi:10.1093/biomet/asy033. Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other knockoffs: knockoffGenotypes
,
knockoffHMM
,
knockoffHaplotypes
# Generate data p = 10; K = 5; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } X = sampleDMC(pInit, Q, n=20) # Generate knockoffs Xk = knockoffDMC(X, pInit, Q) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffDMC(X, pInit, Q, groups=groups)
# Generate data p = 10; K = 5; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } X = sampleDMC(pInit, Q, n=20) # Generate knockoffs Xk = knockoffDMC(X, pInit, Q) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffDMC(X, pInit, Q, groups=groups)
This function efficiently constructs group-knockoffs of 0,1,2 variables distributed according to the fastPHASE model for unphased genotypes.
knockoffGenotypes(X, r, alpha, theta, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
knockoffGenotypes(X, r, alpha, theta, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
X |
a 0,1,2 matrix of size n-by-p containing the original variables. |
r |
a vector of length p containing the "r" parameters estimated by fastPHASE. |
alpha |
a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE. |
theta |
a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE. |
groups |
a vector of length p containing group memberships for each variable. Indices are assumed to be monotone increasing, starting from 1 (default: NULL). |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
Generate group-knockoffs of unphased genotypes according to the fastPHASE HMM. The required model parameters can be obtained through fastPHASE and loaded with loadHMM. This function is more efficient than knockoffHMM for haplotype data.
A 0,1,2 matrix of size n-by-p containing the knockoff variables.
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other knockoffs: knockoffDMC
,
knockoffHMM
,
knockoffHaplotypes
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=FALSE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Generate knockoffs Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=FALSE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Generate knockoffs Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)
This function efficiently constructs group-knockoffs of binary variables distributed according to the fastPHASE model for phased haplotypes.
knockoffHaplotypes(X, r, alpha, theta, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
knockoffHaplotypes(X, r, alpha, theta, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
X |
a binary matrix of size n-by-p containing the original variables. |
r |
a vector of length p containing the "r" parameters estimated by fastPHASE. |
alpha |
a matrix of size p-by-K containing the "alpha" parameters estimated by fastPHASE. |
theta |
a matrix of size p-by-K containing the "theta" parameters estimated by fastPHASE. |
groups |
a vector of length p containing group memberships for each variable. Indices are assumed to be monotone increasing, starting from 1 (default: NULL). |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
Generate group-knockoffs of phased haplotypes according to the fastPHASE HMM. The required model parameters can be obtained through fastPHASE and loaded with loadHMM. This function is more efficient than knockoffHMM for haplotype data.
A binary matrix of size n-by-p containing the knockoff variables.
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other knockoffs: knockoffDMC
,
knockoffGenotypes
,
knockoffHMM
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=TRUE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Generate knockoffs Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)
# Problem size p = 10 n = 100 # Load HMM to generate data r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") hmm.data = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE, phased=TRUE) hmm.data$Q = hmm.data$Q[1:(p-1),,] hmm.data$pEmit = hmm.data$pEmit[1:p,,] # Sample X from this HMM X = sampleHMM(hmm.data$pInit, hmm.data$Q, hmm.data$pEmit, n=n) # Load HMM to generate knockoffs hmm = loadHMM(r_file, alpha_file, theta_file, char_file) hmm$r = hmm$r[1:p] hmm$alpha = hmm$alpha[1:p,] hmm$theta = hmm$theta[1:p,] # Generate knockoffs Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffHaplotypes(X, hmm$r, hmm$alpha, hmm$theta, groups=groups)
This function constructs knockoffs of variables distributed as a hidden Markov model.
knockoffHMM(X, pInit, Q, pEmit, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
knockoffHMM(X, pInit, Q, pEmit, groups = NULL, seed = 123, cluster = NULL, display_progress = FALSE)
X |
an integer matrix of size n-by-p containing the original variables. |
pInit |
an array of length K, containing the marginal distribution of the states for the first variable. |
Q |
an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain. |
pEmit |
an array of size (p,M,K), containing the emission probabilities for each of the M possible emission states, from each of the K hidden states and the p variables. |
groups |
an array of length p, describing the group membership of each variable (default: NULL). |
seed |
an integer random seed (default: 123). |
cluster |
a computing cluster object created by makeCluster (default: NULL). |
display_progress |
whether to show progress bar (default: FALSE). |
Each element of the matrix X should be an integer value between 0 and M-1.
The transition matrices contained in Q are defined with the same convention as in knockoffDMC.
The emission propability matrices contained in pEmit are defined such that ,
where
is the latent variable associated to
.
An integer matrix of size n-by-p containing the knockoff variables.
Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi:10.1093/biomet/asy033. Sesia M, Katsevich E, Bates S, Candès E, Sabatti C (2019). “Multi-resolution localization of causal variants across the genome.” bioRxiv. doi:10.1101/631390.
Other knockoffs: knockoffDMC
,
knockoffGenotypes
,
knockoffHaplotypes
# Generate data p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20) # Generate knockoffs Xk = knockoffHMM(X, pInit, Q, pEmit) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffHMM(X, pInit, Q, pEmit, groups=groups)
# Generate data p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20) # Generate knockoffs Xk = knockoffHMM(X, pInit, Q, pEmit) # Generate group-knockoffs for groups of size 3 groups = rep(seq(p), each=3, length.out=p) Xk = knockoffHMM(X, pInit, Q, pEmit, groups=groups)
This function loads the parameter estimates obtained by fastPHASE (see runFastPhase) and assembles the Li and Stephens HMM, in the format required by the knockoff generation functions knockoffHaplotypes and knockoffGenotypes.
loadHMM(r_file, alpha_file, theta_file, char_file, compact = TRUE, phased = FALSE)
loadHMM(r_file, alpha_file, theta_file, char_file, compact = TRUE, phased = FALSE)
r_file |
a string with the path of the "_rhat.txt" file produced by fastPHASE. |
alpha_file |
a string with the path of the "_alphahat.txt" file produced by fastPHASE. |
theta_file |
a string with the path of the "_thetahat.txt" file produced by fastPHASE. |
char_file |
a string with the path of the "_origchars" file produced by fastPHASE. |
compact |
whether to assemble the explicit transition and emission matrices for the HMM (default: FALSE). |
phased |
whether to assemble a model for phased haplotypes, if compact==FALSE (default: FALSE). |
This function by default returns a structure with three fields:
"r": a numerical array of length p.
"alpha": a numerical array of size (p,K).
"theta": a numerical array of size (p,K).
If the parameter compact is FALSE, this function assembles the HMM model for the genotype data (either unphased or phased), in the format required by the knockoff generation function knockoffHMM.
A structure containing the parameters from the Li and Stephens HMM for phased haplotypes.
Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi:10.1086/502802.
Other fastPHASE: runFastPhase
,
writeXtoInp
# Specify the location of the fastPHASE output files containing the parameter estimates. # Example files can be found in the package installation directory. r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "genotypes_origchars", package = "SNPknock") # Read the parameter files and load the HMM hmm = loadHMM(r_file, alpha_file, theta_file, char_file) # Read the parameter files and load the HMM hmm.large = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE)
# Specify the location of the fastPHASE output files containing the parameter estimates. # Example files can be found in the package installation directory. r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "genotypes_origchars", package = "SNPknock") # Read the parameter files and load the HMM hmm = loadHMM(r_file, alpha_file, theta_file, char_file) # Read the parameter files and load the HMM hmm.large = loadHMM(r_file, alpha_file, theta_file, char_file, compact=FALSE)
This function provides a wrapper for the fastPHASE executable in order to fit an HMM to either unphased genotype data or phased haplotype data. The software fastPHASE will fit the HMM to the genotype data and write the corresponding parameter estimates in four separate files. Since fastPHASE is not an R package, this executable must be downloaded separately by the user. Visit http://scheet.org/software.html for more information on how to obtain fastPHASE.
runFastPhase(fp_path, X_file, out_path = NULL, K = 12, numit = 25, phased = FALSE, seed = 1)
runFastPhase(fp_path, X_file, out_path = NULL, K = 12, numit = 25, phased = FALSE, seed = 1)
fp_path |
a string with the path to the directory with the fastPHASE executable. |
X_file |
a string with the path of the genotype input file containing X in fastPHASE format (as created by writeXtoInp). |
out_path |
a string with the path of the directory in which the parameter estimates will be saved (default: NULL). If this is equal to NULL, a temporary file in the R temporary directory will be used. |
K |
the number of hidden states for each haplotype sequence (default: 12). |
numit |
the number of EM iterations (default: 25). |
phased |
whether the data are already phased (default: FALSE). |
seed |
the random seed for the EM algorithm (default: 1). |
The software fastPHASE saves the parameter estimates in four separate files whose names begin with the string contained in 'out_path' and end with:
"_rhat.txt"
"_alphahat.txt"
"_thetahat.txt"
"_origchars"
The HMM for the genotype data can then be loaded from these files by calling loadHMM.
A string containing the path of the directory in which the parameter estimates were saved. This is useful to find the data when the default option for 'out_path' is used and the output is written in an R temporary directory.
Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi:10.1086/502802.
Other fastPHASE: loadHMM
,
writeXtoInp
fp_path = "~/bin/fastPHASE" # Path to the fastPHASE executable # Run fastPHASE on unphased genotypes # Specify the path to the genotype input file in ".inp" format. # An example file containing unphased genotypes can be found in the package installation folder. X_file = system.file("extdata", "genotypes.inp", package = "SNPknock") fp_outPath = runFastPhase(fp_path, X_file) # Run fastPHASE on phased haplotypes # An example file containing phased haplotypes can be found in the package installation folder. H_file = system.file("extdata", "haplotypes.inp", package = "SNPknock") fp_outPath = runFastPhase(fp_path, H_file, phased=TRUE)
fp_path = "~/bin/fastPHASE" # Path to the fastPHASE executable # Run fastPHASE on unphased genotypes # Specify the path to the genotype input file in ".inp" format. # An example file containing unphased genotypes can be found in the package installation folder. X_file = system.file("extdata", "genotypes.inp", package = "SNPknock") fp_outPath = runFastPhase(fp_path, X_file) # Run fastPHASE on phased haplotypes # An example file containing phased haplotypes can be found in the package installation folder. H_file = system.file("extdata", "haplotypes.inp", package = "SNPknock") fp_outPath = runFastPhase(fp_path, H_file, phased=TRUE)
This function draws independent random samples of a discrete Markov chain.
sampleDMC(pInit, Q, n = 1)
sampleDMC(pInit, Q, n = 1)
pInit |
an array of length K, containing the marginal distribution of the states for the first variable. |
Q |
an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain. |
n |
the number of independent samples to be drawn (default: 1). |
Each element of the output matrix is an integer value between 0 and K-1.
The transition matrices contained in Q are defined such that .
A matrix of size n-by-p containing the n observed Markov chains of length p.
Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi:10.1093/biomet/asy033.
Other models: sampleHMM
p=10; K=5; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } X = sampleDMC(pInit, Q, n=20)
p=10; K=5; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } X = sampleDMC(pInit, Q, n=20)
This function draws independent random samples of an hidden Markov model.
sampleHMM(pInit, Q, pEmit, n = 1)
sampleHMM(pInit, Q, pEmit, n = 1)
pInit |
an array of length K, containing the marginal distribution of the states for the first variable. |
Q |
an array of size (p-1,K,K), containing a list of p-1 transition matrices between the K states of the Markov chain. |
pEmit |
an array of size (p,M,K), containing the emission probabilities for each of the M possible emission states, from each of the K hidden states and the p variables. |
n |
the number of independent samples to be drawn (default: 1). |
Each element of the output matrix is an integer value between 0 and K-1.
The transition matrices contained in Q are defined with the same convention as in sampleDMC.
The emission propability matrices contained in pEmit are defined such that ,
where
is the latent variable associated to
.
A matrix of size n-by-p containing the n observed Markov chains of length p.
Sesia M, Sabatti C, Candès EJ (2019). “Gene hunting with hidden Markov model knockoffs.” Biometrika, 106, 1–18. doi:10.1093/biomet/asy033.
Other models: sampleDMC
p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20)
p=10; K=5; M=3; pInit = rep(1/K,K) Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = pEmit[j,,] / rowSums(pEmit[j,,]) } X = sampleHMM(pInit, Q, pEmit, n=20)
This function converts a genetic matrix X into the fastPHASE input format and saves it to a user-specified file. Then, an HMM can be fitted by calling fastPHASE with runFastPhase.
writeXtoInp(X, phased = FALSE, out_file = NULL)
writeXtoInp(X, phased = FALSE, out_file = NULL)
X |
either a matrix of size n-by-p containing unphased genotypes for n individuals, or a matrix of size 2n-by-p containing phased haplotypes for n individuals. |
phased |
whether the data are phased (default: FALSE). If this is equal to TRUE, each pair of consecutive rows will be assumed to correspond to phased haplotypes from the same individual. |
out_file |
a string containing the path of the output file onto which X will be written (default: NULL). If this is equal to NULL, a temporary file in the R temporary directory will be used. |
A string containing the path of the output file onto which X was written. This is useful to find the data when the default option for 'out_file' is used and X is written onto a temporary file in the R temporary directory.
Scheet P, Stephens M (2006). “A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.” Am. J. Hum. Genet., 78, 629–644. doi:10.1086/502802.
Other fastPHASE: loadHMM
,
runFastPhase
# Convert unphased genotypes # Load an example data matrix X from the package installation directory. X_file = system.file("extdata", "genotypes.RData", package = "SNPknock") load(X_file) # Write X in a temporary file Xinp_file = writeXtoInp(X) # Convert phased haplotypes # Load an example data matrix H from the package installation directory. H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock") load(H_file) # Write H in a temporary file Hinp_file = writeXtoInp(H, phased=TRUE)
# Convert unphased genotypes # Load an example data matrix X from the package installation directory. X_file = system.file("extdata", "genotypes.RData", package = "SNPknock") load(X_file) # Write X in a temporary file Xinp_file = writeXtoInp(X) # Convert phased haplotypes # Load an example data matrix H from the package installation directory. H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock") load(H_file) # Write H in a temporary file Hinp_file = writeXtoInp(H, phased=TRUE)