New-Credible-Set.Rmd
This vignette will show users how the corrcoverage
R package can be used to obtain a new credible set of variants that contains the true causal variant with some specified desired coverage value whilst containing as few variants as possible.
As in the corrected coverage vignette, let’s begin by simulating some GWAS data using the simGWAS
package.
set.seed(18)
library(corrcoverage)
library(simGWAS)
# Simulate reference haplotypes
nsnps <- 200
nhaps <- 1000
lag <- 30 # genotypes are correlated between neighbouring variants
maf <- runif(nsnps + lag, 0.05, 0.5) # common SNPs
laghaps <- do.call("cbind", lapply(maf, function(f) rbinom(nhaps, 1, f)))
haps <- laghaps[, 1:nsnps]
for (j in 1:lag) haps <- haps + laghaps[, (1:nsnps) + j]
haps <- round(haps/matrix(apply(haps, 2, max), nhaps, nsnps, byrow = TRUE))
snps <- colnames(haps) <- paste0("s", 1:nsnps)
freq <- as.data.frame(haps + 1)
freq$Probability <- 1/nrow(freq)
sum(freq$Probability)
## [1] 1
MAF <- colMeans(freq[, snps] - 1) # minor allele frequencies
CV <- sample(snps[which(colMeans(haps) > 0.1)], 1)
iCV <- sub("s", "", CV) # index of cv
LD <- cor2(haps) # correlation between SNPs
OR <- 1.1 # odds ratios
N0 <- 10000 # number of controls
N1 <- 10000 # number of cases
z0 <- simulated_z_score(N0 = N0, # number of controls
N1 = N1, # number of cases
snps = snps, # column names in freq
W = CV, # causal variants, subset of snps
gamma.W = log(OR), # log odds ratios
freq = freq) # reference haplotypes
To calculate \(V\), the prior variance for the estimated effect size, we use Var.data.cc
.
varbeta <- Var.data.cc(f = MAF, N = N1+N0, s = N1/(N0+N1)) # variance of
# estimated effect size
We can then use the ppfunc
function to calculate the posterior probabilities of causality for each variant.
postprobs <- ppfunc(z = z0, V = varbeta)
We use the est_mu
function to obtain an estimate of the true effect at the causal variant.
## [1] 4.970273
The corrected_cov
function is used to find the corrected coverage of a credible set with specified threshold, say 0.9.
Note that this function is similar to using corrcov
as explained in the “Corrected Coverage” vignette; which would require \(Z\)-scores, minor allele frequencies and sample sizes. Here, we have already calculated some of the intermediaries calculated in the corrcov
function (muhat, varbeta etc.) so we can use corrected_cov
instead.
thr = 0.9
corrcov <- corrected_cov(pp0 = postprobs, mu = muhat, V = varbeta,
Sigma = LD, thr = thr, nrep = 1000)
cs <- credset(pp = postprobs, thr = thr)
data.frame(claimed.cov = cs$claimed.cov, corr.cov = corrcov, nvar = cs$nvar)
## claimed.cov corr.cov nvar
## 1 0.9307395 0.967282 9
Using the Bayesian approach for statistical fine-mapping we obtain a 90% credible set consisting of 9 variants. The claimed coverage of this credible set is ~0.93, yet the corrected coverage estimate is ~0.97, suggesting that we can afford to be ‘more confident’ that we have captured the causal variant in our credible set.
Again, for the purpose of this vignette we can investigate how accurate this estimate is by simulating many credible sets from the same system, and finding the proportion of these that contain the true causal variant.
z0.tmp <- simulated_z_score(N0 = N0, # number of controls
N1 = N1, # number of cases
snps = snps, # column names in freq
W = CV, # causal variants, subset of snps
gamma.W = log(OR), # log odds ratios
freq = freq, # reference haplotypes
nrep = 5000)
pps <- ppfunc.mat(zstar = z0.tmp, V = varbeta) # find pps
cs.cov <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = thr)$cov)
true.cov.est <- mean(cs.cov)
data.frame(claimed.cov = cs$claimed.cov, corr.cov = corrcov,
true.cov = true.cov.est, nvar = cs$nvar)
## claimed.cov corr.cov true.cov nvar
## 1 0.9307395 0.967282 0.9696 9
We find that our corrected coverage value is very close to the empirical coverage of the credible set.
Our results suggest that we may be able to remove some variants from the credible set, whilst still achieving the desired coverage of 90%.
The corrected_cs
function uses GWAS summary statistics and some user-defined parameters to find the smallest credible set such that the coverage estimate is within some accuracy of the desired coverage.
The function requires the following parameters to be specified:
z
(vector of marginal \(Z\)-scores)f
(vector of minor allele frequencies)N0
, N1
(number of controls and cases respectively)Sigma
(SNP correlation matrix)desired.cov
(desired coverage of the credible set)In addition, there are a number of optional parameters:
lower
(the lower value to try for the threshold)upper
(the upper value to try for the threshold)acc
(the accuracy required for the coverage of the resultant credible set)max.iter
(maximum number of iterations)The function uses the bisection root finding method to converge to the smallest threshold such that the corrected coverage is larger than the desired coverage. The lower
and upper
parameters define the boundaries of the threshold values for the root finding method, with default values of 0 and 1 respectively. The acc
parameter has a default value set to 0.005, meaning that the algorithm will keep attempting to find a corrected credible set (until the number of iterations reaches max.iter
) that has coverage within 0.005 of the desired coverage value (e.g. between 0.895 and 0.905 for a 90% credible set).
The function reports the threshold values tested and their corresponding corrected coverage value at each iteration. The maximum number of iterations for the bisecting root finding algorithm is an optional parameter, with default value 20. The functions stops when either the number of iterations reaches the maximum, or the corrected coverage is within some accuracy of the desired coverage.
res <- corrected_cs(z = z0, f = MAF, N0, N1,
Sigma = LD, lower = 0.5, upper = 1, desired.cov = 0.9)
## [1] "thr: 0.75 , cov: 0.917720814110773"
## [1] "thr: 0.625 , cov: 0.878894070908612"
## [1] "thr: 0.6875 , cov: 0.89679762147096"
## [1] "thr: 0.71875 , cov: 0.907034158324742"
## [1] "thr: 0.703125 , cov: 0.901702620558738"
## $credset
## [1] "s54" "s58" "s57" "s67"
##
## $req.thr
## [1] 0.703125
##
## $corr.cov
## [1] 0.9017026
##
## $size
## [1] 0.7343958
The first threshold value tested is the midpoint of the lower
(0.5) and upper
(1) parameter values, which here is 0.75. The coverage of this credible set is too high (\(>0.9\)) and so a smaller threshold value is tested, the midpoint of the lower
parameter value (0.5) and the previously tried threshold value (0.75). This credible set has a coverage that is too low, and so a higher threshold value is tested, and so on.
In this example we see that a much smaller threshold value is required to obtain a credible set with 90% corrected coverage of the causal variant, containing only 4 variants. In the standard Bayesian approach, the threshold value of 90% leads to over-coverage.
Finally, we can compare this coverage estimate of the corrected credible set to an empirical estimate of the true coverage.
new.cs.sims <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = res$req.thr)$cov)
true.cov.est2 <- mean(new.cs.sims)
Original 90% credible set:
df1 <- data.frame(claimed.cov = round(cs$claimed.cov, 3), corr.cov = round(corrcov, 3), true.cov = round(true.cov.est, 3), nvar = cs$nvar)
print(df1, row.names = FALSE)
## claimed.cov corr.cov true.cov nvar
## 0.931 0.967 0.97 9
New 90% credible set:
df2 <- data.frame(claimed.cov = round(res$size, 3), corr.cov = round(res$corr.cov, 3), true.cov = round(true.cov.est2, 3), nvar = length(res$credset))
print(df2, row.names = FALSE)
## claimed.cov corr.cov true.cov nvar
## 0.734 0.902 0.897 4
This vignette has shown how the corrcoverage
R package can be used to improve the resolution of a credible set from Bayesian genetic fine-mapping, without the use of any additional data.