Try simulating the errors just once, ERR ~ MVN(nrep, 0, Sigma)
, resulting in an nrep*nsnp
matrix. Let mexp.zm
be a matrix with exp.zm
replicated in each row (nrep
times). Then zstar = exp.zm + ERR
and we do not have to call rmvnorm::mvtnorm
for every SNP considered causal.
The corresponding code from the corrcov
function is,
ERR = mvtnorm:::rmvnorm(nrep, rep(0, ncol(Sigma)), Sigma)
pp_ERR = function(Zj, nrep, Sigma) {
exp.zm = Zj %*% Sigma
mexp.zm = matrix(exp.zm, nrep, length(Zj), byrow = TRUE) # matrix of Zj replicated in each row
zstar = mexp.zm + ERR
bf = 0.5 * (log(1 - r) + (r * zstar^2))
denom = coloc:::logsum(bf)
pp.tmp = exp(bf - denom)
pp.tmp/rowSums(pp.tmp)
}
This method possibly makes the results dependent between different SNPs (only one ERR matrix is used - they have the same errors), but this should be ok for large numbers of simulations.
Check that the original method, zstar = rmvnorm(nrep, exp.zm, Sigma)
, is equivalent to this new method, zstar = exp.zm + ERR
, where ERR ~ N(0, Sigma)
.
Yep, I’ve convinced myself that they’re the same.
BUT, I was looking at some old code and we tried this at some stage:
ERR = rmvnorm(nrep, rep(0, ncol(Sigma)), Sigma)
simzstar <- function(Zj) {
mZj = matrix(Zj, nrep, length(Zj), byrow = TRUE) # matrix of Zj replicated in each row
ii = which(Zj != 0) # currently considered causal snp
mZj[, ii] = mZj[, ii] - ERR[, ii]
exp.zm = mZj %*% Sigma
exp.zm + ERR
}
zstars
and maintains the value of \(\mu\) at the causal variant in the zstar
vectors. Which one should I be using?corrcov
and corrcov_bhat
functionsCheck that the results using corrcov
(\(Z\)-scores, MAFs) and corrcov_bhat
(bhat
and \(V\)) are accurate.
Should we be concerned by the variability of these estimates, considering the only variability between these is when we simulate the errors using rmvnorm
? (10000 simulations)
Aim: Find the smallest threshold needed such that the corrected coverage is above the desired coverage.
Last week, I tried a method whereby I systematically added/removed variants from the set. But, the corrected coverage is only indirectly a function of the snps in the set and is more directly a function of the threshold used to define that set. Therefore, I now consider a method that varies the threshold.
The following function uses the bracketing bisection method to find the ‘root’ of an equation and therefore the smallest threshold s.t. we exceed the desired coverage.
#' corrected_cs
#'
#' @param z Z-scores
#' @param f Minor allele frequencies
#' @param N0 Number of controls
#' @param N1 Number of cases
#' @param Sigma Correlation matrix of SNPs
#' @param lower Lower threshold
#' @param upper Upper threshold
#' @param desired.cov The desired coverage of the causal variant in the credible set
#' @param acc Accuracy of corrected coverage to desired coverage (default = 0.005)
#' @param max.iter Maximum iterations (default = 20)
#'
#' @return list of variants in credible set, required threshold, the corrected coverage and the size of the credible set
#' @export
corrected_cs <- function(z, f, N0, N1, Sigma, lower, upper, desired.cov, acc = 0.005, max.iter = 20){
# lower <- 2*desired.cov - 1
# upper <- min(1,desired.cov + 0.05)
s = N1/(N0+N1) # proportion of cases
V = 1/(2 * (N0+N1) * f * (1 - f) * s * (1 - s))
W = 0.2
r <- W^2/(W^2 + V)
pp <- ppfunc(z, V = V) # pp of system in question
muhat = est_mu(z, f, N0, N1)
nsnps = length(pp)
temp = diag(x = muhat, nrow = nsnps, ncol = nsnps)
zj = lapply(seq_len(nrow(temp)), function(i) temp[i,]) # nsnp zj vectors for each snp considered causal
nrep = 1000
# simulate ERR matrix
ERR = mvtnorm:::rmvnorm(nrep, rep(0,ncol(Sigma)), Sigma)
pp_ERR = function(Zj){
exp.zm = Zj %*% Sigma
mexp.zm = matrix(exp.zm, nrep, length(Zj), byrow=TRUE) # matrix of Zj replicated in each row
zstar = mexp.zm+ERR
bf = 0.5 * (log(1 - r) + (r * zstar^2))
denom = coloc:::logsum(bf) # logsum(x) = max(x) + log(sum(exp(x - max(x)))) so sum is not inf
pp.tmp = exp(bf - denom) # convert back from log scale
pp.tmp / rowSums(pp.tmp)
}
# simulate pp systems
pps <- mapply(pp_ERR, zj, SIMPLIFY = FALSE)
n_pps <- length(pps)
args <- 1:length(pp)
f <- function(thr){ # finds the difference between corrcov and desired.cov
d5 <- lapply(1:n_pps, function(x) {
credsetC(pps[[x]], CV = rep(args[x], dim(pps[[x]])[1]), thr = thr)
})
prop_cov <- lapply(d5, prop_cov) %>% unlist()
sum(prop_cov * pp) - desired.cov
}
# initalize
N=1
fa = f(lower)
fb = f(upper)
if (fa * fb > 0) {
stop("No root in range, increase window")
} else {
fc = min(fa, fb)
while (N <= max.iter & !dplyr::between(fc, 0, acc)) {
c = lower + (upper-lower)/2
fc = f(c)
print(paste("thr: ", c, ", cov: ", desired.cov + fc))
if (fa * fc < 0) {
upper = c
fb = fc
} else if (f(upper) * fc < 0) {
lower = c
fa = fc
}
N = N + 1
}
# df <- data.frame(req.thr = c, corr.cov = desired.cov + fc)
}
o <- order(pp, decreasing = TRUE) # order index for true pp
cumpp <- cumsum(pp[o]) # cum sums of ordered pps
wh <- which(cumpp > c)[1] # how many needed to exceed thr
list(credset = names(pp)[o[1:wh]], req.thr = c, corr.cov = desired.cov + fc, size = cumpp[wh])
}
I ran some simulations using this corrected_cs
function. The first plot shows the true empirical coverage of the simulated credible sets grouped by true \(\mu\) value (approx. 500-1000 simulations in each group).
corrected_cs
function (which finds the required threshold and uses the standard Bayesian approach with this new threshold).The function works well at providing a new credible set with nearer to the desired coverage, however I am wary of how many are below 0.8 and the variability of the true coverage estimates. These are artefacts of our correction (since all the corrected coverages values for the points on the right plot are between 0.8 and 0.8005)
Side note: Why is there a ‘dip’ in the left plot?
The following plots are to visualise the relative error of the corrected and claimed coverage estimates of the original credible sets and the new ones obtained using the corrected_cs
function.
Firstly, for the original credible sets:
The table below shows the summary statistics for the difference between the number of variants in the original 80% credible set and the new 80% credible set (using the corrected_cs
function). I.e. how many variants have been removed.
Obviously fewer variants are removed in high power scenarios as the original credible set often only has 1 variant in it.
Infact, where we would use fine-mapping there is often only one variant in the credible set anyway so most of the time the credible set will stay the same? I think that our correction method will work best in high linkage disequilibrium cases where there are many variants in the set.
## bin mean median low.error high.error
## 1: (0,4] 11.4607724 8 1 22
## 2: (4,4.75] 4.5208597 1 1 3
## 3: (4.75,6] 2.2918862 1 1 1
## 4: (6,8] 0.8317152 1 0 1
## 5: (8,10] 0.6227390 1 0 1
## 6: (10,13] 0.6385542 1 0 1
## 7: (13,Inf] 0.5674419 1 0 1
Single-SNP logistic regression for T1D versus additive genotype score (0, 1 or 2 risk alleles), adjusted for sex and regions in the UK.
Used forward stepwise analyses to investigate secondary independent associations in each region. I.e. fit a new logistic regression model with the most significantly associated SNP as a covariate. Repeat, building up the covariates in the model until no more SNPs are significant.
Create 99% credible sets of SNPs by ordering these and cumulatively summing until the sum exceeds 0.99.
Problems:
My calculated posterior probabilities do not match those in the paper. Perhaps due to cutting out SNPs not in LD matrix, rounding or the method used to calculate pps in the paper. Why are they looking at ABF comparing index SNP causal vrs each of the others rather than ABF comparing each SNP causal vrs the null?
Getting this error: In mvtnorm:::rmvnorm(nrep, rep(0, ncol(Sigma)), Sigma) : sigma is numerically not positive semidefinite.
Anyhow, I have ran an array job that finds the corrected coverage of the 99% credible sets from the T1D data and (if possible) finds a new credible set with corrected coverage >= 0.99 using the corrcoverage
package.
corrected_cs
function## index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov
## 1 rs10277986 26 11 0.9732795 17 0.9903420
## 2 rs11203202 5 4 0.9960134 4 0.9904442
## 3 rs11954020 136 83 0.9941234 74 0.9904929
## 4 rs12416116 7 6 0.9902935 5 0.9900865
## 5 rs12453507 112 65 0.9921685 63 0.9904078
## 6 rs12927355 41 20 0.9693932 24 0.9904093
## 7 rs13415583 139 92 0.9889096 97 0.9900088
## 8 rs1456988 40 26 0.9915964 26 0.9904314
## 9 rs151234 0 10 0.9848938 20 0.9904777
## 10 rs1538171 154 87 0.9701159 101 0.9904821
## 11 rs1615504 40 26 0.9920699 25 0.9904854
## 12 rs1893217 35 11 0.9478510 32 0.9900666
## 13 rs193778 102 23 0.9932024 24 0.9903892
## 14 rs2111485 4 4 0.9822731 4 0.9901479
## 15 rs229533 15 8 0.9937547 5 0.9902300
## 16 rs2476601 2 2 0.7986403 2 0.8968245
## 17 rs3024505 9 7 0.9965116 5 0.9901786
## 18 rs3087243 30 21 0.9523659 44 0.9902696
## 19 rs34593439 3 2 0.9934904 2 0.9903248
## 20 rs402072 25 14 0.9948048 12 0.9904155
## 21 rs4820830 117 59 0.9812119 67 0.9900985
## 22 rs516246 36 12 0.9863866 12 0.9903131
## 23 rs56994090 11 8 0.9974704 8 0.9902434
## 24 rs6043409 39 16 0.9956430 10 0.9901013
## 25 rs61839660 5 4 0.9512246 7 0.9900678
## 26 rs62447205 56 40 0.9824468 55 0.9901434
## 27 rs6476839 21 18 0.9947390 17 0.9903600
## 28 rs6518350 68 41 0.9975402 26 0.9900527
## 29 rs653178 5 2 0.8339680 5 0.9805488
## 30 rs705705 22 9 0.9297544 19 0.9903935
## 31 rs72727394 32 19 0.9946555 19 0.9902028
## 32 rs72928038 56 36 0.9934878 34 0.9904087
## 33 rs757411 82 54 0.9904337 53 0.9900654
## 34 rs75793288 34 22 0.9695837 35 0.9901022
## 35 rs8056814 22 8 0.9671638 14 0.9902707
## 36 rs917911 135 133 0.9950358 48 0.9902130
## 37 rs9585056 7 3 0.9969532 1 0.9898862
I worry that we are giving users a sort of ultimatum, “use the claimed coverage or use our corrected coverage”. It would be good if we could provide them with an indication of how good/bad the claimed and corrected coverage estimates are. Of course, they could look at our relative error plots and see which facet most closely matches their sample size/ OR/ threshold but I think it would be better to provide something more concrete. Perhaps a confidence interval for each estimate or some sort of ‘confidence score’ based on muhat/sample sizes etc.
Our package should also report an error/ warning if they shouldn’t be using fine-mapping or my correction method. I think the best way to implement this is to set a cut-off on the minimum muhat value required.
Investigate how we can reduce the variability in corrected coverage estimates. The method is extemely fast now, so perhaps we could calculate many corrected coverage estimates and take the average of these?
Slightly confused about \(OR\) in our simulations. We fix it at a value, yet in the real data each SNP has an OR value associated with it (allowing us to calculate bhat).
Talk through Rob’s suggestions.