Aim: Obtain a required threshold (\(x\)) from a desired coverage (\(y\)).
Idea 1: Obtain predicted threshold from each GAM (logit(covered)~logit(cumsum)), normalise each with pp and sum
Idea 2: Predict \(y\) for \(x \in (0,1)\) from each GAM, average these and use these to predict required threshold
Idea 3: As above but normalise each with pp
Plots above show that the methods do bring the coverage of the new credible sets closer to 0.8. Idea 2 can be ruled out as it results in too many credible sets with coverage below that desired.
Idea 3 only:
Apply correction factor repeatedly with different thresholds to find the required threshold to obtain the desired corrected coverage.
The slow step is simulating posterior probabilities. This does not need to be done everytime we call the function. Infact we now simulate pps FIRST and then input them (list of matrices) into the quick_corrcov
function along with the threshold and the original posterior probabilities. This function uses C++ credsetmat
function to quickly form credible sets and finds the proportion of these containing the CV, weighted by posterior probabilities. This is VERY quick (~4 secs for 1000 simulations for each snp considered causal = 2e+05 credible sets).
The quick_corrcov
function takes approximately 4 seconds to report the corrected coverage estimate of the credible set, using a specified threshold.
## "if you use this threshold, the resultant credible set has this corrected coverage"
quick_corrcov <- function(thr, simulated.pps, pp){
n_pps <- length(simulated.pps)
args <- 1:nsnps
d5 <- lapply(1:n_pps, function(x) {
credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
})
prop_cov <- lapply(d5, pred_na) %>% unlist()
sum(prop_cov * pp)
}
The quick_corrcov_cs
function takes approximately 4 seconds to report a list of the corrected coverage estimate, the new credible set, the threshold used and the size.
## "if you use this threshold, the resultant credible set is this and has this corrected coverage"
quick_corrcov_cs <- function(thr, simulated.pps, pp){
n_pps <- length(simulated.pps)
args <- 1:nsnps
d5 <- lapply(1:n_pps, function(x) {
credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
})
prop_cov <- lapply(d5, pred_na) %>% unlist()
corr_cov <- sum(prop_cov * pp)
o <- order(pp, decreasing = TRUE)
cumpp <- cumsum(pp[o])
wh <- which(cumpp > thr)[1]
list(credset = names(pp)[o[1:wh]], corr.cov = corr_cov, thr = thr, size = cumpp[wh])
}
BUT the aim of all of these ‘ideas’ is to provide the user with the threshold they should be using to obtain a credible set with the desired coverage. Of course, they could keep using quick_corrcov_cs
above until the corrected coverage is where they want it to be, but we need to automate this.
Could use the following function to obtain a credible set with corrected coverage within some accuracy of the desired coverage. This uses C++ code to generate the credible sets from the simulated pps, do we really need to transfer the whole thing to C? Need to decide when to break the function - i.e. if it is impossible to get within this accuracy as there is one CV holding all the pp - include counter (this doesn’t seem to be stopping after 4 goes??)
new_cs <- function(desired.cov, acc = 0.05, simulated.pps, pp){
corr.cov <- quick_corrcov(desired.cov, simulated.pps, pp)
is_within_tolerance <- FALSE
thr1 <- desired.cov
counter <- 1
while (counter < 4 | !is_within_tolerance) {
if(dplyr::between(corr.cov, desired.cov, desired.cov+acc)){
is_within_tolerance <- TRUE
}
# if corr.cov is too high
if(corr.cov > desired.cov+acc){ # then run again with lower threshold (remove variants)
thr1 <- thr1 - (desired.cov-corr.cov)
corr.cov <- quick_corrcov(thr1, simulated.pps, pp)
next
}
# if corr.cov is too low
if(corr.cov < desired.cov){ # then run again with higher threshold (add variants)
thr1 <- thr1 + (desired.cov-corr.cov)
corr.cov <- quick_corrcov(thr1, simulated.pps, pp)
next
}
counter <- counter + 1
}
o <- order(pp, decreasing = TRUE) # order index for true pp
cumpp <- cumsum(pp[o]) # cum sums of ordered pps
wh <- which(cumpp > thr1)[1] # how many needed to exceed thr
list(credset = o[1:wh], corr.cov = corr.cov, thr = thr1, counter=counter)
}
new_cs(0.8, 0.05, pps, pp0)
A better (more robust) approach is the method below, which systematically removes/adds variants from the credible set until we have a new credible set with corrected coverage within some accuracy.
Odd scenarios I had to consider:
desiredcov_cs <- function(desired.cov, acc = 0.05, simulated.pps, pp){
n_pps <- length(simulated.pps)
args <- 1:length(pp)
thr <- desired.cov
d5 <- lapply(1:n_pps, function(x) {
credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
})
prop_cov <- lapply(d5, pred_na) %>% unlist()
corr_cov <- sum(prop_cov * pp)
o <- order(pp, decreasing = TRUE)
cumpp <- cumsum(pp[o])
wh <- which(cumpp > thr)[1]
is_within_tolerance <- FALSE
while (!is_within_tolerance) {
if(dplyr::between(corr_cov, desired.cov, desired.cov+acc)){
is_within_tolerance <- TRUE
}
if(wh==1){
warning("Only 1 variant in the credible set, so can't make it any smaller")
is_within_tolerance <- TRUE
}
# if corr.cov is too high
if(!wh==1 & corr_cov > desired.cov+acc){
wh <- wh-1 # remove a variant from the cs
thr <- cumsum(pp[o][1:wh-1])[wh-1]+(cumsum(pp[o][1:wh])[wh]-cumsum(pp[o][1:wh-1])[wh-1])/2 # new thr required
corr_cov <- quick_corrcov(thr, simulated.pps, pp)
next
}
# if corr.cov is too low
if(!wh==1 & corr_cov < desired.cov){
wh <- wh+1 # add a variant to the cs
thr <- cumsum(pp[o][1:wh-1])[wh-1]+(cumsum(pp[o][1:wh])[wh]-cumsum(pp[o][1:wh-1])[wh-1])/2 # new thr required
corr_cov <- quick_corrcov(thr, simulated.pps, pp)
if(corr_cov > desired.cov+acc){
warning("This is the smallest credible set with corrected coverage above the desired coverage")
is_within_tolerance <- TRUE
}
next
}
}
list(credset = names(pp)[o[1:wh]], corr_cov = corr_cov, thr = thr, size = cumsum(pp[o])[wh], nvar = wh)
}
I’ve tried this with several datasets and it looks really promising - very quick and accurate.
Note that this function can be used to find the smallest credible set possible with corrected coverage greater than the desired coverage by setting acc to a very small value.
It appears that there isn’t much difference.
I further investigate a region where there doesn’t seem to be a smooth gradient (e.g. OR = 1.1, thr = 0.99, N0=N1 = 5000) by running many more simulations from this region (this is in just_ave/check_nrep/ on hpc) and find that there is a smooth gradient and nrep=1000
is best.
nrep=1000
as it doesn’t take long and seems the best.This function reports the cumsum of a vector of pps
cppFunction('List f2(NumericVector pp) {
int n = pp.size();
NumericVector out(n);
List ret(1);
out[0] = pp[0];
for(int i = 1; i < n; ++i) {
out[i] = out[i - 1] + pp[i];
}
ret[0]=out;
return(ret);
}')
corrcov
functioncorrected_cov <- function(mu, V, Sigma, pp0, thresh, size, nrep = 1000) {
nsnps <- length(pp0)
# form joint z-score vectors
temp <- diag(x = mu, nrow = nsnps, ncol = nsnps)
zj <- lapply(seq_len(nrow(temp)), function(i) temp[i,]) # nsnp zj vectors for each snp considered causal
# simulate pp systems
pps <- mapply(zj_pp, zj, V, MoreArgs = list(nrep = nrep, Sigma = LD), SIMPLIFY = FALSE)
# consider different CV as causal in each list
n_pps <- length(pps)
args <- 1:nsnps
# obtain credible set for each simulation
prop_cov <- lapply(1:n_pps, function(x) {
tmp = credset3(pps[[x]], CV = rep(args[x], nrep), thr = thresh)
sum(tmp[[2]])/length(tmp[[2]])
}) %>% unlist()
# final corrected coverage value
sum(prop_cov * pp0)
}
corrcov <- function(z, f, N0, N1, Sigma, thr, W = 0.2) {
ph0.tmp <- z0_pp(z, f, type = "cc", N = N0 + N1, s = N1/(N0 + N1), W = W)
ph0 <- ph0.tmp[1] # prob of the null
pp0dash <- ph0.tmp[-1] # pps of variants
varbeta <- coloc:::Var.data.cc(f, N0 + N1, N1/(N0 + N1)) # variance of estimated effect size
pp0 <- ppfunc(z, V = varbeta, W = W) # posterior probs of system
muhat <- est_mu(z, f, N0, N1)
corrected_cov(mu = muhat, V = varbeta, Sigma = Sigma, pp0 = pp0, thresh = thr)
}
quick_corrcov
.rmvnorm
already uses source C code by default so I’m not sure this would help too much (profiling shows that this is the slow stage). Takes ~13 secs for 1000 simulations for each 200 snps considered causal.zj_pp <- function(Zj, V, nrep = 5000, W = 0.2, Sigma) {
exp.zm = Zj %*% Sigma # find E(Z_m)
zstar = mvtnorm:::rmvnorm(nrep, exp.zm, Sigma) # nrep is rows, nsnps is cols
r <- W^2/(W^2 + V)
bf = 0.5 * (log(1 - r) + (r * zstar^2))
denom <- coloc:::logsum(bf) # logsum(x) = max(x) + log(sum(exp(x - max(x)))) so sum is not inf
pp.tmp <- exp(bf - denom) # convert back from log scale
pp.tmp/rowSums(pp.tmp)
}
pps <- mapply(zj_pp, zj, V, MoreArgs = list(nrep = nrep, Sigma = LD), SIMPLIFY = FALSE)
At the moment, corrcov
uses the following parameters z, f, N0, N1, Sigma, thr, W = 0.2
. In terms of replacing the \(Z\) scores and minor allele frequences with \(\hat\beta\) and \(V\), I don’t think this will work since maf are needed to find V (Var.data.cc
) and ABFs (approx.bf.p
) within the function body.
Can I get going with the real T1D data?
In terms of finding the required threshold, the desiredcov_cs
function above is working very well. Need to consider other scenarioes where it may not work.
Any ideas where I can get my poster printed A0 - do I have to pay for this?
Tips for poster presentation (I am going to print off A4 versions to hand out if anyone is interested at the conference).
Where can I get the data from the T1D paper - how shall I get going with this?
Is my desiredcov_cs
function good enough to put into my package. I’ll include a vignette on using this too.