We found a couple of bugs in the code. I have corrected these and re-ran the simulations.
ppfunc <- function(z, V, W = 0.2) {
stopifnot(class(z)=="numeric")
r = W^2/(W^2 + V)
bf = 0.5 * (log(1 - r) + (r * z^2)) # log BFs
denom = logsum(bf) # denominator (on log scale so don't sum to 100)
exp(bf - denom) # convert back from log scale
}
but it was incorrect for when z
is a matrix. How?
Our (log) BFs were wrong because we were multiplying r
by z
column wise rather than row wise.
I was calculating the logsum of the whole matrix of BFs (obtaining one value) rather than for each row seperately to obtain one for each simulation (although this didn’t matter as we normalised the PPs afterwards).
BF <- 0.5 * (log(1 - r) + (r * zstar^2))
where zstar
is a nsim*nsnp matrix of Z-scores and r
is a vector of length nsnp of the shrinkage factors. So that r
is being recycled into a matrix to do the multiplications, but it is being recycled column wise rather than row wise!!!!! This was identified because Chris tried to match my results by applying the ppfunc
over each row of zstar
and we were getting different results. To fix this, I have created a .zj_pp
utility function that is used within all the corrected coverage functions (note the difference of calculating the BFs, which now recycles r
row wise):.zj_pp = function(Zj, int.Sigma, int.nrep, int.ERR, int.r) {
stopifnot(class(Zj)=="numeric")
stopifnot(class(int.ERR)=="matrix")
stopifnot(class(int.r)=="numeric")
exp.zm = Zj %*% int.Sigma
mexp.zm = matrix(exp.zm, int.nrep, length(Zj), byrow = TRUE) # matrix of Zj replicated in each row
zstar = mexp.zm + int.ERR # MATRIX of z scores
bf = 0.5 * t(log(1 - int.r) + (int.r * t(zstar^2))) # row wise
denom = apply(bf, 1, logsum) # get different denom for each rep
exp(bf - denom) # convert back from log scale
}
The code for estimating \(\mu\) was incorrect and has now been ammended. We find that our standard 4 estimates of \(\mu\) are no longer accurate, where:
A = max(abs(z))
B = sum(abs(z)*ppdash)
C = (1-ppdash[h0])*max(abs(z))
D = mean(B,C)
Let’s consider some alternative estimators:
muhatA = max(abs(z))
muhatE = sum(abs(z0) * pp0)
muhatF = (muhatA+muhatE)/2
muhatG = muhatE * pph0 + muhatA * (1-pph0)
It seems that E (smallest absolute error) or F (avoiding negative bias) are the best. Having investigated how these perform in our correction, we decide to use muhatE = sum(abs(z0) * pp0)
as this never estimates the coverage to be too high (better to be conservative).
How much better is our corrected coverage estimate than the claimed coverage at genome wide significance?
What happens when we use the 1000 Genomes reference panel for our UK10K simulations? It seems that our correction is now slightly more likely to over estimate the coverage. However, note that up to 15 SNPs are removed in these simulations as they cannot be matched. Moreover, the SNPs removed are not random - they are more likely to be those with small MAFs etc.
Perhaps it would be worth only using the simulations where at most 5 SNPs were dropped from the analysis? I.e. data <- data[data$nsnps>195,]
(note only 1594/4000 simulations are left).
It is useful to see how the number of variants in the credible set changes both pre- and post- correction. Below is a plot for the difference in nvar in the credible sets, with the median value labeled. Negative values show when variants have been removed from the credible set.
Fairly often variants are removed from the credible set (improving the resolution) but as we move to extremely high powered simulations, the credible sets contain a single variant and stay the same.
Black - corrected credible set with -* fewer variants
Blue - corrected credible set stayed the same
Grey - credible set not corrected (either because it’s coverage was accurate or because a smaller credible set could not be found - see below)
rs34536443: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9985 (credible set cannot be made smaller).
rs72928038: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9810 (credible set cannot be made smaller).
Note: The plots for all the regions are avaliable here.
How long does the corrcov
function take to run?
For my own interest, I investigate the relationship between \(\mu\) and the posterior probability of the null model of no genetic effect.
Note that the PP for the null model becomes negligable for \(\mu>6\).
Simulations to plot:
To discuss with Chris: Paper and figure plan.