There is still a problem with coverage, but particularly in regions with relatively low maximum z-scores - i.e. those SNPs falling below the threshold on a Manhattan plot.
The problem still exists in those simulations representing realistic fine-mapping regions (\(Z>5\)), but it is less severe than we originally thought.
In low OR (~1.05) we see undercoverage, where the true coverage is less than the claimed coverage (saying you are 80% sure when in actual fact it’s nearer 50%). This seems to be exaggurated in more associated regions (plot 1 square (3,1)).
In higher OR (1.1-1.3) and lower association regions (\(maxz0<4\)), we see consistent overcoverage due to the ordering step in the Bayesian approach. This is when the true coverage is greater than the claimed coverage.
In medium associated regions (z approx 5) we see slight overcoverage.
In highly associated regions (\(maxz0>6%\)) we see undercoverage again with the Bayesian method, whereby the true coverage is lower than that claimed.
Originally, we were simulating about the maximum z-score. This means that in low effect sizes, we were simulating about something that is not 0, when in actual fact we should be simulating about 0.
The maxz0 method works when OR is large and when maxz0 is a good estimate of the true \(\mu\).
We consider two Bayesian corrections on \(\mu\). These can be thought of as shrinkage estimates of \(\mu\) given \(iCV=i\).
Obtain a seperate \(\mu\) for each SNP.
At SNP \(i\),
\[H_0: \textrm{i not associated}\] \[H_1: \mu \sim N(0,W)\]
\[E(\mu_i)=E(\mu_i|H_0)P(H_0|zm)+E(\mu_i|H_1)P(H_1|zm)\]
Where
\(E(\mu_i|H_0)=0\): Expected value of \(\mu\) under the null.
\(E(\mu_i|H_1)=\hat\mu\): Expected value of \(\mu\) under the alternative. Recall, under the alternative,
\[Prior: \mu \sim N(0,W)\] \[Posterior: \mu \sim N(\hat\mu,\hat V)\]
Where \(\hat\mu= \frac{1}{\hat V}(\frac{0}{W}+\frac{zm}{1})\) and \(\hat V=(1/1+1/W)^{-1}\).
maf <- maf
n1 <- 5000
n0 <- 5000
n <- n1+n0
w <- 3 # this is the prior variance (omega) - could also be set to omega/V
zm <- z0
# post of mu|H1 ~ N(muhat,vhat)
vhat <- (1+1/w)^(-1)
muhat <- vhat*zm # mu-hat = v-hat (0/w + zm[1]/1)
zmh1 <- dnorm(zm,mean=0,sd=sqrt(1+w)) # p(zm|h1)
zmh0 <- dnorm(zm) # p(zm|h0)
eps <- 10^-4 # choose some small epsilon
ph1 <- eps # p(h1)
ph0 <- 1-eps # p(h0)
h1weight <- (zmh1*ph1)/((zmh1*ph1)+(zmh0*ph0)) # p(h1|data)
mu1 <- muhat*h1weight
\[H_0: \textrm{no association}\] \[H_i: \textrm{SNP i causal}\]
\[E(\mu)=E(\mu|H_0)P(H_0|zm)+\sum_i E(\mu_i|H_i)P(H_i|zm)\]
zm <- z0
pvals <- pnorm(abs(z0),lower.tail=FALSE)*2 # convert z scores to p values
tmp <- coloc:::approx.bf.p(p = pvals,
f=maf, type = "cc", N=n1+n1, s=n1/(n1+n1)) # find ABF
w <- 0.2^2/tmp$V # OR w = 3 (W fixed method)
vhat <- (1+1/w)^(-1)
muhat <- vhat*zm # mu-hat = v-hat (0/w + zm[1]/1)
tmp <- tmp[,"lABF"]
p1 <- 1e-04 # hard coded - bad code - can fix later
nsnps <- length(tmp)
prior <- c(1-nsnps*p1,rep(p1,nsnps))
tmp <- c(1,tmp) # add on extra for null model
my.denom <- coloc:::logsum(tmp + prior)
tmp1 <- exp(tmp - my.denom)
posthi <- tmp1/sum(tmp1)
posthi <- posthi[-1] # drop null again - not sum(posthi)<1 since we have a null model as well
mu2 <- sum(muhat*posthi)
The first Bayesian correction method, single SNP \(\mu\), did not work.
W varied: \(W=omega/V=0.2^2/V\)
W fixed: \(W=3\)
We compare each of these methods against the original \(\mu=max(abs(z0))\) method.
We were hoping that the Bayesian method would provide us with a value for \(\mu\) close to the true effect at the CV (approximated by muz0CV=mean(zsim.tmp[,iCV])
, it can be found exactly using expected_z_sim
).
The above plot indicates that the values selected using the bayesian method are not as good as the maxz0 value.
Seems to.
Potential problem: Get lots of instances where the corrected coverage cannot be found and results in rows with NA, which I just remove. This occurs about 40% of the time. Perhaps there is some bias in when the method works and when it does not?
## thr N
## 1: 0.95 405
## 2: 0.80 549
## 3: 0.60 614
## 4: 0.90 521
The method works when we know the true effect at the causal variant, true \(\mu\). We need to find a \(\hat\mu\) that is a good estimate of the true \(\mu\).
So far, \(\hat\mu=max(abs(z0))\) seems to be the best, but this fails to work in low OR.
We have assumed that z0[iCV] is a good estimate if iCV is known, and it is. /PhD/bayesian.distribution.for.mu/check using likelihood/ shows that the maximum of the likelihood is at z0[iCV], in both low and high OR simulations.
This also shows that \(z0[iCV]=\mu\) s.t. \(likelihood(mu,iCV)\) is maximised. I.e. the maximum of the likelihood is at z0[iCV] in our simulations.
Have also shown that when we take any SNP, i, and allow this to be the non-zero element of Zj, then \(z0[i]=\mu\) s.t \(likelihood(mu,i)\) is maximised.