Method:
Use the posterior probability vector (pp) to define the dirichlet distribution, and take many samples from it. \[p^* \sim dirichlet(pp)\]
Sort each sample into descending order.
Cumulatively sum the elements in each sample until the pre-defined threshold is reached.
Estimate \(p(causal.in.set)\) by \(\sum_i PP_i \times I(i \in set^*)\). I.E. go back to the original ordered posterior probabilities, and sum the first \(w\) elements, defined by how many elements were needed to reach the threshold in the sample.
Problem:
PP samples taken from the Dirichlet distribution are too variable.
Even for the OR=1 scenario, we had some samples with one SNP holding >0.9 the posterior probability. For these samples, only 1 variant (\(w=1\)) will be in the credible set, so when we go back to the original system and sum posteior probabilities for the first \(w\) variants to estimate coverage, we get very very low coverage.
A function, bigsim
, was written to simulate posterior probability samples from the same distribution.
# function to generate pp samples
library(mvtnorm)
library(bindata)
library(simGWAS)
bigsim <- function(OR=1.2,n0=10000,n1=10000,nsnps=100,nhaps=1000,LD1=0.3,LD2=0.25) {
maf <- runif(nsnps,0.05,0.4) # doesnt run if >0.4
R<-diag(1,nsnps)
R[cbind(1:(nsnps-1),2:nsnps)] <- LD1
R[cbind(2:nsnps,1:(nsnps-1))] <- LD1
R[cbind(1:(nsnps-2),3:nsnps)] <- LD2
R[cbind(3:nsnps,1:(nsnps-2))] <- LD2
haps=rmvbin(nhaps, maf, bincorr=R)
snps <- colnames(haps) <- paste0("s",1:nsnps)
CV=sample(snps[which(colMeans(haps)>0.1)],1)
freq=haps+1
freq=as.data.frame(freq)
snps <- colnames(freq) <- paste0("s",1:nsnps)
freq$Probability <- 1/nrow(freq)
freq
g1 <- OR
zsim <- simulated_z_score(N0=n0, # number of controls
N1=n1, # number of cases
snps=snps, # column names in freq of SNPs for which Z scores should be generated
W=CV, # causal variants, subset of snps
gamma.W=log(g1), # log odds ratios
freq=freq, # reference haplotypes
nrep=1000)
psim <- pnorm(abs(zsim),lower.tail=FALSE)*2
ppsim <- lapply(1:nrow(zsim), function(i) {
tmp <- coloc:::approx.bf.p(p = psim[i,],
f=colMeans(haps), type = "cc", N=n0+n1, s=n1/(n0+n1))[,"lABF"]
p1 <- 1e-04 # hard coded - bad code - can fix later
nsnps <- length(tmp)
prior <- c(1-nsnps*p1,rep(p1,nsnps))
tmp <- c(1,tmp) # add on extra for null model
my.denom <- coloc:::logsum(tmp + prior)
exp(tmp - my.denom)[-1]
}) %>% do.call("rbind",.) # simulated posterior probabilities
return(list(ppsim=ppsim,zsim=zsim,psim=psim,CV=as.numeric(sub("s","",CV))))
}
res <- bigsim()
ppsim <- res$ppsim # 1000*nsnps
zsim <- res$zsim
CV <- res$CV # causal variant
NOTE, I am getting this error - need to check results still ok.
The general method to obtain a credible set is given below. The estimate of coverage (‘reported coverage’) is given by the sum of the posterior probabilities of the variants in the credible set.
pp <- lowest[[1]]$PP # choose a random vector of posterior probabilities
or <- order(pp, decreasing=TRUE) # find the indices from ordering
pp.ord <- pp[or] # vector of ordered posterior probabilities
nvar <- which(cumsum(pp[or])>0.5)[1] # how many variants needed to reach thr
sum(pp[or[1:nvar]]) # sum the first w variants in the ordered set
This standard estimate of coverage is called ‘reported coverage’ from now onwards
We hope to use PP samples to get a more accurate estimate of coverage.
Chris’s idea was to form credible sets from the samples and note how many variants in each set (w) and then go back to the original system and sum these first w variants. Do this for each sample and average.
The following function, h
generates credible sets for each sample. The output is a list of the number of variants and the size of each credible set.
h=function(i) {
p=ppsim[i,] # choose a sample (row) from y
o=order(p,decreasing=TRUE) # indices from ordering sample
nvar=which(cumsum(p[o])>0.5)[1]
size=sum(p[o[1:nvar]])
result <- list(nvar=nvar,size=size)
return(result)
}
cs <- as.data.frame(t(sapply(1:nrow(ppsim), h)))
cs$nvar <- unlist(cs$nvar)
cs$size <- unlist(cs$size)
head(cs)
We can already see the variability in the samples, whereby the first sample seems to have fairly low entropy (fair few variants needed to reach the threshold) and samples 4-6 seem to have very high entropy where 1 variant holds >80% of the posterior probability. Note that this was for an OR=1.1 scenario.
bigsim
function to generate many samples (n).Code:
## Reported coverage
res <- bigsim() # specify OR in here
ppsim <- res$ppsim
pp <- colMeans(ppsim) # true pp
or <- order(pp, decreasing=TRUE)
pp.ord <- pp[or]
nvar <- which(cumsum(pp[or])>0.5)[1]
sum(pp[or[1:nvar]]) # reported coverage using normal method
## Using simulations to estimate true coverage
f=function(i) {
p=ppsim[i,] # choose a sample (row) from y
o=order(p,decreasing=TRUE) # indices from ordering sample
w=which(cumsum(p[o])>0.5)[1] # how many variants needed to reach thr
sum(pp[o[1:w]])
}
mean(sapply(1:nrow(ppsim), f))
The ‘cumsum’ column gives the reported coverage (the sum of the posterior probabilities of the variants in the credible set for the true pp, which here is the colMeans of ppsim).
The ‘using simulations’ column uses the simulations and the function f
Chris wrote (above) to provide an estimate of the true coverage.
The bias column is \(cumsum-using simulations\). A negative bias value implies the reported coverage value (using the standard method) is too high, and a positive bias value implies that this value is too low.
From Hope’s research, we expect negative bias values for low OR (undercoverage), approximately 0 bias values for medium OR and positive bias values for high OR (overcoverage).
Cumsum | Using simulations | Bias | |
---|---|---|---|
1) OR=1 | 0.502 | 0.216 | -0.289 |
2) OR=1 | 0.507 | 0.208 | -0.299 |
3) OR=1 | 0.504 | 0.208 | -0.296 |
4) OR=1.05 | 0.504 | 0.228 | -0.276 |
5) OR=1.05 | 0.501 | 0.246 | -0.255 |
6) OR=1.05 | 0.507 | 0.204 | -0.303 |
7) OR=1.1 | 0.648 | 0.630 | -0.018 |
8) OR=1.1 | 0.502 | 0.389 | -0.113 |
9) OR=1.1 | 0.627 | 0.618 | -0.009 |
10) OR=1.15 | 0.993 | 0.992 | -0.001 |
11) OR=1.15 | 0.916 | 0.912 | -0.004 |
12) OR=1.15 | 0.998 | 0.998 | 0 |
13) OR=1.2 | 0.999 | 0.999 | 0 |
14) OR=1.2 | 0.999 | 0.999 | 0 |
15) OR=1.2 | 0.999 | 0.999 | 0 |
Cumsum | Using simulations | Bias | |
---|---|---|---|
1) OR=1 | 0.502 | 0.296 | -0.206 |
2) OR=1 | 0.505 | 0.285 | -0.220 |
3) OR=1 | 0.503 | 0.298 | -0.205 |
4) OR=1.05 | 0.914 | 0.909 | -0.005 |
5) OR=1.05 | 0.880 | 0.876 | -0.004 |
6) OR=1.05 | 0.872 | 0.864 | -0.006 |
7) OR=1.1 | 0.999 | 0.999 | 0 |
8) OR=1.1 | 0.999 | 0.999 | 0 |
9) OR=1.1 | 0.999 | 0.999 | 0 |
10) OR=1.15 | 0.999 | 0.999 | 0 |
11) OR=1.15 | 0.999 | 0.999 | 0 |
12) OR=1.15 | 0.999 | 0.999 | 0 |
13) OR=1.2 | 0.999 | 0.999 | 0 |
14) OR=1.2 | 0.999 | 0.999 | 0 |
15) OR=1.2 | 0.999 | 0.999 | 0 |
Cumsum | Using simulations | Bias | |
---|---|---|---|
1) OR=1 | 0.502 | 0.208 | -0.294 |
2) OR=1 | 0.502 | 0.209 | -0.293 |
3) OR=1 | 0.501 | 0.208 | -0.293 |
4) OR=1.05 | 0.502 | 0.207 | -0.295 |
5) OR=1.05 | 0.505 | 0.205 | -0.3 |
6) OR=1.05 | 0.502 | 0.217 | -0.285 |
7) OR=1.1 | 0.739 | 0.726 | -0.013 |
8) OR=1.1 | 0.806 | 0.794 | -0.012 |
9) OR=1.1 | 0.501 | 0.400 | -0.101 |
10) OR=1.15 | 0.999 | 0.999 | 0 |
11) OR=1.15 | 0.829 | 0.817 | -0.012 |
12) OR=1.15 | 0.978 | 0.977 | -0.001 |
13) OR=1.2 | 0.997 | 0.997 | 0 |
14) OR=1.2 | 0.997 | 0.996 | -0.001 |
15) OR=1.2 | 0.999 | 0.999 | 0 |
Cumsum | Using simulations | Bias | |
---|---|---|---|
1) OR=1 | 0.502 | 0.206 | -0.296 |
2) OR=1 | 0.500 | 0.205 | -0.205 |
3) OR=1 | 0.505 | 0.208 | -0.297 |
4) OR=1.05 | 0.502 | 0.235 | -0.267 |
5) OR=1.05 | 0.505 | 0.207 | -0.298 |
6) OR=1.05 | 0.502 | 0.208 | -0.294 |
7) OR=1.1 | 0.542 | 0.533 | -0.009 |
8) OR=1.1 | 0.544 | 0.537 | -0.007 |
9) OR=1.1 | 0.850 | 0.840 | -0.010 |
10) OR=1.15 | 0.979 | 0.977 | -0.002 |
11) OR=1.15 | 0.998 | 0.997 | -0.001 |
12) OR=1.15 | 0.994 | 0.993 | -0.001 |
13) OR=1.2 | 0.999 | 0.999 | 0 |
14) OR=1.2 | 0.999 | 0.999 | 0 |
15) OR=1.2 | 0.999 | 0.999 | 0 |
The results show that there is consistant undercoverage for low OR. Where the reported coverage is close to the threshold (~0.5) but our estimate of true coverage is closer to 20%. This supports Hope’s findings.
The reported coverage is approximately correct for medium power systems (OR=1.1).
Hope’s research (and intuition) implies there would be overcoverage for high OR (the reported coverage is lower than the true coverage). We do not see this in our simulations, why?
Error message when using bigsim
function.
Does it matter that the pp samples in the simulation do not sum to 1? Should the function be ammended with… Although have tried it with this and get similar results.
exp(tmp - my.denom)[-1]/sum(exp(tmp - my.denom)[-1])
We need a way to increase LD so that our posterior probability systems do not just consist of the CV with higher prob, and the rest hovering around 0.
Although when using your original bigsim
function with the following parameters, I STILL couldn’t identify overcoverage using this method.
Block 2 | Block 8 | |||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Block 13 | Block 18 | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Block 21 | Block 23 | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|