Supervisor meeting 6

Estimating coverage using samples from the same distribution

1. Dirichlet method:

Method:

Use the posterior probability vector (pp) to define the dirichlet distribution, and take many samples from it. \[p^* \sim dirichlet(pp)\]
Sort each sample into descending order.
Cumulatively sum the elements in each sample until the pre-defined threshold is reached.
Estimate \(p(causal.in.set)\) by \(\sum_i PP_i \times I(i \in set^*)\). I.E. go back to the original ordered posterior probabilities, and sum the first \(w\) elements, defined by how many elements were needed to reach the threshold in the sample.

Problem:

PP samples taken from the Dirichlet distribution are too variable.
Even for the OR=1 scenario, we had some samples with one SNP holding >0.9 the posterior probability. For these samples, only 1 variant (\(w=1\)) will be in the credible set, so when we go back to the original system and sum posteior probabilities for the first \(w\) variants to estimate coverage, we get very very low coverage.

2. Investigate variability in PP samples

A function, bigsim, was written to simulate posterior probability samples from the same distribution.

# function to generate pp samples
library(mvtnorm)
library(bindata)
library(simGWAS)

bigsim <- function(OR=1.2,n0=10000,n1=10000,nsnps=100,nhaps=1000,LD1=0.3,LD2=0.25) {
  
  maf <- runif(nsnps,0.05,0.4) # doesnt run if >0.4
  
  R<-diag(1,nsnps)
  R[cbind(1:(nsnps-1),2:nsnps)] <- LD1
  R[cbind(2:nsnps,1:(nsnps-1))] <- LD1
  R[cbind(1:(nsnps-2),3:nsnps)] <- LD2
  R[cbind(3:nsnps,1:(nsnps-2))] <- LD2
  haps=rmvbin(nhaps, maf, bincorr=R) 
  snps <- colnames(haps) <- paste0("s",1:nsnps)
  
  CV=sample(snps[which(colMeans(haps)>0.1)],1)
  
  freq=haps+1
  freq=as.data.frame(freq)
  snps <- colnames(freq) <- paste0("s",1:nsnps)
  freq$Probability <- 1/nrow(freq)
  freq
  
  g1 <- OR
  
  zsim <- simulated_z_score(N0=n0, # number of controls
                            N1=n1, # number of cases
                            snps=snps, # column names in freq of SNPs for which Z scores should be generated
                            W=CV, # causal variants, subset of snps
                            gamma.W=log(g1), # log odds ratios
                            freq=freq, # reference haplotypes
                            nrep=1000)
  psim <- pnorm(abs(zsim),lower.tail=FALSE)*2
  
  ppsim <- lapply(1:nrow(zsim), function(i) {
    tmp <- coloc:::approx.bf.p(p = psim[i,],
                               f=colMeans(haps), type = "cc", N=n0+n1, s=n1/(n0+n1))[,"lABF"]
    p1 <- 1e-04 # hard coded - bad code - can fix later
    nsnps <- length(tmp)
    prior <- c(1-nsnps*p1,rep(p1,nsnps))
    tmp <- c(1,tmp) # add on extra for null model
    my.denom <- coloc:::logsum(tmp + prior)
    exp(tmp - my.denom)[-1]
  })  %>% do.call("rbind",.) # simulated posterior probabilities
  return(list(ppsim=ppsim,zsim=zsim,psim=psim,CV=as.numeric(sub("s","",CV))))
}

res <- bigsim()    
ppsim <- res$ppsim # 1000*nsnps
zsim <- res$zsim
CV <- res$CV # causal variant

NOTE, I am getting this error - need to check results still ok.

3. Obtaining credible sets

The general method to obtain a credible set is given below. The estimate of coverage (‘reported coverage’) is given by the sum of the posterior probabilities of the variants in the credible set.

pp <- lowest[[1]]$PP # choose a random vector of posterior probabilities
or <- order(pp, decreasing=TRUE) # find the indices from ordering
pp.ord <- pp[or] # vector of ordered posterior probabilities
nvar <- which(cumsum(pp[or])>0.5)[1]  # how many variants needed to reach thr
sum(pp[or[1:nvar]]) # sum the first w variants in the ordered set

This standard estimate of coverage is called ‘reported coverage’ from now onwards

We hope to use PP samples to get a more accurate estimate of coverage.

Chris’s idea was to form credible sets from the samples and note how many variants in each set (w) and then go back to the original system and sum these first w variants. Do this for each sample and average.

The following function, h generates credible sets for each sample. The output is a list of the number of variants and the size of each credible set.

h=function(i) { 
  p=ppsim[i,] # choose a sample (row) from y
  o=order(p,decreasing=TRUE) # indices from ordering sample
  nvar=which(cumsum(p[o])>0.5)[1] 
  size=sum(p[o[1:nvar]])
  result <- list(nvar=nvar,size=size)
  return(result)
}
cs <- as.data.frame(t(sapply(1:nrow(ppsim), h)))
cs$nvar <- unlist(cs$nvar)
cs$size <- unlist(cs$size)
head(cs)

We can already see the variability in the samples, whereby the first sample seems to have fairly low entropy (fair few variants needed to reach the threshold) and samples 4-6 seem to have very high entropy where 1 variant holds >80% of the posterior probability. Note that this was for an OR=1.1 scenario.

Method

Use the bigsim function to generate many samples (n).
For each of these, form a credible set using the normal method and note how many variants in each of the credible sets (\(w_1,...,w_n\)).
For each \(w_1,...,w_n\), go back to the original system and sum these elements.
Average the results to get an estimate of true coverage.

Code:

## Reported coverage
res <- bigsim() # specify OR in here
ppsim <- res$ppsim
pp <- colMeans(ppsim) # true pp
or <- order(pp, decreasing=TRUE) 
pp.ord <- pp[or] 
nvar <- which(cumsum(pp[or])>0.5)[1] 
sum(pp[or[1:nvar]]) # reported coverage using normal method


## Using simulations to estimate true coverage
f=function(i) { 
  p=ppsim[i,] # choose a sample (row) from y
  o=order(p,decreasing=TRUE) # indices from ordering sample
  w=which(cumsum(p[o])>0.5)[1]  # how many variants needed to reach thr
  sum(pp[o[1:w]]) 
}
mean(sapply(1:nrow(ppsim), f))

Results

The ‘cumsum’ column gives the reported coverage (the sum of the posterior probabilities of the variants in the credible set for the true pp, which here is the colMeans of ppsim).
The ‘using simulations’ column uses the simulations and the function f Chris wrote (above) to provide an estimate of the true coverage.
The bias column is \(cumsum-using simulations\). A negative bias value implies the reported coverage value (using the standard method) is too high, and a positive bias value implies that this value is too low.
From Hope’s research, we expect negative bias values for low OR (undercoverage), approximately 0 bias values for medium OR and positive bias values for high OR (overcoverage).

1. N0=N1=10000, 1000 simulations

	Cumsum	Using simulations	Bias
1) OR=1	0.502	0.216	-0.289
2) OR=1	0.507	0.208	-0.299
3) OR=1	0.504	0.208	-0.296
4) OR=1.05	0.504	0.228	-0.276
5) OR=1.05	0.501	0.246	-0.255
6) OR=1.05	0.507	0.204	-0.303
7) OR=1.1	0.648	0.630	-0.018
8) OR=1.1	0.502	0.389	-0.113
9) OR=1.1	0.627	0.618	-0.009
10) OR=1.15	0.993	0.992	-0.001
11) OR=1.15	0.916	0.912	-0.004
12) OR=1.15	0.998	0.998	0
13) OR=1.2	0.999	0.999	0
14) OR=1.2	0.999	0.999	0
15) OR=1.2	0.999	0.999	0

2. N0=N1=50000 (to try and cut down variability), 1000 simulations

	Cumsum	Using simulations	Bias
1) OR=1	0.502	0.296	-0.206
2) OR=1	0.505	0.285	-0.220
3) OR=1	0.503	0.298	-0.205
4) OR=1.05	0.914	0.909	-0.005
5) OR=1.05	0.880	0.876	-0.004
6) OR=1.05	0.872	0.864	-0.006
7) OR=1.1	0.999	0.999	0
8) OR=1.1	0.999	0.999	0
9) OR=1.1	0.999	0.999	0
10) OR=1.15	0.999	0.999	0
11) OR=1.15	0.999	0.999	0
12) OR=1.15	0.999	0.999	0
13) OR=1.2	0.999	0.999	0
14) OR=1.2	0.999	0.999	0
15) OR=1.2	0.999	0.999	0

Notice, since the sample size is so big, the power of the system is high. Even a tiny increase in effect size (e.g. OR=1.05) leads to high disorder posterior probability systems, with one SNP holding most of the probability.

3. N0=N1=10000, 5000 simulations

	Cumsum	Using simulations	Bias
1) OR=1	0.502	0.208	-0.294
2) OR=1	0.502	0.209	-0.293
3) OR=1	0.501	0.208	-0.293
4) OR=1.05	0.502	0.207	-0.295
5) OR=1.05	0.505	0.205	-0.3
6) OR=1.05	0.502	0.217	-0.285
7) OR=1.1	0.739	0.726	-0.013
8) OR=1.1	0.806	0.794	-0.012
9) OR=1.1	0.501	0.400	-0.101
10) OR=1.15	0.999	0.999	0
11) OR=1.15	0.829	0.817	-0.012
12) OR=1.15	0.978	0.977	-0.001
13) OR=1.2	0.997	0.997	0
14) OR=1.2	0.997	0.996	-0.001
15) OR=1.2	0.999	0.999	0

4. N0=N1=10000, 10000 simulations

	Cumsum	Using simulations	Bias
1) OR=1	0.502	0.206	-0.296
2) OR=1	0.500	0.205	-0.205
3) OR=1	0.505	0.208	-0.297
4) OR=1.05	0.502	0.235	-0.267
5) OR=1.05	0.505	0.207	-0.298
6) OR=1.05	0.502	0.208	-0.294
7) OR=1.1	0.542	0.533	-0.009
8) OR=1.1	0.544	0.537	-0.007
9) OR=1.1	0.850	0.840	-0.010
10) OR=1.15	0.979	0.977	-0.002
11) OR=1.15	0.998	0.997	-0.001
12) OR=1.15	0.994	0.993	-0.001
13) OR=1.2	0.999	0.999	0
14) OR=1.2	0.999	0.999	0
15) OR=1.2	0.999	0.999	0

The results show that there is consistant undercoverage for low OR. Where the reported coverage is close to the threshold (~0.5) but our estimate of true coverage is closer to 20%. This supports Hope’s findings.
The reported coverage is approximately correct for medium power systems (OR=1.1).
Hope’s research (and intuition) implies there would be overcoverage for high OR (the reported coverage is lower than the true coverage). We do not see this in our simulations, why?
- For high OR, the power is too high so the causal SNP holds \(>90\%\) of the posterior probability in most/all the simulations. Hence, this is the only SNP in the credible set, so that when we go back to the original system and sum, the new estimated coverage is the same as the reported.
- To deal with this, I tried decreasing power by decreasing sample size. This did not work and just gave systems with the CV having prob ~0.2 and our estimate of coverage being lower than that reported (the opposite of what we want).
- I think that this could work if we can find a way to increase the LD (matrix not invertible if I increase LD1/LD2 from 0.3/0.25). In all the systems analysed, there is basically no LD when looking at the plotted posterior probabilities.
- I think that by increasing LD we would be able to identify overcoverage as there would be more variants with medium/higher pps that will be included in the estimate.

Questions

Error message when using bigsim function.
Does it matter that the pp samples in the simulation do not sum to 1? Should the function be ammended with… Although have tried it with this and get similar results.

exp(tmp - my.denom)[-1]/sum(exp(tmp - my.denom)[-1])

We need a way to increase LD so that our posterior probability systems do not just consist of the CV with higher prob, and the rest hovering around 0.
Although when using your original bigsim function with the following parameters, I STILL couldn’t identify overcoverage using this method.

Block 2

Block 8

OR	Cumsum	Estimate
1.00	0.6055759	0.35019424
1.05	0.603116	0.3300324
1.10	0.6027125	0.5504439
1.15	0.914472	0.9156887
1.17	0.9966756	0.9966767
1.20	1	1

OR	Cumsum	Estimate
1.00	0.6033427	0.3271931
1.05	0.605718	0.3623133
1.10	0.8804282	0.8825185
1.12	0.9369013	0.9380186
1.15	0.9997031	0.9997031
1.17	0.9969696	0.9969696
1.20	1	1

Block 13

Block 18

OR	Cumsum	Estimate
1.00	0.6024402	0.372719
1.05	0.6084207	0.3879116
1.10	0.7049413	0.7326347
1.12	0.7205915	0.7563341
1.15	0.9571823	0.9574486

OR	Cumsum	Estimate
1.00	0.603848	0.3444517
1.05	0.6055739	0.3665093
1.10	0.6063955	0.5477145
1.12	0.9189605	0.9210904
1.14	0.6718133	0.7198043
1.15	0.9864833	0.9865341
1.20	0.9524999	0.9534688

Block 21

Block 23

OR	Cumsum	Estimate
1.00	0.6049188	0.3388692
1.05	0.6085864	0.3691954
1.10	0.877676	0.8806494
1.12	0.8873314	0.8902675
1.15	0.998194	0.998194
1.20	0.9184488	0.9207424

OR	Cumsum	Estimate
1.00	0.6058238	0.3243148
1.05	0.6076105	0.332549
1.10	0.8588917	0.8625281
1.12	0.6419582	0.599976
1.15	0.9897321	0.9897521
1.20	0.9998707	0.9998707