Motivation

We have found there to be systematic bias in the coverage estimates of the credible sets comprising of putative causal variants in single causal variant Bayesian genetic fine-mapping. The procedure of forming these credible sets is outlined below, with asterisks marking possible causes/contributors to this bias.

We think that it is unlikely that the prior variance has been mis-specified or that the ABFs are inaccurate. We therefore start with the third asterisk and investigate whether the PPs themselves are biased. To do this, we simulate some data and plot the estimated probability of causality against the claimed posterior probability (PP). The following plot is the prediction from a \(logit(y)\sim logit(x)\) model fitted to 10000 simulations where \(y\) is a binary indicator of SNP causality and \(x\) is the claimed posterior probability. We find that the PPs are well empirically calibrated themselves, revealing that it is likely the method of forming the credible sets (sort and sum) which is causing the bias.

We hypothesize that the problem could be related to the flat prior that is routinely used in the method, whereby the prior probability of causality is flat across all SNPs. Whilst this is accurate to calculate the per-SNP posterior probabilities of causality, it seems unrealistic to maintain this “flat belief” once the SNPs have been sorted into descending order of posterior probability. This motivates the idea of considering a “credible set prior” that depends on the ordering of the SNPs.


The Dirichlet distribution

The flat prior probability of causality across all SNPs is the expectation of a Dirichlet distribution with parameter \(\mathbf{\alpha}=\mathbf{1}\). We consider taking samples from this.

Input to Dirichlet: Any vector of length \(k\),

Output of Dirichlet: A vector of length \(k\) that defines a probability distribution (probabilities that sum to 1).

##' Function to generate random deviates from the Dirichlet distribution.
##'
##' @title
##' @param n number of random vectors to generate
##' @param a vector containing concentration parameters
##' @return 
rdirichlet<-function(n, a){
  l <- length(a);
  x <- matrix(rgamma(l*n,a),ncol=l,byrow=TRUE); # generate l*n rgamma random quantities
  sm <- x %*% rep(1, l); # calculate row sums
  x/as.vector(sm); # normalise to sum to one
}

nsnps <- 100
x <- rdirichlet(10000, rep(1, nsnps))
plot(colMeans(x), ylim = c(0, 0.02)) # will tend to 1/nsnps
abline(h = 0.01, col = 2)

We consider sampling from a realisation of the Dirichlet distribution (pi). Analogous to the fine-mapping method, where variants are sorted into descending order of their posterior probabilities, we sort the pi vector into descending order and record the ordering of the sampled “variant”. We expect this to be a variant near the front of the ordering as these have higher associated probabilities.

##' Draw hyper copy of pi, sample 1 value 
##' return 1 at the position of the sampled SNPs ordering
##' 
##' @title
##' @param nsnps number of snps
##' @param hyper hyper-parameter for alpha
##' @return
draw_order <- function(nsnps, hyper=1){
  pi <- rdirichlet(1, rep(hyper, nsnps))
  draws <- sample(nsnps, 1, prob = pi)
  sapply(order(pi, decreasing = TRUE), function(x) sum(x == draws))
} 

draw_order(nsnps)
##   [1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can repeat this process many times and calculate the proportions that each ordered variant was sampled and use this as our “reweighted prior”. The reweighted prior captures our belief that variants near to the front of the SNP ordering are more likely to be causal than those at the back of the ordering.

pi_new <- rowMeans(sapply(1:50000, function(i) draw_order(nsnps, 1))) 


Results

I use both the original flat prior and the new reweighted prior to calculate PPs for a variety of simulated genetic association summary statistics from a low LD region and a high LD region. To see whether the top SNP (the SNP with the largest PP) is well empirically calibrated, I store the maximum PP value and a binary indicator of whether this SNP is the causal SNP (CV).

The following results are for 25000 low LD region simulations and 25000 high LD region simulations, whereby each data point is the average of ~1500 simulations for a specified PP bin (\(x\) is the mean PP value in the interval and \(y\) is the mean binary is.CV value).

Using the reweighted prior, we become “too sure” that the top SNP is the CV, i.e. we upweight the PP for this SNP too much.

Interpretation: Using the flat prior, if a SNP has \(PP=0.5\) then there is a ~50% chance that it is the CV. Using the reweighted prior, if a SNP has \(PP=0.5\) then there is a ~40% chance that it is the CV.



To investigate the impact of the power of the study on the empirical calibration of the posterior probabilities calculated using both the original flat and the reweighted prior, I visualise the above results differently. Below, I have binned the data into bins of minimum \(P\) value in the region and calculated \(mean(is.CV-PP)\) for each of these bins (where PP is the max PP value).

When using the original flat prior, the results match our previous findings - that the PP values are a lower bound for the probability of causality in moderately powered scenarios and that these become unbiased in higher powered scenarios (see Fig. 1 in manuscript).

I find that the results are even more pronounced in lower powered simulations, where we are almost always “too sure” that the SNP at the front of the ordering is the CV (note that in lower powered scenarios, the distribution of \(Z\)-scores may be “flatter” compared to in higher powered scenarios where one (or a few) SNPs may have much greater \(Z\)-scores than the other SNPs –> the lower the power, the less entropy of the system).

Interpretation: In low powered scenarios, if the maximum PP value is 0.5, then there may only be a 40% chance that the corresponding SNP is the CV.


Idea: Is there a way to vary the “steepness” of the reweighted prior depending on the power of the study?


Varying Dirichlet concentration parameters

The parameters in the Dirichlet distribution (\(\bf{\alpha_i}\)) are called “concentration parameters” and \(\alpha_0=\sum_{i=1}^K \alpha_i\). So far in our analysis we assume that all the \(\alpha_i\) are equal and that \(\alpha_i=\alpha=1\), so that \(E[X_i]=\frac{\alpha_i}{\alpha_0}=\frac{1}{nsnps}\) for all \(i\). I investigate the effect that varying \(\alpha\) has on the pdf (assuming that \(nsnps=3\) to aid visualisation).

library(Compositional)
library(MCMCpack)

## alpha = 1e-2
draws <- rdirichlet(200, c(0.01,0.01,0.01) )
bivt.contour(draws)

## alpha = 0.1
draws <- rdirichlet(200, c(.1,.1,.1) )
bivt.contour(draws)

## alpha = 1
draws <- rdirichlet(200, c(1,1,1) )
bivt.contour(draws)

## alpha = 10
draws <- rdirichlet(200, c(10,10,10) )
bivt.contour(draws)

## alpha = 100
draws <- rdirichlet(200, c(100,100,100) )
bivt.contour(draws)

As \(\alpha\) increases, points in the centre are preferred, representing “flatter” probability distributions. This has the expected effect on the reweighted prior.



The covariance of the Dirichlet distribution is,

\[Cov[X_i, X_j]=\frac{-\alpha_i \alpha_j}{\alpha_0^2(\alpha_0+1)},\]

and we assume that \(\alpha_i=\alpha\) so that,

\[Cov[X_i, X_j]=\frac{-\alpha^2}{\alpha_0^2(\alpha_0+1)}.\]

Let’s see what happens to the covariance as a function of \(\alpha\) for a 100 SNP region.



Varying \(\alpha_i\) across SNPs

Suppose now that the \(\alpha_i\)s can vary across SNPs. I visualise some examples for \(nsnps=3\).

draws <- rdirichlet(200, c(10,.1,.1) )
bivt.contour(draws)

draws <- rdirichlet(200, c(1,5,1) )
bivt.contour(draws)

The higher value of \(\alpha_i\), the greater “weight” of \(X_i\) and the greater amount of the total “mass” is assigned to it. However, I don’t think we can use this for our method as the expectation would no longer be \(\frac{1}{nsnps}\) for all SNPs.


Final comments