11th September

1. Assessing empirical coverage

I investigate whether the top PP has accurate empirical coverage when calculated using the standard method (original PPs) and the reweighting method (reweighted PPs). To do this, I store the maximum PP value and a binary indicator of whether this SNP is the causal SNP. I repeat this many times using a variety of parameters to vary the power of the simulations.

Note, the prior for the reweighting method was generated by:

prior <- rowMeans(sapply(1:1000000, function(i) draw_order(nsnps = 100, 1)))

The following results are for 25,000 low LD region simulations and 25,000 high LD region simulations, whereby each data point is the average of ~1500 simulations for a specified PP bin (\(x\) is the mean PP value in the interval and \(y\) is the mean binary is.CV value).

We find that the original PPs have fairly accurate empirical coverage, in that on average the value at the top PP is approximately the probability that that SNP is causal (if a SNP has \(PP=0.5\) then there is approximately a 50% chance that this SNP is the causal SNP (~52%)).

Using the reweighted PPs, if a SNP has \(PP=0.5\) then there is approximately a 40% chance that this SNP is the causal SNP. When using the reweighting prior method, the top SNP’s PP is “too high” in that it does not accurately estimate the probability that that SNP is the causal SNP.

Does this make sense?

This seems to make sense because the prior reweighting method upweights the PPs of the SNPs near the front of the ordering whilst the binary indicator of causality remains the same (that is, the top SNP is the same SNP when using the original PPs and the reweighted PPs). So that, if the top PPs using the original derivation have accurate empirical coverage, then those calculated using the reweighting method will be bias as the top SNP will always have a bigger PP due to it’s bigger prior probability.

Conclusion: Using the prior reweighting method we become “too sure” that the SNP with the biggest PP is the CV.

Side note: when limiting my simulations to those where the problem is most pronounced (\(1e-6<P_{min}<1e-4\)), the results look similar to before.

Impact of power

I’m wary that these results are averaged over many simulations of varying power. Instead, I bin the data into bins of minimum \(P\) value in the region and calculate \(mean(is.CV-PP)\) for each of these bins (where PP is the max PP value).

These results support the earlier conclusion that the reweighting method becomes “too sure” that the SNP with the biggest PP is the CV, especially in lower powered simulations.

But what is happening with the original PP method? For \(10^{-2}>P_{min}>10^{-7}\), the top PP underestimates the probability that that SNP is the CV (e.g. if top PP = 0.5 then there may actually be a 55% chance that this is the CV). This reflects what we see in figure 1 of the manuscript where we describe that the claimed coverage can be used as a lower bound for the true coverage.

Conclusion: Using the prior reweighting method we become “too sure” that the SNP with the biggest PP is the CV. This is especially pronounced in low power simulations, where our prior indicates that our SNP ordering is very informative, when in actual fact the PPs of the SNPs may be very similar and the ordering may be less informative than our prior implies. It doesn’t seem sensible to upweight and downweight SNPs using the same prior in low power scenarioes and high power scenarioes (when e.g. there may be little difference between the SNP PPs or a big difference between the SNP PPs, respectively).

Idea: Vary the prior depending on the power of the simulation. As the power decreases, the prior could become flatter?

2. Accuracy of claimed coverage

I now investigate whether the claimed coverage using these PPs are accurate by summing to a threshold (0.9) and storing the claimed coverage and a binary indicator of whether the CV was contained within the set. We hope that using the reweighted PPs will make the claimed coverage estimates accurate in the standard single variant fine-mapping method.

The following plot shows mean(covered-claimed) for various minimum P value bins. The results are similar to what we observe in Figure 1 of the manuscript (claimed is too low and then becomes unbiased). The reweighting method does not seem to help.

Limit simulations to those where we know that the standard claimed coverage estimates are inaccurate

I now limit the simulations to those in the power range where we know that the claimed coverage under-estimates the true coverage (\(1e-6<P_{min}<1e-4\)), each data point is now the average of ~300 simulations that fall in this \(P\)-value range.

3. Investigating results

Next, I compare the claimed coverage values when using the original and the reweighting method. In most instances, the claimed coverage of the credible sets using the reweighted PPs is higher than that using the original PPs. This could be because we’re more likely to “leap-frog” over the threshold when using the reweighting method.

Notice that there does seem to be some structure in the relationship between the claimed coverage estimates using the different methods.

I now investigate the relationship between the power of the system (minimum \(P\) value) and the probability that the CV is contained within the set, when using both the original and reweighting method.

The credible sets formed using the reweighted priors have lower probability of containing the CV because if it is not one of the earlier SNPs in the ordering then it’s PP is down weighted and it is less likely to be included in the set.

These results imply that in really low powered simulations, the reweighted PP method gives credible sets that are very unlikely to contain the CV. This may be due to the fact that the prior is too informative in low powered scenarioes, again supporting the idea of choosing the “steepness” of the prior to reflect the power of the study.

RED IS USING THE REWEIGHTED PPS.

m_low <- glm(cov~P_min, data = data_low, family = "binomial")
m_low_w <- glm(cov_w~P_min, data = data_low, family = "binomial")

pred <- predict(m_low, data.frame(P_min = seq(min(data_low$P_min), max(data_low$P_min), 0.001)), type="resp")

pred_w <- predict(m_low_w, data.frame(P_min = seq(min(data_low$P_min), max(data_low$P_min), 0.001)), type="resp")

How does where the CV appears in the ordering affect results? Note that the ordering of PPs is the same for the original and reweighted method.

I find that when using the reweighted method, the probability that the CV is in the set decreases very rapidly as it appears further down the ordering (as expected).

Summary

The reweighting method becomes “too sure” that the top SNP is the CV, especially in lower powered scenarios.
The probability that the CV is in the set decreases very rapidly as the power decreases when using the reweighting method.
If the CV is not extremely close to the front of the SNP ordering when using the reweighting method, then it becomes extremely unlikely to be included in the set.
It doesn’t look like the prior reweighting method is going to work.

Questions and next steps

Is it worth investigating varying the prior to reflect the power of the study, i.e. make it less steep in low power? How would I go about this?
Upgrading presentation possible additions:
- More detail on deriving Z-scores and/or PPs
- More detail on our analysis of why the problem exists
- Slide on future plans (going from CV to putative gene and mechanism using chromatin conformation methods)