Before Fixing

This report shows my investigation of what could be causing the problem in our corrected coverage estimates.

However, we’ve now identified the problem (see https://annahutch.github.io/PhD/31july.html) so these findings are redundant.

PPs are Calibrated

We’ve seen that the PPs are well calibrated in the UK10K simulations, lets check whether they are well calibrated in the low and high 1000 Genomes simulations.

Yep - they still are!

Correlation of UK10K Regions

What does the LD look like in the UK10K regions?

The geth() function samples a small (ldd$dist <1200000) LD block on chromosome 22, there are 6 of these to choose from (with 2121, 3927, 3788, 3516, 2960 or 5475 snps). 2 random starting points are sampled and the 100 adjacent snps from each of these are selected.

An example of a 200 SNP region is shown below:

Relationship of Empirical Coverage Proportions and PPs

I have found that the empirical coverage proportions are inaccurate, further narrowing down the problem of our correction to these proportions.

Fixing Empirical Coverage Proportions

In our method, we are calculating $E(Z_M)[iCV]=\mu$ (since $E(Z_M)=Z_J\times \Sigma$). We then simulate $Z*\sim N(E(Z_M),\Sigma)$ by calculating $Z*=E(Z_M)+ERR$ where $ERR\sim N(0,\Sigma)$. Thus, we are sampling $Z_M[iCV]$ from some distribution of $\mu$.

If we consider the 1000 simulated credible sets where the CV is assumed causal then we can obtain a binary vector of length 1000 (whether the CV was in the credible set - props_iCV) and also the corresponding value of $Z_M[iCV]$ used in that simulation (ZmiCV).

The values of ZmiCV will fall in some range centered at $\mu$. However, the proportions are not symmetric as they cannot increase as much as they are able to decrease and thus when we take the average of these (to get prop_cov), this is consistently too low.

Idea: Plot props_iCV against the simulated Z score at the CV (ZmiCV), fit a curve to this (logistic (hopefully) or GAM) and predict the coverage by reading off the props_iCV value for the true $\mu$. If this works, then we can then investigate whether the system is robust to noise in $\mu$ by predicting coverage from $\hat\mu$.

Want to recover the true coverage as a function of the $Z_m$.

Another Approach

Throw away simulations where $Z_M[iCV]$ if far from $\mu$.
E.g. could use ABC approach.
But not great to simulate data and then throw away the ones we don’t want… and how do we decide “how close” $Z_M[iCV]$ must be to $\mu$?