This report shows my investigation of what could be causing the problem in our corrected coverage estimates.
However, we’ve now identified the problem (see https://annahutch.github.io/PhD/31july.html) so these findings are redundant.
We’ve seen that the PPs are well calibrated in the UK10K simulations, lets check whether they are well calibrated in the low and high 1000 Genomes simulations.
Yep - they still are!
What does the LD look like in the UK10K regions?
The geth()
function samples a small (ldd$dist <1200000
) LD block on chromosome 22, there are 6 of these to choose from (with 2121, 3927, 3788, 3516, 2960 or 5475 snps). 2 random starting points are sampled and the 100 adjacent snps from each of these are selected.
An example of a 200 SNP region is shown below:
I have found that the empirical coverage proportions are inaccurate, further narrowing down the problem of our correction to these proportions.
In our method, we are calculating \(E(Z_M)[iCV]=\mu\) (since \(E(Z_M)=Z_J\times \Sigma\)). We then simulate \(Z*\sim N(E(Z_M),\Sigma)\) by calculating \(Z*=E(Z_M)+ERR\) where \(ERR\sim N(0,\Sigma)\). Thus, we are sampling \(Z_M[iCV]\) from some distribution of \(\mu\).
If we consider the 1000 simulated credible sets where the CV is assumed causal then we can obtain a binary vector of length 1000 (whether the CV was in the credible set - props_iCV
) and also the corresponding value of \(Z_M[iCV]\) used in that simulation (ZmiCV
).
The values of ZmiCV
will fall in some range centered at \(\mu\). However, the proportions are not symmetric as they cannot increase as much as they are able to decrease and thus when we take the average of these (to get prop_cov
), this is consistently too low.
Idea: Plot props_iCV
against the simulated Z score at the CV (ZmiCV
), fit a curve to this (logistic (hopefully) or GAM) and predict the coverage by reading off the props_iCV
value for the true \(\mu\). If this works, then we can then investigate whether the system is robust to noise in \(\mu\) by predicting coverage from \(\hat\mu\).
Want to recover the true coverage as a function of the \(Z_m\).
Throw away simulations where \(Z_M[iCV]\) if far from \(\mu\).
E.g. could use ABC approach.
But not great to simulate data and then throw away the ones we don’t want… and how do we decide “how close” \(Z_M[iCV]\) must be to \(\mu\)?