Need to go over Chris’ corrections for the paper.
I have a basic plan for my first year report, but struggling to work on this along side the paper. Would like to finish the paper and then write my first year report - hopefully this will be fine timing-wise as I can just lift the figures and maths from the paper.
Discuss what level of maths is needed in the first year report - is the level from the paper ok or do I need to more fully explain the maths of the bayesian method?
Would like to consider how to describe the problem more generally (and statistically) for a paragraph or two in my paper and first year report.
I.e. why it is a general problem that if you sort things and then choose what order to place them into your credible set, then this makes things problematic - something about not being able to write the probability distribution.
I have decided to focus on the 95% credible sets in the paper/ first year report as there is more margin for error to show the utility of our correction.
The excel spreadsheets containing all the information for the 95% and 99% (included for comparison to the original paper) will be avaliable as .csv files. These are a work in progress but the draft versions are avaliable here and here.
To get some idea of how accurate these results are, I rerun the simulations but using sample sizes to match that of the study (N0 = 12,262 and N1 = 6670).
“We omit 6 of the 44 regions from our analyses due to lack of position information (rs6691977, rs4849135, rs2611215 and rs1052553) and lack of utility of our method (rs34536443 region contains SNPs with posterior probabilities of causality of 0.98857969 and 0.01141317 and rs689 is extremely well documented, but not genotyped well on the ImmunoChip).”
## [1] 4.121487 4.303493 4.605187 4.708329 4.731193 4.762107 5.068904 5.108095 5.273287 5.289199 5.380735 5.413262 5.449691 5.733063 5.733596 5.746440 5.794417 5.946042 6.153421
## [20] 6.161197 6.192332 6.449380 6.658150 6.681179 6.864939 7.145365 7.303684 7.401054 7.648895 8.168146 8.496126 8.599211 8.653968 9.333459 11.919212 12.990950 20.777874
But now it seems that the unsorted sum is not unbiased? This is questionable because we found that the posterior probabilities themselves are accurate (by fitting logit(binary_cv)~logit(pp)
and showing it was basically y=x
, where binary_cv is 0 if the variant is not the CV and 1 if it is - see pps.calib.plot
on hpc).
Can check the posterior probability frequentist coverage by binning the x axis and caluclating means on the y. I do this for the claimed coverage no ordering method.
Below is a plot to show the mean empirical unsorted coverage for equal bins of claimed unsorted coverage (i.e. unsorted vs unsorted). The pink point is the mid-point of the bin.