8th May 2019

Paper and First Year Report
Chat with Rob
T1D
Plots

1. Paper and First Year Report

Need to go over Chris’ corrections for the paper.
I have a basic plan for my first year report, but struggling to work on this along side the paper. Would like to finish the paper and then write my first year report - hopefully this will be fine timing-wise as I can just lift the figures and maths from the paper.
Discuss what level of maths is needed in the first year report - is the level from the paper ok or do I need to more fully explain the maths of the bayesian method?

2. Chat with Rob

Would like to consider how to describe the problem more generally (and statistically) for a paragraph or two in my paper and first year report.
I.e. why it is a general problem that if you sort things and then choose what order to place them into your credible set, then this makes things problematic - something about not being able to write the probability distribution.

3. T1D

I have decided to focus on the 95% credible sets in the paper/ first year report as there is more margin for error to show the utility of our correction.
The excel spreadsheets containing all the information for the 95% and 99% (included for comparison to the original paper) will be avaliable as .csv files. These are a work in progress but the draft versions are avaliable here and here.
To get some idea of how accurate these results are, I rerun the simulations but using sample sizes to match that of the study (N0 = 12,262 and N1 = 6670).

Need some input on how I describe why we only chose 38 out of the 44 regions. So far, I have:

“We omit 6 of the 44 regions from our analyses due to lack of position information (rs6691977, rs4849135, rs2611215 and rs1052553) and lack of utility of our method (rs34536443 region contains SNPs with posterior probabilities of causality of 0.98857969 and 0.01141317 and rs689 is extremely well documented, but not genotyped well on the ImmunoChip).”

The muhat estimates for the regions are given below. They all fall into bins where our method has improved the coverage estimate:

##  [1]  4.121487  4.303493  4.605187  4.708329  4.731193  4.762107  5.068904  5.108095  5.273287  5.289199  5.380735  5.413262  5.449691  5.733063  5.733596  5.746440  5.794417  5.946042  6.153421
## [20]  6.161197  6.192332  6.449380  6.658150  6.681179  6.864939  7.145365  7.303684  7.401054  7.648895  8.168146  8.496126  8.599211  8.653968  9.333459 11.919212 12.990950 20.777874

4. Simulation Method

My simulation approach is shown below.

Plotting the results:

1. Line graphs with geom_smooth(“lm”) added (means)

2. Line graphs with geom_quantile(quantile = 0.5) added (median)

3. Relative Error plots (median + IQR)

But now it seems that the unsorted sum is not unbiased? This is questionable because we found that the posterior probabilities themselves are accurate (by fitting logit(binary_cv)~logit(pp) and showing it was basically y=x, where binary_cv is 0 if the variant is not the CV and 1 if it is - see pps.calib.plot on hpc).
Can check the posterior probability frequentist coverage by binning the x axis and caluclating means on the y. I do this for the claimed coverage no ordering method.
Below is a plot to show the mean empirical unsorted coverage for equal bins of claimed unsorted coverage (i.e. unsorted vs unsorted). The pink point is the mid-point of the bin.

Possible problem: We are investigating how accurate the claimed coverage of unsorted sets is at estimating the empirical coverage of UNSORTED sets. The empirical coverage for the claimed and corrected coverage is the empirical coverage of SORTED sets. Hence, the claimed coverage without ordering is estimating something else. Does it make sence to plot them all on one plot?