PTPN22 Reference Data Analysis

Data file: ic_t1d_onengut_allresults.csv.

1. Build 37 method

Get corresponding region from immunochip-regions.txt –> chr1:113863087-114527968.
Get SNPs in this region from ic_t1d_onengut_allresults.csv file (694 SNPs).
Use aligned9.build37.RData to get LD matrix for these SNPs (drop 55 SNPs here which do not appear in this file –> 639 SNPs). Sigma <- ld(Xsub, Xsub, stat="R")
Proceed as normal to obtain the 99% credible set, claimed coverage and corrected coverage.

## $Credset
##     name.rs        pp
## 1 rs2476601 0.8640309
## 2 rs6679677 0.1359691
## 
## $Claimed
## [1] 1
## 
## $Corrected
## [1] 0.8035921
## 
## $nSNPs_in_region
## [1] 639

2. Build 36 method

Get genotype data from corresponding region from share/Projects/abc/input_files/refdatasets/datafilesD/ –> chr1:113619999-114460000 and convert the SNP names to rs IDs. Remove variants whose names cannot be matched (583 SNPs left).
Get SNPs in this region from ic_t1d_onengut_allresults.csv file (640 SNPs).
Remove the columns from \(X\) representing SNPs which do not appear in both of the above (520 SNPs left). The dimensions of X is 9670*520.
Get LD matrix. LD <- cor2(x)
Proceed as normal to obtain the 99% credible set, claimed coverage and corrected coverage.

## $Credset
##     name.rs        pp
## 1 rs2476601 0.8640309
## 2 rs6679677 0.1359691
## 
## $Claimed
## [1] 1
## 
## $Corrected
## [1] 0.889136
## 
## $nSNPs_in_region
## [1] 520

Compare Build Methods

In order to investigate whether the very low corrected coverage value is due to the observed data (good) or the reference data (bad), I proceed as follows:

Find the common SNPs in the above methods (495 SNPs present in both).
Use the differing LD data to perform the correction.
Apply the corrected credible set method. Do they both get stuck in a loop? Yes.

Build 37 Method:

Build 36 Method:

## $build37_claim
## [1] 1
## 
## $build37_corr
## [1] 0.882864
## 
## $build37_reqthr
## [1] 1
## 
## $build37_newCS_claim
## [1] 1
## 
## $build37_newCS_corr
## [1] 0.989864
## 
## $build36_claim
## [1] 1
## 
## $build36_corr
## [1] 0.8977281
## 
## $build36_reqthr
## [1] 1
## 
## $build36_newCS_claim
## [1] 1
## 
## $build36_newCS_corr
## [1] 0.989864

Notes

This shows that the loop is caused by the method and the observed data, not due to the reference data.
Using this set of 495 SNPs increases the corrected coverage of the original 99% credible set from 80% to ~90%.
Recall that last week, I found that this corrected coverage estimate was so low because fairly often the other high pp SNP was given the bulk of the pp. This led to simulated systems with claimed coverage = nvar = 1, covered = 0.
Consequently, we consider limiting the simulations used in the correction to those with nvar=2, which would reflect the standard 99% credible set containing 2 variants. This is explored in the next section.

Matching nvar Method

This section investigates how the corrected coverage estimate performs when only simulations with the correct number of variants in the credible set are considered.
For the rs2476601 region, when I limit my correction to only use those simulations with 2 variants in the credible set, the corrected coverage increases from 80% to 1, it is much better –> can think of this as only using the simulations which reflect what happened in ‘real-life’.
I calculate this corr_cov_match estimate for each of the T1D region credible sets.
I also report the difference between the corr_cov estimate and the corr_cov_match estimate. Positive values indicate that the matching method has decreased the corrected coverage estimate, whilst negative values indicate that it has increased the estimate.
Notice index SNP rs113010081 region (row 3), looks particularly bad. Probably due to the credible set containing 120 variants and the very few simulated credible sets that would contain exactly this number of variants.

##       index_SNP     claim nvar  corr_cov corr_cov_match                 nsims_match          diff
##  1:  rs10277986 0.9960149   11 0.9754023      0.9932061        4, 5, 7, 2,11,11,... -0.0178037582
##  2:  rs11203202 0.9993207    4 0.9955823      0.9878488 409,248,248,785,372,196,...  0.0077334519
##  3: rs113010081 0.9901508  120 0.9941965      0.6856924             0,0,4,0,0,0,...  0.3085041361
##  4:  rs11954020 0.9901063   83 0.9921446      0.9770868             0,2,0,1,4,2,...  0.0150578593
##  5:  rs12416116 0.9915243    6 0.9908959      0.9982534        1,11, 0, 0,71,40,... -0.0073575682
##  6:  rs12453507 0.9904096   65 0.9915284       0.998754             1,1,3,2,2,0,... -0.0072256020
##  7:  rs12927355 0.9918667   20 0.9705063      0.9773273             0,0,0,0,0,0,... -0.0068210222
##  8:  rs13415583 0.9901788   92 0.9901757      0.9685234             0,0,1,2,0,5,...  0.0216523560
##  9:   rs1456988 0.9953685   26 0.9836469      0.9967821        3,50,12,12,11,12,... -0.0131351791
## 10:    rs151234   0.99019   10 0.9859141      0.9884305             2,0,0,1,1,0,... -0.0025164851
## 11:   rs1538171 0.9903378   87 0.9735812      0.9659006             1,0,0,0,0,0,...  0.0076805895
## 12:   rs1615504 0.9905884   26 0.9907459      0.9961016        4,10,19,18,19,12,... -0.0053557155
## 13:   rs1893217 0.9962419   11 0.9463931      0.9996313             0,0,1,1,0,0,... -0.0532381813
## 14:    rs193778  0.990299   23 0.9929865      0.9987876             5,2,2,0,9,4,... -0.0058011249
## 15:   rs2111485 0.9999631    4 0.9795897      0.9959734       15,33, 0, 4, 3, 0,... -0.0163836654
## 16:    rs229533 0.9902057    8 0.9961789      0.9998258        6, 3, 2, 1,18,29,... -0.0036469689
## 17:   rs2476601         1    2 0.8042806              1  13,181,567,567,577,579,... -0.1957193821
## 18:   rs3024505 0.9919685    7 0.9908633      0.9993566        7, 4, 1,23,24, 6,... -0.0084933366
## 19:   rs3087243 0.9914763   21 0.9457756       0.982225             0,2,2,2,2,2,... -0.0364494018
## 20:  rs34593439 0.9913195    2 0.9927623      0.9952842  50, 37, 37,712,717,353,... -0.0025219006
## 21:    rs402072 0.9909357   14 0.9913847      0.9999948             3,8,5,2,5,2,... -0.0086101698
## 22:   rs4820830 0.9904556   59 0.9853841      0.9947404             0,0,1,1,3,0,... -0.0093562809
## 23:    rs516246 0.9965928   12 0.9876407      0.9953469             2,2,3,0,3,0,... -0.0077062257
## 24:  rs56994090 0.9934261    8 0.9976826      0.9999007        3, 1, 5,51, 7, 9,... -0.0022181060
## 25:   rs6043409 0.9901574   16  0.996303      0.9991871             0,4,0,0,0,0,... -0.0028841086
## 26:  rs61839660 0.9985256    4 0.9469757       0.969185             0,0,3,0,0,1,... -0.0222092458
## 27:  rs62447205 0.9901583   40 0.9797742      0.9949124             1,2,0,0,1,0,... -0.0151381790
## 28:   rs6476839 0.9947365   18 0.9931858      0.9994083             1,2,1,2,2,0,... -0.0062225049
## 29:   rs6518350 0.9902603   41 0.9966605       0.996814             3,1,3,3,2,3,... -0.0001535152
## 30:    rs653178 0.9995797    2 0.8324161      0.8645506   0,374,372,517,369,668,... -0.0321345206
## 31:    rs705705 0.9932401    9 0.9266029       0.954514  99,100, 31,  0,  0, 31,... -0.0279110792
## 32:  rs72727394 0.9904069   19 0.9953029      0.9974931        0, 0, 1, 1, 5,12,... -0.0021902282
## 33:  rs72928038 0.9900544   36 0.9923231      0.9846845        1, 3, 3, 0,10, 0,...  0.0076385507
## 34:    rs757411  0.990144   54 0.9874525      0.9923193             0,2,4,2,2,3,... -0.0048667555
## 35:  rs75793288 0.9940996   22 0.9761336      0.9843889             5,0,0,0,0,0,... -0.0082553002
## 36:   rs8056814 0.9920644    8 0.9649084      0.9955263             0,1,0,0,0,0,... -0.0306178333
## 37:    rs917911  0.990001  133 0.9908669       0.940162             0,1,0,0,0,0,...  0.0507049444
## 38:   rs9585056 0.9907895    3 0.9989341      0.9967159 115,  0, 90, 59, 43,  0,...  0.0022182763
##       index_SNP     claim nvar  corr_cov corr_cov_match                 nsims_match          diff

Simulations

To investigate the wider utility of this method, I rerun my standard simulations to include the corr_cov_match estimate.
It seems that the new method is not better than our original correction.

One would predict that this method would be better for very small credible sets (with only a couple of variants). Hence, the next plot is for only those simulations where nvar <= 3.

Note that some of these relative error bars are only based on a very small number of data points. For example, the thr = 0.99, OR = 1.1, N0 = 5000 facet (where corrected_match looks very good) is only based on 3 data points.
This method does not seem reliable enough to replace our standard corrected coverage method. Perhaps it could be used as an alternative for problematic corrected coverage estimates (e.g. for rs2476601).
Not sure how to incorporate this into the R package as it would be at the users discretion to use this method, and I wouldn’t recommend it.

‘Confidence Interval’ for Corrected Coverage Estimates

The idea of this section is to decide when we should use our corrected credible set method. For example, if the corrected coverage of the standard 99% credible set is ~99% then there is no point in providing a new credible set.
For each T1D regions I simulate 100 corrected coverage estimates using different ERR matrices and use these to form 95% (or 50%) ‘confidence regions’. If the threshold/desired coverage falls within this region, then there is no need to form a new credible set.
This method would hopefully make our results robustly estimated.

rs917911 region

I run my code 100 times, varying the ERR and saving the corrected coverage estimate. Note that I am correcting the coverage of the standard 99% credible set of 133 variants with claimed coverage 0.990001.
In my original simulations, the corrected coverage of this credible set was 0.9893961, yet is seems that none of the 100 replicates are this low (?)
The 50% ‘confidence interval’ is

##       25%       75% 
## 0.9932352 0.9959700

The 95% ‘confidence interval’ is

##        5%       95% 
## 0.9916436 0.9974147

Since 0.99 does not fall in this range (infact the minimum corrected coverage was 0.9903), we would correct this and build a new credible set, which we would hope has fewer variants in.

All regions

I obtain the 95% and 50% ‘confidence regions’ for each of the non-problematic regions, except for the rs113010081 region as this requires too much memory (1321 snps in region).

##      index_snp   corrcov                CI95                CI50
##  1: rs10277986  0.978764 0.9721906,0.9802149 0.9741556,0.9780174
##  2: rs11203202  0.995303 0.9933562,0.9979646 0.9949507,0.9969793
##  3: rs11954020  0.995122 0.9920448,0.9958591 0.9934685,0.9950042
##  4: rs12416116 0.9886482 0.9887499,0.9959496 0.9907356,0.9940735
##  5: rs12453507  0.992248 0.9889613,0.9930303 0.9908088,0.9921349
##  6: rs12927355 0.9696758 0.9668940,0.9721975 0.9684114,0.9707650
##  7: rs13415583 0.9899163 0.9874252,0.9913187 0.9884464,0.9900265
##  8:  rs1456988 0.9887643 0.9880320,0.9922099 0.9888669,0.9909024
##  9:   rs151234 0.9865615 0.9770208,0.9872931 0.9819391,0.9849174
## 10:  rs1538171 0.9695779 0.9679499,0.9740948   0.969232,0.972303
## 11:  rs1615504 0.9911848 0.9901151,0.9939660 0.9912258,0.9931430
## 12:  rs1893217 0.9445883 0.9427223,0.9496506   0.944452,0.947910
## 13:   rs193778 0.9908382 0.9902298,0.9942834 0.9913912,0.9932437
## 14:  rs2111485 0.9792549 0.9788722,0.9847620 0.9807153,0.9830364
## 15:   rs229533 0.9981773 0.9949370,0.9987852 0.9962214,0.9978486
## 16:  rs2476601 0.8005921 0.7954517,0.8097113 0.7998783,0.8058585
## 17:  rs3024505 0.9892239 0.9902502,0.9955760   0.991483,0.994281
## 18:  rs3087243 0.9486954 0.9468846,0.9538603 0.9489227,0.9516272
## 19: rs34593439 0.9912887 0.9910036,0.9959825 0.9925432,0.9950101
## 20:   rs402072 0.9930691 0.9909713,0.9954966 0.9925727,0.9944596
## 21:  rs4820830 0.9838468 0.9795138,0.9857821 0.9812997,0.9839691
## 22:   rs516246 0.9867699 0.9827057,0.9884294 0.9848526,0.9869850
## 23: rs56994090  0.996886 0.9954066,0.9982024 0.9963256,0.9976138
## 24:  rs6043409 0.9964087 0.9946749,0.9980954 0.9960422,0.9973196
## 25: rs61839660 0.9521729 0.9453772,0.9578613 0.9485344,0.9542905
## 26: rs62447205 0.9807707 0.9780220,0.9819977   0.979047,0.981009
## 27:  rs6476839 0.9947377 0.9917515,0.9953835 0.9927836,0.9943124
## 28:  rs6518350 0.9983014 0.9951457,0.9988502 0.9964435,0.9979501
## 29:   rs653178 0.8325501 0.8302166,0.8430690 0.8340166,0.8390162
## 30:   rs705705 0.9285993 0.9237517,0.9316649 0.9270735,0.9297125
## 31: rs72727394 0.9942578 0.9923124,0.9966216 0.9937476,0.9955544
## 32: rs72928038 0.9922407 0.9884362,0.9947106 0.9907878,0.9932962
## 33:   rs757411  0.993689 0.9886599,0.9937992 0.9899442,0.9922903
## 34: rs75793288 0.9728269 0.9681660,0.9778688 0.9708571,0.9756968
## 35:  rs8056814  0.973799 0.9591791,0.9704493 0.9625117,0.9674740
## 36:   rs917911 0.9951829 0.9915600,0.9970556 0.9930566,0.9959115
## 37:  rs9585056 0.9989179 0.9959032,0.9999277 0.9969796,0.9989419
##      index_snp   corrcov                CI95                CI50

The following table shows the correction for the regions which this method implies should be corrected - i.e. 0.99 does not fall into the 95% confidence region.
We see that these are all well behaved, with more variants being added if the corrected coverage is less than 0.99 and variants being removed if it is greater than 0.99 (except rs917911 which we are aware of).
In particular, notice index-snp rs9585056 (row 36) - our correction has reduced the credible set size from 3 variants to just 1!

##      index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 1   rs10277986            26          11      0.9763712          14   0.9900244         3
## 2   rs11203202             5           4      0.9956508           4   0.9899901         0
## 3   rs11954020           136          83      0.9937890          75   0.9900860        -8
## 6   rs12927355            41          20      0.9688299          24   0.9900287         4
## 9     rs151234            NA          10      0.9831869          18   0.9898650         8
## 10   rs1538171           154          87      0.9708851         100   0.9900226        13
## 11   rs1615504            40          26      0.9930152          24   0.9900069        -2
## 12   rs1893217            35          11      0.9453810          29   0.9900648        18
## 13    rs193778           102          23      0.9949010          20   0.9898605        -3
## 14   rs2111485             4           4      0.9822650           4   0.9900062         0
## 15    rs229533            15           8      0.9948442           5   0.9901647        -3
## 16   rs3024505             9           7      0.9926321           6   0.9899121        -1
## 17   rs3087243            30          21      0.9519229          38   0.9900073        17
## 18  rs34593439             3           2      0.9940004           2   0.9896408         0
## 19    rs402072            25          14      0.9959448          11   0.9900473        -3
## 20   rs4820830           117          59      0.9840735          64   0.9900965         5
## 21    rs516246            36          12      0.9865726          12   0.9900208         0
## 22  rs56994090            11           8      0.9983908           7   0.9900007        -1
## 23   rs6043409            39          16      0.9959000           9   0.9900867        -7
## 24  rs61839660             5           4      0.9507457           7   0.9901005         3
## 25  rs62447205            56          40      0.9799506          57   0.9900367        17
## 26   rs6476839            21          18      0.9941413          17   0.9900715        -1
## 27   rs6518350            68          41      0.9990129          15   0.9900782       -26
## 28    rs653178             5           2      0.8479623           5   0.9901032         3
## 29    rs705705            22           9      0.9254648          19   0.9900180        10
## 30  rs72727394            32          19      0.9952338          19   0.9900978         0
## 33  rs75793288            34          22      0.9770497          25   0.9900484         3
## 34   rs8056814            22           8      0.9700932          13   0.9906368         5
## 35    rs917911           135         133      0.9893961          54   0.9898731       -79
## 36   rs9585056             7           3      0.9979568           1   0.9907311        -2
## 161  rs2476601             2           2      0.8034561           2   0.9480000         0

Investigating discrepency

To further investigate the discrepency between what we see in our simulations and what happens in the real data, I investigate the top SNP (SNP with largest absolute Z-score) from each T1D region. This will help us to understand how the true CVs may be different to random SNPs selected in our simulations in terms of MAF and LD.
Note that the are instances when the index SNP is not the top SNP. In these reasons, I report the absolute Z score for the index SNP and the top SNP. The index_snp_pos column shows that the index SNP has \(x\)’th highest absolute Z score.

##       index_snp    topSNP       topmaf    
##  [1,] "rs1538171"  "rs2326451"  0.4478586 
##  [2,] "rs62447205" "rs12719030" 0.4842805 
##  [3,] "rs10277986" "rs7780389"  0.04811343
##  [4,] "rs6476839"  "rs10974435" 0.4966785 
##  [5,] "rs705705"   "rs877636"   0.3487061 
##  [6,] "rs653178"   "rs3184504"  0.4862989 
##  [7,] "rs3024505"  "rs3024493"  0.1608732 
##  [8,] "rs1456988"  "rs4900384"  0.2775208 
##  [9,] "rs12927355" "rs741172"   0.3193143 
## [10,] "rs193778"   "rs6498177"  0.2447503 
## [11,] "rs12453507" "rs12150079" 0.3266698 
## [12,] "rs13415583" "rs2309837"  0.3919801 
## [13,] "rs402072"   "rs425105"   0.1660339 
## [14,] "rs6043409"  "rs2250261"  0.1570547 
## [15,] "rs11203202" "rs9981624"  0.3355872 
## [16,] "rs6518350"  "rs56132007" 0.1847568 
## [17,] "rs4820830"  "rs1468176"  0.3890339 
## [18,] "rs75793288" "rs6827756"  0.3625148

For each T1D region, I plot a histogram of the MAFs and mark on where the top SNP’s MAF falls (red line) and the average MAF (dashed line).
The title of each plot is the index SNP in the region, the claimed coverage and the corrected coverage of the 99% credible set.

I then obtain a set of SNPs with similar MAFs to that of the top SNP (+/- 0.05). From these, I see how many are in high LD with the top SNP (\(LD>0.8\)), medium LD (\(0.5<LD<0.8\)), lower LD (\(0.2<LD<0.5\)) and lowest LD (\(0<LD<0.2\)). The table below shows these results.

##      Index_SNP High_LD Mid_LD Low_LD Lowest_LD Total_Similar_MAF
## 1    rs2476601       2      2      9       164               177
## 2    rs1538171     103      9     79       101               292
## 3   rs62447205      13      6      5        35                59
## 4   rs10277986      29     12     10       118               169
## 5    rs6476839       7      0      2         7                16
## 6   rs61839660      16     12      0        34                62
## 7   rs12416116       6      2     11        21                40
## 8     rs917911       4      0      1        76                81
## 9     rs705705      13      0      0         0                13
## 10    rs653178       3      0      0         2                 5
## 11   rs9585056       1      2      8        12                23
## 12   rs3024505       4      2      2        58                66
## 13   rs1456988      23      1     12        12                48
## 14  rs56994090       3      0      0         9                12
## 15  rs72727394      17     12      1        23                53
## 16  rs34593439       4      0      1       103               108
## 17    rs151234      10      1      1         7                19
## 18  rs12927355      26      1     12        41                80
## 19    rs193778       5     30      5        67               107
## 20   rs8056814      23      6      0        72               101
## 21  rs12453507       3     13      8        53                77
## 22    rs757411      12     17      0         5                34
## 23  rs13415583       3      2     30       103               138
## 24   rs1893217      11     25      2        59                97
## 25   rs1615504      26      3      2         1                32
## 26    rs402072      11      3      0        12                26
## 27    rs516246      17      4      1         8                30
## 28   rs6043409       1      3     17        59                80
## 29  rs11203202       5      7      4        12                28
## 30   rs6518350       9      0      6        14                29
## 31   rs4820830      62     13      0        18                93
## 32    rs229533       5      0      3        30                38
## 33   rs2111485       2      2     42         1                47
## 34   rs3087243      39      0      0        46                85
## 35 rs113010081      52      5      0       377               434
## 36  rs75793288      22     18      0        50                90
## 37  rs11954020      13      0     24        32                69
## 38  rs72928038       2      3      0        10                15
## 39   rs2476601       2      2      9       164               177

Final (ish) T1D Results

Combining what we have discovered for the T1D data, below is a final data frame showing information for each of the 38 regions reported in table 1 of the original paper (6 omitted).
NAs are reported when the correction is not needed.
For rs2476601 I would suggest using the matching nvar method (corrcov would now be 1).
rs113010081 is missing as this requires too much memory (~1300 SNPs).

##     index_SNP theirCS_nvar ourCS_nvar ourCS_claim ourCS_corr     95%CI_corr newCS_nvar newCS_corr diff_nvar
## 1  rs10277986           26         11       0.996  0.9763712 0.9722, 0.9802         14  0.9900244         3
## 2  rs11203202            5          4      0.9993  0.9956508 0.9934, 0.9980          4  0.9899901         0
## 3  rs11954020          136         83      0.9901  0.9937890 0.9920, 0.9959         75  0.9900860        -8
## 4  rs12927355           41         20      0.9919  0.9688299 0.9669, 0.9722         24  0.9900287         4
## 5    rs151234           NA         10      0.9902  0.9831869 0.9770, 0.9873         18  0.9898650         8
## 6   rs1538171          154         87      0.9903  0.9708851 0.9679, 0.9741        100  0.9900226        13
## 7   rs1615504           40         26      0.9906  0.9930152 0.9901, 0.9940         24  0.9900069        -2
## 8   rs1893217           35         11      0.9962  0.9453810 0.9427, 0.9497         29  0.9900648        18
## 9    rs193778          102         23      0.9903  0.9949010 0.9902, 0.9943         20  0.9898605        -3
## 10  rs2111485            4          4           1  0.9822650 0.9789, 0.9848          4  0.9900062         0
## 11   rs229533           15          8      0.9902  0.9948442 0.9949, 0.9988          5  0.9901647        -3
## 12  rs3024505            9          7       0.992  0.9926321 0.9903, 0.9956          6  0.9899121        -1
## 13  rs3087243           30         21      0.9915  0.9519229 0.9469, 0.9539         38  0.9900073        17
## 14 rs34593439            3          2      0.9913  0.9940004   0.991, 0.996          2  0.9896408         0
## 15   rs402072           25         14      0.9909  0.9959448 0.9910, 0.9955         11  0.9900473        -3
## 16  rs4820830          117         59      0.9905  0.9840735 0.9795, 0.9858         64  0.9900965         5
## 17   rs516246           36         12      0.9966  0.9865726 0.9827, 0.9884         12  0.9900208         0
## 18 rs56994090           11          8      0.9934  0.9983908 0.9954, 0.9982          7  0.9900007        -1
## 19  rs6043409           39         16      0.9902  0.9959000 0.9947, 0.9981          9  0.9900867        -7
## 20 rs61839660            5          4      0.9985  0.9507457 0.9454, 0.9579          7  0.9901005         3
## 21 rs62447205           56         40      0.9902  0.9799506   0.978, 0.982         57  0.9900367        17
## 22  rs6476839           21         18      0.9947  0.9941413 0.9918, 0.9954         17  0.9900715        -1
## 23  rs6518350           68         41      0.9903  0.9990129 0.9951, 0.9989         15  0.9900782       -26
## 24   rs653178            5          2      0.9996  0.8479623 0.8302, 0.8431          5  0.9901032         3
## 25   rs705705           22          9      0.9932  0.9254648 0.9238, 0.9317         19  0.9900180        10
## 26 rs72727394           32         19      0.9904  0.9952338 0.9923, 0.9966         19  0.9900978         0
## 27 rs75793288           34         22      0.9941  0.9770497 0.9682, 0.9779         25  0.9900484         3
## 28  rs8056814           22          8      0.9921  0.9700932 0.9592, 0.9704         13  0.9906368         5
## 29   rs917911          135        133        0.99  0.9893961 0.9916, 0.9971         54  0.9898731       -79
## 30  rs9585056            7          3      0.9908  0.9979568 0.9959, 0.9999          1  0.9907311        -2
## 31  rs2476601            2          2           1  0.8034561 0.7955, 0.8097          2  0.9480000         0
## 32 rs12416116            7          6      0.9915  0.9886270 0.9887, 0.9959         NA         NA        NA
## 33 rs12453507          112         65      0.9904  0.9901308   0.989, 0.993         NA         NA        NA
## 34 rs13415583          139         92      0.9902  0.9897138 0.9874, 0.9913         NA         NA        NA
## 35  rs1456988           40         26      0.9954  0.9907794 0.9880, 0.9922         NA         NA        NA
## 36 rs72928038           56         36      0.9901  0.9951392 0.9884, 0.9947         NA         NA        NA
## 37   rs757411           82         54      0.9901  0.9903273 0.9887, 0.9938         NA         NA        NA

Standard simulations for bigger OR

Whilst the median value of our corrected coverage estimate may not be as good as the claimed coverage, the range of values mostly falls within that of the claimed coverage.

Binning by true mu value (or muhat value or max Z) looks even better!

Next Steps

Decide what to do with the MAF and LD comparison section.
Finalise T1D results and make into .CSV file, reporting which method was used.

Paper summary

“The good, the bad and the ugly: Bayesian model selection produces spurious posterior probabilities for phylogenetic trees”

Star-tree paradox

A particular binary phylogenetic tree (a binary tree has exactly 2 descendants from each interior node) gets a very high posterior probability, even though a star tree was used to generate the data.
For example, if we generate some large dataset from a star tree and calculate the posterior probabilities that the data came from each of 3 binary trees, then we would expect the posterior probabilities to be 1/3, as the data does not contain any information for each of these trees. But infact, the pps tended to be spurious and sometimes very highly supportive of one binary tree model.

True Model Included

If the true model is included in the comparison, then the posterior probabilities of the models are well behaved:
- As \(n \longrightarrow \infty\) the true model dominates and its pp tends to 1.
- If there are a few equally right models then the model with the fewest parameters will dominate.
But in the real-world there is never a completely true model (models are simplified representations of the physical world). Hence, the focus of the paper is on the asymptotics of pps when the true model is not included in the comparison.

Set-up

Sample some sets of data from the true model.
Consider two models to compare, \(H_1\) and \(H_2\) (neither being the true model and with equal dimensions - i.e. same number of free parameters).
Use a uniform prior for each model with value \(1/2\).
\(P_K\) is the posterior model probability for model \(K\) (i.e. \(P_1=Pr(H_1|x)\).
For a measure of how good/bad the model is, they consider the (KL) distance between each model and the true model, \(D_1\) for model 1 and \(D_2\) for model 2.
- \(D=0\): Right model
- \(D>0\): Wrong model
- The closer to 0, the better the model.

Types of Asymptotic Behaviour

1. ‘The Good’

\(P_1\) converges to a single reasonable value (not 0 or 1).
Then we can say something like “for every large dataset, the posterior probability of this model is approximately this value”.
This occurs when we compare two identical models or overlapping models where the best fitting parameter values lie in the region of overlap.
Whether the models are right or wrong does not matter.
Example of overlapping models: The truth is \(p=1/2\) and \(H_1:0.4<p<0.6\) and \(H_2=0<p<1\). We assign a uniform prior on \(p\) in each model, then as \(n \longrightarrow \infty\), \(P_1 \longrightarrow \frac{1}{1+0.2}=\frac{5}{6}\). I.e. the more-informative model \(H_1\) is favoured.
The rationale is that one would like a sure answer given an infinite amount of data.
Not really relevent as you wouldn’t normally compare identical or overlapping models…

2. ‘The Bad’

\(P_1\) converges to a ‘nondegenerate’ (non constant) distribution, so that \(P_1\) fluctuates among dataset like a random number (strong support may be attached for a certain model in some datasets).
So that if we compare two equally right or equally wrong models (for data from the same true model), \(P_1\) varies among datasets according to a nondegenerate distribution.
Two scenarioes:
1. Both models are right (\(D_1=D_2=0\)).
2. Both models are equally wrong (\(D_1=D_2>0\)) but indistinct. (indistinct means that ‘in infinite data, the two models make essentially the same predictions about the data and are unidentifiable’ –> at the best-fitting parameter values, the models are unidentifiable).

3. ‘The Ugly’

\(P_1\) has a degenerate two-point distribution at values 0 and 1 (a step function?). So that, in some datasets we will favor model 1 will full confidence and in others, model 2 with full confidence.
0 half the time and 1 the other half.
Occurs when the models are equally wrong but also distinct.
E.g. true \(p\) is 0.5 (i.e. coin tossing).
- Compare \(H_1:p=0.4\) and \(H_2:p=0.6\) (the two models are equally wrong).
- Assign a uniform prior to the two models (1/2 each) and calculate the model pp.
- Find that in large datasets, moderate posterior probabilities are rare and that either \(H_1\) or \(H_2\) will be favoured with \(pp>0.99\)!
- If \(H_1\) is changed slightly to \(H_1:p=0.42\), then model 1 is slightly less wrong than model 2. In large datasets, the less wrong model will eventually win but the more wrong model can actually recieve some high support. The method becomes overconfident before it becomes reliable (supports the wrong model with high posterior too often).
- Note that we only see this phenomenom when the two models oppose eachother.

Concluding remarks

The asymptotic behaviour is determined by whether or not the compared models are distinct, and not by whether they are both right or both wrong(!!!), or by whether they have unknown parameters.
This problem is especially problematic in phylogenetics because here the phylogeny (the model) is of primary interest, not the branch lengths (the parameters). Thus it may be less of a problem in other statistical anaylses where the model is not of primary interest.

SOMX and MASAMB talk plan

Section: Intro
- Slide: Explain that this is a more detailed version of my BSU together talk, with a bit of real world application added.
- (Perhaps a slide explaining my terminology would be helpful? Claimed coverage/size/corrected etc?)
Section: Bayesian approach for fine-mapping:
- Slide: 1 CV assumption - how this enables us to obtain the BFs for each SNP causal in a region and conviently convert these to pps using only the marginal summary statistics for each variant in turn. Z-scores –> ABF –> pps.
- Slide: The method –> order then and cumulatively sum until some threshold is exceeded. These form your ‘99% credible set’. Include inserts from papers using this statement.
Section: The problem
- Slide: Through simulations we identified that these coverage estimates show systematic bias.
- Slide: More plots.
Section: My fix
- Slide: I have developed a method to obtain an accurate corrected coverage estimate that hopefully people can report in their studies instead of the claimed coverage (size of cred set). Steps (use flow chart thing to explain):
  - Use the joint Z-scores to obtain the marginal Z-scores. (includes estimating the true effect at the CV which I’ll come back to on the next slide).
  - Simulate more Z-score systems from the MVN distribution.
  - Form credible sets from each of these simulated systems and report whether the CV was contained in the set. Find the proportion of simulations containing the CV.
  - Multiply this proportion by the CVs pp (talk about how this downweights the estimate from unlikely CVs considered causal).
  - Obviously we don’t know the CV so repeat this for each SNP considered as the CV. Sum all of these to get the final corrected coverage estimate.
- Slide: Estimating mu - started by using the maximum Z score in the system, but this is a very bad estimate in low power scenarios because we are using the absolute Z-scores (is this correct? Ask Chris). Show nice plots.
- Slide: Relative error plots showing how amazing my correction is.
- Slide: R-package. Explain the key things it can do (provide corrected coverage estimate and provide a new credible set - smallest set of variants with the desired coverage of the true CV). Talk about input paramters it needs (only summary stats - Z and mafs or beta and V).
- Slide: Real data - breifly discuss its use on the T1D data - perhaps use rs9585056 example where I have narrowed the credible set down from 3 variants to 1.
Section: Closing remarks
- Slide: Importance - Improves the method of identifying sets of putative CVs to give over to the biologists. Saves time, money and resources in the wet-lab follow up of the putative CVs. Can inform researchers if their credible set is ‘over-’/‘under-confident’ of containing the CV. A biologist may be less likely to commiting lots of time to the follow-up if you’re actually only 60% sure you’ve captured the CV as opposed to 90%.
- Slide: What next? Finalise T1D application (there are some problematic regions), write a paper, later in my PhD I’d like to apply this to other biological omics datasets (such as capture Hi-C).

Should I include a slide showing what happens when the 1 CV assumption is dropped? Could always include this as a later slide and show it if anyone asks.

Need to perhaps have an idea of how I could apply this to CHi-C…

Plots to redo:

Line graphs showing systematic bias - include higher power plots.
Rel error of my correction - decide which one to use.

3rdApril