Data file: ic_t1d_onengut_allresults.csv
.
immunochip-regions.txt
–> chr1:113863087-114527968.ic_t1d_onengut_allresults.csv
file (694 SNPs).aligned9.build37.RData
to get LD matrix for these SNPs (drop 55 SNPs here which do not appear in this file –> 639 SNPs). Sigma <- ld(Xsub, Xsub, stat="R")
## $Credset
## name.rs pp
## 1 rs2476601 0.8640309
## 2 rs6679677 0.1359691
##
## $Claimed
## [1] 1
##
## $Corrected
## [1] 0.8035921
##
## $nSNPs_in_region
## [1] 639
share/Projects/abc/input_files/refdatasets/datafilesD/
–> chr1:113619999-114460000 and convert the SNP names to rs IDs. Remove variants whose names cannot be matched (583 SNPs left).ic_t1d_onengut_allresults.csv
file (640 SNPs).LD <- cor2(x)
## $Credset
## name.rs pp
## 1 rs2476601 0.8640309
## 2 rs6679677 0.1359691
##
## $Claimed
## [1] 1
##
## $Corrected
## [1] 0.889136
##
## $nSNPs_in_region
## [1] 520
In order to investigate whether the very low corrected coverage value is due to the observed data (good) or the reference data (bad), I proceed as follows:
Find the common SNPs in the above methods (495 SNPs present in both).
Use the differing LD data to perform the correction.
Apply the corrected credible set method. Do they both get stuck in a loop? Yes.
Build 37 Method:
Build 36 Method:
## $build37_claim
## [1] 1
##
## $build37_corr
## [1] 0.882864
##
## $build37_reqthr
## [1] 1
##
## $build37_newCS_claim
## [1] 1
##
## $build37_newCS_corr
## [1] 0.989864
##
## $build36_claim
## [1] 1
##
## $build36_corr
## [1] 0.8977281
##
## $build36_reqthr
## [1] 1
##
## $build36_newCS_claim
## [1] 1
##
## $build36_newCS_corr
## [1] 0.989864
This shows that the loop is caused by the method and the observed data, not due to the reference data.
Using this set of 495 SNPs increases the corrected coverage of the original 99% credible set from 80% to ~90%.
Recall that last week, I found that this corrected coverage estimate was so low because fairly often the other high pp SNP was given the bulk of the pp. This led to simulated systems with claimed coverage = nvar = 1, covered = 0.
Consequently, we consider limiting the simulations used in the correction to those with nvar=2
, which would reflect the standard 99% credible set containing 2 variants. This is explored in the next section.
This section investigates how the corrected coverage estimate performs when only simulations with the correct number of variants in the credible set are considered.
For the rs2476601 region, when I limit my correction to only use those simulations with 2 variants in the credible set, the corrected coverage increases from 80% to 1, it is much better –> can think of this as only using the simulations which reflect what happened in ‘real-life’.
I calculate this corr_cov_match
estimate for each of the T1D region credible sets.
I also report the difference between the corr_cov
estimate and the corr_cov_match
estimate. Positive values indicate that the matching method has decreased the corrected coverage estimate, whilst negative values indicate that it has increased the estimate.
Notice index SNP rs113010081 region (row 3), looks particularly bad. Probably due to the credible set containing 120 variants and the very few simulated credible sets that would contain exactly this number of variants.
## index_SNP claim nvar corr_cov corr_cov_match nsims_match diff
## 1: rs10277986 0.9960149 11 0.9754023 0.9932061 4, 5, 7, 2,11,11,... -0.0178037582
## 2: rs11203202 0.9993207 4 0.9955823 0.9878488 409,248,248,785,372,196,... 0.0077334519
## 3: rs113010081 0.9901508 120 0.9941965 0.6856924 0,0,4,0,0,0,... 0.3085041361
## 4: rs11954020 0.9901063 83 0.9921446 0.9770868 0,2,0,1,4,2,... 0.0150578593
## 5: rs12416116 0.9915243 6 0.9908959 0.9982534 1,11, 0, 0,71,40,... -0.0073575682
## 6: rs12453507 0.9904096 65 0.9915284 0.998754 1,1,3,2,2,0,... -0.0072256020
## 7: rs12927355 0.9918667 20 0.9705063 0.9773273 0,0,0,0,0,0,... -0.0068210222
## 8: rs13415583 0.9901788 92 0.9901757 0.9685234 0,0,1,2,0,5,... 0.0216523560
## 9: rs1456988 0.9953685 26 0.9836469 0.9967821 3,50,12,12,11,12,... -0.0131351791
## 10: rs151234 0.99019 10 0.9859141 0.9884305 2,0,0,1,1,0,... -0.0025164851
## 11: rs1538171 0.9903378 87 0.9735812 0.9659006 1,0,0,0,0,0,... 0.0076805895
## 12: rs1615504 0.9905884 26 0.9907459 0.9961016 4,10,19,18,19,12,... -0.0053557155
## 13: rs1893217 0.9962419 11 0.9463931 0.9996313 0,0,1,1,0,0,... -0.0532381813
## 14: rs193778 0.990299 23 0.9929865 0.9987876 5,2,2,0,9,4,... -0.0058011249
## 15: rs2111485 0.9999631 4 0.9795897 0.9959734 15,33, 0, 4, 3, 0,... -0.0163836654
## 16: rs229533 0.9902057 8 0.9961789 0.9998258 6, 3, 2, 1,18,29,... -0.0036469689
## 17: rs2476601 1 2 0.8042806 1 13,181,567,567,577,579,... -0.1957193821
## 18: rs3024505 0.9919685 7 0.9908633 0.9993566 7, 4, 1,23,24, 6,... -0.0084933366
## 19: rs3087243 0.9914763 21 0.9457756 0.982225 0,2,2,2,2,2,... -0.0364494018
## 20: rs34593439 0.9913195 2 0.9927623 0.9952842 50, 37, 37,712,717,353,... -0.0025219006
## 21: rs402072 0.9909357 14 0.9913847 0.9999948 3,8,5,2,5,2,... -0.0086101698
## 22: rs4820830 0.9904556 59 0.9853841 0.9947404 0,0,1,1,3,0,... -0.0093562809
## 23: rs516246 0.9965928 12 0.9876407 0.9953469 2,2,3,0,3,0,... -0.0077062257
## 24: rs56994090 0.9934261 8 0.9976826 0.9999007 3, 1, 5,51, 7, 9,... -0.0022181060
## 25: rs6043409 0.9901574 16 0.996303 0.9991871 0,4,0,0,0,0,... -0.0028841086
## 26: rs61839660 0.9985256 4 0.9469757 0.969185 0,0,3,0,0,1,... -0.0222092458
## 27: rs62447205 0.9901583 40 0.9797742 0.9949124 1,2,0,0,1,0,... -0.0151381790
## 28: rs6476839 0.9947365 18 0.9931858 0.9994083 1,2,1,2,2,0,... -0.0062225049
## 29: rs6518350 0.9902603 41 0.9966605 0.996814 3,1,3,3,2,3,... -0.0001535152
## 30: rs653178 0.9995797 2 0.8324161 0.8645506 0,374,372,517,369,668,... -0.0321345206
## 31: rs705705 0.9932401 9 0.9266029 0.954514 99,100, 31, 0, 0, 31,... -0.0279110792
## 32: rs72727394 0.9904069 19 0.9953029 0.9974931 0, 0, 1, 1, 5,12,... -0.0021902282
## 33: rs72928038 0.9900544 36 0.9923231 0.9846845 1, 3, 3, 0,10, 0,... 0.0076385507
## 34: rs757411 0.990144 54 0.9874525 0.9923193 0,2,4,2,2,3,... -0.0048667555
## 35: rs75793288 0.9940996 22 0.9761336 0.9843889 5,0,0,0,0,0,... -0.0082553002
## 36: rs8056814 0.9920644 8 0.9649084 0.9955263 0,1,0,0,0,0,... -0.0306178333
## 37: rs917911 0.990001 133 0.9908669 0.940162 0,1,0,0,0,0,... 0.0507049444
## 38: rs9585056 0.9907895 3 0.9989341 0.9967159 115, 0, 90, 59, 43, 0,... 0.0022182763
## index_SNP claim nvar corr_cov corr_cov_match nsims_match diff
To investigate the wider utility of this method, I rerun my standard simulations to include the corr_cov_match
estimate.
It seems that the new method is not better than our original correction.
nvar <= 3
.Note that some of these relative error bars are only based on a very small number of data points. For example, the thr = 0.99, OR = 1.1, N0 = 5000 facet (where corrected_match
looks very good) is only based on 3 data points.
This method does not seem reliable enough to replace our standard corrected coverage method. Perhaps it could be used as an alternative for problematic corrected coverage estimates (e.g. for rs2476601).
Not sure how to incorporate this into the R package as it would be at the users discretion to use this method, and I wouldn’t recommend it.
The idea of this section is to decide when we should use our corrected credible set method. For example, if the corrected coverage of the standard 99% credible set is ~99% then there is no point in providing a new credible set.
For each T1D regions I simulate 100 corrected coverage estimates using different ERR matrices and use these to form 95% (or 50%) ‘confidence regions’. If the threshold/desired coverage falls within this region, then there is no need to form a new credible set.
This method would hopefully make our results robustly estimated.
I run my code 100 times, varying the ERR and saving the corrected coverage estimate. Note that I am correcting the coverage of the standard 99% credible set of 133 variants with claimed coverage 0.990001.
In my original simulations, the corrected coverage of this credible set was 0.9893961, yet is seems that none of the 100 replicates are this low (?)
The 50% ‘confidence interval’ is
## 25% 75%
## 0.9932352 0.9959700
## 5% 95%
## 0.9916436 0.9974147
## index_snp corrcov CI95 CI50
## 1: rs10277986 0.978764 0.9721906,0.9802149 0.9741556,0.9780174
## 2: rs11203202 0.995303 0.9933562,0.9979646 0.9949507,0.9969793
## 3: rs11954020 0.995122 0.9920448,0.9958591 0.9934685,0.9950042
## 4: rs12416116 0.9886482 0.9887499,0.9959496 0.9907356,0.9940735
## 5: rs12453507 0.992248 0.9889613,0.9930303 0.9908088,0.9921349
## 6: rs12927355 0.9696758 0.9668940,0.9721975 0.9684114,0.9707650
## 7: rs13415583 0.9899163 0.9874252,0.9913187 0.9884464,0.9900265
## 8: rs1456988 0.9887643 0.9880320,0.9922099 0.9888669,0.9909024
## 9: rs151234 0.9865615 0.9770208,0.9872931 0.9819391,0.9849174
## 10: rs1538171 0.9695779 0.9679499,0.9740948 0.969232,0.972303
## 11: rs1615504 0.9911848 0.9901151,0.9939660 0.9912258,0.9931430
## 12: rs1893217 0.9445883 0.9427223,0.9496506 0.944452,0.947910
## 13: rs193778 0.9908382 0.9902298,0.9942834 0.9913912,0.9932437
## 14: rs2111485 0.9792549 0.9788722,0.9847620 0.9807153,0.9830364
## 15: rs229533 0.9981773 0.9949370,0.9987852 0.9962214,0.9978486
## 16: rs2476601 0.8005921 0.7954517,0.8097113 0.7998783,0.8058585
## 17: rs3024505 0.9892239 0.9902502,0.9955760 0.991483,0.994281
## 18: rs3087243 0.9486954 0.9468846,0.9538603 0.9489227,0.9516272
## 19: rs34593439 0.9912887 0.9910036,0.9959825 0.9925432,0.9950101
## 20: rs402072 0.9930691 0.9909713,0.9954966 0.9925727,0.9944596
## 21: rs4820830 0.9838468 0.9795138,0.9857821 0.9812997,0.9839691
## 22: rs516246 0.9867699 0.9827057,0.9884294 0.9848526,0.9869850
## 23: rs56994090 0.996886 0.9954066,0.9982024 0.9963256,0.9976138
## 24: rs6043409 0.9964087 0.9946749,0.9980954 0.9960422,0.9973196
## 25: rs61839660 0.9521729 0.9453772,0.9578613 0.9485344,0.9542905
## 26: rs62447205 0.9807707 0.9780220,0.9819977 0.979047,0.981009
## 27: rs6476839 0.9947377 0.9917515,0.9953835 0.9927836,0.9943124
## 28: rs6518350 0.9983014 0.9951457,0.9988502 0.9964435,0.9979501
## 29: rs653178 0.8325501 0.8302166,0.8430690 0.8340166,0.8390162
## 30: rs705705 0.9285993 0.9237517,0.9316649 0.9270735,0.9297125
## 31: rs72727394 0.9942578 0.9923124,0.9966216 0.9937476,0.9955544
## 32: rs72928038 0.9922407 0.9884362,0.9947106 0.9907878,0.9932962
## 33: rs757411 0.993689 0.9886599,0.9937992 0.9899442,0.9922903
## 34: rs75793288 0.9728269 0.9681660,0.9778688 0.9708571,0.9756968
## 35: rs8056814 0.973799 0.9591791,0.9704493 0.9625117,0.9674740
## 36: rs917911 0.9951829 0.9915600,0.9970556 0.9930566,0.9959115
## 37: rs9585056 0.9989179 0.9959032,0.9999277 0.9969796,0.9989419
## index_snp corrcov CI95 CI50
The following table shows the correction for the regions which this method implies should be corrected - i.e. 0.99 does not fall into the 95% confidence region.
We see that these are all well behaved, with more variants being added if the corrected coverage is less than 0.99 and variants being removed if it is greater than 0.99 (except rs917911 which we are aware of).
In particular, notice index-snp rs9585056 (row 36) - our correction has reduced the credible set size from 3 variants to just 1!
## index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 1 rs10277986 26 11 0.9763712 14 0.9900244 3
## 2 rs11203202 5 4 0.9956508 4 0.9899901 0
## 3 rs11954020 136 83 0.9937890 75 0.9900860 -8
## 6 rs12927355 41 20 0.9688299 24 0.9900287 4
## 9 rs151234 NA 10 0.9831869 18 0.9898650 8
## 10 rs1538171 154 87 0.9708851 100 0.9900226 13
## 11 rs1615504 40 26 0.9930152 24 0.9900069 -2
## 12 rs1893217 35 11 0.9453810 29 0.9900648 18
## 13 rs193778 102 23 0.9949010 20 0.9898605 -3
## 14 rs2111485 4 4 0.9822650 4 0.9900062 0
## 15 rs229533 15 8 0.9948442 5 0.9901647 -3
## 16 rs3024505 9 7 0.9926321 6 0.9899121 -1
## 17 rs3087243 30 21 0.9519229 38 0.9900073 17
## 18 rs34593439 3 2 0.9940004 2 0.9896408 0
## 19 rs402072 25 14 0.9959448 11 0.9900473 -3
## 20 rs4820830 117 59 0.9840735 64 0.9900965 5
## 21 rs516246 36 12 0.9865726 12 0.9900208 0
## 22 rs56994090 11 8 0.9983908 7 0.9900007 -1
## 23 rs6043409 39 16 0.9959000 9 0.9900867 -7
## 24 rs61839660 5 4 0.9507457 7 0.9901005 3
## 25 rs62447205 56 40 0.9799506 57 0.9900367 17
## 26 rs6476839 21 18 0.9941413 17 0.9900715 -1
## 27 rs6518350 68 41 0.9990129 15 0.9900782 -26
## 28 rs653178 5 2 0.8479623 5 0.9901032 3
## 29 rs705705 22 9 0.9254648 19 0.9900180 10
## 30 rs72727394 32 19 0.9952338 19 0.9900978 0
## 33 rs75793288 34 22 0.9770497 25 0.9900484 3
## 34 rs8056814 22 8 0.9700932 13 0.9906368 5
## 35 rs917911 135 133 0.9893961 54 0.9898731 -79
## 36 rs9585056 7 3 0.9979568 1 0.9907311 -2
## 161 rs2476601 2 2 0.8034561 2 0.9480000 0
To further investigate the discrepency between what we see in our simulations and what happens in the real data, I investigate the top SNP (SNP with largest absolute Z-score) from each T1D region. This will help us to understand how the true CVs may be different to random SNPs selected in our simulations in terms of MAF and LD.
Note that the are instances when the index SNP is not the top SNP. In these reasons, I report the absolute Z score for the index SNP and the top SNP. The index_snp_pos
column shows that the index SNP has \(x\)’th highest absolute Z score.
## index_snp topSNP topmaf
## [1,] "rs1538171" "rs2326451" 0.4478586
## [2,] "rs62447205" "rs12719030" 0.4842805
## [3,] "rs10277986" "rs7780389" 0.04811343
## [4,] "rs6476839" "rs10974435" 0.4966785
## [5,] "rs705705" "rs877636" 0.3487061
## [6,] "rs653178" "rs3184504" 0.4862989
## [7,] "rs3024505" "rs3024493" 0.1608732
## [8,] "rs1456988" "rs4900384" 0.2775208
## [9,] "rs12927355" "rs741172" 0.3193143
## [10,] "rs193778" "rs6498177" 0.2447503
## [11,] "rs12453507" "rs12150079" 0.3266698
## [12,] "rs13415583" "rs2309837" 0.3919801
## [13,] "rs402072" "rs425105" 0.1660339
## [14,] "rs6043409" "rs2250261" 0.1570547
## [15,] "rs11203202" "rs9981624" 0.3355872
## [16,] "rs6518350" "rs56132007" 0.1847568
## [17,] "rs4820830" "rs1468176" 0.3890339
## [18,] "rs75793288" "rs6827756" 0.3625148
For each T1D region, I plot a histogram of the MAFs and mark on where the top SNP’s MAF falls (red line) and the average MAF (dashed line).
The title of each plot is the index SNP in the region, the claimed coverage and the corrected coverage of the 99% credible set.
## Index_SNP High_LD Mid_LD Low_LD Lowest_LD Total_Similar_MAF
## 1 rs2476601 2 2 9 164 177
## 2 rs1538171 103 9 79 101 292
## 3 rs62447205 13 6 5 35 59
## 4 rs10277986 29 12 10 118 169
## 5 rs6476839 7 0 2 7 16
## 6 rs61839660 16 12 0 34 62
## 7 rs12416116 6 2 11 21 40
## 8 rs917911 4 0 1 76 81
## 9 rs705705 13 0 0 0 13
## 10 rs653178 3 0 0 2 5
## 11 rs9585056 1 2 8 12 23
## 12 rs3024505 4 2 2 58 66
## 13 rs1456988 23 1 12 12 48
## 14 rs56994090 3 0 0 9 12
## 15 rs72727394 17 12 1 23 53
## 16 rs34593439 4 0 1 103 108
## 17 rs151234 10 1 1 7 19
## 18 rs12927355 26 1 12 41 80
## 19 rs193778 5 30 5 67 107
## 20 rs8056814 23 6 0 72 101
## 21 rs12453507 3 13 8 53 77
## 22 rs757411 12 17 0 5 34
## 23 rs13415583 3 2 30 103 138
## 24 rs1893217 11 25 2 59 97
## 25 rs1615504 26 3 2 1 32
## 26 rs402072 11 3 0 12 26
## 27 rs516246 17 4 1 8 30
## 28 rs6043409 1 3 17 59 80
## 29 rs11203202 5 7 4 12 28
## 30 rs6518350 9 0 6 14 29
## 31 rs4820830 62 13 0 18 93
## 32 rs229533 5 0 3 30 38
## 33 rs2111485 2 2 42 1 47
## 34 rs3087243 39 0 0 46 85
## 35 rs113010081 52 5 0 377 434
## 36 rs75793288 22 18 0 50 90
## 37 rs11954020 13 0 24 32 69
## 38 rs72928038 2 3 0 10 15
## 39 rs2476601 2 2 9 164 177
Combining what we have discovered for the T1D data, below is a final data frame showing information for each of the 38 regions reported in table 1 of the original paper (6 omitted).
NAs are reported when the correction is not needed.
For rs2476601 I would suggest using the matching nvar method (corrcov would now be 1).
rs113010081 is missing as this requires too much memory (~1300 SNPs).
## index_SNP theirCS_nvar ourCS_nvar ourCS_claim ourCS_corr 95%CI_corr newCS_nvar newCS_corr diff_nvar
## 1 rs10277986 26 11 0.996 0.9763712 0.9722, 0.9802 14 0.9900244 3
## 2 rs11203202 5 4 0.9993 0.9956508 0.9934, 0.9980 4 0.9899901 0
## 3 rs11954020 136 83 0.9901 0.9937890 0.9920, 0.9959 75 0.9900860 -8
## 4 rs12927355 41 20 0.9919 0.9688299 0.9669, 0.9722 24 0.9900287 4
## 5 rs151234 NA 10 0.9902 0.9831869 0.9770, 0.9873 18 0.9898650 8
## 6 rs1538171 154 87 0.9903 0.9708851 0.9679, 0.9741 100 0.9900226 13
## 7 rs1615504 40 26 0.9906 0.9930152 0.9901, 0.9940 24 0.9900069 -2
## 8 rs1893217 35 11 0.9962 0.9453810 0.9427, 0.9497 29 0.9900648 18
## 9 rs193778 102 23 0.9903 0.9949010 0.9902, 0.9943 20 0.9898605 -3
## 10 rs2111485 4 4 1 0.9822650 0.9789, 0.9848 4 0.9900062 0
## 11 rs229533 15 8 0.9902 0.9948442 0.9949, 0.9988 5 0.9901647 -3
## 12 rs3024505 9 7 0.992 0.9926321 0.9903, 0.9956 6 0.9899121 -1
## 13 rs3087243 30 21 0.9915 0.9519229 0.9469, 0.9539 38 0.9900073 17
## 14 rs34593439 3 2 0.9913 0.9940004 0.991, 0.996 2 0.9896408 0
## 15 rs402072 25 14 0.9909 0.9959448 0.9910, 0.9955 11 0.9900473 -3
## 16 rs4820830 117 59 0.9905 0.9840735 0.9795, 0.9858 64 0.9900965 5
## 17 rs516246 36 12 0.9966 0.9865726 0.9827, 0.9884 12 0.9900208 0
## 18 rs56994090 11 8 0.9934 0.9983908 0.9954, 0.9982 7 0.9900007 -1
## 19 rs6043409 39 16 0.9902 0.9959000 0.9947, 0.9981 9 0.9900867 -7
## 20 rs61839660 5 4 0.9985 0.9507457 0.9454, 0.9579 7 0.9901005 3
## 21 rs62447205 56 40 0.9902 0.9799506 0.978, 0.982 57 0.9900367 17
## 22 rs6476839 21 18 0.9947 0.9941413 0.9918, 0.9954 17 0.9900715 -1
## 23 rs6518350 68 41 0.9903 0.9990129 0.9951, 0.9989 15 0.9900782 -26
## 24 rs653178 5 2 0.9996 0.8479623 0.8302, 0.8431 5 0.9901032 3
## 25 rs705705 22 9 0.9932 0.9254648 0.9238, 0.9317 19 0.9900180 10
## 26 rs72727394 32 19 0.9904 0.9952338 0.9923, 0.9966 19 0.9900978 0
## 27 rs75793288 34 22 0.9941 0.9770497 0.9682, 0.9779 25 0.9900484 3
## 28 rs8056814 22 8 0.9921 0.9700932 0.9592, 0.9704 13 0.9906368 5
## 29 rs917911 135 133 0.99 0.9893961 0.9916, 0.9971 54 0.9898731 -79
## 30 rs9585056 7 3 0.9908 0.9979568 0.9959, 0.9999 1 0.9907311 -2
## 31 rs2476601 2 2 1 0.8034561 0.7955, 0.8097 2 0.9480000 0
## 32 rs12416116 7 6 0.9915 0.9886270 0.9887, 0.9959 NA NA NA
## 33 rs12453507 112 65 0.9904 0.9901308 0.989, 0.993 NA NA NA
## 34 rs13415583 139 92 0.9902 0.9897138 0.9874, 0.9913 NA NA NA
## 35 rs1456988 40 26 0.9954 0.9907794 0.9880, 0.9922 NA NA NA
## 36 rs72928038 56 36 0.9901 0.9951392 0.9884, 0.9947 NA NA NA
## 37 rs757411 82 54 0.9901 0.9903273 0.9887, 0.9938 NA NA NA
Decide what to do with the MAF and LD comparison section.
Finalise T1D results and make into .CSV file, reporting which method was used.
“The good, the bad and the ugly: Bayesian model selection produces spurious posterior probabilities for phylogenetic trees”
A particular binary phylogenetic tree (a binary tree has exactly 2 descendants from each interior node) gets a very high posterior probability, even though a star tree was used to generate the data.
For example, if we generate some large dataset from a star tree and calculate the posterior probabilities that the data came from each of 3 binary trees, then we would expect the posterior probabilities to be 1/3, as the data does not contain any information for each of these trees. But infact, the pps tended to be spurious and sometimes very highly supportive of one binary tree model.
Sample some sets of data from the true model.
Consider two models to compare, \(H_1\) and \(H_2\) (neither being the true model and with equal dimensions - i.e. same number of free parameters).
Use a uniform prior for each model with value \(1/2\).
\(P_K\) is the posterior model probability for model \(K\) (i.e. \(P_1=Pr(H_1|x)\).
\(P_1\) converges to a single reasonable value (not 0 or 1).
Then we can say something like “for every large dataset, the posterior probability of this model is approximately this value”.
This occurs when we compare two identical models or overlapping models where the best fitting parameter values lie in the region of overlap.
Whether the models are right or wrong does not matter.
Example of overlapping models: The truth is \(p=1/2\) and \(H_1:0.4<p<0.6\) and \(H_2=0<p<1\). We assign a uniform prior on \(p\) in each model, then as \(n \longrightarrow \infty\), \(P_1 \longrightarrow \frac{1}{1+0.2}=\frac{5}{6}\). I.e. the more-informative model \(H_1\) is favoured.
The rationale is that one would like a sure answer given an infinite amount of data.
Not really relevent as you wouldn’t normally compare identical or overlapping models…
\(P_1\) converges to a ‘nondegenerate’ (non constant) distribution, so that \(P_1\) fluctuates among dataset like a random number (strong support may be attached for a certain model in some datasets).
So that if we compare two equally right or equally wrong models (for data from the same true model), \(P_1\) varies among datasets according to a nondegenerate distribution.
\(P_1\) has a degenerate two-point distribution at values 0 and 1 (a step function?). So that, in some datasets we will favor model 1 will full confidence and in others, model 2 with full confidence.
0 half the time and 1 the other half.
Occurs when the models are equally wrong but also distinct.
The asymptotic behaviour is determined by whether or not the compared models are distinct, and not by whether they are both right or both wrong(!!!), or by whether they have unknown parameters.
This problem is especially problematic in phylogenetics because here the phylogeny (the model) is of primary interest, not the branch lengths (the parameters). Thus it may be less of a problem in other statistical anaylses where the model is not of primary interest.
Section: Intro
Section: Bayesian approach for fine-mapping:
Section: The problem
Section: My fix
Section: Closing remarks
Should I include a slide showing what happens when the 1 CV assumption is dropped? Could always include this as a later slide and show it if anyone asks.
Need to perhaps have an idea of how I could apply this to CHi-C…
Plots to redo: