Exclude 6 regions (I report the index SNP):
No position information for rs6691977, rs4849135, rs2611215 or rs1052553.
rs34536443 is in the data file as rs34536443b. Here, two SNPs with pps 0.98857969 and 0.01141317 (rs74956615 and rs34725611) would obviously be in the 99% credible set. This has a corrected coverage estimate of 0.9999999.
Ignore rs689 (in the insulin gene - well studied but not genotyped well on ImmunoChip).
## index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 1 rs10277986 26 11 0.9763712 14 0.9900244 3
## 2 rs11203202 5 4 0.9956508 4 0.9899901 0
## 3 rs11954020 136 83 0.9937890 75 0.9900860 -8
## 4 rs12416116 7 6 0.9886270 6 0.9902008 0
## 5 rs12453507 112 65 0.9901308 59 0.9900471 -6
## 6 rs12927355 41 20 0.9688299 24 0.9900287 4
## 7 rs13415583 139 92 0.9897138 94 0.9900704 2
## 8 rs1456988 40 26 0.9907794 26 0.9900568 0
## 9 rs151234 NA 10 0.9831869 18 0.9898650 8
## 10 rs1538171 154 87 0.9708851 100 0.9900226 13
## 11 rs1615504 40 26 0.9930152 24 0.9900069 -2
## 12 rs1893217 35 11 0.9453810 29 0.9900648 18
## 13 rs193778 102 23 0.9949010 20 0.9898605 -3
## 14 rs2111485 4 4 0.9822650 4 0.9900062 0
## 15 rs229533 15 8 0.9948442 5 0.9901647 -3
## 16 rs3024505 9 7 0.9926321 6 0.9899121 -1
## 17 rs3087243 30 21 0.9519229 38 0.9900073 17
## 18 rs34593439 3 2 0.9940004 2 0.9896408 0
## 19 rs402072 25 14 0.9959448 11 0.9900473 -3
## 20 rs4820830 117 59 0.9840735 64 0.9900965 5
## 21 rs516246 36 12 0.9865726 12 0.9900208 0
## 22 rs56994090 11 8 0.9983908 7 0.9900007 -1
## 23 rs6043409 39 16 0.9959000 9 0.9900867 -7
## 24 rs61839660 5 4 0.9507457 7 0.9901005 3
## 25 rs62447205 56 40 0.9799506 57 0.9900367 17
## 26 rs6476839 21 18 0.9941413 17 0.9900715 -1
## 27 rs6518350 68 41 0.9990129 15 0.9900782 -26
## 28 rs653178 5 2 0.8479623 5 0.9901032 3
## 29 rs705705 22 9 0.9254648 19 0.9900180 10
## 30 rs72727394 32 19 0.9952338 19 0.9900978 0
## 31 rs72928038 56 36 0.9951392 35 0.9900415 -1
## 32 rs757411 82 54 0.9903273 54 0.9901913 0
## 33 rs75793288 34 22 0.9770497 25 0.9900484 3
## 34 rs8056814 22 8 0.9700932 13 0.9906368 5
## 35 rs917911 135 133 0.9893961 54 0.9898731 -79
## 36 rs9585056 7 3 0.9979568 1 0.9907311 -2
## 161 rs2476601 2 2 0.8034561 2 0.9480000 0
## 37 rs113010081 187 120 0.9903131 98 0.9900402 -22
There does not seem to be any systematic bias for whether the corrected coverage of the standard credible set (our_cs_corrcov
) is above or below the threshold.
This also means that there is no systematic difference in the number of variants included in the credible set obtained using the standard Bayesian approach (‘Old Credible Set’) and our correction approach (‘New Credible Set’).
Problematic index SNPs are rs917911 (row 35) and rs2476601 (row 37). I walk through the correction for each of these.
Question: Why does our correction method remove (so many) variants?
The original 99% credible set consists of 133 variants and has size 0.990001. This has a corrected coverage of 0.9893961 –> implying we need to add some variants to the set to achieve the desired coverage. Instead our correction actually removes 79 variants!
Note, the muhat value in this region is 4.605187.
## index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 35 rs917911 135 133 0.9893961 54 0.9898731 -79
There are 639 variants in the region (after trimming for LD). The LD matrix looks like this:
thr: 0.977667841612856
(with corrcov: 0.98970859720129
) and thr: 0.977667841614675
(with corrcov: 0.990175538886801
). Ideally, the algorithm would choose the option that gives corrected coverage > 0.99, but it just chooses whichever one it ends on after the maximum number of iterations. I need to change the function such that if it gets stuck, then it chooses the one with corrcov > 0.99.I run the correction factor again but with a slightly lower accuracy (0.001 instead of 0.0001) and find that a new credible set with corrected coverage of 0.9904272 can be found which contains 62 variants and uses a threshold of 0.9804688.
Below is a plot of the lower posterior probabilities in this region, with a red line at 0.0003125, the difference between corrected coverage estimates for the standard Bayesian method and our correction.
## Warning: Removed 15 rows containing missing values (geom_point).
Answer: I think that these descripencies are due to the variability in our corrected coverage estimate. In the original table, the difference between the corrected coverage of the new and the old credible set is 0.0003125 (see red line above) and we know that the variability of our correction is larger than this. Not sure how to go about fixing this?
Side notes:
muhat in this region is 20.777874.
One of these two SNPs is protein changing - so this is assumed to be the CV.
Question: Why does the obvious credible set only have 80% corrected coverage?
## index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov
## 16 rs2476601 2 2 0.8034561 2 0.948
It seems that our correction algorithm suggests that we need to use a threshold of 1 to reach 99% corrected coverage (the max we can get with a lower threshold is ~0.97). This credible set would contain all the variants in the region (639). Perhaps if we let the algorithm run for more values between 0.999999999 and 1 then the correction method would work?
I have stepped through the code to get the corrected coverage of the 99% credible set. Indeed, the corrected coverage is identical to simulating credible sets from just these two and proceeding as normal (as all other pps are basically 0 so these simulations are discounted when we normalise by pps at the end).
When assuming these two SNPs are causal in turn, the proportion of credible sets containing the assumed CV is 0.79 and 0.795 for rs6679677 and rs2476601 assumed causal respectively. We would expect this to be a lot higher, especially for rs2476601 which has a pp of 0.8640309.
Note, the LD between these SNPs is -0.6153853.
## claimed.cov covered nvar
## 1 0.9990098 0 1
## 2 1.0000000 1 2
## 3 1.0000000 1 1
## 4 0.9924901 0 1
## 5 1.0000000 1 2
## 6 1.0000000 0 1
## 7 1.0000000 1 2
## 8 1.0000000 1 2
## 9 1.0000000 1 1
## 10 1.0000000 1 1
## 11 1.0000000 1 1
## 12 0.9999994 1 1
## 13 1.0000000 1 2
## 14 1.0000000 1 2
## 15 1.0000000 1 2
## 16 1.0000000 1 2
## 17 1.0000000 1 2
## 18 1.0000000 1 2
## 19 1.0000000 1 2
## 20 0.9983981 0 1
The corrected coverage of the obvious credible set is lower than we would expect because fairly often the other SNP (not considered causal) gets the bulk of the posterior probability, so is contained in the credible set instead of (or as well as) the assumed CV.
## CV true.mu mu.est true.cov claimed.cov corrcov nvar req.thr new.corrcov new.claim0 new.true.cov new.nvar
## NULL rs2476601 13.76096 14.55924 1 1 1 2 0.875 0.9904891 1 0.9812 2
## CV true.mu mu.est true.cov claimed.cov corrcov nvar req.thr new.corrcov new.claim0 new.true.cov new.nvar
## rs6679677 rs6679677 13.7647 12.87692 0.9996 1 1 2 0.825 0.9908652 0.955062 0.9672 1
Left: The claimed coverage of credible sets is wrong. The size is smaller than the true coverage for lower \(\mu\), implying that you need to increase the target size to get the coverage you want (?).
Middle: We can provide an accurate coverage estimate of these credible sets, with 0 error relative to the true coverage and smaller variability than the the claimed coverage.
Right: We can obtain a new credible set that has the desired coverage (estimated coverage of the new credible set).
T1D results: Corrected coverage of original credible sets is sometimes above and sometimes below 0.99.
Simulations: Corrected coverage of credible sets is mostly above 0.99.
Possible reasons:
High OR values My simulations vary OR from 1.05 to 1.2 and I haven’t investigated what happens in higher ORs. Perhaps the T1D results with corrected coverage <0.99 have super high ORs which I havent yet investigated?
It does seem that those coverage estimates below 0.99 come from regions where the index SNP has a higher OR value.
I have ran our standard simulations for larger OR values (OR = 1.25, 1.3, 1.5, 1.8). Findings:
But, we still see way more cases of corrected coverage > 0.99.
For this big simulation dataset, I investigate how good the corrected and claimed coverage estimates are for true.mu bins.
Note: these plots do not change much when I bin by muhat instead.
So it seem that when we compare corrected and claimed coverage by bins of muhat, corrected coverage clearly looks better for all thresholds. However, when we compare relative error plots faceted by threshold, sample size and odds ratio, it doesn’t look much better. This uses the same data so why could this be?
The index SNPs for credible sets with corrected coverage < 0.99 do tend to have larger ORs, but when I run more standard simulations with \(ORs>1.2\), we still see the pattern of corrected coverage >> 0.99.
Value of mu: Perhaps it is something to do with the value of mu?
Our simulations so far use LD matrices from Africans, which commonly have less LD.
We need to see how our method works when we use LD matrices from T1D regions.
The ImmunoChip region data I have is from build 37, but the freq
matrices Mary has made are for build 36. https://genome.ucsc.edu/cgi-bin/hgLiftOver is used to match the positions between builds.
To investigate the problematic T1D regions (e.g. rs2476601 where the 99% credible set only has 80% corrected coverage), I need to match ichip IDs to rs IDs using this file /home/cew54/share/Data/GWAS/IChip/final_chip/immunochip-final-annotation.txt.gz
.
For now, I use these build 36 T1D regions to run my standard simulations.
The relative error plots show that the corrected coverage estimate is still better than claimed coverage (in most cases??)
The plot below shows the relative error of the coverage estimate of the new credible set obtained using our correction.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Also, we still see the discrepency.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And yes, we do still see the discrepency.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Last week I found that the corrected coverage method seems to work fine when the 2 CVs are not in LD, but it fails when the 2 CVs are in high LD.
Luckily, Chris has some results to show that most of the time the 2 CVs are not in LD.
To confirm these findings I run two simulations and compare these to the 1 CV simulation results:
2 CVs in the same LD square
2 CVs in different LD squares
These plots confirm that when the two CVs are in LD, our method works better than claimed coverage for low power and tends to perform equally to claimed coverage in high power simulations.
When the CVs are not in LD, our method outperforms claimed coverage in all scenarios - seems to perform just as well as in the 1 CV simualtions.
Our correction method still works when we limit our simulations to CVs with maf greater than 0.05 (representing SNPs).
For the rs917911 T1D region (79 variants removed from credible set), I think that the weird corrected coverage estimate is due to the variability in our corrected coverage value. The difference between the corrected coverage of the new and the old credible set is 0.0003125, and there must be lots of SNPs with posterior probability in this range.
For the rs2476601 T1D region with low corrected coverage of the obvious credible set, the simulated credible sets for the two SNPs with high pp considered causal have lower coverage than expected because the other SNP is sometimes given the bulk of the posterior probability. These SNPs have -0.6 LD.
Discrepency between simulations and T1D results - Our simulations imply that the corrected coverage of the credible set is too high and needs to be pulled down, whilst the T1D results show that the corrected coverage is sometimes above and sometimes below the threshold. This discrepency seems to be due to the freq data from which we are simulating from vrs the T1D freq data.
Showed through simulations that our method works well when there are 2 CVs not in LD.
Showed through simulations that our method is about as good as claimed coverage when there are 2 CVs in LD (is better than claimed coverage in low power scenarios).
Why do 4 of the SNPs not fall into Immunochip regions?
How can I correct the rs917911 T1D region?
‘Estimating max Z’ slack comment.