Simulations with maf > 0.05
T1D Data
Simulation Results: Key Plots
Discrepancy between T1D and simulations
2 CVs
Key Findings

1. Simulations with maf > 0.05

So far, in my simulations I have been choosing the CV through a random sample of all the SNPs. I rerun my simulations to check that the method still works when the CV has maf s.t. \(0.05<maf<0.95\).

Phew! Perhaps looks even better than before!

To these same simulations I also apply the correction and show the relative error plots of the coverage estimate for these new credible sets.

Just as a check, using the same simulations, I plot the true value of \(\mu\) against our estimate.

2. Type 1 Diabetes Data

Exclude 6 regions (I report the index SNP):

No position information for rs6691977, rs4849135, rs2611215 or rs1052553.
rs34536443 is in the data file as rs34536443b. Here, two SNPs with pps 0.98857969 and 0.01141317 (rs74956615 and rs34725611) would obviously be in the 99% credible set. This has a corrected coverage estimate of 0.9999999.
Ignore rs689 (in the insulin gene - well studied but not genotyped well on ImmunoChip).

##       index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 1    rs10277986            26          11      0.9763712          14   0.9900244         3
## 2    rs11203202             5           4      0.9956508           4   0.9899901         0
## 3    rs11954020           136          83      0.9937890          75   0.9900860        -8
## 4    rs12416116             7           6      0.9886270           6   0.9902008         0
## 5    rs12453507           112          65      0.9901308          59   0.9900471        -6
## 6    rs12927355            41          20      0.9688299          24   0.9900287         4
## 7    rs13415583           139          92      0.9897138          94   0.9900704         2
## 8     rs1456988            40          26      0.9907794          26   0.9900568         0
## 9      rs151234            NA          10      0.9831869          18   0.9898650         8
## 10    rs1538171           154          87      0.9708851         100   0.9900226        13
## 11    rs1615504            40          26      0.9930152          24   0.9900069        -2
## 12    rs1893217            35          11      0.9453810          29   0.9900648        18
## 13     rs193778           102          23      0.9949010          20   0.9898605        -3
## 14    rs2111485             4           4      0.9822650           4   0.9900062         0
## 15     rs229533            15           8      0.9948442           5   0.9901647        -3
## 16    rs3024505             9           7      0.9926321           6   0.9899121        -1
## 17    rs3087243            30          21      0.9519229          38   0.9900073        17
## 18   rs34593439             3           2      0.9940004           2   0.9896408         0
## 19     rs402072            25          14      0.9959448          11   0.9900473        -3
## 20    rs4820830           117          59      0.9840735          64   0.9900965         5
## 21     rs516246            36          12      0.9865726          12   0.9900208         0
## 22   rs56994090            11           8      0.9983908           7   0.9900007        -1
## 23    rs6043409            39          16      0.9959000           9   0.9900867        -7
## 24   rs61839660             5           4      0.9507457           7   0.9901005         3
## 25   rs62447205            56          40      0.9799506          57   0.9900367        17
## 26    rs6476839            21          18      0.9941413          17   0.9900715        -1
## 27    rs6518350            68          41      0.9990129          15   0.9900782       -26
## 28     rs653178             5           2      0.8479623           5   0.9901032         3
## 29     rs705705            22           9      0.9254648          19   0.9900180        10
## 30   rs72727394            32          19      0.9952338          19   0.9900978         0
## 31   rs72928038            56          36      0.9951392          35   0.9900415        -1
## 32     rs757411            82          54      0.9903273          54   0.9901913         0
## 33   rs75793288            34          22      0.9770497          25   0.9900484         3
## 34    rs8056814            22           8      0.9700932          13   0.9906368         5
## 35     rs917911           135         133      0.9893961          54   0.9898731       -79
## 36    rs9585056             7           3      0.9979568           1   0.9907311        -2
## 161   rs2476601             2           2      0.8034561           2   0.9480000         0
## 37  rs113010081           187         120      0.9903131          98   0.9900402       -22

There does not seem to be any systematic bias for whether the corrected coverage of the standard credible set (our_cs_corrcov) is above or below the threshold.

This also means that there is no systematic difference in the number of variants included in the credible set obtained using the standard Bayesian approach (‘Old Credible Set’) and our correction approach (‘New Credible Set’).

Problematic index SNPs are rs917911 (row 35) and rs2476601 (row 37). I walk through the correction for each of these.

Index SNP - rs917911

Question: Why does our correction method remove (so many) variants?

The original 99% credible set consists of 133 variants and has size 0.990001. This has a corrected coverage of 0.9893961 –> implying we need to add some variants to the set to achieve the desired coverage. Instead our correction actually removes 79 variants!

Note, the muhat value in this region is 4.605187.

##    index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov nvar.diff
## 35  rs917911           135         133      0.9893961          54   0.9898731       -79

There are 639 variants in the region (after trimming for LD). The LD matrix looks like this:

It seems that our correction algorithm gets ‘stuck’ between thr: 0.977667841612856 (with corrcov: 0.98970859720129) and thr: 0.977667841614675 (with corrcov: 0.990175538886801). Ideally, the algorithm would choose the option that gives corrected coverage > 0.99, but it just chooses whichever one it ends on after the maximum number of iterations. I need to change the function such that if it gets stuck, then it chooses the one with corrcov > 0.99.

I run the correction factor again but with a slightly lower accuracy (0.001 instead of 0.0001) and find that a new credible set with corrected coverage of 0.9904272 can be found which contains 62 variants and uses a threshold of 0.9804688.

Below is a plot of the lower posterior probabilities in this region, with a red line at 0.0003125, the difference between corrected coverage estimates for the standard Bayesian method and our correction.

## Warning: Removed 15 rows containing missing values (geom_point).

Answer: I think that these descripencies are due to the variability in our corrected coverage estimate. In the original table, the difference between the corrected coverage of the new and the old credible set is 0.0003125 (see red line above) and we know that the variability of our correction is larger than this. Not sure how to go about fixing this?

Index SNP - rs2476601

Side notes:

muhat in this region is 20.777874.
One of these two SNPs is protein changing - so this is assumed to be the CV.

Question: Why does the obvious credible set only have 80% corrected coverage?

##    index_SNP their_cs_nvar our_cs_nvar our_cs_corrcov new_cs_nvar new_corrcov
## 16 rs2476601             2           2      0.8034561           2       0.948

The output from the correction method is:

It seems that our correction algorithm suggests that we need to use a threshold of 1 to reach 99% corrected coverage (the max we can get with a lower threshold is ~0.97). This credible set would contain all the variants in the region (639). Perhaps if we let the algorithm run for more values between 0.999999999 and 1 then the correction method would work?
I have stepped through the code to get the corrected coverage of the 99% credible set. Indeed, the corrected coverage is identical to simulating credible sets from just these two and proceeding as normal (as all other pps are basically 0 so these simulations are discounted when we normalise by pps at the end).
When assuming these two SNPs are causal in turn, the proportion of credible sets containing the assumed CV is 0.79 and 0.795 for rs6679677 and rs2476601 assumed causal respectively. We would expect this to be a lot higher, especially for rs2476601 which has a pp of 0.8640309.
This is lower than expected because fairly often the other SNP (not considered causal) gets the super high pp (~ 40% of the time).
Note, the LD between these SNPs is -0.6153853.

This means that the first 20 simulated credible sets for say rs2476601 considered causal look like this. Note that the proportion covered is not as low as the plot implies because there are instances when both of the SNPs are contained in the set.

##    claimed.cov covered nvar
## 1    0.9990098       0    1
## 2    1.0000000       1    2
## 3    1.0000000       1    1
## 4    0.9924901       0    1
## 5    1.0000000       1    2
## 6    1.0000000       0    1
## 7    1.0000000       1    2
## 8    1.0000000       1    2
## 9    1.0000000       1    1
## 10   1.0000000       1    1
## 11   1.0000000       1    1
## 12   0.9999994       1    1
## 13   1.0000000       1    2
## 14   1.0000000       1    2
## 15   1.0000000       1    2
## 16   1.0000000       1    2
## 17   1.0000000       1    2
## 18   1.0000000       1    2
## 19   1.0000000       1    2
## 20   0.9983981       0    1

rs2476601 has a much higher posterior probability than rs6679677, should we be concerned that the proportion covered for these SNPs considered causal is approximately the same?

The corrected coverage of the obvious credible set is lower than we would expect because fairly often the other SNP (not considered causal) gets the bulk of the posterior probability, so is contained in the credible set instead of (or as well as) the assumed CV.

I investigate this further by using the freq data for the corresponding ImmunoChip region (N.B. different build so slightly different region). I vary the CV and the OR (1.05, 1.1, 1.15, 1.2) and set the appropriate sample sizes, N0=12000 and N1=6000. We see that the results match that of our simulations; most corrected coverage values are \(>0.99\).

The following plot shows that our correction should work well in this region (corrected = blue, claimed = red). The second plot is the relative error (estimate - true/true) of the credible set obtained by applying our correction procedure. Since the estimates are all approximately 99% it gives us a good indication of what the true coverage is.

I change my simulations slightly, by matching up the SNP ichip IDs (Mary’s position info) and rs IDs (my position data) so that I can investigate what happens when I consider each of these 2 high pp SNPs as causal (in t1d/rs2476601/matchIDs/ on HPC).
- My region, chr1:113863087-114527968.
- Mary’s region, chr1:113619999-114460000.
- OR = 1.89 and N1 = 6000, N0 = 120000 approximately matching the real conditions reported.
Considering rs2476601 as causal:

##             CV  true.mu   mu.est true.cov claimed.cov corrcov nvar req.thr new.corrcov new.claim0 new.true.cov new.nvar
## NULL rs2476601 13.76096 14.55924        1           1       1    2   0.875   0.9904891          1       0.9812        2

Considering rs6679677 as causal:

##                  CV true.mu   mu.est true.cov claimed.cov corrcov nvar req.thr new.corrcov new.claim0 new.true.cov new.nvar
## rs6679677 rs6679677 13.7647 12.87692   0.9996           1       1    2   0.825   0.9908652   0.955062       0.9672        1

3. Simulation Results: Plots

Left: The claimed coverage of credible sets is wrong. The size is smaller than the true coverage for lower \(\mu\), implying that you need to increase the target size to get the coverage you want (?).

Middle: We can provide an accurate coverage estimate of these credible sets, with 0 error relative to the true coverage and smaller variability than the the claimed coverage.

Right: We can obtain a new credible set that has the desired coverage (estimated coverage of the new credible set).

4. Discrepancy Between Simulations and T1D Results

T1D results: Corrected coverage of original credible sets is sometimes above and sometimes below 0.99.

Simulations: Corrected coverage of credible sets is mostly above 0.99.

Investigating the discrepency

Possible reasons:

High OR values My simulations vary OR from 1.05 to 1.2 and I haven’t investigated what happens in higher ORs. Perhaps the T1D results with corrected coverage <0.99 have super high ORs which I havent yet investigated?
- Left plot: The OR for the index-SNP in the T1D data. The red points are from credible sets with corrected coverage > 0.99 (matching simulations) and the black points are for those with corrected coverage < 0.99.
- Right plot: ORs of index-snps from credible sets with corrected coverage > 0.99 vrs those with corrected coverage < 0.99.

It does seem that those coverage estimates below 0.99 come from regions where the index SNP has a higher OR value.

I have ran our standard simulations for larger OR values (OR = 1.25, 1.3, 1.5, 1.8). Findings:

There are more cases where our correction results in non-conservative estimates (correction estimate is bigger than true) for higher OR values.
Our correction is best relative to the claimed coverage in low power scenarios (low OR and low sample size).
There are cases where our correction is not as good as the claimed coverage (e,g, N0 = 5000, big OR)

But, we still see way more cases of corrected coverage > 0.99.

For this big simulation dataset, I investigate how good the corrected and claimed coverage estimates are for true.mu bins.

Note: these plots do not change much when I bin by muhat instead.
So it seem that when we compare corrected and claimed coverage by bins of muhat, corrected coverage clearly looks better for all thresholds. However, when we compare relative error plots faceted by threshold, sample size and odds ratio, it doesn’t look much better. This uses the same data so why could this be?
The index SNPs for credible sets with corrected coverage < 0.99 do tend to have larger ORs, but when I run more standard simulations with \(ORs>1.2\), we still see the pattern of corrected coverage >> 0.99.

Value of mu: Perhaps it is something to do with the value of mu?
- I have run the non-problematic T1D results again, with a threshold of 80% and stored the estimate of mu.
- Using an 80% threshold, there is still a descrepency between the real results and the simulations.
- Perhaps if you squint you can see a negative trend between corrected coverage and muhat, but investigating the OR is a better idea (bigger OR –> bigger muhat).

Differences in LD matrices: Perhaps the LD blocks our simulations are based on do not reflect the T1D LD blocks.

Our simulations so far use LD matrices from Africans, which commonly have less LD.
We need to see how our method works when we use LD matrices from T1D regions.
The ImmunoChip region data I have is from build 37, but the freq matrices Mary has made are for build 36. https://genome.ucsc.edu/cgi-bin/hgLiftOver is used to match the positions between builds.
To investigate the problematic T1D regions (e.g. rs2476601 where the 99% credible set only has 80% corrected coverage), I need to match ichip IDs to rs IDs using this file /home/cew54/share/Data/GWAS/IChip/final_chip/immunochip-final-annotation.txt.gz.
For now, I use these build 36 T1D regions to run my standard simulations.

T1D Region with index SNP rs10277986 (this commonly leads to corrected coverage < 0.99)
- Build 37: Chr7:50900900-51133350.
- Build 36: Chr7:50866661-51640000.
- The following plots are from simulations using this freq matrix, which has 396 SNPs.

The relative error plots show that the corrected coverage estimate is still better than claimed coverage (in most cases??)
The plot below shows the relative error of the coverage estimate of the new credible set obtained using our correction.

But we still the discrepency between the T1D results and the simulations from a T1D region!!

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T1D Region with index SNP rs3087243 (this commonly leads to corrected coverage < 0.99)
- Build 37: Chr2:204446380-204816382.
- Build 36: Chr2:202920548-204528303.
- The following plots are from simulations using this freq matrix, which has 552 SNPs.

Also, we still see the discrepency.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T1D Region with index SNP rs2111485 (this commonly leads to corrected coverage < 0.99)
- Build 37: Chr2:162960873-163361685.
- Build 36: Chr2:162669118-163101007.
- The following plots are from simulations using this freq matrix, which has ~450 SNPs.

And yes, we do still see the discrepency.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5. 2 CVs

Last week I found that the corrected coverage method seems to work fine when the 2 CVs are not in LD, but it fails when the 2 CVs are in high LD.
Luckily, Chris has some results to show that most of the time the 2 CVs are not in LD.
To confirm these findings I run two simulations and compare these to the 1 CV simulation results:

2 CVs in the same LD square
2 CVs in different LD squares

These plots confirm that when the two CVs are in LD, our method works better than claimed coverage for low power and tends to perform equally to claimed coverage in high power simulations.
When the CVs are not in LD, our method outperforms claimed coverage in all scenarios - seems to perform just as well as in the 1 CV simualtions.

Key Findings

Our correction method still works when we limit our simulations to CVs with maf greater than 0.05 (representing SNPs).
For the rs917911 T1D region (79 variants removed from credible set), I think that the weird corrected coverage estimate is due to the variability in our corrected coverage value. The difference between the corrected coverage of the new and the old credible set is 0.0003125, and there must be lots of SNPs with posterior probability in this range.
For the rs2476601 T1D region with low corrected coverage of the obvious credible set, the simulated credible sets for the two SNPs with high pp considered causal have lower coverage than expected because the other SNP is sometimes given the bulk of the posterior probability. These SNPs have -0.6 LD.
Discrepency between simulations and T1D results - Our simulations imply that the corrected coverage of the credible set is too high and needs to be pulled down, whilst the T1D results show that the corrected coverage is sometimes above and sometimes below the threshold. This discrepency seems to be due to the freq data from which we are simulating from vrs the T1D freq data.
Showed through simulations that our method works well when there are 2 CVs not in LD.
Showed through simulations that our method is about as good as claimed coverage when there are 2 CVs in LD (is better than claimed coverage in low power scenarios).

To Do

Run T1D again with full LD data.
Run correction for T1D region with index SNP rs34536443b.
Investigate simulations for larger OR values.

Questions

Why do 4 of the SNPs not fall into Immunochip regions?
How can I correct the rs917911 T1D region?
‘Estimating max Z’ slack comment.

27th March 2019

Anna Hutchinson

2019-03-25

Contents