3rd July

Effect of LD
Observed True Effect
Re-Analysis of T1D Data
Realistic Fine-Mapping (using reference panel)
Comments and To Do

1. Effect of LD

So far, all of our simulations are based on ldd regions on chromosome 22 using data from UK10K, we did not quantify the LD of these regions.
To investigate the effect of LD on our method, we apply our method to various regions (low, medium or high LD) on chromosome 10 using 1000 Genomes data.
Each of the regions contain approximately 700 SNPs (corrcov function takes ~ 2 minutes).

Inferences

Our method seems to work best relative to the claimed coverage in the \(2.5<\mu<6\) range (\(0.01<P<2e-09\)).
It becomes very variable in high powered scenarioes for high LD regions. Here, the corrected coverage is often too low. What would these regions look like? Let’s look at a high powered example with the CV coloured in red:

##    true_mu   mu_est
## 1 19.88103 18.65872

## $credset
##  [1] 362 307 438 440 466 286 285 309 445 308 315 446 447 450 434 449 373
## [18] 326 369 476
## 
## $claimed.cov
## [1] 0.9574063
## 
## $nvar
## [1] 20

Variants in the 95% credible set coloured in blue:

The coverage estimates reflect what is shown in the relative error plots.

claimed <- size
true.est <- data$true.cov.est
corrected <- corrcov(data$z0, data$maf, data$NN, data$NN, data$LD)

data.frame(true = true.est, claimed = claimed, corrected = corrected)

##     true   claimed corrected
## 1 0.9812 0.9574063 0.9134529

Why is our corrected coverage estimate too low? Lets break down the corrcov function.

W = 0.2
varbeta = data$v
z = data$z0
Sigma = data$LD
nrep = 1000 # note - get similar result when increasing this
thr = 0.95
iCV = data$CV

r = W^2/(W^2 + varbeta)
bf = 0.5 * (log(1 - r) + (r * z^2))

par(mfrow=c(1,2))
plot(bf, ylab = "ABF", main = "ABF")
points(cs$credset, bf[cs$credset], col = "blue")
points(iCV, bf[iCV], col = "red")
plot(pp, ylab = "PP", main = "PP")
points(cs$credset, pp[cs$credset], col = "blue")
points(iCV, pp[iCV], col = "red")

The following plot is the proportion of the nrep = 1000 simulated credible sets that contain the corresponding variant that is assumed causal. The variants in the credible set are coloured in blue and the CV is coloured in red.

The following plot is the weighted proportions. The final corrected coverage value is the sum of these.

So it appears that when the variants with the largest PPs in the original system (those in the credible set) are assumed to be causal and more Z score results are simulated from this distribution, the assumed CV does not appear in the credible set as often as expected. This is likely due to the high LD in the region allocating the biggest PPs to variants that are highly correlated with the CV (but not the CV, so that it doesn’t appear in the resultant credible set). Let’s investigate this for the simulations where the CV is assumed to be causal.

iCV_sims <- pps[[iCV]]
max <- apply(iCV_sims, 1, which.max)

# what proportion of the 1000 simulations give the biggest PP to the CV?
length(which(max==iCV))/1000

## [1] 0.372

table(max)

## max
## 285 308 315 326 328 336 339 340 341 342 343 347 362 369 373 384 385 434 
##   2   1   2   4   1   2   1   1   4   3   3   3   6  11   3   7   2   3 
## 438 440 445 446 447 449 450 464 466 473 476 489 
## 399 372 125   3   4   3   5  11   7   4   5   3

So the CV is only given the largest of the PP in 39% of simulations, and in another 39% of the simulations the adjacent variant (438) is given the largest PP, kind of illustrating why our method is not as accurate in regions of high LD.
This is intuitive because our correction method works by simulating more GWAS results from “the same system” and finding the proportion of the resultant credible sets that contain the variant that is assumed to be causal. In regions with very high LD, when we simulate more GWAS results, the PP is shared amongst the highly correlated variants and in some cases, the actual assumed CV won’t get a substantial proportion of the allocated PP and therefore will not appear in the credible set.
Perhaps we should therefore incorporate the expected LD in the region (as well as the value for \(\hat\mu\)) in the “recommended use” section.

2. Bin by \(z0[iCV]\) rather than \(\mu=ze[iCV]\)

Facted by LD.

It is apparant that the higher the LD, the more our method struggles.
The number of simulations for the above plot (binned by muhat) is….

##           
##            low medium medium2 high
##   (0,1]      0      0       0    0
##   (1,2]    152    172     165  189
##   (2,2.5]  307    258     245  201
##   (2.5,3]  203    160     192  159
##   (3,4]    249    242     236  256
##   (4,4.75] 144    162     167  167
##   (4.75,6] 237    266     266  237
##   (6,8]    234    235     258  257
##   (8,13]   304    305     271  323
##   (13,16]   47     41      47   42
##   (16,Inf] 113    150     147  161

3. Re-Analysis of T1D results

The T1D results I have so far use data which has been quality checked and therefore SNPs have been removed. Chris has created files containing all the data.
The code for this analysis is on the HPC AJHG/T1D_3/.

Problematic Regions

1. rs12453507 and rs113010081 (row 27 and 39)

rs12453507: There are NAs in the LD matrix (Sigma[194, 48]) between variants imm_17_34695499 (48) and imm_17_34950738 (194).
Chris says that “I think this is a bug in snpStats when two SNPs have no subjects in one of the double homozygous cells” and has given me a function to fix this.

2. rs72928038 and rs34536443 (rows 8 and 38)

Cannot make credible set smaller error in corrected_cs function (because the credible set already only has one variant).
Need to add a tryCatch to my code so the whole job doesn’t fail.
Have ammended code to deal with this.

3. rs9585056, rs6043409 and rs917911 (row 18, 33 and 15)

No root in range error.
I’ve been using lower=0.9 parameter value in corrected_cs_bhat function.
Have changed this to lower=0 for these.

Having fixed these problematic regions, I re-run the T1D analysis where:

The index SNP is called by it’s position rather than name (rsIDs are variable).
corrcov_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, thr = thresh, W = 0.2, nrep = 1000) is used to find the corrected coverage estimate.
If the threshold (0.95) is not contained within the 95% confidence interval for the corrected coverage estimate, then the corrected_cs_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, lower = 0.9, upper = 1, desired.cov = 0.95, acc = 0.0001, max.iter = 50) function is used to find a corrected credible set.

In the following plot, the corrected coverage for the regions using the original (trimmed) data are shown in pale pink for comparison.

In the paragraph below, the numbers in brackets are the corresponding results from my original T1D analysis (using the trimmed data set).
I found that a corrected credible set could be found in 25 (24) out of the 39 regions. Specifically, 8 (13) credible sets could be reduced in size and 17 (11) credible sets required the inclusion of additional variants. The median/ mean corrected coverage for the 8 (13) credible sets which could be reduced in size was 0.971/ 0.972 (0.9694/ 0.9696) and the median/ mean number of variants removed from the set was 1/ 4 (2/ 5.077). The median/ mean corrected coverage for the 17 (11) credible sets which required more variants to be added to the set was 0.866/ 0.873 (0.9111/ 0.8939) and the median/ mean number of variants removed from the set was 6/ 9 (3/ 4.364).
These results differ because there are a different number of SNPs in the corresponding regions between my original and my new results:

##          snps cut full
## 1  rs10277986 362  389
## 2  rs11203202 201  222
## 3  rs11954020 397  425
## 4  rs12416116 300  326
## 5  rs12453507 790  866
## 6  rs12927355 482  564
## 7  rs13415583 747  849
## 8   rs1456988 382  412
## 9    rs151234 248  291
## 10  rs1538171 702  805
## 11  rs1615504 173  198
## 12  rs1893217 262  284
## 13   rs193778 293  391
## 14  rs2111485 320  354
## 15   rs229533 208  231
## 16  rs2476601 639  691
## 17  rs3024505 337  390
## 18  rs3087243 512  557
## 19 rs34593439 296  335
## 20   rs402072 151  169
## 21  rs4820830 778  844
## 22   rs516246 183  201
## 23 rs56994090  79   87
## 24  rs6043409 339  371
## 25 rs61839660 313  343
## 26 rs62447205 641  692
## 27  rs6476839 248  280
## 28  rs6518350 160  191
## 29   rs653178 468  538
## 30   rs705705  64   70
## 31 rs72727394 200  219
## 32 rs72928038 268  292
## 33   rs757411 160  193
## 34 rs75793288 624  705
## 35  rs8056814 512  607
## 36   rs917911 523  563
## 37  rs9585056 154  165

index_snp	muhat	nsnps	claim	corrcov	CI95	req_thr	new_corrcov	new.CI95	orig.nvar	new.nvar	diff
rs10277986	6.3486	389	0.9908	0.8949	(0.8858,0.898)	0.9872070	0.9501	(0.9534,0.9602)	12	12	0
rs11203202	7.5674	222	0.9994	0.9631	(0.9583,0.9659)	0.9204353	0.9499	(0.9478,0.958)	4	4	0
rs113010081	4.8744	1342	0.9539	0.9502	(0.949,0.9573)	NA	NA	NA	51	1	-50
rs11954020	3.9973	425	0.9515	0.9701	(0.9655,0.9735)	0.9209473	0.9501	(0.9437,0.953)	74	67	-7
rs12416116	6.7886	326	0.9557	0.9567	(0.9489,0.9636)	NA	NA	NA	2	1	-1
rs12453507	5.1121	866	0.9541	0.9492	(0.9438,0.9524)	NA	NA	NA	45	1	-44
rs12927355	8.7231	564	0.9565	0.8662	(0.8645,0.8749)	0.9977097	0.9500	(0.9482,0.9526)	17	28	11
rs13415583	5.1904	849	0.9522	0.9022	(0.8953,0.9039)	0.9800293	0.9500	(0.9444,0.9502)	58	71	13
rs1456988	5.2630	412	0.9661	0.9122	(0.9055,0.9146)	0.9790527	0.9500	(0.9461,0.9526)	23	24	1
rs151234	6.7041	291	0.9861	0.9295	(0.9287,0.948)	0.9537201	0.9499	(0.9313,0.9498)	2	2	0
rs1538171	6.5230	805	0.9517	0.8421	(0.8406,0.8499)	0.9950684	0.9501	(0.9473,0.953)	41	77	36
rs1615504	5.0927	198	0.9621	0.9414	(0.9335,0.9423)	0.9603516	0.9501	(0.9451,0.9532)	21	21	0
rs1893217	8.3153	284	0.9545	0.8601	(0.8525,0.8672)	0.9952523	0.9503	(0.9526,0.9595)	12	25	13
rs193778	5.6491	391	0.9608	0.9609	(0.9548,0.9656)	0.9388672	0.9501	(0.9461,0.9576)	10	9	-1
rs2111485	7.9084	354	0.9719	0.8484	(0.8473,0.8615)	0.9951172	0.9500	(0.9455,0.9571)	3	4	1
rs229533	5.3313	231	0.9528	0.9749	(0.9703,0.9806)	0.9140625	0.9500	(0.9438,0.9586)	6	5	-1
rs2476601	21.8305	691	1.0000	0.7489	(0.7348,0.7528)	1.0000000	0.9497	(0.9481,0.9555)	2	2	0
rs3024505	6.6024	390	0.9985	0.8773	(0.8825,0.8984)	0.9870750	0.9498	(0.9433,0.9559)	3	3	0
rs3087243	8.4387	557	0.9507	0.8498	(0.8431,0.8573)	0.9969482	0.9500	(0.9433,0.9509)	16	25	9
rs34536443	7.8423	247	0.9985	0.9950	(0.989,0.998)	NA	NA	NA	1	1	0
rs34593439	7.5158	335	0.9949	0.9694	(0.9579,0.972)	0.9328125	0.9501	(0.9521,0.9661)	2	2	0
rs402072	5.4127	169	0.9657	0.9634	(0.96,0.9701)	0.9220703	0.9501	(0.9379,0.9509)	11	10	-1
rs4820830	6.1016	844	0.9515	0.9156	(0.9129,0.9235)	0.9734375	0.9501	(0.9457,0.9543)	42	48	6
rs516246	6.4900	201	0.9611	0.9225	(0.9222,0.934)	0.9674941	0.9502	(0.941,0.952)	11	12	1
rs56994090	6.0550	87	0.9732	0.9709	(0.958,0.9694)	0.9205566	0.9501	(0.9422,0.9542)	6	5	-1
rs6043409	5.5195	371	0.9661	0.9696	(0.9709,0.9785)	0.9072266	0.9500	(0.9468,0.9589)	9	7	-2
rs61839660	11.8881	343	0.9752	0.8257	(0.8176,0.8312)	0.9999023	0.9501	(0.9484,0.9552)	4	8	4
rs62447205	5.7966	692	0.9531	0.8975	(0.8909,0.8971)	0.9836426	0.9500	(0.9498,0.955)	27	38	11
rs6476839	5.9203	280	0.9540	0.9307	(0.9304,0.9435)	0.9625000	0.9501	(0.9459,0.9547)	14	15	1
rs6518350	4.4576	191	0.9542	0.9750	(0.9664,0.976)	0.9153320	0.9500	(0.9408,0.9533)	9	9	0
rs653178	13.5525	538	0.9999	0.7477	(0.7395,0.7542)	1.0000000	0.9506	(0.9498,0.9605)	2	5	3
rs705705	9.9452	70	0.9921	0.8373	(0.8333,0.8422)	0.9990356	0.9500	(0.9442,0.9517)	8	12	4
rs72727394	5.1817	219	0.9648	0.9406	(0.9342,0.9443)	0.9621094	0.9501	(0.947,0.9566)	17	17	0
rs72928038	7.5404	292	0.9810	0.9820	(0.9693,0.985)	NA	NA	NA	1	1	0
rs757411	4.9777	193	0.9518	0.9361	(0.9332,0.9461)	0.9563058	0.9500	(0.9387,0.9529)	37	39	2
rs75793288	7.2588	705	0.9517	0.8940	(0.884,0.9028)	0.9931827	0.9503	(0.946,0.9605)	16	37	21
rs8056814	8.5139	607	0.9514	0.8484	(0.8445,0.8625)	0.9996133	0.9503	(0.9454,0.9551)	4	21	17
rs917911	4.6019	563	0.9502	0.9715	(0.966,0.9793)	0.8942952	0.9502	(0.9385,0.9544)	33	15	-18
rs9585056	5.6000	165	0.9590	0.9918	(0.9832,0.9941)	0.6846541	0.9499	(0.9323,0.961)	2	1	-1

Note that for the BACH2 region (rsid rs72928038), the standard 95% credible set now picks out just this index SNP (it has a PP of 0.981) and although this has a high corrected coverage value, the credible set cannot be made any smaller.

There also appears to be some correlation between the value for muhat and the corrected coverage estimate (I noticed this because the regions with very low corrected coverage all seemed to have high muhats where our method has been shown to say that the corrected coverage is so low). Indeed, these results match what we see in our simulations (see first section of this report). Perhaps we need an upper \(\hat\mu\) value for where our method should not be used.

T1D Results - Plots

These are available to view on github: https://github.com/annahutch/PhD/tree/master/plots
NB, if the “new credible set” plot is missing then the credible set did not need to be corrected.

4. Realistic Fine-Mapping (Reference LD info)

Often researchers do not have access to the full genotype data and they therefore use reference panels to obtain LD information.
A realistic fine-mapping scenario would be that a researcher has the \(Z\) scores, \(MAF\)s and sample sizes. They would then match the SNPs to the 1000 genomes resource to get an estimate of the LD. I want to run an analysis that reflects this realistic scenario to see how our correction performs.
In my simulations, I have the positions of the SNPs - i.e. Chr22: 42861613 (pos42861613) and the corresponding UK10K data. I want to match the SNPs (by their position) to 1000 Genomes data (1000GP_Phase3_chr22.legend.gz) to generate independent reference data.
I can then investigate descrepencies between analysing a large dataset and smaller reference or vice versa (where large/small refer to the total number of chromosomes observed in each dataset).

5. Comments and To Do

Finalise R Package with Kevin:
- .gitignore and .Rbuildignore more stuff to make repository neater.
- Figure out continuous deployment stuff - every time the repository is pushed the gh-pages branch should be deployed and the pkgdown::build_site command should be evoked so that the webpage is updated. I have got this half working but the vignettes are returning 404 errors.
- Possibility: Check out algolia for a proper search option on the website (full repository search).
- Possibility: Look at Zenodo (each time the package is published, a version is backed up here and can be cited etc).
Do a walk-through “typical GWAS example” where we have the Z scores and MAFs (and sample sizes) and we estimate the SNP correlations from a reference panel. How do I go about using the 1000 genomes data for the LD estimate?
Ask Chris about code for simulating haplotypes (as this is used in my R package examples):

#' # simulate fake haplotypes to obtain MAFs and LD matrix

#' nhaps <- 1000
#' lag <- 7
#' maf.tmp <- runif(nsnps+lag, 0.05, 0.5) # common SNPs
#' laghaps <- do.call("cbind", lapply(maf.tmp, function(f) rbinom(nhaps,1,f)))
#' haps <- laghaps[,1:nsnps]

# specifically the following 3 lines??

#' for(j in 1:lag)
#'    haps <- haps + laghaps[,(1:nsnps)+j]
#' haps <- round(haps/matrix(apply(haps,2,max),nhaps,nsnps,byrow=TRUE))


#' maf <- colMeans(haps)
#' LD <- cor2(haps)

Reminder: Need to write a careful paragraph in the discussion of the paper about missing data/ drop-out rate/ gold standard stuff.
Do a paper draft for PLOS genetics to give to Chris whilst I’m at APTS.
Prepare for first year upgrading:
- Look into DAP-G fine-mapping.
- Explaining Robs idea - the reweighting of the prior “..and so a potential alternative approach would be to examine the effect of sampling from, and re-weighting the prior.”
- Methods for specifying W in peaky.