Contents

  1. Effect of LD
  2. Observed True Effect
  3. Re-Analysis of T1D Data
  4. Realistic Fine-Mapping (using reference panel)
  5. Comments and To Do

1. Effect of LD




Inferences

  • Our method seems to work best relative to the claimed coverage in the \(2.5<\mu<6\) range (\(0.01<P<2e-09\)).

  • It becomes very variable in high powered scenarioes for high LD regions. Here, the corrected coverage is often too low. What would these regions look like? Let’s look at a high powered example with the CV coloured in red:

##    true_mu   mu_est
## 1 19.88103 18.65872
## $credset
##  [1] 362 307 438 440 466 286 285 309 445 308 315 446 447 450 434 449 373
## [18] 326 369 476
## 
## $claimed.cov
## [1] 0.9574063
## 
## $nvar
## [1] 20
  • Variants in the 95% credible set coloured in blue:

  • The coverage estimates reflect what is shown in the relative error plots.
claimed <- size
true.est <- data$true.cov.est
corrected <- corrcov(data$z0, data$maf, data$NN, data$NN, data$LD)

data.frame(true = true.est, claimed = claimed, corrected = corrected)
##     true   claimed corrected
## 1 0.9812 0.9574063 0.9134529
  • Why is our corrected coverage estimate too low? Lets break down the corrcov function.
W = 0.2
varbeta = data$v
z = data$z0
Sigma = data$LD
nrep = 1000 # note - get similar result when increasing this
thr = 0.95
iCV = data$CV

r = W^2/(W^2 + varbeta)
bf = 0.5 * (log(1 - r) + (r * z^2))

par(mfrow=c(1,2))
plot(bf, ylab = "ABF", main = "ABF")
points(cs$credset, bf[cs$credset], col = "blue")
points(iCV, bf[iCV], col = "red")
plot(pp, ylab = "PP", main = "PP")
points(cs$credset, pp[cs$credset], col = "blue")
points(iCV, pp[iCV], col = "red")

  • The following plot is the proportion of the nrep = 1000 simulated credible sets that contain the corresponding variant that is assumed causal. The variants in the credible set are coloured in blue and the CV is coloured in red.

  • The following plot is the weighted proportions. The final corrected coverage value is the sum of these.

  • So it appears that when the variants with the largest PPs in the original system (those in the credible set) are assumed to be causal and more Z score results are simulated from this distribution, the assumed CV does not appear in the credible set as often as expected. This is likely due to the high LD in the region allocating the biggest PPs to variants that are highly correlated with the CV (but not the CV, so that it doesn’t appear in the resultant credible set). Let’s investigate this for the simulations where the CV is assumed to be causal.
iCV_sims <- pps[[iCV]]
max <- apply(iCV_sims, 1, which.max)

# what proportion of the 1000 simulations give the biggest PP to the CV?
length(which(max==iCV))/1000
## [1] 0.372
table(max)
## max
## 285 308 315 326 328 336 339 340 341 342 343 347 362 369 373 384 385 434 
##   2   1   2   4   1   2   1   1   4   3   3   3   6  11   3   7   2   3 
## 438 440 445 446 447 449 450 464 466 473 476 489 
## 399 372 125   3   4   3   5  11   7   4   5   3
  • So the CV is only given the largest of the PP in 39% of simulations, and in another 39% of the simulations the adjacent variant (438) is given the largest PP, kind of illustrating why our method is not as accurate in regions of high LD.

  • This is intuitive because our correction method works by simulating more GWAS results from “the same system” and finding the proportion of the resultant credible sets that contain the variant that is assumed to be causal. In regions with very high LD, when we simulate more GWAS results, the PP is shared amongst the highly correlated variants and in some cases, the actual assumed CV won’t get a substantial proportion of the allocated PP and therefore will not appear in the credible set.

  • Perhaps we should therefore incorporate the expected LD in the region (as well as the value for \(\hat\mu\)) in the “recommended use” section.


2. Bin by \(z0[iCV]\) rather than \(\mu=ze[iCV]\)




##           
##            low medium medium2 high
##   (0,1]      0      0       0    0
##   (1,2]    152    172     165  189
##   (2,2.5]  307    258     245  201
##   (2.5,3]  203    160     192  159
##   (3,4]    249    242     236  256
##   (4,4.75] 144    162     167  167
##   (4.75,6] 237    266     266  237
##   (6,8]    234    235     258  257
##   (8,13]   304    305     271  323
##   (13,16]   47     41      47   42
##   (16,Inf] 113    150     147  161

3. Re-Analysis of T1D results



Problematic Regions


1. rs12453507 and rs113010081 (row 27 and 39)

  • rs12453507: There are NAs in the LD matrix (Sigma[194, 48]) between variants imm_17_34695499 (48) and imm_17_34950738 (194).

  • Chris says that “I think this is a bug in snpStats when two SNPs have no subjects in one of the double homozygous cells” and has given me a function to fix this.


2. rs72928038 and rs34536443 (rows 8 and 38)

  • Cannot make credible set smaller error in corrected_cs function (because the credible set already only has one variant).

  • Need to add a tryCatch to my code so the whole job doesn’t fail.

  • Have ammended code to deal with this.


3. rs9585056, rs6043409 and rs917911 (row 18, 33 and 15)

  • No root in range error.

  • I’ve been using lower=0.9 parameter value in corrected_cs_bhat function.

  • Have changed this to lower=0 for these.


Having fixed these problematic regions, I re-run the T1D analysis where:

  1. The index SNP is called by it’s position rather than name (rsIDs are variable).

  2. corrcov_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, thr = thresh, W = 0.2, nrep = 1000) is used to find the corrected coverage estimate.

  3. If the threshold (0.95) is not contained within the 95% confidence interval for the corrected coverage estimate, then the corrected_cs_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, lower = 0.9, upper = 1, desired.cov = 0.95, acc = 0.0001, max.iter = 50) function is used to find a corrected credible set.

In the following plot, the corrected coverage for the regions using the original (trimmed) data are shown in pale pink for comparison.



  • In the paragraph below, the numbers in brackets are the corresponding results from my original T1D analysis (using the trimmed data set).

  • I found that a corrected credible set could be found in 25 (24) out of the 39 regions. Specifically, 8 (13) credible sets could be reduced in size and 17 (11) credible sets required the inclusion of additional variants. The median/ mean corrected coverage for the 8 (13) credible sets which could be reduced in size was 0.971/ 0.972 (0.9694/ 0.9696) and the median/ mean number of variants removed from the set was 1/ 4 (2/ 5.077). The median/ mean corrected coverage for the 17 (11) credible sets which required more variants to be added to the set was 0.866/ 0.873 (0.9111/ 0.8939) and the median/ mean number of variants removed from the set was 6/ 9 (3/ 4.364).

  • These results differ because there are a different number of SNPs in the corresponding regions between my original and my new results:

##          snps cut full
## 1  rs10277986 362  389
## 2  rs11203202 201  222
## 3  rs11954020 397  425
## 4  rs12416116 300  326
## 5  rs12453507 790  866
## 6  rs12927355 482  564
## 7  rs13415583 747  849
## 8   rs1456988 382  412
## 9    rs151234 248  291
## 10  rs1538171 702  805
## 11  rs1615504 173  198
## 12  rs1893217 262  284
## 13   rs193778 293  391
## 14  rs2111485 320  354
## 15   rs229533 208  231
## 16  rs2476601 639  691
## 17  rs3024505 337  390
## 18  rs3087243 512  557
## 19 rs34593439 296  335
## 20   rs402072 151  169
## 21  rs4820830 778  844
## 22   rs516246 183  201
## 23 rs56994090  79   87
## 24  rs6043409 339  371
## 25 rs61839660 313  343
## 26 rs62447205 641  692
## 27  rs6476839 248  280
## 28  rs6518350 160  191
## 29   rs653178 468  538
## 30   rs705705  64   70
## 31 rs72727394 200  219
## 32 rs72928038 268  292
## 33   rs757411 160  193
## 34 rs75793288 624  705
## 35  rs8056814 512  607
## 36   rs917911 523  563
## 37  rs9585056 154  165
index_snp muhat nsnps claim corrcov CI95 req_thr new_corrcov new.CI95 orig.nvar new.nvar diff
rs10277986 6.3486 389 0.9908 0.8949 (0.8858,0.898) 0.9872070 0.9501 (0.9534,0.9602) 12 12 0
rs11203202 7.5674 222 0.9994 0.9631 (0.9583,0.9659) 0.9204353 0.9499 (0.9478,0.958) 4 4 0
rs113010081 4.8744 1342 0.9539 0.9502 (0.949,0.9573) NA NA NA 51 1 -50
rs11954020 3.9973 425 0.9515 0.9701 (0.9655,0.9735) 0.9209473 0.9501 (0.9437,0.953) 74 67 -7
rs12416116 6.7886 326 0.9557 0.9567 (0.9489,0.9636) NA NA NA 2 1 -1
rs12453507 5.1121 866 0.9541 0.9492 (0.9438,0.9524) NA NA NA 45 1 -44
rs12927355 8.7231 564 0.9565 0.8662 (0.8645,0.8749) 0.9977097 0.9500 (0.9482,0.9526) 17 28 11
rs13415583 5.1904 849 0.9522 0.9022 (0.8953,0.9039) 0.9800293 0.9500 (0.9444,0.9502) 58 71 13
rs1456988 5.2630 412 0.9661 0.9122 (0.9055,0.9146) 0.9790527 0.9500 (0.9461,0.9526) 23 24 1
rs151234 6.7041 291 0.9861 0.9295 (0.9287,0.948) 0.9537201 0.9499 (0.9313,0.9498) 2 2 0
rs1538171 6.5230 805 0.9517 0.8421 (0.8406,0.8499) 0.9950684 0.9501 (0.9473,0.953) 41 77 36
rs1615504 5.0927 198 0.9621 0.9414 (0.9335,0.9423) 0.9603516 0.9501 (0.9451,0.9532) 21 21 0
rs1893217 8.3153 284 0.9545 0.8601 (0.8525,0.8672) 0.9952523 0.9503 (0.9526,0.9595) 12 25 13
rs193778 5.6491 391 0.9608 0.9609 (0.9548,0.9656) 0.9388672 0.9501 (0.9461,0.9576) 10 9 -1
rs2111485 7.9084 354 0.9719 0.8484 (0.8473,0.8615) 0.9951172 0.9500 (0.9455,0.9571) 3 4 1
rs229533 5.3313 231 0.9528 0.9749 (0.9703,0.9806) 0.9140625 0.9500 (0.9438,0.9586) 6 5 -1
rs2476601 21.8305 691 1.0000 0.7489 (0.7348,0.7528) 1.0000000 0.9497 (0.9481,0.9555) 2 2 0
rs3024505 6.6024 390 0.9985 0.8773 (0.8825,0.8984) 0.9870750 0.9498 (0.9433,0.9559) 3 3 0
rs3087243 8.4387 557 0.9507 0.8498 (0.8431,0.8573) 0.9969482 0.9500 (0.9433,0.9509) 16 25 9
rs34536443 7.8423 247 0.9985 0.9950 (0.989,0.998) NA NA NA 1 1 0
rs34593439 7.5158 335 0.9949 0.9694 (0.9579,0.972) 0.9328125 0.9501 (0.9521,0.9661) 2 2 0
rs402072 5.4127 169 0.9657 0.9634 (0.96,0.9701) 0.9220703 0.9501 (0.9379,0.9509) 11 10 -1
rs4820830 6.1016 844 0.9515 0.9156 (0.9129,0.9235) 0.9734375 0.9501 (0.9457,0.9543) 42 48 6
rs516246 6.4900 201 0.9611 0.9225 (0.9222,0.934) 0.9674941 0.9502 (0.941,0.952) 11 12 1
rs56994090 6.0550 87 0.9732 0.9709 (0.958,0.9694) 0.9205566 0.9501 (0.9422,0.9542) 6 5 -1
rs6043409 5.5195 371 0.9661 0.9696 (0.9709,0.9785) 0.9072266 0.9500 (0.9468,0.9589) 9 7 -2
rs61839660 11.8881 343 0.9752 0.8257 (0.8176,0.8312) 0.9999023 0.9501 (0.9484,0.9552) 4 8 4
rs62447205 5.7966 692 0.9531 0.8975 (0.8909,0.8971) 0.9836426 0.9500 (0.9498,0.955) 27 38 11
rs6476839 5.9203 280 0.9540 0.9307 (0.9304,0.9435) 0.9625000 0.9501 (0.9459,0.9547) 14 15 1
rs6518350 4.4576 191 0.9542 0.9750 (0.9664,0.976) 0.9153320 0.9500 (0.9408,0.9533) 9 9 0
rs653178 13.5525 538 0.9999 0.7477 (0.7395,0.7542) 1.0000000 0.9506 (0.9498,0.9605) 2 5 3
rs705705 9.9452 70 0.9921 0.8373 (0.8333,0.8422) 0.9990356 0.9500 (0.9442,0.9517) 8 12 4
rs72727394 5.1817 219 0.9648 0.9406 (0.9342,0.9443) 0.9621094 0.9501 (0.947,0.9566) 17 17 0
rs72928038 7.5404 292 0.9810 0.9820 (0.9693,0.985) NA NA NA 1 1 0
rs757411 4.9777 193 0.9518 0.9361 (0.9332,0.9461) 0.9563058 0.9500 (0.9387,0.9529) 37 39 2
rs75793288 7.2588 705 0.9517 0.8940 (0.884,0.9028) 0.9931827 0.9503 (0.946,0.9605) 16 37 21
rs8056814 8.5139 607 0.9514 0.8484 (0.8445,0.8625) 0.9996133 0.9503 (0.9454,0.9551) 4 21 17
rs917911 4.6019 563 0.9502 0.9715 (0.966,0.9793) 0.8942952 0.9502 (0.9385,0.9544) 33 15 -18
rs9585056 5.6000 165 0.9590 0.9918 (0.9832,0.9941) 0.6846541 0.9499 (0.9323,0.961) 2 1 -1
  • Note that for the BACH2 region (rsid rs72928038), the standard 95% credible set now picks out just this index SNP (it has a PP of 0.981) and although this has a high corrected coverage value, the credible set cannot be made any smaller.

  • There also appears to be some correlation between the value for muhat and the corrected coverage estimate (I noticed this because the regions with very low corrected coverage all seemed to have high muhats where our method has been shown to say that the corrected coverage is so low). Indeed, these results match what we see in our simulations (see first section of this report). Perhaps we need an upper \(\hat\mu\) value for where our method should not be used.


T1D Results - Plots



4. Realistic Fine-Mapping (Reference LD info)



5. Comments and To Do

#' # simulate fake haplotypes to obtain MAFs and LD matrix

#' nhaps <- 1000
#' lag <- 7
#' maf.tmp <- runif(nsnps+lag, 0.05, 0.5) # common SNPs
#' laghaps <- do.call("cbind", lapply(maf.tmp, function(f) rbinom(nhaps,1,f)))
#' haps <- laghaps[,1:nsnps]

# specifically the following 3 lines??

#' for(j in 1:lag)
#'    haps <- haps + laghaps[,(1:nsnps)+j]
#' haps <- round(haps/matrix(apply(haps,2,max),nhaps,nsnps,byrow=TRUE))


#' maf <- colMeans(haps)
#' LD <- cor2(haps)