So far, all of our simulations are based on ldd regions on chromosome 22 using data from UK10K, we did not quantify the LD of these regions.
To investigate the effect of LD on our method, we apply our method to various regions (low, medium or high LD) on chromosome 10 using 1000 Genomes data.
Each of the regions contain approximately 700 SNPs (corrcov
function takes ~ 2 minutes).
Our method seems to work best relative to the claimed coverage in the \(2.5<\mu<6\) range (\(0.01<P<2e-09\)).
It becomes very variable in high powered scenarioes for high LD regions. Here, the corrected coverage is often too low. What would these regions look like? Let’s look at a high powered example with the CV coloured in red:
## true_mu mu_est
## 1 19.88103 18.65872
## $credset
## [1] 362 307 438 440 466 286 285 309 445 308 315 446 447 450 434 449 373
## [18] 326 369 476
##
## $claimed.cov
## [1] 0.9574063
##
## $nvar
## [1] 20
claimed <- size
true.est <- data$true.cov.est
corrected <- corrcov(data$z0, data$maf, data$NN, data$NN, data$LD)
data.frame(true = true.est, claimed = claimed, corrected = corrected)
## true claimed corrected
## 1 0.9812 0.9574063 0.9134529
corrcov
function.W = 0.2
varbeta = data$v
z = data$z0
Sigma = data$LD
nrep = 1000 # note - get similar result when increasing this
thr = 0.95
iCV = data$CV
r = W^2/(W^2 + varbeta)
bf = 0.5 * (log(1 - r) + (r * z^2))
par(mfrow=c(1,2))
plot(bf, ylab = "ABF", main = "ABF")
points(cs$credset, bf[cs$credset], col = "blue")
points(iCV, bf[iCV], col = "red")
plot(pp, ylab = "PP", main = "PP")
points(cs$credset, pp[cs$credset], col = "blue")
points(iCV, pp[iCV], col = "red")
nrep = 1000
simulated credible sets that contain the corresponding variant that is assumed causal. The variants in the credible set are coloured in blue and the CV is coloured in red.iCV_sims <- pps[[iCV]]
max <- apply(iCV_sims, 1, which.max)
# what proportion of the 1000 simulations give the biggest PP to the CV?
length(which(max==iCV))/1000
## [1] 0.372
table(max)
## max
## 285 308 315 326 328 336 339 340 341 342 343 347 362 369 373 384 385 434
## 2 1 2 4 1 2 1 1 4 3 3 3 6 11 3 7 2 3
## 438 440 445 446 447 449 450 464 466 473 476 489
## 399 372 125 3 4 3 5 11 7 4 5 3
So the CV is only given the largest of the PP in 39% of simulations, and in another 39% of the simulations the adjacent variant (438) is given the largest PP, kind of illustrating why our method is not as accurate in regions of high LD.
This is intuitive because our correction method works by simulating more GWAS results from “the same system” and finding the proportion of the resultant credible sets that contain the variant that is assumed to be causal. In regions with very high LD, when we simulate more GWAS results, the PP is shared amongst the highly correlated variants and in some cases, the actual assumed CV won’t get a substantial proportion of the allocated PP and therefore will not appear in the credible set.
Perhaps we should therefore incorporate the expected LD in the region (as well as the value for \(\hat\mu\)) in the “recommended use” section.
It is apparant that the higher the LD, the more our method struggles.
The number of simulations for the above plot (binned by muhat) is….
##
## low medium medium2 high
## (0,1] 0 0 0 0
## (1,2] 152 172 165 189
## (2,2.5] 307 258 245 201
## (2.5,3] 203 160 192 159
## (3,4] 249 242 236 256
## (4,4.75] 144 162 167 167
## (4.75,6] 237 266 266 237
## (6,8] 234 235 258 257
## (8,13] 304 305 271 323
## (13,16] 47 41 47 42
## (16,Inf] 113 150 147 161
The T1D results I have so far use data which has been quality checked and therefore SNPs have been removed. Chris has created files containing all the data.
The code for this analysis is on the HPC AJHG/T1D_3/
.
rs12453507: There are NAs in the LD matrix (Sigma[194, 48]
) between variants imm_17_34695499 (48) and imm_17_34950738 (194).
Chris says that “I think this is a bug in snpStats when two SNPs have no subjects in one of the double homozygous cells” and has given me a function to fix this.
Cannot make credible set smaller error in corrected_cs
function (because the credible set already only has one variant).
Need to add a tryCatch
to my code so the whole job doesn’t fail.
Have ammended code to deal with this.
No root in range error.
I’ve been using lower=0.9
parameter value in corrected_cs_bhat
function.
Have changed this to lower=0
for these.
Having fixed these problematic regions, I re-run the T1D analysis where:
The index SNP is called by it’s position rather than name (rsIDs are variable).
corrcov_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, thr = thresh, W = 0.2, nrep = 1000)
is used to find the corrected coverage estimate.
If the threshold (0.95) is not contained within the 95% confidence interval for the corrected coverage estimate, then the corrected_cs_bhat(bhat = bhat, V = varbeta, N0 = 12262, N1 = 6670, Sigma = Sigma, lower = 0.9, upper = 1, desired.cov = 0.95, acc = 0.0001, max.iter = 50)
function is used to find a corrected credible set.
In the following plot, the corrected coverage for the regions using the original (trimmed) data are shown in pale pink for comparison.
In the paragraph below, the numbers in brackets are the corresponding results from my original T1D analysis (using the trimmed data set).
I found that a corrected credible set could be found in 25 (24) out of the 39 regions. Specifically, 8 (13) credible sets could be reduced in size and 17 (11) credible sets required the inclusion of additional variants. The median/ mean corrected coverage for the 8 (13) credible sets which could be reduced in size was 0.971/ 0.972 (0.9694/ 0.9696) and the median/ mean number of variants removed from the set was 1/ 4 (2/ 5.077). The median/ mean corrected coverage for the 17 (11) credible sets which required more variants to be added to the set was 0.866/ 0.873 (0.9111/ 0.8939) and the median/ mean number of variants removed from the set was 6/ 9 (3/ 4.364).
These results differ because there are a different number of SNPs in the corresponding regions between my original and my new results:
## snps cut full
## 1 rs10277986 362 389
## 2 rs11203202 201 222
## 3 rs11954020 397 425
## 4 rs12416116 300 326
## 5 rs12453507 790 866
## 6 rs12927355 482 564
## 7 rs13415583 747 849
## 8 rs1456988 382 412
## 9 rs151234 248 291
## 10 rs1538171 702 805
## 11 rs1615504 173 198
## 12 rs1893217 262 284
## 13 rs193778 293 391
## 14 rs2111485 320 354
## 15 rs229533 208 231
## 16 rs2476601 639 691
## 17 rs3024505 337 390
## 18 rs3087243 512 557
## 19 rs34593439 296 335
## 20 rs402072 151 169
## 21 rs4820830 778 844
## 22 rs516246 183 201
## 23 rs56994090 79 87
## 24 rs6043409 339 371
## 25 rs61839660 313 343
## 26 rs62447205 641 692
## 27 rs6476839 248 280
## 28 rs6518350 160 191
## 29 rs653178 468 538
## 30 rs705705 64 70
## 31 rs72727394 200 219
## 32 rs72928038 268 292
## 33 rs757411 160 193
## 34 rs75793288 624 705
## 35 rs8056814 512 607
## 36 rs917911 523 563
## 37 rs9585056 154 165
index_snp | muhat | nsnps | claim | corrcov | CI95 | req_thr | new_corrcov | new.CI95 | orig.nvar | new.nvar | diff |
---|---|---|---|---|---|---|---|---|---|---|---|
rs10277986 | 6.3486 | 389 | 0.9908 | 0.8949 | (0.8858,0.898) | 0.9872070 | 0.9501 | (0.9534,0.9602) | 12 | 12 | 0 |
rs11203202 | 7.5674 | 222 | 0.9994 | 0.9631 | (0.9583,0.9659) | 0.9204353 | 0.9499 | (0.9478,0.958) | 4 | 4 | 0 |
rs113010081 | 4.8744 | 1342 | 0.9539 | 0.9502 | (0.949,0.9573) | NA | NA | NA | 51 | 1 | -50 |
rs11954020 | 3.9973 | 425 | 0.9515 | 0.9701 | (0.9655,0.9735) | 0.9209473 | 0.9501 | (0.9437,0.953) | 74 | 67 | -7 |
rs12416116 | 6.7886 | 326 | 0.9557 | 0.9567 | (0.9489,0.9636) | NA | NA | NA | 2 | 1 | -1 |
rs12453507 | 5.1121 | 866 | 0.9541 | 0.9492 | (0.9438,0.9524) | NA | NA | NA | 45 | 1 | -44 |
rs12927355 | 8.7231 | 564 | 0.9565 | 0.8662 | (0.8645,0.8749) | 0.9977097 | 0.9500 | (0.9482,0.9526) | 17 | 28 | 11 |
rs13415583 | 5.1904 | 849 | 0.9522 | 0.9022 | (0.8953,0.9039) | 0.9800293 | 0.9500 | (0.9444,0.9502) | 58 | 71 | 13 |
rs1456988 | 5.2630 | 412 | 0.9661 | 0.9122 | (0.9055,0.9146) | 0.9790527 | 0.9500 | (0.9461,0.9526) | 23 | 24 | 1 |
rs151234 | 6.7041 | 291 | 0.9861 | 0.9295 | (0.9287,0.948) | 0.9537201 | 0.9499 | (0.9313,0.9498) | 2 | 2 | 0 |
rs1538171 | 6.5230 | 805 | 0.9517 | 0.8421 | (0.8406,0.8499) | 0.9950684 | 0.9501 | (0.9473,0.953) | 41 | 77 | 36 |
rs1615504 | 5.0927 | 198 | 0.9621 | 0.9414 | (0.9335,0.9423) | 0.9603516 | 0.9501 | (0.9451,0.9532) | 21 | 21 | 0 |
rs1893217 | 8.3153 | 284 | 0.9545 | 0.8601 | (0.8525,0.8672) | 0.9952523 | 0.9503 | (0.9526,0.9595) | 12 | 25 | 13 |
rs193778 | 5.6491 | 391 | 0.9608 | 0.9609 | (0.9548,0.9656) | 0.9388672 | 0.9501 | (0.9461,0.9576) | 10 | 9 | -1 |
rs2111485 | 7.9084 | 354 | 0.9719 | 0.8484 | (0.8473,0.8615) | 0.9951172 | 0.9500 | (0.9455,0.9571) | 3 | 4 | 1 |
rs229533 | 5.3313 | 231 | 0.9528 | 0.9749 | (0.9703,0.9806) | 0.9140625 | 0.9500 | (0.9438,0.9586) | 6 | 5 | -1 |
rs2476601 | 21.8305 | 691 | 1.0000 | 0.7489 | (0.7348,0.7528) | 1.0000000 | 0.9497 | (0.9481,0.9555) | 2 | 2 | 0 |
rs3024505 | 6.6024 | 390 | 0.9985 | 0.8773 | (0.8825,0.8984) | 0.9870750 | 0.9498 | (0.9433,0.9559) | 3 | 3 | 0 |
rs3087243 | 8.4387 | 557 | 0.9507 | 0.8498 | (0.8431,0.8573) | 0.9969482 | 0.9500 | (0.9433,0.9509) | 16 | 25 | 9 |
rs34536443 | 7.8423 | 247 | 0.9985 | 0.9950 | (0.989,0.998) | NA | NA | NA | 1 | 1 | 0 |
rs34593439 | 7.5158 | 335 | 0.9949 | 0.9694 | (0.9579,0.972) | 0.9328125 | 0.9501 | (0.9521,0.9661) | 2 | 2 | 0 |
rs402072 | 5.4127 | 169 | 0.9657 | 0.9634 | (0.96,0.9701) | 0.9220703 | 0.9501 | (0.9379,0.9509) | 11 | 10 | -1 |
rs4820830 | 6.1016 | 844 | 0.9515 | 0.9156 | (0.9129,0.9235) | 0.9734375 | 0.9501 | (0.9457,0.9543) | 42 | 48 | 6 |
rs516246 | 6.4900 | 201 | 0.9611 | 0.9225 | (0.9222,0.934) | 0.9674941 | 0.9502 | (0.941,0.952) | 11 | 12 | 1 |
rs56994090 | 6.0550 | 87 | 0.9732 | 0.9709 | (0.958,0.9694) | 0.9205566 | 0.9501 | (0.9422,0.9542) | 6 | 5 | -1 |
rs6043409 | 5.5195 | 371 | 0.9661 | 0.9696 | (0.9709,0.9785) | 0.9072266 | 0.9500 | (0.9468,0.9589) | 9 | 7 | -2 |
rs61839660 | 11.8881 | 343 | 0.9752 | 0.8257 | (0.8176,0.8312) | 0.9999023 | 0.9501 | (0.9484,0.9552) | 4 | 8 | 4 |
rs62447205 | 5.7966 | 692 | 0.9531 | 0.8975 | (0.8909,0.8971) | 0.9836426 | 0.9500 | (0.9498,0.955) | 27 | 38 | 11 |
rs6476839 | 5.9203 | 280 | 0.9540 | 0.9307 | (0.9304,0.9435) | 0.9625000 | 0.9501 | (0.9459,0.9547) | 14 | 15 | 1 |
rs6518350 | 4.4576 | 191 | 0.9542 | 0.9750 | (0.9664,0.976) | 0.9153320 | 0.9500 | (0.9408,0.9533) | 9 | 9 | 0 |
rs653178 | 13.5525 | 538 | 0.9999 | 0.7477 | (0.7395,0.7542) | 1.0000000 | 0.9506 | (0.9498,0.9605) | 2 | 5 | 3 |
rs705705 | 9.9452 | 70 | 0.9921 | 0.8373 | (0.8333,0.8422) | 0.9990356 | 0.9500 | (0.9442,0.9517) | 8 | 12 | 4 |
rs72727394 | 5.1817 | 219 | 0.9648 | 0.9406 | (0.9342,0.9443) | 0.9621094 | 0.9501 | (0.947,0.9566) | 17 | 17 | 0 |
rs72928038 | 7.5404 | 292 | 0.9810 | 0.9820 | (0.9693,0.985) | NA | NA | NA | 1 | 1 | 0 |
rs757411 | 4.9777 | 193 | 0.9518 | 0.9361 | (0.9332,0.9461) | 0.9563058 | 0.9500 | (0.9387,0.9529) | 37 | 39 | 2 |
rs75793288 | 7.2588 | 705 | 0.9517 | 0.8940 | (0.884,0.9028) | 0.9931827 | 0.9503 | (0.946,0.9605) | 16 | 37 | 21 |
rs8056814 | 8.5139 | 607 | 0.9514 | 0.8484 | (0.8445,0.8625) | 0.9996133 | 0.9503 | (0.9454,0.9551) | 4 | 21 | 17 |
rs917911 | 4.6019 | 563 | 0.9502 | 0.9715 | (0.966,0.9793) | 0.8942952 | 0.9502 | (0.9385,0.9544) | 33 | 15 | -18 |
rs9585056 | 5.6000 | 165 | 0.9590 | 0.9918 | (0.9832,0.9941) | 0.6846541 | 0.9499 | (0.9323,0.961) | 2 | 1 | -1 |
These are available to view on github: https://github.com/annahutch/PhD/tree/master/plots
NB, if the “new credible set” plot is missing then the credible set did not need to be corrected.
Often researchers do not have access to the full genotype data and they therefore use reference panels to obtain LD information.
A realistic fine-mapping scenario would be that a researcher has the \(Z\) scores, \(MAF\)s and sample sizes. They would then match the SNPs to the 1000 genomes resource to get an estimate of the LD. I want to run an analysis that reflects this realistic scenario to see how our correction performs.
In my simulations, I have the positions of the SNPs - i.e. Chr22: 42861613 (pos42861613) and the corresponding UK10K data. I want to match the SNPs (by their position) to 1000 Genomes data (1000GP_Phase3_chr22.legend.gz
) to generate independent reference data.
I can then investigate descrepencies between analysing a large dataset and smaller reference or vice versa (where large/small refer to the total number of chromosomes observed in each dataset).
pkgdown::build_site
command should be evoked so that the webpage is updated. I have got this half working but the vignettes are returning 404 errors.Do a walk-through “typical GWAS example” where we have the Z scores and MAFs (and sample sizes) and we estimate the SNP correlations from a reference panel. How do I go about using the 1000 genomes data for the LD estimate?
Ask Chris about code for simulating haplotypes (as this is used in my R package examples):
#' # simulate fake haplotypes to obtain MAFs and LD matrix
#' nhaps <- 1000
#' lag <- 7
#' maf.tmp <- runif(nsnps+lag, 0.05, 0.5) # common SNPs
#' laghaps <- do.call("cbind", lapply(maf.tmp, function(f) rbinom(nhaps,1,f)))
#' haps <- laghaps[,1:nsnps]
# specifically the following 3 lines??
#' for(j in 1:lag)
#' haps <- haps + laghaps[,(1:nsnps)+j]
#' haps <- round(haps/matrix(apply(haps,2,max),nhaps,nsnps,byrow=TRUE))
#' maf <- colMeans(haps)
#' LD <- cor2(haps)
Reminder: Need to write a careful paragraph in the discussion of the paper about missing data/ drop-out rate/ gold standard stuff.
Do a paper draft for PLOS genetics to give to Chris whilst I’m at APTS.
Prepare for first year upgrading: