There are two reasons why we need to LD prune our data:
To remove correlated predictors in the \(\chi^2\) regression to select our \(q\) vectors.
To remove correlated samples for the KDE estimation (so our BW is not biased) [but kde estimation may not be as good when using fewer data points].
I do this using the LDAK software. LDAK weights tend to be lower for SNPs in regions of high LD and vice versa (so that in the full heritability model, heritability is higher for SNPs in regions of lower LD/higher MAF). If a SNP has 0 weighting, then this means that its signal is (almost) perfectly captured by neighbouring SNPs.
Method:
Generate european only 1000 Genomes phase 3 haplotype files for each chromosome (see ldprune/make_eurhap.R
).
Cut down each haplotype file to only contain asthma SNPs (matched by hg19 coords - note that there are a few that can’t be matched, hopefully these are the same ones not matched in the baseline LD annotation step - i.e those asthma SNPs not in 1000 Genomes phase 3 dataset) (see ldprune/make_asthmahaps.R
). At this stage, 7042 SNPs are lost.
Use write.plink
to make plink files
Use LDAK (look at run-ldak.sh
script) to generate weights.
Error when using --cut-weights
: “Reading annotations for 128549 predictors from tes/ldak.bim Error, basepair for rs6854799 (112205778.00) is lower than that for rs6854800 (112205779.00)”
See http://dougspeed.com/advanced-options/
The proportion of SNPs given a weight of 0 are shown below:
## $chr1
## [1] 0.7389829
##
## $chr10
## [1] 0.7498808
##
## $chr11
## [1] 0.7527958
##
## $chr12
## [1] 0.738965
##
## $chr13
## [1] 0.7560465
##
## $chr14
## [1] 0.731472
##
## $chr15
## [1] 0.7067432
##
## $chr16
## [1] 0.6810886
##
## $chr17
## [1] 0.673031
##
## $chr18
## [1] 0.7306687
##
## $chr19
## [1] 0.6371283
##
## $chr20
## [1] 0.7177763
##
## $chr21
## [1] 0.7111854
##
## $chr22
## [1] 0.6755821
##
## $chr3
## [1] 0.7442185
##
## $chr5
## [1] 0.7475649
##
## $chr6
## [1] 0.7678233
##
## $chr7
## [1] 0.7427875
##
## $chr8
## [1] 0.7619172
##
## $chr9
## [1] 0.7363088
My preliminary results look ok when using only the SNPs with non-zero weights for the KDE estimation, and it seems to solve some of the very small \(q\) problems.
I use two sources to obtain auxillary functional data:
Match 2,001,256 asthma GWAS SNPs to their annotations in the baseline v2.2 LD model either by SNPID or hg19 BP.
/home/ah2011/rds/hpc-work/asthma/baselineLD_annots
).The emission probabilities are shown below (with my predicted meanings):
My final annotation dataset is a data.frame of 130 annotations at 1,978,358 SNPs.
full_annots <- readRDS("full_annotations.RDS")
I then perform a mixture of PCA (continuous annotations) and MCA (quantitative annotations) for dimensionality reduction using PCAmixdata
.
There will be disease-irrelevant variability in the data set, yet we only want to pick out the disease-relevant variability for our method. To do this, we regress the \(\chi^2\) statistics from the asthma GWAS data (using on the independent SNPs) against each dimension obtained from the dimensionality reduction and choose the dimensions that come out as significant as our \(q\) vectors.
##
## Call:
## lm(formula = chisq ~ ., data = ind_x_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.491 -0.928 -0.581 0.296 104.818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0588692 0.0028431 372.433 < 2e-16 ***
## dim.1 0.0006224 0.0005271 1.181 0.237647
## dim.2 0.0004942 0.0005390 0.917 0.359134
## dim.3 0.0003452 0.0006642 0.520 0.603289
## dim.4 -0.0026354 0.0006939 -3.798 0.000146 ***
## dim.5 0.0021696 0.0007456 2.910 0.003615 **
## dim.6 -0.0011157 0.0008139 -1.371 0.170446
## dim.7 -0.0006136 0.0008499 -0.722 0.470303
## dim.8 0.0002293 0.0009023 0.254 0.799396
## dim.9 0.0134115 0.0008526 15.730 < 2e-16 ***
## dim.10 0.0010536 0.0010507 1.003 0.316005
## dim.11 -0.0002311 0.0011597 -0.199 0.842044
## dim.12 -0.0002268 0.0012789 -0.177 0.859263
## dim.13 0.0005794 0.0011994 0.483 0.629052
## dim.14 0.0020241 0.0013447 1.505 0.132256
## dim.15 -0.0013026 0.0014988 -0.869 0.384812
## dim.16 -0.0009163 0.0014209 -0.645 0.518987
## dim.17 -0.0008743 0.0015096 -0.579 0.562463
## dim.18 0.0145264 0.0013914 10.440 < 2e-16 ***
## dim.19 0.0008206 0.0016116 0.509 0.610629
## dim.20 0.0026149 0.0016321 1.602 0.109132
## dim.21 -0.0131723 0.0016263 -8.100 5.53e-16 ***
## dim.22 -0.0026284 0.0016857 -1.559 0.118945
## dim.23 0.0006325 0.0017574 0.360 0.718927
## dim.24 -0.0024074 0.0017878 -1.347 0.178123
## dim.25 -0.0014359 0.0017927 -0.801 0.423150
## dim.26 0.0007662 0.0016462 0.465 0.641610
## dim.27 0.0140075 0.0016329 8.578 < 2e-16 ***
## dim.28 0.0029525 0.0018529 1.593 0.111056
## dim.29 0.0004786 0.0018658 0.257 0.797558
## dim.30 0.0002545 0.0018933 0.134 0.893050
## dim.31 0.0093636 0.0019175 4.883 1.04e-06 ***
## dim.32 0.0021812 0.0019396 1.125 0.260785
## dim.33 0.0021487 0.0019383 1.109 0.267628
## dim.34 -0.0100960 0.0017972 -5.618 1.94e-08 ***
## dim.35 -0.0020731 0.0020084 -1.032 0.301969
## dim.36 -0.0048509 0.0020033 -2.421 0.015459 *
## dim.37 0.0204051 0.0019916 10.246 < 2e-16 ***
## dim.38 0.0059072 0.0020605 2.867 0.004145 **
## dim.39 0.0157483 0.0019992 7.877 3.35e-15 ***
## dim.40 -0.0059882 0.0020720 -2.890 0.003852 **
## dim.41 -0.0027968 0.0020650 -1.354 0.175607
## dim.42 0.0029864 0.0021265 1.404 0.160209
## dim.43 -0.0003507 0.0020617 -0.170 0.864915
## dim.44 -0.0095762 0.0018978 -5.046 4.51e-07 ***
## dim.45 -0.0023827 0.0021194 -1.124 0.260928
## dim.46 -0.0001481 0.0021212 -0.070 0.944336
## dim.47 -0.0068358 0.0019273 -3.547 0.000390 ***
## dim.48 0.0031135 0.0021627 1.440 0.149964
## dim.49 -0.0048615 0.0021790 -2.231 0.025677 *
## dim.50 0.0051218 0.0021750 2.355 0.018533 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.692 on 439041 degrees of freedom
## (4613 observations deleted due to missingness)
## Multiple R-squared: 0.001928, Adjusted R-squared: 0.001814
## F-statistic: 16.96 on 50 and 439041 DF, p-value: < 2.2e-16
What do the loadings plots look like for these significant dimensions? And are these monotonic in \(p\)? Not always…
How about if I do the orthogonal rotation? Nope doesn’t seem to help.
## Warning: Removed 341 rows containing non-finite values (stat_boxplot).
Recall that the method worked well when only using the 96 LD score annotations, perhaps I should just revert back to these.
To speed up our method, we bin similar \((p,q)\) pairs and calculate contours and therefore \(v\) values for each bin (see cFDR_final/check_hexbinning
).
These bins need to be extremely small (see https://www.rdocumentation.org/packages/hexbin/versions/1.29.0/topics/hexbin). The number of bins are specified by the xbins
parameter, which is the number of bins partitioning the x (p) direction (not equally spaced).
Method | Time | # bins | Biggest bin |
---|---|---|---|
No binning | 4hr 36 | (1,999,262) | 1 |
nxbin=5000 | 3hr 58 | 1,779,293 | 47 |
nxbin=2000 | 2hr 41 | 1,129,918 | 81 |
nxbin=1000 | 1hr 14 | 497,643 | 105 |
But binning by p values seems to introduce some tailing off:
This is due to the lack of bins used in the low \(p\), high \(q\) range of points. E.g. there are 24 data points with low p (\(p<1e-15\)) and high q (\(q>0\)) and high \(q\) in our Asthma data set. When using xbin=1000, there is one bin for all these points (p,q centre of mass (2.7e-19, 3.78)). When using xbin=5000, there are 5 bins for all these points (p,q centre of mass (4.7e-18,0.49), (2.8e-17,0.57), (3.3e-17,1.1), (2.6e-19,3.7), (2.8e-19,3.8)).
If I instead bin by \(z\) scores (rather than p values) since we are particularly interested in low \(p\), then there is a single bin for each of these 24 points (even when using ony 600 bins) and the tailing off goes away.
Method | Time | # bins | Biggest bin |
---|---|---|---|
No binning | 4hr 36 | (1,999,262) | 1 |
nxbin=1000 | 37 min | 171,688 | 285 |
nxbin=500 | 21 min | 60,380 | 629 |
We need to consider what to set xbin to (the number of bins in the zp direction). This parameter mainly depends on the number of data points. For my preliminary analysis, for 2 million data points, xbin=500
suffices (0.025%).
If the problematic \(v\) values for extremely low \(q\) persists after the above efforts, then I need to think of a method of mapping these points back to a reasonable value. This can be done by fitting a spline through the points. I’d need to reasonably extend the spline for small \(q\) and use some distance threshold for when the points are corrected.
Comments and questions
When I have everything sorted the method will be:
Generate auxillary functional data (using annotation files from LDSC (22,000 SNPs lost here) and cell type specific blueprint segmentation data) using
PCAmix
.Generate a set of independent SNPs using LDAK weighting procedure on 1000 Genomes data (7000 SNPs lost here so far).
Regress the \(\chi^2\) statistics of the independent SNPs on the dimensions from
PCAmix
and use significant dimensions as our \(q\) vectors to iterate over.Use
functional_cFDR_v3
to get iterative \(v\) values (incorporates binning and independent SNPs for KDE).In the LDAK weighting step, I match asthma SNPs to 1000 Genomes phase 3 haplotype data (by position in the .leg files), 7000 SNPs (so far - haven’t done for chr 2 or 4) cannot be matched. In the baseline LD model, I match asthma SNPs to their annotation in the baseline LD model files (that supposidly contain annotation data for all 1000 Genomes phase 3 SNPs) by SNP name or position, 22767 SNPs cannot be matched. Surely the same number of SNPs should be missing from both data sets? Maybe LDSC threw away some SNPs..
Need to get LDAK weights for chr 2, 4.
Need some way to instruct users on values for
xbin
,res_p
andres_q
parameters [note that I need to increaseres_q
in my analysis as the range is now a lot bigger].Is there any way to check my LDAK weights are correct?