1. LD pruning


There are two reasons why we need to LD prune our data:

  1. To remove correlated predictors in the \(\chi^2\) regression to select our \(q\) vectors.

  2. To remove correlated samples for the KDE estimation (so our BW is not biased) [but kde estimation may not be as good when using fewer data points].

I do this using the LDAK software. LDAK weights tend to be lower for SNPs in regions of high LD and vice versa (so that in the full heritability model, heritability is higher for SNPs in regions of lower LD/higher MAF). If a SNP has 0 weighting, then this means that its signal is (almost) perfectly captured by neighbouring SNPs.


Method:

  1. Generate european only 1000 Genomes phase 3 haplotype files for each chromosome (see ldprune/make_eurhap.R).

  2. Cut down each haplotype file to only contain asthma SNPs (matched by hg19 coords - note that there are a few that can’t be matched, hopefully these are the same ones not matched in the baseline LD annotation step - i.e those asthma SNPs not in 1000 Genomes phase 3 dataset) (see ldprune/make_asthmahaps.R). At this stage, 7042 SNPs are lost.

  3. Use write.plink to make plink files

  4. Use LDAK (look at run-ldak.sh script) to generate weights.


To do:

  • Get results for chr 2 and 4:

Error when using --cut-weights: “Reading annotations for 128549 predictors from tes/ldak.bim Error, basepair for rs6854799 (112205778.00) is lower than that for rs6854800 (112205779.00)”

See http://dougspeed.com/advanced-options/



The proportion of SNPs given a weight of 0 are shown below:

## $chr1
## [1] 0.7389829
## 
## $chr10
## [1] 0.7498808
## 
## $chr11
## [1] 0.7527958
## 
## $chr12
## [1] 0.738965
## 
## $chr13
## [1] 0.7560465
## 
## $chr14
## [1] 0.731472
## 
## $chr15
## [1] 0.7067432
## 
## $chr16
## [1] 0.6810886
## 
## $chr17
## [1] 0.673031
## 
## $chr18
## [1] 0.7306687
## 
## $chr19
## [1] 0.6371283
## 
## $chr20
## [1] 0.7177763
## 
## $chr21
## [1] 0.7111854
## 
## $chr22
## [1] 0.6755821
## 
## $chr3
## [1] 0.7442185
## 
## $chr5
## [1] 0.7475649
## 
## $chr6
## [1] 0.7678233
## 
## $chr7
## [1] 0.7427875
## 
## $chr8
## [1] 0.7619172
## 
## $chr9
## [1] 0.7363088

My preliminary results look ok when using only the SNPs with non-zero weights for the KDE estimation, and it seems to solve some of the very small \(q\) problems.


2. Generation of auxillary functional data


I use two sources to obtain auxillary functional data:

  1. Match 2,001,256 asthma GWAS SNPs to their annotations in the baseline v2.2 LD model either by SNPID or hg19 BP.

    • The baseline model contains annotation data for 1000 Genomes phase 3 SNPs.
    • There were 22,771 Asthma SNPs not found in the baseline LD model.
    • The SNPs lost in the LDAK weights step are a subset of those lost here.
    • Of these, only 27 have \(p < 1e-05\).
    • (see /home/ah2011/rds/hpc-work/asthma/baselineLD_annots).


  1. Match 2,001,256 asthma GWAS SNPs to their ChromHMM genome segmentation annotation from BLUEPRINT ChIP-Seq data in 34 blood cell types (http://dcc.blueprint-epigenome.eu/#/md/secondary_analysis/Segmentation_of_ChIP-Seq_data_20140811). They are matched by hg19 BP and if the SNP falls on a segment boundary (approx. 200 SNPs in each cell type), it is randomly allocated one of the allocations.

The emission probabilities are shown below (with my predicted meanings):


My final annotation dataset is a data.frame of 130 annotations at 1,978,358 SNPs.

full_annots <- readRDS("full_annotations.RDS")

I then perform a mixture of PCA (continuous annotations) and MCA (quantitative annotations) for dimensionality reduction using PCAmixdata.


There will be disease-irrelevant variability in the data set, yet we only want to pick out the disease-relevant variability for our method. To do this, we regress the \(\chi^2\) statistics from the asthma GWAS data (using on the independent SNPs) against each dimension obtained from the dimensionality reduction and choose the dimensions that come out as significant as our \(q\) vectors.

## 
## Call:
## lm(formula = chisq ~ ., data = ind_x_pred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -1.491  -0.928  -0.581   0.296 104.818 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.0588692  0.0028431 372.433  < 2e-16 ***
## dim.1        0.0006224  0.0005271   1.181 0.237647    
## dim.2        0.0004942  0.0005390   0.917 0.359134    
## dim.3        0.0003452  0.0006642   0.520 0.603289    
## dim.4       -0.0026354  0.0006939  -3.798 0.000146 ***
## dim.5        0.0021696  0.0007456   2.910 0.003615 ** 
## dim.6       -0.0011157  0.0008139  -1.371 0.170446    
## dim.7       -0.0006136  0.0008499  -0.722 0.470303    
## dim.8        0.0002293  0.0009023   0.254 0.799396    
## dim.9        0.0134115  0.0008526  15.730  < 2e-16 ***
## dim.10       0.0010536  0.0010507   1.003 0.316005    
## dim.11      -0.0002311  0.0011597  -0.199 0.842044    
## dim.12      -0.0002268  0.0012789  -0.177 0.859263    
## dim.13       0.0005794  0.0011994   0.483 0.629052    
## dim.14       0.0020241  0.0013447   1.505 0.132256    
## dim.15      -0.0013026  0.0014988  -0.869 0.384812    
## dim.16      -0.0009163  0.0014209  -0.645 0.518987    
## dim.17      -0.0008743  0.0015096  -0.579 0.562463    
## dim.18       0.0145264  0.0013914  10.440  < 2e-16 ***
## dim.19       0.0008206  0.0016116   0.509 0.610629    
## dim.20       0.0026149  0.0016321   1.602 0.109132    
## dim.21      -0.0131723  0.0016263  -8.100 5.53e-16 ***
## dim.22      -0.0026284  0.0016857  -1.559 0.118945    
## dim.23       0.0006325  0.0017574   0.360 0.718927    
## dim.24      -0.0024074  0.0017878  -1.347 0.178123    
## dim.25      -0.0014359  0.0017927  -0.801 0.423150    
## dim.26       0.0007662  0.0016462   0.465 0.641610    
## dim.27       0.0140075  0.0016329   8.578  < 2e-16 ***
## dim.28       0.0029525  0.0018529   1.593 0.111056    
## dim.29       0.0004786  0.0018658   0.257 0.797558    
## dim.30       0.0002545  0.0018933   0.134 0.893050    
## dim.31       0.0093636  0.0019175   4.883 1.04e-06 ***
## dim.32       0.0021812  0.0019396   1.125 0.260785    
## dim.33       0.0021487  0.0019383   1.109 0.267628    
## dim.34      -0.0100960  0.0017972  -5.618 1.94e-08 ***
## dim.35      -0.0020731  0.0020084  -1.032 0.301969    
## dim.36      -0.0048509  0.0020033  -2.421 0.015459 *  
## dim.37       0.0204051  0.0019916  10.246  < 2e-16 ***
## dim.38       0.0059072  0.0020605   2.867 0.004145 ** 
## dim.39       0.0157483  0.0019992   7.877 3.35e-15 ***
## dim.40      -0.0059882  0.0020720  -2.890 0.003852 ** 
## dim.41      -0.0027968  0.0020650  -1.354 0.175607    
## dim.42       0.0029864  0.0021265   1.404 0.160209    
## dim.43      -0.0003507  0.0020617  -0.170 0.864915    
## dim.44      -0.0095762  0.0018978  -5.046 4.51e-07 ***
## dim.45      -0.0023827  0.0021194  -1.124 0.260928    
## dim.46      -0.0001481  0.0021212  -0.070 0.944336    
## dim.47      -0.0068358  0.0019273  -3.547 0.000390 ***
## dim.48       0.0031135  0.0021627   1.440 0.149964    
## dim.49      -0.0048615  0.0021790  -2.231 0.025677 *  
## dim.50       0.0051218  0.0021750   2.355 0.018533 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.692 on 439041 degrees of freedom
##   (4613 observations deleted due to missingness)
## Multiple R-squared:  0.001928,   Adjusted R-squared:  0.001814 
## F-statistic: 16.96 on 50 and 439041 DF,  p-value: < 2.2e-16

What do the loadings plots look like for these significant dimensions? And are these monotonic in \(p\)? Not always…


How about if I do the orthogonal rotation? Nope doesn’t seem to help.

## Warning: Removed 341 rows containing non-finite values (stat_boxplot).


Recall that the method worked well when only using the 96 LD score annotations, perhaps I should just revert back to these.


3. Binning (p,q) data


To speed up our method, we bin similar \((p,q)\) pairs and calculate contours and therefore \(v\) values for each bin (see cFDR_final/check_hexbinning).

These bins need to be extremely small (see https://www.rdocumentation.org/packages/hexbin/versions/1.29.0/topics/hexbin). The number of bins are specified by the xbins parameter, which is the number of bins partitioning the x (p) direction (not equally spaced).


Binning by p values


Method Time # bins Biggest bin
No binning 4hr 36 (1,999,262) 1
nxbin=5000 3hr 58 1,779,293 47
nxbin=2000 2hr 41 1,129,918 81
nxbin=1000 1hr 14 497,643 105

But binning by p values seems to introduce some tailing off:



This is due to the lack of bins used in the low \(p\), high \(q\) range of points. E.g. there are 24 data points with low p (\(p<1e-15\)) and high q (\(q>0\)) and high \(q\) in our Asthma data set. When using xbin=1000, there is one bin for all these points (p,q centre of mass (2.7e-19, 3.78)). When using xbin=5000, there are 5 bins for all these points (p,q centre of mass (4.7e-18,0.49), (2.8e-17,0.57), (3.3e-17,1.1), (2.6e-19,3.7), (2.8e-19,3.8)).


Z scores


If I instead bin by \(z\) scores (rather than p values) since we are particularly interested in low \(p\), then there is a single bin for each of these 24 points (even when using ony 600 bins) and the tailing off goes away.

Method Time # bins Biggest bin
No binning 4hr 36 (1,999,262) 1
nxbin=1000 37 min 171,688 285
nxbin=500 21 min 60,380 629

We need to consider what to set xbin to (the number of bins in the zp direction). This parameter mainly depends on the number of data points. For my preliminary analysis, for 2 million data points, xbin=500 suffices (0.025%).


Mapping problematic points


If the problematic \(v\) values for extremely low \(q\) persists after the above efforts, then I need to think of a method of mapping these points back to a reasonable value. This can be done by fitting a spline through the points. I’d need to reasonably extend the spline for small \(q\) and use some distance threshold for when the points are corrected.


Comments and questions


When I have everything sorted the method will be:

  1. Generate auxillary functional data (using annotation files from LDSC (22,000 SNPs lost here) and cell type specific blueprint segmentation data) using PCAmix.

  2. Generate a set of independent SNPs using LDAK weighting procedure on 1000 Genomes data (7000 SNPs lost here so far).

  3. Regress the \(\chi^2\) statistics of the independent SNPs on the dimensions from PCAmix and use significant dimensions as our \(q\) vectors to iterate over.

  4. Use functional_cFDR_v3 to get iterative \(v\) values (incorporates binning and independent SNPs for KDE).


  • In the LDAK weighting step, I match asthma SNPs to 1000 Genomes phase 3 haplotype data (by position in the .leg files), 7000 SNPs (so far - haven’t done for chr 2 or 4) cannot be matched. In the baseline LD model, I match asthma SNPs to their annotation in the baseline LD model files (that supposidly contain annotation data for all 1000 Genomes phase 3 SNPs) by SNP name or position, 22767 SNPs cannot be matched. Surely the same number of SNPs should be missing from both data sets? Maybe LDSC threw away some SNPs..

  • Need to get LDAK weights for chr 2, 4.

  • Need some way to instruct users on values for xbin, res_p and res_q parameters [note that I need to increase res_q in my analysis as the range is now a lot bigger].

  • Is there any way to check my LDAK weights are correct?