1. LD pruning


There are two reasons why we need to LD prune our data:

  1. To remove correlated predictors in the \(\chi^2\) regression to select our \(q\) vectors.

  2. To remove correlated samples for the KDE estimation (so our BW is not biased) [but kde estimation may not be as good when using fewer data points].

I do this using the LDAK software. LDAK weights tend to be lower for SNPs in regions of high LD and vice versa (so that in the full heritability model, heritability is higher for SNPs in regions of lower LD/higher MAF). If a SNP has 0 weighting, then this means that its signal is (almost) perfectly captured by neighbouring SNPs.


Method:

  1. Generate european only 1000 Genomes phase 3 haplotype files for each chromosome (see ldprune/make_eurhap.R).

  2. Cut down each haplotype file to only contain asthma SNPs (matched by hg19 coords - note that there are a few that can’t be matched, hopefully these are the same ones not matched in the baseline LD annotation step - i.e those asthma SNPs not in 1000 Genomes phase 3 dataset) (see ldprune/make_asthmahaps.R). At this stage, 7042 SNPs are lost.

  3. Use write.plink to make plink files

  4. Use LDAK (look at run-ldak.sh script) to generate weights.


To do:

  • Get results for chr 2 and 4:

Error when using --cut-weights: “Reading annotations for 128549 predictors from tes/ldak.bim Error, basepair for rs6854799 (112205778.00) is lower than that for rs6854800 (112205779.00)”

See http://dougspeed.com/advanced-options/



The proportion of SNPs given a weight of 0 are shown below:

## $chr1
## [1] 0.7389829
## 
## $chr10
## [1] 0.7498808
## 
## $chr11
## [1] 0.7527958
## 
## $chr12
## [1] 0.738965
## 
## $chr13
## [1] 0.7560465
## 
## $chr14
## [1] 0.731472
## 
## $chr15
## [1] 0.7067432
## 
## $chr16
## [1] 0.6810886
## 
## $chr17
## [1] 0.673031
## 
## $chr18
## [1] 0.7306687
## 
## $chr19
## [1] 0.6371283
## 
## $chr20
## [1] 0.7177763
## 
## $chr21
## [1] 0.7111854
## 
## $chr22
## [1] 0.6755821
## 
## $chr3
## [1] 0.7442185
## 
## $chr5
## [1] 0.7475649
## 
## $chr6
## [1] 0.7678233
## 
## $chr7
## [1] 0.7427875
## 
## $chr8
## [1] 0.7619172
## 
## $chr9
## [1] 0.7363088

My preliminary results look ok when using only the SNPs with non-zero weights for the KDE estimation, and it seems to solve some of the very small \(q\) problems.


2. Generation of auxillary functional data


I use two sources to obtain auxillary functional data:

  1. Match 2,001,256 asthma GWAS SNPs to their annotations in the baseline v2.2 LD model either by SNPID or hg19 BP.

    • The baseline model contains annotation data for 1000 Genomes phase 3 SNPs.
    • There were 22,771 Asthma SNPs not found in the baseline LD model.
    • The SNPs lost in the LDAK weights step are a subset of those lost here.
    • Of these, only 27 have \(p < 1e-05\).
    • (see /home/ah2011/rds/hpc-work/asthma/baselineLD_annots).


  1. Match 2,001,256 asthma GWAS SNPs to their ChromHMM genome segmentation annotation from BLUEPRINT ChIP-Seq data in 34 blood cell types (http://dcc.blueprint-epigenome.eu/#/md/secondary_analysis/Segmentation_of_ChIP-Seq_data_20140811). They are matched by hg19 BP and if the SNP falls on a segment boundary (approx. 200 SNPs in each cell type), it is randomly allocated one of the allocations.

The emission probabilities are shown below (with my predicted meanings):