1. Fine-mapping

• But there is still uncertainty about PPs, and whilst division of SNPs into two sets:
1. $$PP \approx 0$$
2. $$PP>0$$ is robust, the numerical values within the $$PP>0$$ group are quite noisy.
• This motivates the use of credible sets of putative causal variants.
• We found that credible sets derived from the dominant fine-mapping method are systematically biased and have developed a method to correct for this.
• Our method can be used to improve the resolution of the fine-mapping experiment without the use of any additional data.
• We have made our method accessible to the scientify community via a CRAN R package and a webpage including several vignettes.

Fine-mapping is limited in that it only pinpoints putative causal variants and does not elucidate the mechanisms by which the causal variants operate to cause disease. Fine-mapping is a necessary preliminary step to ensure efficient allocation of resources (e.g. to a handful of SNPs within the 99% credible sets) but functional genomic techniques should be exploited to dissect the underlying biology. For example, our method was used as a proof-of-principle to show that the one variant in the region which had a functional effect (measured using MPRA) was contained within the corrected 99% credible set, whilst the other 2 variants contained in the set showed no functional effect.

2. Functional Genomics

• Functional genomics can be used at the fine-mapping stage, e.g. CAVIAR and PAINTOR can incorporate functional annotations as priors in their Bayesian framework.
• But this has its complications: (i) Converting functional data to a probability of causality (as required to be used as a prior) is difficult and subjective (ii) Harder to detect novel causal variants which e.g. do not have any functional data (iii) Errors in functional data will lead to spurious results that may be hard to detect.
• It may therefore be preferable to use functional data after fine-mapping to (i) validate results (as in our T1D analysis where we found that the SNPs in the single SNP credible sets had functional effects) or (ii) extend inferences, for example finding the target genes in the relevent cell types.
• But what is functional data (details of ATAC-seq/ChIP-seq/ENCODE database) and how are the results analysed (details of ChromHMM/Segway).

• Current methods for SNP enrichment (see existing methods section):
• GARFIELD: Non-parameteric enrichment analysis of GWAS variants (exceeding a chosen association threshold) for annotations in various cell types, accounting for LD, MAF and local gene density.
• CHEERS: Statistical method that accounts for subtle changes in chromatin landscape (e.g. different peak characteristics) to identify SNP enrichment across cell states. Specifically designed to quantify SNP enrichment for various immune cell states
• fGWAS: Goal is to build a hierarchical model to identify the shared characteristics of SNPs that causally influence a trait (estimates patterns of enrichment across the whole genome). Illudes that the derived priors can be used to reweight GWAS PPs in Maller et al.’s approach.
• GoShifter: Recognises that the non-random distribution of genomic annotations and LD need to be accounted for and uses a circularised permutation method to estimate the null enrichment statistics.
• GREGOR: Statistical method to quantify SNP enrichment in annotations. The expected overlap between SNPs and annotations is modelled by taking a sample of matched SNPs. Ultimately attempts to select the locus with the best chance of demonstrating a function variant, but not the causal variant within it.
• GPA: Primarily used to prioritise GWAS variants using multiple GWAS data sets (pleiotropy - diseases are related and often share underlying genetic variants) and functional annotations, but an intermediate result is the enrichment of functional annotations. Does not account for LD between variants.

My current work is focussed on investigating the relationships between functional data and association/causality statistics. To do this, I have downloaded genomic annotations (100bp resolution) for 19 human cell types from the Segway encyclopedia and overlaid SNPs in the T1D GWAS. For each of the 123130 SNPs, I have the functional annotation in the 19 human cell types and their P value with T1D (need to extend this so I have PPs too - currently only have PPs for $$\approx 16,000$$ SNPs from my previous analysis). Some questions I have explored using regression (logistic/penalised/quantile) include “which annotation in which cell type is most significant for P value?” and “falling in an active region in which cell type is most significant for P value?”. However, this analysis does not account for LD and the non-random distribution of genomic annotations. I.e. I have not specified the null test statistic distribution to account for confounding.

I am also investigating techniques to incorporate functional data to reweight association/causality statistics. Namely, I am hoping to use the cFDR method method to reweight GWAS P values using a binary indicator of active/inactive chromatin. I would then like to extend the cFDR method to PPs (rather than P values) and implement the reweighting PP method in the prostate cancer paper.

Note that my functional data only measures the activity of genomic regions (e.g. if it is transcribed region) and it does not model the potential for biochemical specificity that could allow certain regions to regulate only specific regions. It may be fruitful to incorporate contact information, such as that from CHi-C. Indeed, the CRISPRi-FlowFISH paper suggests that the percentage contribution of an enhancer on a gene is proportional the mean of the ATAC-seq and H3K27ac ChIP-Seq peaks at the enhancer and the KR-normalised Hi-C contact frequency between the enchancer and the gene. Due to resolution problems, Peaky could be used to further this analysis.

Idea: Extend current functional data (measuring activity at that SNP) using peaky data (but then would I need to specify genes…).