GWAS to function

Identifying a disease-associated genomic locus is a small step to better understanding the biology of disease. Fine-mapping may then be used to identify the specific genetic variant within the associated locus that is causal, and then the target gene and mechanisms (which are often tissue-specific) need to be identified. This is difficult partly due to the fact that >90% of disease-associated variants are located in non-protein coding regions of the genome, and many are far away from the nearest known gene ¹ ².

General Results ³

Complex traits are highly polygenic (lots of genes involved)
- Each individual will carry a number of alleles that increase and decrease the trait or disease risk.
- Each individual variant therefore only explains a small proportion of the variance.
- Motivates the use of polygenic risk scores.
Pleiotropy is common
- Pleiotropy: One variant influences many traits.
- E.g. Mendelian mutations (occur in one gene and give rise to a disease) are associated with multiple phenotypes in an affected individual. For example, cystic fibrosis is caused by a single mutation in the CFRT gene which gives rise to disease phenotypes in the lungs, liver, pancreas and intestines.
- Linked to co-localisation.
CVs are enriched in non-coding regions
- This implies that they affect gene products through regulation.
- These variants are unlikely to affect protein levels without influencing mRNA levels. Therefore, they are more likely to affect protein abundance than protein function (i.e. by an amino acid changing variant in the gene).
Disease associated variants are enriched in cis-regulatory elements (CREs)
- These are typically defined by chromatin accessibility.
- Implying that they affect disease risk through altering the genetic regulation of one or more target genes.
- Still very hard to link these to target genes.
- These CREs containing disease associated variants tend to be active in disease-relevant cell types. E.g. the PICs paper found that CVs of auto-immune diseases were enriched in predicted B and T cell enhancers. This implies that CVs influence disease risk by altering the function of cell type-specific regulatory elements.

Testing the function of a regulatory variant ⁴

Suppose that we have cis-regulatory causal variant and we want to test it’s function, that is, whether it e.g. alters gene expression, affects a binding site or violates the protein structure.

Could use in silico analysis to determine whether a particular variant is predicted to disrupt a TF binding motif, but not all motifs are known and many CVs do not reside in known TF binding motifs.
Could test the functionality of the variant by using both the risk and protective alleles one-by-one in cell culture. But this must be done in the correct experimental conditions (namely, cell type and signals). This has now been extended to Massively Parallel Reporter Assays (MPRAs):
- Perturb individual SNPs.
- Reporter assay: Promoter and reporter gene (e.g. encoding luciferase) cloned into a vector to measure the promoter activity. More luciferase = strong promoter, less luciferase = weaker promoter.
- MPRA was developed in 2009 to help find meaning from mutations. It can be used to systematically test each variant in an associated genomic region to find the one that is likely causal (aka fine-mapping) - “directly test the molecular effects of a large number of variants using highly quantitative assays”.
- Basics: Put DNA barcode in 3’ UTR of the gene. Can then sequence the mRNA that gets expressed during the transcription of the gene and count the barcodes which can then be attributed to the specific barcodes. I.e. see if a specific DNA sequence (or even variant) leads to increased/ decreased gene expression.
Genome editing, e.g. CRISPR, has been used to determine the allele-specific functions of distal CRE. For example, use genome editing tools to delete the causal variant and observe the effect on the phenotype (e.g. does it increase expression of a gene?).
Note that most people only look at the downstream affect of a variant (e.g. what gene it is affecting), but it is also important to look upstream of the variant to see what is affecting it (which may help determine function). E.g. HNF1B makes a protein that binds to specific regions of the DNA to indirectly regulate the activity of other genes so it’s possible that HNF1B –> CV –> target gene.

Linking to target genes

This is particularly difficult because most disease associated variants reside in non-coding regions and do not affect the closest gene in linear distance. This motivates the use of chromosome conformation capture techniques to investigate the 3D folding of the genome (which is cell-specific).
The transcription of a gene occurs at the promoter, but enhancers and other distal degulatory elements may also affect gene transcription by physically interacting with their target promoters (or even with each other).
To investigate these physical contacts between promoters and enhancers (and other distal regulatory elements) we could e.g. use the CV containing CRE as the bait and perform Hi-C to identify interactions (or conversely, use the gene promoter as the bait and use Hi-C to identify interactions - do any of these regions contain CVs?).
Example: The FTO locus is associated with obesity (BMI) and there are many significant variants that lie within the gene. However it has been found that none of these change the FTO gene function or expression. A study ⁵ used chromatin conformation methods to find that this locus interacted with the Irx3 promoter (which is located far away) and is enriched for enhancer-associated histone marks. They confirmed that IRX3 is the likely target gene of the FTO enhancer region, not the FTO gene. The CV in this locus has also now been identified.

3 main lines of evidence for linking variants to their target genes:

Physical contact (Hi-C)
Functional (look at activity correlation across genome, e.g. using chromatin summary tracks)
Genetic (eQTL analysis - link genetic variants to gene expression of particular genes).

Determining molecular function ⁶

So, say that we have found a CV and we know the function of this (e.g. it is an eQTL that affects expression levels). Suppose that we also know the gene that is affected (e.g. it is an eQTL that affects expression levels in gene X). We still don’t know the molecular function of the variant, i.e. how the variant affects expression levels (e.g. is it through affecting the ability of a trans-acting factor to bind, and what is this trans-acting factor).
These effects may be direct (e.g. directly affecting binding of TFs) or indirect (affecting DNA methylation).
The effect of a variant on TF binding can be confirmed by ChIP-qPCR.

Summary

We now have vast amounts of GWAS data linking genomic loci to complex diseases. Focus should now shift to finding meaning from these associations. We don’t just want to find disease associated genetic variation, but we need to consider the intermediaries in this process. For example:

Identifying the specific causal variants
Identify relevant tissues/ cell type
Identifying the molecular functions of the causal variants (e.g. acting through a trans-acting TF, does it change expression levels?)
Identifying intermediate phenotypes
Identifying the target genes
Understanding how changes in the function of regulation of the causal genes lead to altered disease risk

Note that once we move out of the genetic space, the effects are bi-directional, e.g. the disease could be affecting gene expression elsewhere in the genome, rather than the genetic basis of the disease affecting this gene expression or maybe this is due to correlation?

“We thus suggest that an increased emphasis on the downstream functional dissection of already-identified GWAS loci, rather than a search for ever more GWAS loci, might be most likely to benefit knowledge of pathophysiology” ⁷.

Methods

E.g. GRAM

A generalized model to predict the molecular effect of a non-coding variant in a cell-type specific manner.

GRAM is a generalised model to predict the expression-modulating effect of a non-coding variant in a cell-specific manner. I.e. estimate the expression consequence of a non-coding variant.

This new method has been applied to fine-mapping the causal variants in 5 LD blocks that are associated with prostate cancer. It requires gene expression and SELEX DeepBind scores (https://www.nature.com/articles/nmeth.3559). 561 eQTL SNPs from the 5 LD blocks were identified and “GRAMMAR” was used to get the prediction score for each allele in each patient.

E.g. FUMA

Functionally annotates GWAS findings and prioritises the most likely causal SNPs and genes using information from 18 biological data repositories and tools.

SNP2GENE process:

Input is GWAS summary statistics. From these, 1000 Genomes LD structures are used to find independent significant SNP associations (\(P<5e-8\) and \(r^2<0.6\)). For each of these independent significant SNPs, all other SNPs with \(r^2\geq0.6\) are included in the list of “candidate SNPs”
The candidate SNPs are then annotated for functional consequences on gene functions (using ANNOVAR), deleteriousness score (CADD score), potential regulatory function, effects on gene expression and 3D structure (Hi-C data).
Functionally annotated SNPs are mapped to genes based on functional consequences on genes by (i) physical position on the genome (positional mapping) (ii) eQTL associations (iii) 3D chromatin interactions. At the end of this step, the user has a set of prioritised genes.

GENE2FUNC process:

Biological information for each prioritised gene is provided. E.g. Tissue specific expression patterns based on GTEx v6 RNA-seq data for each gene are visualized as an interactive heatmap.

References

Eric Lander’s talk

Broad institute talk

GWAS to function

Anna Hutchinson

23/08/2019

General Results ³

Testing the function of a regulatory variant ⁴

Linking to target genes

Determining molecular function ⁶

Summary

Methods

E.g. GRAM

E.g. FUMA

References

GWAS to function

Anna Hutchinson

23/08/2019

General Results 3

Testing the function of a regulatory variant 4

Linking to target genes

Determining molecular function 6

Summary

Methods

E.g. GRAM

E.g. FUMA

References

General Results ³

Testing the function of a regulatory variant ⁴

Determining molecular function ⁶