Identifying a disease-associated genomic locus is a small step to better understanding the biology of disease. Fine-mapping may then be used to identify the specific genetic variant within the associated locus that is causal, and then the target gene and mechanisms (which are often tissue-specific) need to be identified. This is difficult partly due to the fact that >90% of disease-associated variants are located in non-protein coding regions of the genome, and many are far away from the nearest known gene 12.
Suppose that we have cis-regulatory causal variant and we want to test it’s function, that is, whether it e.g. alters gene expression, affects a binding site or violates the protein structure.
Could use in silico analysis to determine whether a particular variant is predicted to disrupt a TF binding motif, but not all motifs are known and many CVs do not reside in known TF binding motifs.
Could test the functionality of the variant by using both the risk and protective alleles one-by-one in cell culture. But this must be done in the correct experimental conditions (namely, cell type and signals). This has now been extended to Massively Parallel Reporter Assays (MPRAs):
Perturb individual SNPs.
Reporter assay: Promoter and reporter gene (e.g. encoding luciferase) cloned into a vector to measure the promoter activity. More luciferase = strong promoter, less luciferase = weaker promoter.
MPRA was developed in 2009 to help find meaning from mutations. It can be used to systematically test each variant in an associated genomic region to find the one that is likely causal (aka fine-mapping) - “directly test the molecular effects of a large number of variants using highly quantitative assays”.
Basics: Put DNA barcode in 3’ UTR of the gene. Can then sequence the mRNA that gets expressed during the transcription of the gene and count the barcodes which can then be attributed to the specific barcodes. I.e. see if a specific DNA sequence (or even variant) leads to increased/ decreased gene expression.
Genome editing, e.g. CRISPR, has been used to determine the allele-specific functions of distal CRE. For example, use genome editing tools to delete the causal variant and observe the effect on the phenotype (e.g. does it increase expression of a gene?).
Note that most people only look at the downstream affect of a variant (e.g. what gene it is affecting), but it is also important to look upstream of the variant to see what is affecting it (which may help determine function). E.g. HNF1B makes a protein that binds to specific regions of the DNA to indirectly regulate the activity of other genes so it’s possible that HNF1B –> CV –> target gene.
This is particularly difficult because most disease associated variants reside in non-coding regions and do not affect the closest gene in linear distance. This motivates the use of chromosome conformation capture techniques to investigate the 3D folding of the genome (which is cell-specific).
The transcription of a gene occurs at the promoter, but enhancers and other distal degulatory elements may also affect gene transcription by physically interacting with their target promoters (or even with each other).
To investigate these physical contacts between promoters and enhancers (and other distal regulatory elements) we could e.g. use the CV containing CRE as the bait and perform Hi-C to identify interactions (or conversely, use the gene promoter as the bait and use Hi-C to identify interactions - do any of these regions contain CVs?).
Example: The FTO locus is associated with obesity (BMI) and there are many significant variants that lie within the gene. However it has been found that none of these change the FTO gene function or expression. A study 5 used chromatin conformation methods to find that this locus interacted with the Irx3 promoter (which is located far away) and is enriched for enhancer-associated histone marks. They confirmed that IRX3 is the likely target gene of the FTO enhancer region, not the FTO gene. The CV in this locus has also now been identified.
3 main lines of evidence for linking variants to their target genes:
So, say that we have found a CV and we know the function of this (e.g. it is an eQTL that affects expression levels). Suppose that we also know the gene that is affected (e.g. it is an eQTL that affects expression levels in gene X). We still don’t know the molecular function of the variant, i.e. how the variant affects expression levels (e.g. is it through affecting the ability of a trans-acting factor to bind, and what is this trans-acting factor).
These effects may be direct (e.g. directly affecting binding of TFs) or indirect (affecting DNA methylation).
The effect of a variant on TF binding can be confirmed by ChIP-qPCR.
We now have vast amounts of GWAS data linking genomic loci to complex diseases. Focus should now shift to finding meaning from these associations. We don’t just want to find disease associated genetic variation, but we need to consider the intermediaries in this process. For example:
Identifying the specific causal variants
Identify relevant tissues/ cell type
Identifying the molecular functions of the causal variants (e.g. acting through a trans-acting TF, does it change expression levels?)
Identifying intermediate phenotypes
Identifying the target genes
Understanding how changes in the function of regulation of the causal genes lead to altered disease risk
Note that once we move out of the genetic space, the effects are bi-directional, e.g. the disease could be affecting gene expression elsewhere in the genome, rather than the genetic basis of the disease affecting this gene expression or maybe this is due to correlation?
“We thus suggest that an increased emphasis on the downstream functional dissection of already-identified GWAS loci, rather than a search for ever more GWAS loci, might be most likely to benefit knowledge of pathophysiology” 7.
A generalized model to predict the molecular effect of a non-coding variant in a cell-type specific manner.
GRAM is a generalised model to predict the expression-modulating effect of a non-coding variant in a cell-specific manner. I.e. estimate the expression consequence of a non-coding variant.
This new method has been applied to fine-mapping the causal variants in 5 LD blocks that are associated with prostate cancer. It requires gene expression and SELEX DeepBind scores (https://www.nature.com/articles/nmeth.3559). 561 eQTL SNPs from the 5 LD blocks were identified and “GRAMMAR” was used to get the prediction score for each allele in each patient.
Functionally annotates GWAS findings and prioritises the most likely causal SNPs and genes using information from 18 biological data repositories and tools.
SNP2GENE process:
Input is GWAS summary statistics. From these, 1000 Genomes LD structures are used to find independent significant SNP associations (\(P<5e-8\) and \(r^2<0.6\)). For each of these independent significant SNPs, all other SNPs with \(r^2\geq0.6\) are included in the list of “candidate SNPs”
The candidate SNPs are then annotated for functional consequences on gene functions (using ANNOVAR), deleteriousness score (CADD score), potential regulatory function, effects on gene expression and 3D structure (Hi-C data).
Functionally annotated SNPs are mapped to genes based on functional consequences on genes by (i) physical position on the genome (positional mapping) (ii) eQTL associations (iii) 3D chromatin interactions. At the end of this step, the user has a set of prioritised genes.
GENE2FUNC process:
Biological information for each prioritised gene is provided. E.g. Tissue specific expression patterns based on GTEx v6 RNA-seq data for each gene are visualized as an interactive heatmap.