GARFIELD


Aim: Identify disease-relevant cell type specific genomic features. E.g. DHS enriched in certain blood cell types in Crohn’s disease.

Input: GWAS P values, MAFs, distance to nearest TSS, list of variants with \(R^2\geq 0.01\), list of variants with \(R^2\geq 0.8\) (both within 1 Mb window), binary annotation string for each variant (length \(k\) for \(k\) different cell type specific annotations). Or will use UK10K data to estimate some of these (only for European samples).

Output: OR enrichment of annotations in SNPs exceeding a specified P value threshold (and a final model containing multiple annotations).


Garfield uses GWAS results and functional annotation data to finds features that are relevant to a disease or trait of interest. Some of GARFIELD’s advantages are:

  1. It accounts for various confounding (LD, local gene density)
  2. Uses a P value threshold method, meaning that it can be utilised in lower powered studies where variants may not reach genome wide significance.
  3. Their greedy LD pruning and tagging annotation steps supposedly retain a larger proportion of potentially causal variants (or tags of such SNPs) compared with p-value-independent pruning methods.

Method

1. LD pruning step

Calculate \(r^2\) with the lead-SNP for all variants in each 1-Mb window. Throw away all SNPs with \(r^2>0.1\) to obtain an independent set of SNPs. In their analysis, this step retained an average of 2.2% of genome-wide variants in the independent SNP set.

2. LD tagging annotation

Go through each annotation for each variant in the independent SNP set and allocate a 1 if that variant (or a variant with \(r^2\geq 0.8\) in 500 kb of that variant) overlaps the annotation (the latter to help with lower powered studies).

3. Model to quantify enrichment using feature matching

For each annotation, the following (GLM) model is fitted for various threshold values (\(T\)),

\[\begin{equation} logit E(y) = 1\alpha + X_{TSS}\beta_{TSS} + X_{TAGS}\beta_{TAGS} + X_{Aj}\beta_{Aj} \end{equation}\]

where,

  • \(y_i\) is 1 if SNP i has P value \(<T\) and 0 otherwise.
  • \(X_{Aj}\) binary indicator of annotation \(j\) for each SNP \(i\).
  • \(X_{TSS}\) is a categorical covariant denoting which quantile bin of distance to nearest TSS a variant falls (typically use 5 quantiles).
  • \(X_{TAGS}\) is a categorical covariant denoting which quantile bin of the number of variants with \(r^2\geq0.8\) (number of LD proxies) a variant falls (typically use 15 quantiles).

I.e. the model quantifies which annotations harbour more GWAS variants at a given threshold, \(T\). The odds ratio statistic is calculated via: \(\beta_{Aj}=log(OR_{Aj}).\)

4. Model selection

The above model is for one annotation only, but we want to find a model which contains multiple annotations (i.e. to identify conditionally independent sets of regulatory annotations underpinning the enrichment signals).

  1. Fit the above model for each annotation in turn and order these based on statistical significance.

  2. Iteratively try to add an annotation to the original model if this significantly improves the model fit given all annotations in the model (analysis of deviance using chi-square test).

This gives the final model:

\[\begin{equation} logit E(y) = 1\alpha + X_{TSS}\beta_{TSS} + X_{TAGS}\beta_{TAGS} + X_{A1}\beta_{A1} + ... + X_{Aj}\beta_{Aj} \end{equation}\]

5. Account for multiple testing

Multiple testing may be a problem because typically lots of annotations will be tested. Bonferroni and BH cannot be applied off the bat because these assume independence between tests (and here we may have e.g. biological replicates). Instead, they first estimate the effective number of independent tests performed (as in Galway).

  1. Find the correlation matrix of the binary annotation matrix (where rows are variants in the independent set and columns are annotations).

  2. Calculate the eigenvalues of this matrix, \(\lambda_i\).

  3. Define the effective number of features as \(M_{eff}=(\sum_{i=1}^M\sqrt{\lambda_i})^2/(\sum_{i=1}^M \lambda_i)\) where \(M\) is the total number of annotations.

  4. Apply Bonferroni correction on \(M_{eff}\) at the 95% significance level.

This procedure takes into account the fact that more closely related cell types and tissues will be more similar to each other in their annotations than those not closely related.


Other info

  • They compared this method with LDSC, fgwas, GoShifter, GREGOR and GPA. They estimated FPR (the proportion of cell types showing significant enrichment for a given trait under the null) to quantify methods by simulating fake annotations (which should therefore not show enrichment).

  • Their feature matching method controls for biases in enrichment analysis by significantly decreasing the number of observed enrichments.

  • Already known that LD and gene density can confound enrichment analysis Trynka et al., but they found that LD is the largest confounder.

  • When comparing with other methods; “GREGOR yielded the highest number of enrichment (median, 24, maximum, 398), followed by GARFIELD (median, 10, maximum, 364). fgwas and LDSC yielded intermediate levels of enrichment (median, 5, maximum, 327; median, 5, maximum, 144, respectively), whereas GoShifter was very conservative (median, 0, maximum, 5).” and that “fgwas, LDSC and GoShifter showed much lower between-method concordance rates”.

  • They generate very nice plots (below) which shows enrichment of DNAse hotspots in various cell types for Crohn’s disease.