GARFIELD


Aim: Identify disease-relevant cell type specific genomic features.

Input: GWAS P values, MAFs, distance to nearest TSS, list of variants with \(R^2\geq 0.01\), list of variants with \(R^2\geq 0.8\) (both within 1 Mb window), binary annotation string for each variant (length \(k\) for \(k\) different cell type specific annotations).

Output: OR enrichment of annotations in SNPs exceeding a specified P value threshold (and a final model containing multiple annotations).


Method

1. LD pruning step

Calculate \(r^2\) for all variants in each 1-Mb window and denote \(r^2<0.01\) as independent. Create an independent set of SNPs and also retain the next most significant variant independent of all other variants in the independence set.

2. LD tagging annotation

Go through each annotation for each variant in the independent SNP set and allocate a 1 if that variant (or a variant with \(r^2\geq 0.8\) in 500 kb of that variant) overlaps the annotation (the latter to help with lower powered studies).

3. Model to quantify enrichment

For each annotation, the following model is fitted for various threshold values (\(T\)),

\[\begin{equation} logit E(y) = 1\alpha + X_{TSS}\beta_{TSS} + X_{TAGS}\beta_{TAGS} + X_{Aj}\beta_{Aj} \end{equation}\]

where,

  • \(y_i\) is 1 if SNP i has P value \(<T\) and 0 otherwise.
  • \(X_{Aj}\) binary indicator of annotation \(j\) for each SNP \(i\).
  • \(X_{TSS}\) is a categorical covariant denoting which quantile bin of distance to nearest TSS a variant falls (typically use 5 quantiles).
  • \(X_{TAGS}\) is a categorical covariant denoting which quantile bin of the number of variants with \(r^2\geq0.8\) (number of LD proxies) a variant falls (typically use 15 quantiles).

I.e. the model quantifies which annotations harbour more GWAS variants at a given threshold, \(T\).

4. Model selection

The above model is for one annotation only, but we want to find a model which contains multiple annotations (i.e. to indifify conditionally independent sets of regulatory annotations underpinning the enrichment signals).

Fit the above model for each annotation in turn and order these based on statistical significance.

Iteratively try to add an annotation to the original model if this significantly improves the model fit given all annotations in the model.

5. Account for multiple testing

Multiple testing may be a problem because typically lots of annotations will be tested. Bonferroni and BH cannot be applied off the bat because these assume independence between tests (and here we may have e.g. biological replicates). Instead, they first estimate the effective number of independent tests performed (as in Galway).

  1. Find the correlation matrix of the binary annotation matrix (where rows are variants in the independent set and columns are annotations).

  2. Calculate the eigenvalues of this matrix, \(\lambda_i\).

  3. Define the effective number of features as \(M_{eff}=(\sum_{i=1}^M\sqrt{\lambda_i})^2/(\sum_{i=1}^M \lambda_i)\) where \(M\) is the total number of annotations.

  4. Apply Bonferroni correction on \(M_{eff}\) at the 95% significance level.

This procedure takes into account the fact that more closely related cell types and tissues will be more similar to eachother in their annotations than those not closely related.


Other info

  • The compared this method with LDSC, fgwas, GoShifter, GREGOR and GPA. They used estimtated FPR (the proportion of cell types showing significant enrichment for a given trait) to quantify methods by simulating fake annotations (which should therefore not show enrichment).

  • They generate very nice plots