Existing methods (fgwas)

fgwas was developed by Joe Pickrell as a means of incorporating functional annotation data with GWAS data. The method ultimately finds the set of annotations, both at the region and per-SNP level (and their effect size estimates), that causally influence a trait by specifying two key prior probabilities:

\(\Pi_k\): The prior probability that region \(k\) contains a causal SNP associated with the trait.
\(\pi_{ik}\): The prior probability that SNP \(i\) is the causal SNP, given that region \(k\) contains a causal SNP.

These prior probabilities are estimated on the basis of the patterns of enrichment across the whole genome.

These priors can also be combined with the Bayes factors and posterior probabilities derived by Maller et al., to reweight SNP association statistics. For example, if SNP X and SNP Y have equal \(p\)-values but SNP X lies in a coding region of the genome whilst SNP Y lies in a region of repressed chromatin, then this method could be use to identify SNP X as more likely to be causal for the trait than SNP Y.

The full method can be broken down into 5 steps:

Define the hierarchical model
Write down the likelihood
Define a penalised log-likelihood function
Selection of the “best” model
Conditional analysis to find interchangeable annotations

1. Define the hierarchical model

“The goal is to build a model to identify the shared characteristics of SNPs that causally influence a trait”.

The genome is partitioned into contiguous blocks of size \(K\) SNPs (typically \(K=5,000\)). Consider \(M\) SNPs have been genotyped in \(N\) individuals in a GWAS, then there are \(M/K\) blocks. The block size is an optional parameter in the software.

Let \(\Pi_k\) be the prior probability that block \(k\) contains a causal SNP associated with the trait, the probability of the data is

\[\begin{equation} \tag{1} P(\mathbf{y})=\Pi_{k=1}^{M/K} (1-\Pi_k)P_k^0 + \Pi_k P_k^1 \end{equation}\]

where \(P_k^0\) is the probability of the data in block \(k\) under the null model of no SNPs being associated witth the trait in the block and \(P_k^1\) is the probability of the data in block \(k\) under the model where there is one SNP associated with the trait in the block. This set up means that there are two options for each block in the genome; (i) it contains no SNPs associated with the trait (ii) it contains one SNP associated with the trait. Also notice that the blocks are defined by the number of SNPs, not the number of bases, which directly implies the assumption that the probability that a genomic region contains a SNP associated with a given phenotype depends on the SNP density rather than physical size (a short region with many SNPs is a priori as likely to have an association as a long region with few SNPs).

Further,

\[\begin{equation} \tag{2} P_k^1=\sum_{i \in S_k} \pi_{ik} P^1_{ik} \end{equation}\]

where \(S_k\) is the set if SNPs in block \(k\), \(\pi_{ik}\) is the prior probability conditional on there being an association in block \(k\) that SNP \(i\) is the causal SNP in the region and \(P_{ik}^1\) is the probaility of the data under the model where this SNP is associated with the trait.

The regional prior is modelled as,

\[\begin{equation} \tag{3} ln\left(\dfrac{\Pi_k}{1-\Pi_k}\right)=\kappa + \sum_{l=1}^{L_1} \gamma_l I_{kl}, \end{equation}\]

where \(L_1\) is the number of region-level annotations in the model, \(\gamma_l\) is the effect associated with annotation \(l\) (e.g. the effect of high gene density) and \(I_{kl}\) takes the value 1 if region \(k\) is annotated with annotation \(l\) and 0 otherwise.

The SNP prior is modelled as,

\[\begin{equation} \tag{4} \pi_{ik}=\dfrac{e^{x_i}}{\sum_{j \in S_k} e^{x_j}} \end{equation}\]

where logistic regression is used to relate \(\pi_{ik}\) to the annotation indicators,

\[\begin{equation} \tag{5} x_i=\sum_{l=1}^{L_2} \lambda_l I_{il}, \end{equation}\]

where \(L_2\) is the number of SNP-level annotations in the model, \(\lambda_l\) is the effect of SNP annotation \(l\) (e.g. the effect of a nonsynonomous SNP) and \(I_{il}\) takes the vaue 1 if SNP \(i\) falls in annotation \(l\) and 0 otherwise.

2. Write down the likelihood

The likelihood of the data is therefore

\[\begin{equation} \tag{6} L(\textbf{y}|\theta)= \Pi_{k=1}^{M/K} (1-\Pi_k)P_k^0 + \Pi_k \sum_{i=1}^K \pi_{ij} P_{ik}^1 \end{equation}\]

and we can use \(BF_i=\frac{P_1}{P_0}\) to write,

\[\begin{equation} \tag{7} L(\textbf{y}|\theta)= \Pi_{k=1}^{M/K} P_k^0\left[(1-\Pi_k) + \Pi_k \sum_{i=1}^K \pi_{ij} BF_i \right] \end{equation}\]

3. Define a penalised log-likelihood function

Standard methods such as MLE can be used to estimate the parameters in the model, but this may lead to over-fitting. Instead, ridge regression is used to shrink the effect esimates of insignificant predictors towards 0. The penalised log-likelihood function is defined as:

\[\begin{equation} \tag{8} l*(\textbf{y}|\theta)=ln(L(\textbf{y}|\theta))-{\color{red}p}(\sum_{l=1}^{L_1}\gamma_l^2+\sum_{l=1}^{L_2}\lambda_l^2) \end{equation}\]

where \(p\) is the tuning (shrinkage) parameter (the larger the value for \(p\) the more coefficients will be shrunk to 0).

Cross-validation is used to tune the penalty \(p\). The chromosomal segments are split into 10 folds. Let \(\theta_{-f}^p\) be the parameters of the model estimated without the data from fold \(f\) and under penalty \(p\), and let \(l*_f(\theta_{-f}^p)\) be the penalised log-likelihood of the data in fold \(f\) under the model optimised without fold \(f\). Then,

\[\begin{equation} \tag{9} l'(\theta^p)=\dfrac{1}{10} \sum_{i=1}^{10} l*_i(\theta_{-f}^p). \end{equation}\]

4. Model selection

Our aim is to choose a relatively sparse model that fits the data. Start with forward selection: for each of the \(L\) functional annotations of SNPs, fit a model including (i) a region-level parameter for regions in the top third of the distirbution of gene density (ii) a region-level parameter for regions in the bottom third of the distirbution of gene density (iii) a SNP-level parameter for SNPs from 0-5 kb from a TSS (iv) a SNP-level parameter for SNPs from 5-10 kb from a TSS (v) a SNP-level parameter for the annotation in question.

We now have \(L\) models and calculate the likeihood (equation 6) for each.

Add the annotation that most significantly improves the likelihood to the model –> model 1a.
Test model 1a plus each annotation identified as having a significant marginal effect.
Go back to step 1 if any annotation remains significant.

I.e. we add the SNP annotation that significantly improves the likelihood at each round.

Using the model found above, tune the penalty parameter \(p\) by finding the value of \(p\) that maximises the cross-validation likelihood (equation 9). Cross-validation is then also used to see if any terms can be removed from the model.

Drop each annotation from the model (including (i)-(iv)) in turn and evaluate the cross-validation likelihood.
Drop the annotation from the model and return to step 1 if a simpler model has a higher cross-validation likelihood han the full model.
Report the model that has the highest cross-validation likelihood. This is the (nearly) final model.

I.e. we keep removing terms one-by-one until the cross-validation likelihood doesn’t improve anymore.

5. Conditional analysis to find interchangeable annotations

Many genomic annotations are correlated and it is possible that these correlated annotations will appear in the model together. It is therefore useful to know which annotations are related and can be used interchangably in the model. After the (nearly) final model is found by following the steps above, for each of the annotations in this model, the other annotations are taken in turn and tested whether the included annotation was significantly more informative than the nonincluded annotation (by fixing the estimates at their maximum-likelihood value and calculating the MLE of the ttested parameter conditional on the fixed values - if the CI does not overlap 0, then this is evidence that the second annotation adds more information to the model than does the first annotation).

Reweighting GWAS

Once the model has been fitted, we have empirical estimates of

\(\Pi_k\): The prior probability that region \(k\) contains a causal SNP associated with the trait.
\(\pi_{ik}\): The prior probability that SNP \(i\) is the causal SNP, given that region \(k\) contains a causal SNP.

The model tells us which annotations are enriched with genetic variants influencing the trait. “A convenient side effect of fitting an explicit statistical model relating properties of SNPs to the probability of association is that the functional informattion can be used for reweighting the GWAS.”

Similarly to Maller et al., we define a region level BF as the weighted sum of the per-SNP BFs. The BF for region \(k\), that measures the evidence of association in the region is,

\[\begin{equation} \tag{10} BF_k^R=\sum_{i \in S_k} \hat\pi_{ik}BF_i. \end{equation}\]

The PP that region \(k\) contains an association is then,

\[\begin{equation} \tag{11} PPA_k^R = \dfrac{\hat\Pi_k BF_k^R/(1-\hat\Pi_k)}{1 + \hat\Pi_k BF_k^R/(1-\hat\Pi_k)}. \end{equation}\]

We can also find the PP that any given SNP \(i\) in region \(k\) is the causal one by using Maller et al’s general framework but also allowing the prior probabilities to vary across SNPs.

\[\begin{equation} \tag{12} PPA_{ik} = \dfrac{\hat\pi_{ik} BF_i}{\sum_{j \in S_k\hat\pi_{jk} BF_j}}. \end{equation}\]

Comments

Pickrell has deposited the functional annotation data for 1000 genomes variants on github. For example, this is a dataframe of dimension 3007196 (positions/SNPs) *454 (annotations).

x <- fread("chr1.annot.wdist.wcoding.gz")
x[1:5,1:5]

“The block size is chosen to be much larger than the extrend of LD in the population”. Why is this when we are only assuming a single CV?
\(\kappa\) in equation (3).
Articles that cite fgwas state that it is mainly useful for high power associations and reweighting GWAS.