Deriving PPs

1. \(Z\)-scores

GWAS analysis typically proceeds by fitting single-SNP logistic regression models.

For each SNP \(i\) typed in the study, the following model is fitted: \[\begin{equation} logit(P(Y=1|X_i=x_i))=\beta_0+x_i\beta_{i}+\epsilon\,, \end{equation}\] where \(Y\) is a binary indicator of disease (0 = no disease, 1 = diseased), \(X_{i}\) is the genotype information at SNP \(i\) (0, 1 or 2 for how many copies of the risk allele are present at that position), \(\beta_{i}\) is the regression coefficient quantifying the evidence of an association between SNP \(i\) and the disease, and \(\epsilon\) is a normally distributed error term.

The marginal \(Z\) scores for each SNP are derived by dividing the estimated regression coefficient by it’s standard error, \[\begin{equation} Z_i=\dfrac{\hat\beta_i}{\sqrt{V_i}}\,, \end{equation}\] where \(V_i=var(\hat\beta_i)\).

2. Posterior probabilities of causality

The posterior probabilities of causality (PP) for each SNP \(i\) in an associated genomic region with \(k\) SNPs can be calculated, \[\begin{equation} PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\,, \quad i \in \{1,...,k\} \end{equation}\]

where \(D\) is the genotype data (0, 1 or 2 counts of the minor allele) for the entire genomic region and \(W\) is chosen to reflect the researcher’s prior belief on the variability of the true OR. We chose to set to \(W=0.2\) in our method, reflecting a belief that 95% of ORs range from \(exp(-1.96\times 0.2)=0.68\) to \(exp(1.96\times 0.2)=1.48\).

Bayes theorem can be used to rewrite this in terms of the likelihood and the prior, \[\begin{equation} \begin{aligned} PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\propto P(D|\beta_i\sim N(0,W),\text{ }i\text{ causal})\times P(\beta_i \sim N(0,W),\text{ }i\text{ causal}). \end{aligned} \end{equation}\]

The prior term, \(P(\beta_i \sim N(0,W),\text{ }i\text{ causal})\), is easy since each SNP is assumed to be equally likely to be causal - i.e. \(P(\beta_i \sim N(0,W),\text{ }i\text{ causal})=\frac{1}{k}\).

The likelihood requires more thought. Assuming that there is only one CV per region and that this is typed in the study, then if SNP \(i\) is causal, \(\beta_i\neq 0\) and \(\beta_j\) (for \(j\neq i\)) is non-zero only through LD between SNPs \(i\) and \(j\) so that,

\[\begin{equation} \begin{aligned} P(D|\beta_i\sim N(0,W),\text{ }i\text{ causal}) = P(D_i|\beta_i\sim N(0,W),\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }\beta_i\sim N(0,W),\text{ }i\text{ causal}) \\ = P(D_i |\beta_i\sim N(0,W),\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }i\text{ causal})\,, \end{aligned} \end{equation}\]

since \(D_{-i}\) is independent of \(\beta_i\) given \(D_i\) (\(D_i\) and \(D_{-i}\) are the genotype data at SNP \(i\) and at the remaining SNPs in the genomic region, respectively).

We can substitute this form of the likelihood into the equation for the PPs,

\[\begin{equation} PP_i\propto P(D_i|\beta_i \sim N(0,W),\text{ }i \text{ causal})\,. \end{equation}\]

We divide by the probability of the data under the null hypothesis of no genetic effect to find that,

\[\begin{equation} PP_i\propto \frac{P(D_i|\beta_i \sim N(0,W),\text{ }i \text{ causal})}{P(D_i|H_0)}= BF_i\,, \end{equation}\]

where \(BF_i\) is the Bayes factor for SNP \(i\), measuring the ratio of the probabilities of the data at SNP \(i\) given the alternative (SNP \(i\) is causal) and the null (no genetic effect) models.

This means that the PPs are proportional to the per-SNP BFs and conviently we can use Wakefield’s asymptotic approach to derive these,

Given that \(\hat\beta_i\sim N(\beta_i,V_i)\) and \(\beta_i\sim N(0,W)\),

\[\begin{equation} ABF_i=\sqrt{\frac{V_i}{V_i+W}}exp\left(\frac{Z_i^2}{2}\frac{W}{(V_i+W)}\right)\,, \end{equation}\] where \(Z_i^2=\dfrac{\hat\beta_i^2}{V_i}\) is the squared marginal \(Z\) score for SNP \(i\).

Deriving PPs

Anna Hutchinson

12/09/2019

1. \(Z\)-scores

2. Posterior probabilities of causality