LD score regression

References

Raymond Walters lectures: https://www.youtube.com/watch?v=dVrF0l9jMgE and https://www.youtube.com/watch?v=QVPNouAbXsY
Original LDSC paper: LDSC distinguishes confounding from polygenicity in genome-wide association studies
Stratified LDSC paper: Partitioning heritability by functional annotation using genome-wide association summary statistics
An atlas of genetic correlations across human diseases and traits
Heidi Marika Hautakangas’ thesis

Intuition behind LDSC

In genetics, the standard additive model is

\[\begin{equation} \tilde{y_i}=\sum_{j=1}^J \beta_jx_{ij} +\epsilon_i \end{equation}\]

where \(y_i\) measures our phenotype of interest, \(x_{ij}\) is the genotype matrix and \(\beta_j\) measures the effect size of SNP \(j\) on the phenotype.

The data is typically standardised so that \(var(\tilde{y_i})=1\) and all \(var(x_{ij})=1\), which implicitly assumes a relationship between \(\beta_j\) and MAF (e.g. that rarer things (smaller MAF) typically have a larger effect size to compensate). There are two extremes for this standardisation step: (i) once we assume a constant variance for \(\beta\), the variance explained by each SNP is the same (so that rarer things have a larger effect size to compensate) and (ii) no standardisation so that the distribution of effect sizes is the same and doesn’t depend on allele frequency. Realistically, it is somewhere between these extremes and will be trait specific.

In the genome, SNPs are correlated with one another and so from a GWAS we can estimate the marginal effects,

\[\begin{equation} \hat{\beta_j}^{GWAS}=s_j+\sum_{k=1}^J \beta_k r_{x_{i,j},x_{i,k}}+\epsilon_j \end{equation}\]

where \(s_j\) is some bias from confounders (e.g. population stratification or relatedness) and \(r_{x_{i,j},x_{i,k}}\) is the correlation between SNPs \(x_j\) and \(x_k\).

The LD score of SNP \(j\) is defined as

\[\begin{equation} l_j=\sum_{k=1}^J r^2_{x_{i,j}, x_{i,k}}. \end{equation}\]

In which case, the LDSC model is a simple linear regression of \(\chi^2\) statistics against LD scores. In a polygenic model, the expected \(\chi^2\) statistics is defined as:

\[\begin{equation} E(\chi^2_j)=1+N\alpha+\dfrac{Nh^2_{SNP}}{M} l_j \end{equation}\]

where N is the sample size, \(\alpha\) is a measure of confounding, M is the number of SNPs and \(h^2=\sum_j\beta_j^2\) measures SNP-heritability. This relationship between \(\chi^2\) value and LD score is intuitive because the more things you tag (and the degree with which you tag), the more likely you are to tag a CV. More formally, “assuming a uniform prior, we see SNPs with more LD friends showing more association”.

Uses of LDSC

If we regress our \(\chi^2\) values from the GWAS on \(Nl_j\) for each SNP \(j\), we get:

Intercept: estimate of \(1+N\alpha\) (test for deviation from 1 as index of stratification/confounding and use to correct for confounding. \(>1\) implies confounding, similar to genomic control).
Slope: estimate of \(\frac{h^2_{SNP}}{M}\) (with known M, can convert to an estimate of \(h^2_{SNP}\)), i.e. how much it tracks with changes in LD.

This method was first used to distinguish between population stratification (where there will be no relationship between LD score and \(\chi^2\) association statistic) and actually interesting polygenic effects (where there will be a positive relationship between LD score and \(\chi^2\) association statistic) by examining the LDSC intercept. This was compared with \(\lambda_{GC}\) values (with which the observed \(\chi^2\) values are divided by in the genomic-control method) to show that genomic control is unnecessarily conservative (LD score intercept \(<\lambda_{GC}\)). Moreover, contrary to LDSC, genomic-control does not distingusih true polygenicity from confounding bias.

Heritability describes the proportion of the phenotypic variation that can be explained by genetic factors. Traditionally, twin studies were used to estimate heritability but now linear mixed models (LMM) are typically used by partitioning phenotypic variance into variance components. However, LMM typically use REML for parameter estimation and therefore require individual genotype data. LDSC offers an alternative method for estimating heritability without requiring individual genotypes.

Note that the precomputed LD scores (\(l\)) for European and East Asian populations can be downloaded directly from github so that LDSC can be performed easily using only GWAS summary statistics (and a reference LD population).

Key points

LDSC was developed as a tool to distinguish confounding from polygenicity in GWAS using only summary statistics and a reference LD panel.
It’s development was based on the fact that \(\chi^2\) values for true associations are positively correlated with LD scores whereas \(\chi^2\) values for false positives (e.g. due to population stratification/drift) are not correlated with LD scores.
The intercept of the \(\chi^2 \sim LD score\) regression estimates confounding (\(=1\) if no confounding) similarly (but arguably better than) \(\lambda_{GC}\).
An extention of LDSC is stratified LDSC, which aims to partition heritability by functional annotation.

Stratified LD score regression (S-LDSC)

Trynka et al. (2013) have shown that the contribution to heritability for quantitative and complex diseases is not uniform across the genome and that regions with certain functional annotations contribute more (or less) to the overall heritability. S-LDSC can be used to partition the heritability by functional annotations.

We have previously assumed that

\[\begin{equation} Var(\beta_j)=\dfrac{h^2_{SNP}}{M} \end{equation}\]

i.e. that heritability from each SNP is on average the same genome wide. But what if we want to evaluate whether there are regions of the genome with stronger effects (i.e. higher \(Var(\beta_j)\))?

To do this, we allow the variance to vary between functional categories (\(C\)),

\[\begin{equation} Var(\beta_j)=\sum_{c:j\in C_c}\tau_c \end{equation}\]

with disjoint categories

\[\begin{equation} h^2_{SNP}(C_c)=\sum_{j\in C_c}\beta_j^2=\tau_c\times M(C_c) \end{equation}\]

otherwise we’re assuming overlapping categories act additively on the total variance.

The category specific LD score for SNP \(j\) is defined by

\[\begin{equation} l_{j,C}=\sum_{k \in C} r^2_{jk} \end{equation}\]

i.e. the sum of LD tagging of SNP \(j\) with all other SNPs in the functional category.

The stratified LD score model now looks like,

\[\begin{equation} E(\chi^2_j)=1+N\alpha+N\sum_C \tau_c l_{j,C} \end{equation}\]

so that rather than summing for all LD friends, we are now summing for all LD friends which are also in some functional category \(c\). We can estimate \(\tau_c\) via multiple regression with \(l_{j,c}\) computed from reference data for a choice of annotation, where \(\tau_c\) is the per SNP contribution to heritability of category \(c\).

There are two ways to evaluate partitioned heritability results:

Enrichment of effects in a single annotation.
- \(h^2_{SNP}(C_c)=\tau_C\times M(C_c)\)
- \(Enrichment=\dfrac{h^2_SNP(C_c)/M(C_c)}{h^2_SNP/M}\) (the per SNP heritability in annotation C divided by the genome wide average heritability per SNP).
Enrichment conditional on other annotations.
- I.e. whether \(\tau_c\) differs from 0 in multiple linear regression
- Important for highly overlapping annotations

Full derivations can be found here.

Details on annotations

It is often useful to define buffer regions around annotations. For example, rather than a binary 0/1 for whether the SNP falls in an annotation, it may be important to know whether a SNP lies very close to the boundaries of these annotations. For this reason, additional annotations can be defined for SNPs falling in these buffer regions (e.g. all annotations plus all annotations + buffer region).
Can extend to continuous annotations (rather than 0/1 whether it is in the annotation or not; https://www.nature.com/articles/ng.3954).
Used to make statements like “variants for BMI are enriched in regions that suggest active marks in CNS cells”.

Advantages of stratified LD score regression

Only requires summary statistics.
Does not assume a single CV per region.
Does not only use SNPs either reaching genome-wide significance or falling in genome-wide significant regions.
Accounts for LD.
Computationally efficient.

Drawbacks of stratified LD score regression

Requires large data sets and/or large SNP heritability.
Trait analysed must be polygenic.
Requires an LD reference panel matched to the population studied.
Not application to studies using custom genotyping arrays (due to using 1000 genomes data to find LD scores that need to be generalsied to the study SNPs).
Based on additive model and does not consider non-additive effects.