Intuition behind LDSC

In genetics, the standard additive model is

\[\begin{equation} \tilde{y_i}=\sum_{j=1}^J \beta_jx_{ij} +\epsilon_i \end{equation}\]

where \(y_i\) measures our phenotype of interest, \(x_{ij}\) is the genotype matrix and \(\beta_j\) measures the effect size of SNP \(j\) on the phenotype.

The data is typically standardised so that \(var(\tilde{y_i})=1\) and all \(var(x_{ij})=1\), which implicitly assumes a relationship between \(\beta_j\) and MAF (e.g. that rarer things (smaller MAF) typically have a larger effect size to compensate). There are two extremes for this standardisation step: (i) once we assume a constant variance for \(\beta\), the variance explained by each SNP is the same (so that rarer things have a larger effect size to compensate) and (ii) no standardisation so that the distribution of effect sizes is the same and doesn’t depend on allele frequency. Realistically, it is somewhere between these extremes and will be trait specific.

In the genome, SNPs are correlated with one another and so from a GWAS we can estimate the marginal effects,

\[\begin{equation} \hat{\beta_j}^{GWAS}=s_j+\sum_{k=1}^J \beta_k r_{x_{i,j},x_{i,k}}+\epsilon_j \end{equation}\]

where \(s_j\) is some bias from confounders (e.g. population stratification or relatedness) and \(r_{x_{i,j},x_{i,k}}\) is the correlation between SNPs \(x_j\) and \(x_k\).

The LD score of SNP \(j\) is defined as

\[\begin{equation} l_j=\sum_{k=1}^J r^2_{x_{i,j}, x_{i,k}}. \end{equation}\]

In which case, the LDSC model is a simple linear regression of \(\chi^2\) statistics against LD scores. In a polygenic model, the expected \(\chi^2\) statistics is defined as:

\[\begin{equation} E(\chi^2_j)=1+N\alpha+\dfrac{Nh^2_{SNP}}{M} l_j \end{equation}\]

where N is the sample size, \(\alpha\) is a measure of confounding, M is the number of SNPs and \(h^2=\sum_j\beta_j^2\) measures SNP-heritability. This relationship between \(\chi^2\) value and LD score is intuitive because the more things you tag (and the degree with which you tag), the more likely you are to tag a CV. More formally, “assuming a uniform prior, we see SNPs with more LD friends showing more association”.

Uses of LDSC

If we regress our \(\chi^2\) values from the GWAS on \(Nl_j\) for each SNP \(j\), we get:

  1. Intercept: estimate of \(1+N\alpha\) (test for deviation from 1 as index of stratification/confounding and use to correct for confounding. \(>1\) implies confounding, similar to genomic control).

  2. Slope: estimate of \(\frac{h^2_{SNP}}{M}\) (with known M, can convert to an estimate of \(h^2_{SNP}\)), i.e. how much it tracks with changes in LD.

This method was first used to distinguish between population stratification (where there will be no relationship between LD score and \(\chi^2\) association statistic) and actually interesting polygenic effects (where there will be a positive relationship between LD score and \(\chi^2\) association statistic) by examining the LDSC intercept. This was compared with \(\lambda_{GC}\) values (with which the observed \(\chi^2\) values are divided by in the genomic-control method) to show that genomic control is unnecessarily conservative (LD score intercept \(<\lambda_{GC}\)). Moreover, contrary to LDSC, genomic-control does not distingusih true polygenicity from confounding bias.

Heritability describes the proportion of the phenotypic variation that can be explained by genetic factors. Traditionally, twin studies were used to estimate heritability but now linear mixed models (LMM) are typically used by partitioning phenotypic variance into variance components. However, LMM typically use REML for parameter estimation and therefore require individual genotype data. LDSC offers an alternative method for estimating heritability without requiring individual genotypes.

Note that the precomputed LD scores (\(l\)) for European and East Asian populations can be downloaded directly from github so that LDSC can be performed easily using only GWAS summary statistics (and a reference LD population).