Introduction

SNP enrichment methods aim to elucidate the functional annotations that trait-associated SNPs overlap more frequently than expected by chance. The most informative annotations likely relate to the cell-type specific biological mechanism of the trait, and consequently SNP enrichment is generally considered in a trait specific manner. For example, annotations representing enhancer marks in pancreatic tissue may be relevant for dissecting T2D associated loci, but less so for immune-mediated (type 1) diabetes, where enhancer marks in lymphoid tissues may be more relevant.

Trait-associated variants and genomic annotations are not randomly distributed throughout the genome, and their heterogeneous distribution implicates confounding features, such as genes which are enriched for trait-associated SNPs and genomic annotations such as DHS sites. In order to meaningfully assess the significance of the observed test statistic in SNP enrichment methods, a null distribution needs to be specified which preserves this confounding, such that the test statistics are estimated in the presence of counfounding.

One way to do this is to specify a null distribution by matching SNPs based on various confounding characteristics (``SNP matching methods’’). For binary SNP overlap with an annotation, GREGOR matches SNPs based on LD, gene proximity and MAF, and uses a permutation-based approach to calculate enrichment \(p\)-values, whilst GARFIELD matches SNPs based on LD and local gene density and uses a logistic regression framework to derive statistical significance .

However by pre-specifying confounding characteristics, hidden confounders which may bias the enrichment statistics can be missed . Moreover, whilst matching based on LD is crucial in SNP matching methods (since the more SNPs a particular SNP tags, the more likely it is that one of them will overlap the annotation, even by chance) , it is not always clear which other matching parameters should be used in a given analysis. Trynka et al. (2015) found that controlling for LD alone was sufficient to mitigate type 1 error in some instances, but many other matching parameters were required in other instances. Similarly, Iotchkova et al. (2019) found that controlling for LD meant that controlling for MAF was no longer required, which is likely due to rarer SNPs being more likely to have lower LD.

To account for the discrepancies arising from choosing matching parameters, the GoShifter (genomic annotation shifter) method defines a null distribution by circularising and ``shifting’’ the annotation, thereby preserving the local genomic architecture and preserving these to calculate the enrichment statistics. The GoShifter (genomic annotation shifter) method was primarily developed to ascertain the functional annotations which differentiate causal from trait-associated variants.

The GoShifter method requires (i) a list of index SNPs from the GWAS of interest and (ii) annotation data, such as the genomic positions where a chromatin or histone annotation is present, which can be converted to a binary vector of presence or absence of the annotation genome-wide. The method starts by defining each locus as the genomic region encompassing all variants in LD with each index SNP (for example \(r^2>0.8\) using a representative reference panel). These loci are then extended by twice the median size of the tested annotation to ensure that the loci also contained SNPs not linked to the GWAS index SNP. The proportion of loci containing a SNP which overlapped the annotation is calculated. Next, to calculate a null distribution which preserves the local structure of genetic variation, the annotation vector for each locus is circularised to preserve its spatial structure, and randomly shifted. After each random shift (typically a large number of iterations) the proportion of loci overlapping an annotation is calculated. A \(p\)-value is then calculated as the proportion of iterations where the number of overlapping loci was greater than or equal to that for the “true” SNP-annotation configuration.

However, GoShifter only leverages SNPs that have been shown to be associated with the trait of interest which may limit power. The GARFIELD method allows more SNPs to be included in the calculation of enrichment, using a GWAS \(p\)-value threshold for variant inclusion in the method. Interestingly, Iotchkova et al. (2019) found that some enrichment was identified only when using lower \(p\)-value threshold (e.g. \(p<1\times 10^{-5}\)), implying that as the statistical power to detect more trait-associated variant increases, more enrichment may be uncovered.


My GoShifter method

Rather than restricting my analysis to SNPs exceeding some significance threshold and potentially losing power by ignoring SNPs with low effects or low frequency that are associated with the trait but yet to be discovered , I use all common variants to test for enrichment of Ikaros-related annotations with T1D GWAS statistics.

My method is based on the GoShifter methodology, but I use a different test statistic of interest. Specifically, my test statistic compares the mean \(p\)-value for SNPs residing in the annotation and the mean \(p\)-value for SNPs not in the annotation. Firstly, I exclude the variants in the T1D data-set residing in the MHC region of the genome, which is known to have complex LD structure, and convert the annotation data into a binary SNP overlap vector (\(q\)). Next, I partition the genome into approximately independent LD blocks defined by the LD detect method . Within each LD block I firstly calculate the quantities required for the observed test statistic, the mean \(-log10(p)\) value for (i) the SNPs overlapping the annotation (i.e. \(q=1\)), \(p1\) and (ii) the SNPs not overlapping the annotation (i.e. \(q=0\)), \(p0\), and the number of SNPs overlapping and not overlapping the annotation, \(n0\) and \(n1\) respectively. If there are no SNPs in the block overlapping the annotation (i.e. \(n1=0\)) then I set \(p1=0\), and similarly, if all SNPs in the block overlap the annotation (i.e. \(n0=0\)) then I set \(p0=0\). The overall test statistic is then

\[\begin{equation} X=\dfrac{\frac{\sum_{i=1}^N p1_i \times n1_i}{\sum_{i=1}^N n1_i}}{\frac{\sum_{i=1}^N p0_i \times n0_i}{\sum_{i=1}^N n0_i}} \end{equation}\]

where \(N\) is the number of LD blocks. This is calculated for the observed SNP-annotation overlap to derive the observed test statistic.

Within each LD block, the annotation vector is then circularised and randomly permuted whereby the magnitude of the permutation is generated by randomly sampling an integer from a uniform distribution between 0 and the number of SNPs in the LD block. The key quantities, \(p1\) and \(p0\) and then calculated and stored (since we are preserving the structure of the annotation within the LD block, \(n0\) and \(n1\) stay constant). We repeat this 1000 times for each LD block. To generate an estimate of the overall test statistic under the null distribution, we take a random sample of a permutation for each LD block and use the values for \(p1\) and \(p0\) in the above equation. We do this 10,000 times to generate 10,000 test statistics computed under the null distribution.

As in GoShifter, I define the ``delta-overlap’’ as the difference between the observed test statistic and the mean of the test statistics under the null.


Results

The following results are for the aCD4 IKZF1 ChIP-seq annotation.

No LD partitioning

If we do not account for LD, then the delta overlap is \(do = 0.203542\), which suggests very strong enrichment. If we plot a histogram of the test statistics under the null then we see that the null distribution is centered on 1, since the mean \(p\)-values for SNPs in a randomly permuted annotation are similar to those SNPs not in the randomly permuted annotation. This drastic enrichment is because we have not accounted for confounders, such as gene density, MAF and LD (we know that trait-associated SNPs are enriched in these confounders, which may inflate the delta overlap statistic). Thus, the observed test statistic is calculated in the presence of confounding, whereas the test statistics under the null model are not calculated in the presence of this confounding - meaning that the observed and null test statistics are not directly comparable.


Partitioning by LD detect blocks

By partitioning the genome into LD blocks and circularising and permuting within these, we preserve some of the confounding when calculating the test statistics under the null distribution.

The null distribution is no longer centered on 1, which suggests that confounding is present and our null distribution is capturing it.


To show that there is indeed confounding, I plot the mean \(-log10(p)\) value against the proportion of SNPs with the annotation in each LD block. Indeed, the blocks with more annotations do on average have smaller p-values i.e. being in certain LD blocks makes you more likely to have smaller p-values and more annotations present, which suggests confounding.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'


But what is causing the confounding and are we capturing enough of it in our null distribution?

From the literature, the likely confounders are gene proximity and/or MAF. I investigate this using regression models.

When regressing on both \(p\) and the binary annotation, both MAF and gene proximity are significant - suggesting that they are confounders.

But, the t-statistic is about 10 times larger for gene distance than MAF when regressing on annotation, suggesting that gene distance is the dominant confounder. Moreover, for gene distance, the coefficients are similar (annots is half of that for -log10(p)) whereas for MAF, the coefficients are very different (annots is 100 times smaller than -log10(p)).

Note that these observations should be taken with a grain of salt due to the highly correlated p-values used as the independent variable.

model_p_genedist <- lm(-log10(p) ~ abs(gene_distance), data = T1D)
summary(model_p_genedist)
## 
## Call:
## lm(formula = -log10(p) ~ abs(gene_distance), data = T1D)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.509 -0.367 -0.170  0.176 73.490 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.086e-01  2.346e-04 2168.23   <2e-16 ***
## abs(gene_distance) -2.098e-07  5.303e-09  -39.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6353 on 8557892 degrees of freedom
## Multiple R-squared:  0.0001829,  Adjusted R-squared:  0.0001828 
## F-statistic:  1565 on 1 and 8557892 DF,  p-value: < 2.2e-16
model_annot_genedist <- glm(annot ~ abs(gene_distance), data = T1D)
summary(model_annot_genedist)
## 
## Call:
## glm(formula = annot ~ abs(gene_distance), data = T1D)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.01401  -0.01401  -0.01401  -0.01250   1.02099  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.401e-02  4.058e-05   345.4   <2e-16 ***
## abs(gene_distance) -1.059e-07  9.174e-10  -115.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01207509)
## 
##     Null deviance: 103498  on 8557893  degrees of freedom
## Residual deviance: 103337  on 8557892  degrees of freedom
## AIC: -13510632
## 
## Number of Fisher Scoring iterations: 2
model_p_maf <- lm(-log10(p) ~ MAF, data = T1D)
summary(model_p_maf)
## 
## Call:
## lm(formula = -log10(p) ~ MAF, data = T1D)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.545 -0.368 -0.170  0.174 73.505 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4817732  0.0003699 1302.47   <2e-16 ***
## MAF         0.1248719  0.0015382   81.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6409 on 8075026 degrees of freedom
##   (482866 observations deleted due to missingness)
## Multiple R-squared:  0.0008154,  Adjusted R-squared:  0.0008153 
## F-statistic:  6590 on 1 and 8075026 DF,  p-value: < 2.2e-16
model_annot_maf <- glm(annot ~ MAF, data = T1D)
summary(model_annot_maf)
## 
## Call:
## glm(formula = annot ~ MAF, data = T1D)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.01267  -0.01252  -0.01221  -0.01174   0.98889  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.269e-02  6.314e-05  201.02   <2e-16 ***
## MAF         -3.021e-03  2.626e-04  -11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01196961)
## 
##     Null deviance: 96656  on 8075027  degrees of freedom
## Residual deviance: 96655  on 8075026  degrees of freedom
##   (482866 observations deleted due to missingness)
## AIC: -12819163
## 
## Number of Fisher Scoring iterations: 2

Further stratification

I investigate whether including the additional confounding parameters changes the distribution of test statistics calculated under the null. To do this, I further stratify SNPs by gene proximity and MAF.

Note that this approach is similar to the GoShifter stratified enrichment method, which they use to test for enrichment in the presence of a colocalising annotation. [which makes sense as we are testing for enrichment in the presence of additional confounders and they are testing for enrichment in the presence of additional annotations].


Firstly, I further stratify the LD blocks into 3 bins for gene distance. The null distribution move slightly towards the test statistic, suggesting that the LD blocks alone do not capture enough of the confounding due to gene distance (probably because they’re too big).


Next, I further stratify the LD blocks into 3 bins for gene distance and 3 bins for MAF (a total of 9 bins). The distribution gets slightly tighter but the mean of the null test statistics stays roughly the same.

Note that these results (MAF isn’t needed that much) are supported in the literature. Trynka et al. (2015) showed that “including MAF in the matching parameters has no effect” (see Fig S4) and Iotchkova et al. (2019) showed that MAF is dropped from the model after including LD information.


Overall, accounting for LD makes the biggest difference, whilst additionally accounting for gene distance helps a little and MAF a little more.

Note that stratifying by MAF shifts the null left slightly. This can be explained by the following regression model, whereby gene proximity and MAF have opposite signs.

model_p_MAFgenedist <- lm(-log10(p) ~ MAF + abs(gene_distance), data = T1D)
summary(model_p_MAFgenedist)
## 
## Call:
## lm(formula = -log10(p) ~ MAF + abs(gene_distance), data = T1D)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.549 -0.367 -0.170  0.175 73.501 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.856e-01  3.809e-04 1275.11   <2e-16 ***
## MAF                 1.252e-01  1.538e-03   81.40   <2e-16 ***
## abs(gene_distance) -2.337e-07  5.495e-09  -42.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6409 on 8075025 degrees of freedom
##   (482866 observations deleted due to missingness)
## Multiple R-squared:  0.001039,   Adjusted R-squared:  0.001039 
## F-statistic:  4200 on 2 and 8075025 DF,  p-value: < 2.2e-16

Final method

For my analysis, I use T1D GWAS data from Cooper et al. (2017). This data is a reanalysis of the data from Barrett et al. (2009), with extra imputation of unmeasured genotypes using IMPUTE2. Briefly, genotypes were derived from the Afffymetrix GeneChip Mapping 500K Set and Illumina 550K Infinium microarray platforms for 5,913 T1D cases and 8,828 controls passing QC (Barrett et al. 2009). The samples were obtained from a meta-analysis of three genome-wide association analyses of T1D: (i) T1DGC (ii) GoKinD/NIMD (Cooper et al. 2008) and (iii) WTCCC . In Cooper et al. (2017), the USA samples were excluded and genomes were imputed using the 1000 Genomes Project phase III cohort as a reference.

For my analysis, I excluded SNPs residing in the MHC region of the genome. I then ran my SNP enrichment method, which is based upon the GoShifter methodology:

  1. Stratify SNPs by LD block, gene proximity and MAF.

    • For LD block, the LD detect genome partitioning was used to stratify SNPs into approximately independent genomic regions.
    • For gene proximity, SNPs were either classified as within a gene (gene distance = 0), close to a gene (gene distance was below the median of distances for SNPs not within genes) or far from a gene (gene distance was above the median of distances for SNPs not within genes). Gene locations were downloaded from GENCODE (V15; GRCh38) in GTF format and converted to sorted BED files using the BEDOPS gtf2bed function. The closest-feature utility, also in BEDOPS, was used to extract the closest gene and it’s distance in base pairs for each of the SNPs in the T1D GWAS data set.
    • For MAF, SNPs were classified as low frequency (MAF was below the median MAF) or higher frequency (MAF was above the median MAF). The SNPs for which the MAF was not avaliable were randomly allocated an MAF bin. MAFs were collected using the control sample MAFs for SNPs analysed on the Illumina or Affymetrix technologies (depending on data availability)
  2. Run GoShifter.R as an array job for the number of LD blocks to generate intermediate statistics for 1000 permutations of the annotation vector.

  3. Run makeStats.R to generate the final test statistics (a list of the overall observed test statistic and a vector of the overall test statistics computed under the null (10,000 of these)).


Final results

I have ran my method to test for T1D SNP enrichment using five data sets:

  1. IKZF1 ChIP-seq peaks in activated CD4+ T cells
  2. IKZF1 ChIP-seq peaks in LCLs (ENCSR441VHN)
  3. IKZF1 ChIP-seq peaks in LCLs (ENCSR874AFU)
  4. IKZF1 predicted binding sites in LCLs
  5. IKZF1 ChIP-seq peaks in liver (negative control)

In each of these, I convert the data to a binary SNP-peak overlap. I find that there is greatest enrichment in IKZF1 ChIP peaks in aCD4 cells and also strong enrichment in LCLs. Accordingly, there is anti-enrichment for T1D SNPs in IKZF1 ChIP peaks in liver cells, which makes sense as this tissue is not implicated in T1D.

The plot for predicted IKZF1 binding sites is interesting, as the null distribution is extremely variable. I think that this is due to the annotation, where there are only 208 annotation-SNP overlaps in the whole data set (208/8557894; 0.002%).

As in the GoShifter manuscript, we can also consider the delta-overlap value (observed test stat - mean test stat under null), but this doesn’t account for variability in the null distribution and the p-values are probably our best bet.


Next steps