1. GWAS meta-analysis


I have written the equation for SNP correlations in a GWAS meta-analysis in terms of sample sizes and MAFs to investigate what impact these quantities have on the estimate.

Firstly, I first fix all minor allele frequencies to 0.5. In the plot below, we see that as the sample size for study 1 dominates, r.meta approaches r1 (0.5 in first facet, 0.9 in second facet).


To investigate the impact that MAF has, I instead fix sample size and vary MAFs - assuming for now that the MAFs for SNP A and SNP B are the same (MAF_A1=MAF_B1). We see that if the SNPs are more common in study 1, then r.meta approaches r1 again.

Putting this together (for r1=0.9), we see that if the SNPs are more common in study 1 than study 2, then r.meta approaches r1 quicker as the sample size of r1 increases.



2. ChIP-seq data


I have three datasets for comparison:

  1. Predicted IKZF1 binding sites from (Funk et al. 2020): (HINT20 score > 200 and for duplicates I’ve used the biggest score).

  2. Empirical IKZF1 ChIP-seq data in B-lymphocytes (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105587): IDR peaks.

  3. Empirical IKZF1 ChIP data in activated CD4+ T cells analysed by me.


Encouragingly, the distribution of peaks across chromosomes (excluding X and Y) is very similar for the three data sets, especially for the two ChIP data sets.

I’ve also visualised the data in SeqMonk and the peak regions are similar.


Overlapping SNPs to ChIP peaks


I have two T1D GWAS data sets:

  1. Robertson et al. (2020):
    • Genotyped 140,000 ImmunoChip variants in 61,427 participants and then imputed up to 715,000 SNPs
    • Fine-mapped 52 ImmunoChip regions that reached genome-wide significance in the European case-control data and then did trans-ethnic fine-mapping on the regions with evidence of a single CV using PAINTOR.
    • Note that when defining regions to fine-map, they took the lead variant and examined all variants that were included with 1.5Mb of the lead variant (750kb either side) which usually consisted of the whole ImmunoChip region and any other proximal SNPs.
  2. Cooper et al. (2017):
    • Genotyped variants genome-wide in 6000 cases and 9000 controls then imputed.
    • Chris has fine-mapped the whole genome (defining regions by LD detect regions) with SuSiE.
    • I.e. have GWAS \(p\)-values (imputed) for 8,618,379 SNPs and GUESSFM fine-mapping results for 284,262 SNPs across 44 LD detect regions with evidence of an association (\(p<1e-06\)).

Only 288,572 SNPs are found in both data sets. I.e. 40% of the SNPs from Robertson et al. are also in the Cooper data set.


Overlapping Robertson et al. SNPs

I first overlap the SNPs in the Robertson et al. GWAS data set to (i) IKZF1 ChIP peaks in B-lymphocytes and (ii) IKZF1 ChIP peaks in activated CD4+ T cells.

Note that some SNPs overlap multiple peaks in the B-lymphocyte data (i.e. this data set consists of overlapping peaks - perhaps due to the IDR procedure?). I remove these for now.

Of the 715,023 SNPs, 17906 (2.5%) mapped to a peak in the B-lymphocytes data and 16168 (2.3%) mapped to a peak in the CD4 data.


Overlapping Cooper et al. SNPs

I now overlap the Cooper et al. SNPs. Note that this data is imputed genome-wide (compared to the Robertson data which is mainly focused on ImmunoChip regions).

Of the 8,618,379 SNPs, 83,002 (0.96%) mapped to a peak in the B-lymphocytes data and 74,657 (0.87%) mapped to a peak in the CD4 data


Finally, I use a Venn diagram to investigate the sharing of SNPs that overlap an IKZF1 ChIP peak between the various data sets.


Hmm, why are SNPs with small T1D \(p\)-values enriched in IKZF1 binding sites in B-lymphocytes but not in activated CD4+ T cells?

My plan was to robustly show that T1D SNPs are enriched in IKZF1 binding sites, but the boxplots and QQ plots show that this may not be the case for activated CD4+ T cells..?

Anyhow, I was planning on using GoShifter or GARFIELD but I’m moving away from these as they use the LD + friends method for “associated variants” and require LD info which I’d have to find, potentially loosing loads of SNPs when matching.

Given that we have credible sets of SNPs, I think it would be best to use these, only I can’t find a method that does this. Reading the Onengut paper (“enrichment analysis” section: https://www.nature.com/articles/ng.3245#Sec2), they stratified SNPs into (A) credible-set SNPs and (B) non credible set SNPs and formed 2*2 contingency tables for overlap with annotation, stratified by MAF. Followed by some trend test to test for association. I like the idea of doing something similar to this but am wary that this doesn’t account for other confounders such as gene-based confounders (e.g. distance to gene/ gene density) - but perhaps these will be incorporated in the MAF stratification?


Predicted IKZF1 binding sites

I also have the predicted IKZF1 binding sites in lymphoblast tissue from Funk et al. (2020). Note that these predicted sites are more specific than the ChIP data as they are exactly the IKZF1 DNA binding motifs (i.e. instances of the IKZF1 motif in DNase accessible regions).

50 Roberston SNPs and 222 Cooper SNPs (intersection is 21 SNPs) fall into a predicted IKZF1 binding site. As these SNPs fall exactly in the 7bp DNA binding motif for IKZF1 they could be functionally important.

Only 2 of the 50 Robertson SNPs (rs12922733 and rs34723737) are found in fine-mapped credible sets, but these have negligible PPs (6e-3 and 1e-2 respectively). Although, from searching the literature:

  • rs12922733 is an eQTL in blood for DEXI and SOCS1 (from eQTLGen). Whilst Bonilla and colleagues (2019) found that Dexi does not affect disease risk in the non-obese diabetic mouse model, SOCS1 is a suppressor of cytokine signaling important for immune cell homeostasis and regulation of inflammation.

  • rs34723737 does not reach GWS in the Robertson data and hasn’t been significantly associated to any other phenotypes. Although there is evidence that this SNP is an eQTL for UBASH3A (rs34723737 is an intron variant in UBASH3A) in blood and T cells (Blueprint). This gene promotes accumulation of activated target receptors, such as T-cell receptors. Ge and colleagues (2017) explain how UBASH3A Mediates Risk for T1D. Although they identify rs11203203 and rs80054410 as the risk alleles in this locus (rs80054410 is ~500 bp downstream of rs34723737 and rs11203203 is ~200 bp downstream of rs80054410).

Only 23 of the 222 Cooper SNPs are found in fine-mapped regions but all of these have negligible PPs (\(\leq 0.0002172675\)) and are not found in the credible sets.


Fine-mapping IKZF1


I compare the fine-mapping results for the genomic regions containing \(\textit{IKZF1}\) from the Robertson and Cooper GWAS data.

Firstly, Robertson et al. fine-map the ImmunoChip region containing \(\textit{IKZF1}\) (chr7:50221511-51009338) using GUESSFM. Recall that when defining regions to fine-map, they took the lead variant and examined all variants that were included with 1.5Mb of the lead variant (750kb either side) (which usually consisted of the whole ImmunoChip region and any other proximal SNPs).

The Manhattan plot of the ImmunoChip region containing IKZF1 (chr7:50221511-51009338) is shown below with the two credible sets identified from GUESSFM shown in red and blue. This region encompasses IKZF1 (left-hand side) and also GRB10 (glucose receptor; right-hand side). In the Robertson et al. paper, they use Ensembl and RefSeq to annotate \(IKZF1\) as the target gene for signal 1 (red SNPs) and \(GRB10\) as the target gene for signal 2 (blue SNPs).


Chris has fine-mapped the LD detect region that contains \(\textit{IKZF1}\) (chr7:49172682-51607626) using SuSiE. The fine-mapping results are very similar to those for the corresponding ImmunoChip region using GUESSFM.


The PP plots for the GUESSFM fine-mapping are shown below:

For signal 1, rs6944602 has the highest PP and is an eQTL for \(\textit{IKZF1}\) in relevant cell types (https://genetics.opentargets.org/variant/7_50406053_G_A). For example, it is an eQTL for \(\textit{IKZF1}\) in CD4 and CD8 T cells (CEDAR).

I’ve also labelled rs10272724 which Chris and colleagues found to be protective for T1D (https://diabetes.diabetesjournals.org/content/60/3/1041) in red. I wonder why it doesn’t have a large PP?


And the PP plots for the SuSiE fine-mapping are shown below. But rs6944602 is not present in either of the credible sets from SuSiE (pip=0.005273756). Again, I’ve labelled rs10272724 (protective for T1D)


Perhaps I could run a formal meta-analysis fine-mapping method for this region. Although I think this means I could only use the SNPs present in both data sets (or impute the SNPs?). E.g. if using the LD detect region containing \(\textit{IKZF1}\) as the fine-mapping region, then only 3508 SNPs are present in both data sets.


The fine-mapping analysis is important because I’d like to search for SNP-SNP interactions between the likely causal SNP in the gene body (rs6944602?) and SNPs which fall in IKZF1 ChIP peaks/ predicted binding sites. I.e looking for SNP interactions between SNPs in and those 250 SNPs in predicted IKZF1 binding sites. [Useful review article on SNP interactions: https://academic.oup.com/bfg/article/14/2/143/257445].

Not sure how to do this, think I need access to the genotypes for the following regression model:

\[\begin{equation} outcome \sim SNP1 + SNP2 + SNP1:SNP2 \end{equation}\]


Comments and questions

  • What is the citation for deriving V?

  • Your SuSiE results for James. Will there be new results for the Cooper T1D data too?

  • Good citation to show that people are thinking about statistical ways to use correlated vectors (https://arxiv.org/abs/2011.14996), e.g. for leveraging various external data in cFDR.