1. IKZF1 genomic region


The IKZF1 gene (chr7:50,303,453-50,405,101) spans two ImmunoChip regions: Region 1 - chr7:50,207,118-50,325,210 and Region 2 - chr7:50,327,041-50,626,687. I have two T1D GWAS results for these ImmunoChip regions: Robertson et al. (2020) and Onengut et al. (2015). The Manhattan plots are similar for the two GWAS’.

Directly comparing the \(p\)-values for SNPs present in both the studies show that the resultant \(p\)-values are similar.


Fine-mapping

  • In Robertson, fine-mapping is performed using GUESSFM on the lead-variant plus all variants within 1.5Mb of the lead variant (750kb either side). This genomic region is huge and “would usually consist of the entirety of the ImmunoChip region and also any other variants directly genotyped on the ImmunoChip that were proximal to the ImmunoChip region”. In this region (chr7:50,221,511 - 51,009,338) they found evidence of two independent associations (with PPsum>0.99). The SNPs in each credible set are coloured, but only the red signal links to \(\textit{IKZF1}\) [the blue signal is very far from IKZF1 and is linked to GRB10/COBL].

  • In Onengut, fine-mapping is performed using the standard Bayesian approach on the lead-variant (reported as rs62447205) plus all variants within a 50-kb window. The genomic region is much smaller and maps to the red signal above (shown by the dotted vertical lines). Note that this region overlaps the whole IKZF1 gene body and has an overall L2G score for \(\textit{IKZF1}\) in Open Targets of 0.71 (next best is FIGNL1 with 0.058).

There is little agreement in the fine-mapping results for this region. Onengut picks out rs11770117 (this SNP is present in the Robertson GWAS but not in any credible sets), whilst Robertson picks out rs6944602 (this SNP is in the Onengut credible set but has PP=0.0031). These SNPs are just 12 bp away!

In Open Targets:


A larger T1D GWAS has just come out (Chiou et al. (2021)) for \(N0 = 470,876\), \(N1 = 18,803\) (\(N\approx 500,000\)) [cf. Robertson \(N0 = 25,386\), \(N1 = 16,159\) plus 6000 trio families so \(N \approx 60,000\)]. For each independent signal, they defined “credible set variants” as those in at least low LD (\(r>0.1\)) with the index variant and within a 5Mb window. They then used the standard Bayesian method for fine-mapping.

For the IKZF1 region, this larger study finds two signals. One upstream of that which we’re investigating here (chr7:50218638-50282196) and one in the specific region we’re looking at. But the prioritised SNPs are different again! They are rs10262731 (\(PP=0.127\)), rs28625633 (\(PP=0.112\)) and rs10236879 (\(PP=0.110\)), which are all eQTLs for \(\textit{IKZF1}\) in various immune cell types on Open Targets (again decreasing expression with \(p=8.7e-172\)). rs6944602 is in the credible set, but has \(PP=0.005972722\).


Summary

  • No conclusive fine-mapping results for the IKZF1 gene region.

  • The SNPs prioritised by Robertson and Chiou look the most promising as these are listed as eQTLs for IKZF1 on Open Targets.

  • This paper shows how difficult fine-mapping this region may be: https://www.mdpi.com/1422-0067/21/21/8383 (although for SLE).

  • Visualising all prioritised SNPs mentioned here on one plot (a Manhattan plot of the Robertson p-values; Chiou haven’t made their summary stats available yet):


2. IKZF1 ChIP data


aCD4

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) was used to check the quality of the raw ChIP-seq reads. ChIP-seq data sets were then mapped with bowtie2 (version 2.4.2) using addition flags –no-mixed and –no-discordant. The mapping was performed against the reference genome GRCh38.p13, obtained from the GENCODE project. The mapped data was converted to BAM format using samtools. SeqMonk (version 1.47.1) was used for peak calling, quantitation and annotation. Upon data input to SeqMonk, read lengths were extended by 200 bp and reads were filtered to only include high quality reads (mapping quality > 20). The MACS (Model-Based Analysis for ChIP-Seq) algorithm within SeqMonk was used for peak calling with p-value 1.0E−05, a sonicated fragment size of 300 bp and input sample as control. Genomic regions with coverage outliers were excluded (defined as those regions in the input dataset with a stringency of 3.0 above the median using the BoxWhisker filter tools).


  • 105,859 out of the 8,618,379 Cooper SNPs (1.2%) overlap an IKZF1 ChIP peak in activated CD4 T cells (on average 4.4 SNPs in each peak).

  • 25,960 out of the 715,023 Robertson SNPs (3.6%) overlap an IKZF1 ChIP peak in activated CD4 T cells (on average 6.8 SNPs in each peak).

The SNPs in the Cooper dataset that overlapped a peak were enriched for smaller GWAS \(p\)-values, but the SNPs in the Robertson dataset that overlapped were not. Why could this be? Why are SNPs mainly in ImmunoChip regions not enriched but those genome-wide are?


Publically available data

We used the ReMap2020 database (https://academic.oup.com/nar/article/48/D1/D180/5608991) to search for existing IKZF1 ChIP-seq data. Of the 9 publicly available IKZF1 ChIP-seq datasets, three were conducted in GM12878 (human B cell derived cell line), one in Hep-G2 (human liver cancer cell line), two in K-562 (human myelogenous leukaemia cell line), one in HSPC (hematopoietic stem and progenitor cells) and two in BCR-ABL1 (whereby the BCR/ABL gene is introduced into the TF-1 leukemia cell line to generate TF-1 BCR/ABL cells; https://www.ebi.ac.uk/ols/ontologies/bto/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBTO_0004761; https://pubmed.ncbi.nlm.nih.gov/18829496/). As T1D is an immune-mediated disease, only ChIP-Seq experiments performed in the GM12878 cell line were included:

  1. ENCSR000EUJ: Single-end ChIP for 2 isogenic replicates (2010): https://www.encodeproject.org/experiments/ENCSR000EUJ/.

    • Relatively old ChIP-seq data set (2010) and has ENCODE flags indicating that the data quality is low. Specifically, this data has warnings relating to the control data used: “control insufficient read depth”, “inconsistent control read length” (indicating the control used has very low sequencing depth and read length) and “missing controlled_by” (meaning that the control files are not matched to corresponding experimental files). There are also warnings relating to the experimental data: “low read length” (indicating that the sequencing read length is below the given threshold) and “low read depth” (indicating that the read depth of the alignment files is below the given threshold).
  2. ENCSR874AFU: Paired-end ChIP for 2 isogenic replicates (2016): https://www.encodeproject.org/experiments/ENCSR874AFU/.

  3. ENCSR441VHN: Paired-end ChIP for 2 isogenic replicates (2017): https://www.encodeproject.org/experiments/ENCSR441VHN/.

    • ENCSR874AFU and ENCSR441VHN are newer data sets (2016 and 2017 respectively) and the ENCODE audit suggests are of much better quality. There are however warnings for “mild to moderate bottlenecking” and “antibody characterised with exemption”. PCR bottlenecking is measured by the PCR bottleneck coefficient (PBC), defined as the fraction of genomic locations with exactly one unique read versus those covered by at least one unique read. “Mild to moderate” is usually nothing to worry about (see here). The antibody warning means that it did not pass its primary characterisation test (a western blot), but it did pass the second.

I proceed with the ENCSR874AFU and ENCSR441VHN data. I use the optimal IDR peaks, which are the largest set of peaks derived from IDR analysis of biological replicates and pseudoreplicates (“a subsample of reads, chosen without replacement, from a single replicate used as a substitute for replication in the absence of true biological replicates”) and is the most sensitive set of peaks. “Peaks in the optimal set can be interpreted as high-confidence peaks, representing reproducible events and accounting for read sampling noise.”


The peaks identified for the two B-cell analyses are similar. Analysing them both separately for now, I find that T1D SNPs with smaller \(p\)-values are enriched in Ikaros binding sites in B-cells for both GWAS data sets.

  • In the first B-cell data set (ENCSR874AFU), 80,678 out of 8,618,379 Cooper SNPs (0.94%) overlap a ChIP peak whilst 16,349 out of 715,023 Robertson SNPs (2.3%) overlap a ChIP peak.

  • In the second B-cell data set (ENCSR441VHN), 82,649 out of 8,618,379 Cooper SNPs (0.96%) overlap a ChIP peak whilst 17,906 out of 715,023 Robertson SNPs (2.5%) overlap a ChIP peak.


For the SNPs that overlap a ChIP peak, there is more similarity between the B-cell data sets than the aCD4 data set, which is reassuring.


3. Predicted IKZF1 binding sites


The Funk et al. (2020) data contains predicted TF binding sites in lymphoblast tissues. They used 21 biosamples (mixture of FAIRE-seq and DNase-seq) in LCLs from B-lymphocytes to predict regions of the DNA that are bound by a protein (using the HINT algorithm; DNA cutting enzymes in FAIRE-seq and DNase-seq cannot cut where a protein is bound). They then used the Find Individual Motif Occurrences (FIMO) method to predict which TF was most likely to bind at each footprint.

I limit the predicted binding sites to those with a HINT20 score \(>200\) as recommended. For duplicates, I keep the row with the highest HINT20 score.


I reason that if a SNP both (1) overlaps a predicted IKZF1 binding site and (2) there is a IKZF1 ChIP-seq peak in this region then this is strong evidence that IKZF1 binds here and this SNP could be altering its binding ability. I.e. this analysis is concerned with finding trans-effects that IKZF1 may have.

Focussing on the Robertson SNPs for now (above I shown that IKZF1 may function through B-cells rather than aCD4 cells for diabetes pathogenesis - at least for the Robertson SNPs), 50 of these overlap an IKZF1 predicted binding site. The Venn diagram below shows the number of these which overlap a ChIP peak in my 3 ChIP-seq data sets.

I reason that since the predicted IKZF1 binding sites are for B-cells, I should be focussing on those SNPs which overlap a ChIP-peak in the B-cell ChIP data, specifically those overlapping a peak in both the B-cell data sets. Of these 23 SNPs, only one of these is genome-wide significant in the Robertson GWAS (rs2295665; \(p=1.704e-18\)). This SNP is in the MHC region of the genome and has lots of eQTL evidence for lots of genes on Open Targets (https://genetics.opentargets.org/variant/6_31664909_C_T).

SNP rs1633081 is just below GWS (\(2.087e-07\)) but overlaps a predicted IKZF1 binding site and falls in a IKZF1 ChIP peak in both the B-cell data sets (it is also in the MHC). This SNP has eQTL evidence for lots of SNPs (e.g. \(\textit{ZFP57, MOG, HLA-G, GABBR1, HLA-F, HCG9}\) etc.). Interestingly, there is eQTL evidence for rs1633081 and \(\textit{ZFP57}\) in various cell-types, including whole blood, whereby the SNP increases \(\textit{ZFP57}\) expression. Recessive mutations in \(\textit{ZFP57}\) account for \(\sim 5\%\) of transient neonatal diabetes mellitus 1 (TNDM1) cases (a crazy form a diabetes which is present at birth, then sorts itself out, then comes back intermittently). Possible mechanism: rs1633081 prevents IKZF1 binding in B-cells and increases \(\textit{ZFP57}\) expression (this gene is involved in genomic imprinting).

So I have information suggesting that:

  1. rs1633081 falls in a IKZF1 binding site in B-cells (to show that it actually disrupts IKZF1 binding I’d need some extra experimental data). This SNP has reference allele G and alternative allele T in the third position of the IKZF1 binding motif - looking on HOCOMOCO, this looks like it could have a big effect on binding (https://hocomoco10.autosome.ru/motif/IKZF1_HUMAN.H10MO.C).

  2. rs1633081 is an eQTL for various genes, including some which have been linked to diabetes. Should we look for physical interaction? What if it has a trans effect?

But it is not immediately apparent how these two separate bits of information come together. For example, would need to show that an IKZF1 knockout phenocopies rs1633081. Would also potentially have to show some physical interaction of rs1633081 and the target gene (e.g. CHi-C).


Final comments and queries

  • Note that the DEXI SNP I mentioned last week (rs12922733) is not in a ChIP peak in any of the data sets.

  • Alex Marson has lots of data on CRISPR in primary human T cells.

  • Using causalDB rs1633081 is only found in the FINEMAP 95% credible set for UC and bread intake… (http://mulinlab.org/causaldb/index.html?q=rs1633081&t=0).

  • Note that Ikaros can function a transcriptional repressor of its target genes through both HDAC-dependent and HDAC-independent mechanisms. “Ikaros is abundantly localized in pericentromeric heterochromatin in the nucleus. Experiments by several groups have shown that Ikaros regulates expression of its target genes by recruiting them to pericentromeric heterochromatin, resulting in their activation or repression”. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243972/)