The IKZF1 gene (chr7:50,303,453-50,405,101) spans two ImmunoChip regions: Region 1 - chr7:50,207,118-50,325,210 and Region 2 - chr7:50,327,041-50,626,687. I have two T1D GWAS results for these ImmunoChip regions: Robertson et al. (2020) and Onengut et al. (2015). The Manhattan plots are similar for the two GWAS’.
Directly comparing the \(p\)-values for SNPs present in both the studies show that the resultant \(p\)-values are similar.
PPsum>0.99
). The SNPs in each credible set are coloured, but only the red signal links to \(\textit{IKZF1}\) [the blue signal is very far from IKZF1 and is linked to GRB10/COBL].There is little agreement in the fine-mapping results for this region. Onengut picks out rs11770117 (this SNP is present in the Robertson GWAS but not in any credible sets), whilst Robertson picks out rs6944602 (this SNP is in the Onengut credible set but has PP=0.0031). These SNPs are just 12 bp away!
In Open Targets:
rs11770117 (and rs12719030 and rs11764792 which also have high PPs in Onengut) have lots of eQTL evidence for FIGNL1 (see https://genetics.opentargets.org/variant/7_50406065_A_T).
rs6944602 (and rs62447208 which also has a high PP in Robertson) has lots of eQTL evidence for IKZF1 (see https://genetics.opentargets.org/variant/7_50406053_G_A). It decreases IKZF1 expression in whole blood (\(beta=-0.44\), \(p=3.3e-310\)) and immune related cell types (including B cells and T cells).
A larger T1D GWAS has just come out (Chiou et al. (2021)) for \(N0 = 470,876\), \(N1 = 18,803\) (\(N\approx 500,000\)) [cf. Robertson \(N0 = 25,386\), \(N1 = 16,159\) plus 6000 trio families so \(N \approx 60,000\)]. For each independent signal, they defined “credible set variants” as those in at least low LD (\(r>0.1\)) with the index variant and within a 5Mb window. They then used the standard Bayesian method for fine-mapping.
For the IKZF1 region, this larger study finds two signals. One upstream of that which we’re investigating here (chr7:50218638-50282196) and one in the specific region we’re looking at. But the prioritised SNPs are different again! They are rs10262731 (\(PP=0.127\)), rs28625633 (\(PP=0.112\)) and rs10236879 (\(PP=0.110\)), which are all eQTLs for \(\textit{IKZF1}\) in various immune cell types on Open Targets (again decreasing expression with \(p=8.7e-172\)). rs6944602 is in the credible set, but has \(PP=0.005972722\).
No conclusive fine-mapping results for the IKZF1 gene region.
The SNPs prioritised by Robertson and Chiou look the most promising as these are listed as eQTLs for IKZF1 on Open Targets.
This paper shows how difficult fine-mapping this region may be: https://www.mdpi.com/1422-0067/21/21/8383 (although for SLE).
Visualising all prioritised SNPs mentioned here on one plot (a Manhattan plot of the Robertson p-values; Chiou haven’t made their summary stats available yet):
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) was used to check the quality of the raw ChIP-seq reads. ChIP-seq data sets were then mapped with bowtie2 (version 2.4.2) using addition flags –no-mixed and –no-discordant. The mapping was performed against the reference genome GRCh38.p13, obtained from the GENCODE project. The mapped data was converted to BAM format using samtools. SeqMonk (version 1.47.1) was used for peak calling, quantitation and annotation. Upon data input to SeqMonk, read lengths were extended by 200 bp and reads were filtered to only include high quality reads (mapping quality > 20). The MACS (Model-Based Analysis for ChIP-Seq) algorithm within SeqMonk was used for peak calling with p-value 1.0E−05, a sonicated fragment size of 300 bp and input sample as control. Genomic regions with coverage outliers were excluded (defined as those regions in the input dataset with a stringency of 3.0 above the median using the BoxWhisker filter tools).
105,859 out of the 8,618,379 Cooper SNPs (1.2%) overlap an IKZF1 ChIP peak in activated CD4 T cells (on average 4.4 SNPs in each peak).
25,960 out of the 715,023 Robertson SNPs (3.6%) overlap an IKZF1 ChIP peak in activated CD4 T cells (on average 6.8 SNPs in each peak).
The SNPs in the Cooper dataset that overlapped a peak were enriched for smaller GWAS \(p\)-values, but the SNPs in the Robertson dataset that overlapped were not. Why could this be? Why are SNPs mainly in ImmunoChip regions not enriched but those genome-wide are?
We used the ReMap2020 database (https://academic.oup.com/nar/article/48/D1/D180/5608991) to search for existing IKZF1 ChIP-seq data. Of the 9 publicly available IKZF1 ChIP-seq datasets, three were conducted in GM12878 (human B cell derived cell line), one in Hep-G2 (human liver cancer cell line), two in K-562 (human myelogenous leukaemia cell line), one in HSPC (hematopoietic stem and progenitor cells) and two in BCR-ABL1 (whereby the BCR/ABL gene is introduced into the TF-1 leukemia cell line to generate TF-1 BCR/ABL cells; https://www.ebi.ac.uk/ols/ontologies/bto/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBTO_0004761; https://pubmed.ncbi.nlm.nih.gov/18829496/). As T1D is an immune-mediated disease, only ChIP-Seq experiments performed in the GM12878 cell line were included:
ENCSR000EUJ: Single-end ChIP for 2 isogenic replicates (2010): https://www.encodeproject.org/experiments/ENCSR000EUJ/.
ENCSR874AFU: Paired-end ChIP for 2 isogenic replicates (2016): https://www.encodeproject.org/experiments/ENCSR874AFU/.
ENCSR441VHN: Paired-end ChIP for 2 isogenic replicates (2017): https://www.encodeproject.org/experiments/ENCSR441VHN/.
I proceed with the ENCSR874AFU and ENCSR441VHN data. I use the optimal IDR peaks, which are the largest set of peaks derived from IDR analysis of biological replicates and pseudoreplicates (“a subsample of reads, chosen without replacement, from a single replicate used as a substitute for replication in the absence of true biological replicates”) and is the most sensitive set of peaks. “Peaks in the optimal set can be interpreted as high-confidence peaks, representing reproducible events and accounting for read sampling noise.”
The peaks identified for the two B-cell analyses are similar. Analysing them both separately for now, I find that T1D SNPs with smaller \(p\)-values are enriched in Ikaros binding sites in B-cells for both GWAS data sets.
For the SNPs that overlap a ChIP peak, there is more similarity between the B-cell data sets than the aCD4 data set, which is reassuring.
The Funk et al. (2020) data contains predicted TF binding sites in lymphoblast tissues. They used 21 biosamples (mixture of FAIRE-seq and DNase-seq) in LCLs from B-lymphocytes to predict regions of the DNA that are bound by a protein (using the HINT algorithm; DNA cutting enzymes in FAIRE-seq and DNase-seq cannot cut where a protein is bound). They then used the Find Individual Motif Occurrences (FIMO) method to predict which TF was most likely to bind at each footprint.
I limit the predicted binding sites to those with a HINT20 score \(>200\) as recommended. For duplicates, I keep the row with the highest HINT20 score.
I reason that if a SNP both (1) overlaps a predicted IKZF1 binding site and (2) there is a IKZF1 ChIP-seq peak in this region then this is strong evidence that IKZF1 binds here and this SNP could be altering its binding ability. I.e. this analysis is concerned with finding trans-effects that IKZF1 may have.
Focussing on the Robertson SNPs for now (above I shown that IKZF1 may function through B-cells rather than aCD4 cells for diabetes pathogenesis - at least for the Robertson SNPs), 50 of these overlap an IKZF1 predicted binding site. The Venn diagram below shows the number of these which overlap a ChIP peak in my 3 ChIP-seq data sets.
I reason that since the predicted IKZF1 binding sites are for B-cells, I should be focussing on those SNPs which overlap a ChIP-peak in the B-cell ChIP data, specifically those overlapping a peak in both the B-cell data sets. Of these 23 SNPs, only one of these is genome-wide significant in the Robertson GWAS (rs2295665; \(p=1.704e-18\)). This SNP is in the MHC region of the genome and has lots of eQTL evidence for lots of genes on Open Targets (https://genetics.opentargets.org/variant/6_31664909_C_T).
SNP rs1633081 is just below GWS (\(2.087e-07\)) but overlaps a predicted IKZF1 binding site and falls in a IKZF1 ChIP peak in both the B-cell data sets (it is also in the MHC). This SNP has eQTL evidence for lots of SNPs (e.g. \(\textit{ZFP57, MOG, HLA-G, GABBR1, HLA-F, HCG9}\) etc.). Interestingly, there is eQTL evidence for rs1633081 and \(\textit{ZFP57}\) in various cell-types, including whole blood, whereby the SNP increases \(\textit{ZFP57}\) expression. Recessive mutations in \(\textit{ZFP57}\) account for \(\sim 5\%\) of transient neonatal diabetes mellitus 1 (TNDM1) cases (a crazy form a diabetes which is present at birth, then sorts itself out, then comes back intermittently). Possible mechanism: rs1633081 prevents IKZF1 binding in B-cells and increases \(\textit{ZFP57}\) expression (this gene is involved in genomic imprinting).
So I have information suggesting that:
rs1633081 falls in a IKZF1 binding site in B-cells (to show that it actually disrupts IKZF1 binding I’d need some extra experimental data). This SNP has reference allele G and alternative allele T in the third position of the IKZF1 binding motif - looking on HOCOMOCO, this looks like it could have a big effect on binding (https://hocomoco10.autosome.ru/motif/IKZF1_HUMAN.H10MO.C).
rs1633081 is an eQTL for various genes, including some which have been linked to diabetes. Should we look for physical interaction? What if it has a trans effect?
But it is not immediately apparent how these two separate bits of information come together. For example, would need to show that an IKZF1 knockout phenocopies rs1633081. Would also potentially have to show some physical interaction of rs1633081 and the target gene (e.g. CHi-C).
Note that the DEXI SNP I mentioned last week (rs12922733) is not in a ChIP peak in any of the data sets.
Alex Marson has lots of data on CRISPR in primary human T cells.
Using causalDB rs1633081 is only found in the FINEMAP 95% credible set for UC and bread intake… (http://mulinlab.org/causaldb/index.html?q=rs1633081&t=0).
Note that Ikaros can function a transcriptional repressor of its target genes through both HDAC-dependent and HDAC-independent mechanisms. “Ikaros is abundantly localized in pericentromeric heterochromatin in the nucleus. Experiments by several groups have shown that Ikaros regulates expression of its target genes by recruiting them to pericentromeric heterochromatin, resulting in their activation or repression”. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243972/)