18th Feb

Introduction

Position weight matrices (PWM) are a widely adopted approach to represent motifs in biological sequences (Wasserman and Sandelin (2004)) with rows for each symbol in the relevant alphabet (for example, 4 rows for A, T, C and G nucleotides) and columns for each position in the motif (for example a transcription factor binding motif) (Stormo et al. (1982)). They are generally constructed using SELEX experimental data, which identify the DNA sequences that bind the desired target . Using these, a position frequency matrix (PFM) is constructed by counting the occurrences of each nucleotide at each position of the DNA sequences. Each column of the PWM is then normalized to generate the position probability matrix (PPM), and the final PWM is then obtained by logarithmic transformations of the PPM divided by the nucleotides’ background probabilities. To avoid bias due to small sample sizes such as PPM entries having a value of 0 by chance (leading to negative infinities in the PWM), pseudocounts are generally added to each element of the PFM. Motif matrices are typically visualised as sequence logos (see IKZF1 sequence logo below).

PWMs are a useful tool to predict where a transcription factor may bind in the genome, but are based only on the frequency of nucleotides and therefore do not provide any information on the direction of effect of mutations within the motif (Weirauch et al. (2013)). There have been many methods developed to predict the effects of mutations within a motif (get some citations). For example, the Intragenomics Replicates (IGR) method scores SNPs in a TFBS based on their modulation in affinity using ChIP-chip or ChIP-seq data (Cowper-Sal.lari et al. (2012)). Briefly, \(k\)-mers containing the SNP are searched genome-wide and ChIP-based data is overlaid to investigate how that \(k\)-mer alters the binding of the transcription factor of interest. The method can be extended so that only genomic regions favorable to binding in the relevant cell type are searched (for example DNaseI hypersensitive regions).

The SEMpl method (Nishizaki et al. (2020)) is essentially a generalised version of the IGR method that can be used to quantify the magnitude and direct of effect of all possible SNPs within a TFBS. Firstly, the PWM for the transcription factor of interest is used to enumerate all \(k\)-mers for which the TF has an increased likelihood of binding using the TFM-PVALUE method with \(p<4^{-5}\) (Touzet and Varré (2007)). Then, all possible SNPs are simulated for each \(k\)-mer to create lists of mutated \(k\)-mers. Each of these mutated \(k\)-mers are then aligned to the reference human genome in region of open chromatin in the relevant cell type using Bowtie. A ``ChIP-seq score’’ is calculated for each mutated \(k\)-mer as the highest ChIP-seq signal value over the region 50 bp before and after the aligned site in the relevant cell type. Finally, the SEM score for each position is the \(log2\) of the average ChIP-seq signal to endogenous signal ratio for the mapped \(k\)-mers for each mutated \(k\)-mer list. This is repeated using different \(p\)-value thresholds in the TFM-PVALUE method until convergence using an EM-like method (this corrects for differences arising from unique starting \(k\)-mers).

The SEMpl method therefore requires (i) a PWM for the TF of interest (ii) DNase-seq data in the relevant cell type and (iii) ChIP-seq data for the TF of interest in the relevant cell type. The SEMpl method will help me quantify the effect of any mutations in the IKZF1 binding motif, for example indicating whether a SNP is likely to increase, decrease or totally ablate IKZF1 binding.

I applied the SEMpl approach to predict the effect of SNPs within the IKZF1 binding motif in LCLs. I chose LCLs as the relevant cell type due to the availability of IKZF1 ChIP data-seq and DNase-seq data from ENCODE in the formats required for SEMpl. I downloaded the DNase-seq data from ENCODE (accession number: ENCFF001UVC) and the ChIP-seq data also from ENCODE (accession number: ENCSR441VHN), using the bigWig files for fold change over control (ENCFF405CHV) and signal p-value (ENCFF176OVL) in two separate analyses as the software does not specify which should be used. I downloaded the PFM for IKZF1 from hocomoco (Kulakovskiy et al. (2018)) (model: IKZF1_HUMAN.H11MO.0.C) and used the `universalmotif’ R package to convert the motif from hocomoco to transfac format. I also converted the numerical values to integers, as required by the SEMpl software.

SEMpl results

The sequence logo for IKZF1:

For the SEM plots below:

Solid gray line represents endogenous binding.
Dashed gray line represents scrambled background.
They define anything above the solid gray line as predicted to increase binding on average, anything below the two lines as decreasing average binding and anything falling below the dashed gray line as ablating binding, on average.

Signal p-value ChIP data

When using the ChIP-seq data listing signal \(p\)-values, the signal to background t-test (-log10 p) is not significant (=1.57964, but they use 2.5 to call significant hits). This is the average of 100 t-tests from 1000 randomly chosen kmers from the SNP signal files versus 1000 randomly chosen kmers from the scrambled signal file. “This threshold was chosen by meta-analysis to reduce the prevalence of spurious prediction models arising from poor staring ChIP-seq or PWM data.”

Fold change over control ChIP data

When using the ChIP-seq data listing the fold change over control ChIP data, the signal to background t-test (-log10 p) is significant (=3.5). But the SEM plot looks very different to that above, showing that the SEMpl method is very sensitive to ChIP-seq data input (the same PWM and DNase-seq data was used).

These results don’t really make sense though, as it can’t be that G in position 4 ablates binding when this is the consensus nucleotide in this position…

Case-only test for interaction

I am going to run a case-only test for interaction on a pre-selected set of SNPs which look to be potentially functionally interesting. Specifically, I will be testing for interaction between the likely causal variant in the IKZF1 gene region and variants which overlap IKZF1 binding sites in relevant cell types. The idea is that if, say, less IKZF1 is being produced anyway (due to a mutation in the IKZF1 gene region) and it is more difficult for IKZF1 to bind (due to a mutation in an IKZF1 binding motif), then this may implicate T1D.

I therefore need to decide on (1) the causal SNP(s) in the IKZF1 gene region and (2) the potentially interacting SNPs.

Unfortunately for the former, I have found inconclusive fine-mapping results for the IKZF1 region from 5 studies. I originally thought that rs6944602 from Robertson et al. (2020) may be the best bet as this has the highest signal. This SNP decreases \(\textit{IKZF1}\) expression in immune relevant cell types (https://genetics.opentargets.org/variant/7_50406053_G_A). Gant et al. (2021) have characterised the IKZF1 locus and found that homozygous microdeletions in rs62445866, rs17133807 and rs6944602 significantly decreased IKZF1 mRNA levels in LCLs. They also found that rs17133807 maps to a conserved enhancer and interacts with the IKZF1 promoter in murine lymphoid cells. In Robertson, rs62445866 and rs17133807 have very small PPs. Alternatively, the MFM results prioritise rs62445866 and rs17133807, but not rs6944602 (see below). [rs62445866 and rs17133807 \(r^2=0.97\); rs6944602 and rs62445866 \(r^2=0.63\); rs6944602 and rs17133807 \(r^2=0.63\)].

For the potentially interacting SNPs, I consider variants overlapping IKZF1 ChIP peaks in aCD4 and/or LCLs and/or predicted IKZF1 binding sites in LCLs.

Of the 8 million SNPs in the Cooper et al. (2017) data set, 52 overlap a peak in all of the data sets. Only 3 of these have T1D \(p<1e-05\), two in the MHC and rs1534430 (chr2:12504610) which is strongly associated with hypothyroidism. These SNPs reside on the reverse DNA strand but it is not immediately clear how to interpret the SEMpl output for the reverse DNA strand. From reading, it seems that the binding of TFs is not strand-specific but perhaps I have to run SEMpl on the reverse complement of the IKZF1 sequence logo?

I decide to include all SNPs which overlap a predicted IKZF1 binding site and at least 1 of the ChIP data sets (160 of these) for my case-only test for interaction. I could also further stratify by maximum T1D p-value for inclusion (although using \(p<0.05\) only leaves 21 SNPs).

An example is SNP rs1633081 (chr6:29749124) which is just below GWS (\(2.087e−07\)) but overlaps a predicted IKZF1 binding site and falls in a IKZF1 ChIP peak in both the B-cell data sets. This SNP changes G>T at position 3 of the IKZF1 binding motif, which SEMpl suggests would ablate IKZF1 binding (in both SEMpl results). This SNP has eQTL evidence for loads of genes and so the prevention of IKZF1 binding could implicate all of these genes’ expression. Specifically, it has eQTL evidence for \(\textit{ZFP57}\) in whole blood (SNP increases expression) and \(\textit{ZFP57}\) is strongly implicated in transient neonatal diabetes (https://care.diabetesjournals.org/content/36/3/505) (note that IKZF1 can both positively and negatively regulate gene expression).

Method

\(H_0\): Genotypes at loci 1 and 2 are independent (i.e. there is no correlation between genotypes at the two loci). \(H_1\): Genotypes at loci 1 and 2 are not independent.

To test this hypothesis, I could use a \(\chi^2\) test for independence using genotype counts, however the genotype data is imputed and thus I only have posterior probabilities for each of the genotypes (0, 1 or 2). I could instead calculate the expected genotype for each sample, and run some statistical test for independence for the two vectors of expected genotypes at the two loci. Caveats to consider: may not have an equal number of samples at each locus and the expected genotypes may not be normally distributed (so can’t use e.g. a t-test). Could potentially use a Mann-Whitney U test?

There is also the complication with the SNPs being genotyped/imputed using different arrays (affymetric for WTCCC and illumina for T1DGC). This means that for each SNP I’ll have expected genotypes from either the affymetric or illumina chip, or both.

As a quick check, I’ve ran a Mann-whitney U test comparing expected genotypes at rs6944602 and rs1633081 in cases (genotyped on the illumina platform only for now) which gives \(p<2.2e-16\). But I’ve also checked for interaction in controls and this also gives \(p<2.2e-16\). (note correlation of expected genotypes in cases is -0.01559424 and in controls it is 0.008361651). Actually, I don’t think this sort of statistical test is correct… think I may need a regression approach.

Comments and notes

Note that there is now SEMplMe which includes the effect of DNA methylation, but I don’t think this is really needed here (https://www.biorxiv.org/content/10.1101/2020.08.13.250118v1 and https://github.com/Boyle-Lab/SEMplMe).
The IKZF1 motif differs on Jaspar and hocomoco. I’m not sure why this is (they don’t really make it clear what data they use to generate the motifs), but SEMpl is pretty robust to differences in the PWM (see Supp Fig 1) and the hocomoco motif I’m using seems to be the one generally used.
Where to go after the case-only test for interaction? (I presume this won’t find anything).
t1dgc-chr6.RData missing in rds/rds-cew54-wallace-share/Data/GWAS/t1d-barrett-imputed-1kg-may2018

References

Cowper-Sal.lari, Richard, Xiaoyang Zhang, Jason B. Wright, Swneke D. Bailey, Michael D. Cole, Jerome Eeckhoute, Jason H. Moore, and Mathieu Lupien. 2012. “Breast Cancer RiskAssociated SNPs Modulate the Affinity of Chromatin for FOXA1 and Alter Gene Expression.” Nature Genetics 44 (11): 1191–8. https://doi.org/10.1038/ng.2416.

Kulakovskiy, Ivan V., Ilya E. Vorontsov, Ivan S. Yevshin, Ruslan N. Sharipov, Alla D. Fedorova, Eugene I. Rumynskiy, Yulia A. Medvedeva, et al. 2018. “HOCOMOCO: Towards a Complete Collection of Transcription Factor Binding Models for Human and Mouse via Large-Scale ChIP-Seq Analysis.” Nucleic Acids Research 46 (D1): D252–D259. https://doi.org/10.1093/nar/gkx1106.

Nishizaki, Sierra S, Natalie Ng, Shengcheng Dong, Robert S Porter, Cody Morterud, Colten Williams, Courtney Asman, Jessica A Switzenberg, and Alan P Boyle. 2020. “Predicting the Effects of SNPs on Transcription Factor Binding Affinity.” Bioinformatics 36 (2): 364–72. https://doi.org/10.1093/bioinformatics/btz612.

Stormo, Gary D., Thomas D. Schneider, Larry Gold, and Andrzej Ehrenfeucht. 1982. “Use of the ‘Perceptron’ Algorithm to Distinguish Translational Initiation Sites in E. Coli.” Nucleic Acids Research 10 (9): 2997–3011. https://doi.org/10.1093/nar/10.9.2997.

Touzet, Hélène, and Jean-Stéphane Varré. 2007. “Efficient and Accurate P-Value Computation for Position Weight Matrices.” Algorithms for Molecular Biology 2 (1): 15. https://doi.org/10.1186/1748-7188-2-15.

Wasserman, Wyeth W., and Albin Sandelin. 2004. “Applied Bioinformatics for the Identification of Regulatory Elements.” Nature Reviews Genetics 5 (4): 276–87. https://doi.org/10.1038/nrg1315.

Weirauch, Matthew T., Atina Cote, Raquel Norel, Matti Annala, Yue Zhao, Todd R. Riley, Julio Saez-Rodriguez, et al. 2013. “Evaluation of Methods for Modeling Transcription Factor Sequence Specificity.” Nature Biotechnology 31 (2): 126–34. https://doi.org/10.1038/nbt.2486.