T1D GWAS data

Two options:

  1. Robertson et al. (2020):
    • Genotyped 140,000 ImmunoChip variants in 61,427 participants and then imputed up to 715,000 SNPs
    • Fine-mapped 52 ImmunoChip regions that reached genome-wide significance in the European case-control data and then did trans-ethnic fine-mapping on the regions with evidence of a single CV using PAINTOR.
    • Note that when defining regions to fine-map, they took the lead variant and examined all variants that were included with 1.5Mb of the lead variant (750kb either side) which usually consisted of the whole ImmunoChip region and any other proximal SNPs.
    • Do they have a different number of samples per SNP?
  2. Cooper et al. (2017):
    • (Why was this not published?)
    • Genotyped variants genome-wide in 6000 cases and 9000 controls then imputed.
    • Chris has fine-mapped the whole genome (defining regions using LD detect) with SuSiE.
    • I.e. have GWAS \(p\)-values (imputed) for 8,618,379 SNPs and GUESSFM fine-mapping results for 284,262 SNPs across 44 LD detect regions with evidence of an association (\(p<1e-06\)).

We could make a hybrid of the two, so that we have the dense genotyped SNPs measured in large partipant cohorts from Robertson et al. but also other SNPs directly genotyped/imputed in non-ImmunoChip regions from Cooper et al.


Reading on TFBS


https://mediaspace.itap.purdue.edu/media/Transcription+Factor+Binding+Site+Prediction+in+DNA/1_329lc2ft/82915941

library("MotifDb")
motifs <- query(MotifDb, "ikzf1")
sapply(motifs, consensusString)
##  Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C 
##                                "?TGGGAG?" 
## Mmusculus-HOCOMOCOv10-IKZF1_MOUSE.H10MO.C 
##                                "?TGGGAG?" 
##  Hsapiens-SwissRegulon-IKZF1.SwissRegulon 
##                                "?TGGGAG?"
library(seqLogo)
seqLogo(motifs[[1]])  # Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C

motifs_ikzf2 <- query(MotifDb, "ikzf2")
sapply(motifs_ikzf2, consensusString)
## Hsapiens-SwissRegulon-IKZF2.SwissRegulon 
##                            "A?AAGGAAAA?"
seqLogo(motifs_ikzf2[[1]])  # Hsapiens-SwissRegulon-IKZF2.SwissRegulon 


SEMpl method

Currently the effect of mutations in a TFBS is estimated using PWM. Nishizaki et al. 2020 developed a SNP effect matrix pipeline (the SEMpl method) to derive SNP effect matrices (SEM) which estimate the consequence of mutations in a TFBS. See how these differ from PWM below:

“PWM versus SEM of transcription factor GATA1. (A) The PWM can be read as likely nucleotides along a transcriptions factor’s motif. (B) Similarly, the SEM can be read as nucleotides along a motif, but with additional information about the effect any given SNP may have on transcription factor-binding affinity. The solid gray line represents endogenous binding, the dashed gray line represents a scrambled background. We define anything above the solid gray line as predicted to increase binding on average, anything between the two lines as decreasing average binding and anything falling below the dashed gray line as ablating binding on average”

“PWM versus SEM of transcription factor GATA1. (A) The PWM can be read as likely nucleotides along a transcriptions factor’s motif. (B) Similarly, the SEM can be read as nucleotides along a motif, but with additional information about the effect any given SNP may have on transcription factor-binding affinity. The solid gray line represents endogenous binding, the dashed gray line represents a scrambled background. We define anything above the solid gray line as predicted to increase binding on average, anything between the two lines as decreasing average binding and anything falling below the dashed gray line as ablating binding on average”


Brief overview of the method:

  1. For the transcription factor of interest, enumerate all valid kmers (“endogenous kmer list”).

  2. Introduce all possible SNPs to each kmer.

  3. Align all kmers to the genome and filter for regions of open chromatin by DNase-seq and calculate the “average ChIP-seq scores” for each alignment (which is the highest signal vallue over the region 50 bp before and after the aligned site).

  4. The SEM score for each position is computed as the log2 of the average ChIP-seq signal to endogenous signal ratio for the mapped kmers for each mutated kmer list.

Then repeat the process many times until convergence using an EM-like method (to correct for differences arising from unique starting kmers). Note that this is a generalisation of the intragenomic replicates (IGR) method (https://www.nature.com/articles/ng.2416)

Note that they have precomputed SEMpl scores for some TFs (including Ikaros in kidney - but this was then discarded due to the PWM and/or dataset not containing enough information about the binding of a TF and therefore does not produce any significant enrichments). To calculate the scores yourself, you need ChIP-seq data and PWMs.

Example of usage:

./iterativeSEM  -PWM examples/MA0114.1.pwm 
                -merge_file examples/wgEncodeOpenChromDnaseHepg2Pk.narrowPeak 
                -big_wig examples/wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101RawRep1.bigWig 
                -TF_name HNF4A 
                -genome data/hg19 
                -output results/HNF4A

Extra readings


Liu and colleagues (2020) have designed a SNPs filtering workflow called 3 Dimensional Functional Annotation of Accessible Cell Type Specific SNPs (3DFAACTS-SNP) that utilises overlapping profiles of Treg-specific epigenomic data (ATAC-seq, Hi-C and FOXP3-CHIP) to identify regulatory elements potentially driving the effect of variants associated with T1D and the genes they control. They focus on Tregs cells (which are mediated by expression of FOXP3) as these have been shown to be implicated in T1D. Their hypothesis is that: “the genetic variation that specifically alters Treg function will reside in open chromatin in Treg cells that is bound by FOXP3 and the genes controlled by these regulatory regions can be identified by chromosome conformation capture”. For their analysis they used the Bradfield GWAS data and the Onengut fine-mapping data. For the 1228 fine-mapped T1D SNPs, they identified 36 (from 14 loci) that met their filtering criteria and call these “T1D 3DFAACTS SNPs”.

[Perhaps our hypothesis is similar: “(some of) the genetic variation that specifically alters CD4 + T cell function will reside in open chromatin in CD4 + T cells that is bound by Ikaros TFs and the genes controlled by these regulatory regions can be identifed by chromosome conformation capture”. Perhaps we can then expand this to memory/naive/stimulated CD4 + T cells.]


Gosia’s group have a new paper about costimulation in T cell activation. They defined naïve cells as CD45RA+ and memory cells as CD45RA-. They treated cells with abatacept (a drug used to treat some immune diseases) which blocks costimulation, and found that naive cells were more sensitive to abatacept inhibition than memory cells –> supporting evidence that memory T cells are less dependent on costimulaton than naïve T cells and will continue to proliferate even when CD28 engagement is blocked. They also went one step further with this hypothesis, framing it in a different way - that abatacept doesn’t completely block costimuation, so that memory cells can still respond to very low levels of costimulation and are therefore more sensitive to costimuation. This is supported by CRISPR studies which show that complete loss of costimulation caused substantial further inhibition, particularly in memory cells. Overall, they found that naïve cells are more sensitive to TCR stimulation. Perhaps because memory T cells have more effective CD28 costimulation, whereas naïve cells are more dependent on the TCR and therefore susceptible to its loss.

They also found that CD28- and TCR-sensitive genes (and their regulatory regions) were enriched for T1D variants (by investigating whether disease SNPs map within active or open chromatin regions near the -senstive genes).

They also tested to see if any SNPs disrupted a TFBS. But note that to define “disease SNPs” they used the lead-SNP and friends method.

Side note: is there a way to link this to Thomas et al. (2007) ’s finding that “CD4+ T cells with reduced Ikaros DNA-binding activity no longer require signals from the TCR or CD28 for histone acetylation at the IL2 promoter, and no longer require CD28 costimulation for expression of the IL2 gene”.


My analysis


I’ve pre-processed the two datasets into GRanges objects:

  1. Predicted IKZF1 binding sites in lymphoblast cells (Funk et al. 2020). Note that this contains all predicted binding sites. I limit my analysis to those regions with a HINT score \(>200\), which the authors found to be an optimal filtering threshold to control both false-positive and false-negative errors. There are also some duplicate regions with different HINT20 scores. For now, I just keep the first instance of the region in the data frame (need to think of a better way to deal with these duplicates later).
ikzf1_pred_full <- readRDS("ikzf1_pred_Granges.RDS")
ikzf1_pred <- ikzf1_pred_full[which(score(ikzf1_pred_full)>200)]

df <- data.frame(ikzf1_pred)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_pred <- GRanges(out)
  1. Empirical IKZF1 binding sites from ChIP-Seq in B-lymphocytes (from a female donor) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105587). Note that this data is already the conservative idr thresholded peaks. Briefly, these are the peaks which can be replicated based on “true” replicates (not pool) (see https://biodatascience.github.io/compbio/test/IDR.html). Again, there are duplicates and I keep only the first instance of the region in the data frame (but there are still overlapping CHiP-seq peaks).
ikzf1_chip <- readRDS("ikzf1_chip_Granges.RDS")

df <- data.frame(ikzf1_chip)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_chip <- GRanges(out)

Note that for the predicted binding site data set, HOCOMOCO motif info is used along with DNase data from ENCODE. I can’t gaurantee that the empirical ChIP-seq data (also from ENCODE) I’m using is not used in either of these specifications.


Altogether there are 13676 predicted binding sites after thresholding, and 60776 empirical CHiP-seq peaks. We saw last week that the distribution of peaks across chromosomes was similar for the predicted and empirical peaks. I now consider if there is much overlap between the empirical and the predicted peaks.

However, there is 1 region (chr16:90008631-90009267) that houses 5 predicted peaks. Below, I show a simple schematic of where the predicted binding sites reside on the CHiP-seq peak. But looking at this region in the UCSC genome browser, it doesn’t look very interesting. It seems to house the \(\textit{DBNDD1}\) gene, but when I match my Cooper SNPs - none of them fall in this genomic region.

There are also 89 regions that house 4 predicted binding sites, which I could look into in the future.


Using the predicted TFBS data set only, I find that only 222 of the 8,618,379 Cooper SNPs fall into these predicted TFBS. Some of these have very low \(p\)-values for T1D, which look promising

However, only 23 were fine-mapped and most of these have pip=0, with the largest pip being 0.0002.

This seems odd - what do the regions look like for the two SNPs with tiny \(p\)-values (\(p\approx 10^{-140}\): chr6:31497713 (in chr6_block24) and chr6:32894143 (in chr6_block26))?

Ah - its because they reside in the MHC region of the genome. The first region contains 7 independent signals identifed by SuSiE, whilst the second contains 8. I guess it makes sense for these SNPs not to be in the credible sets here.


Summary of where I’m up to with application project


Binary cFDR


Chris has sped up the binary cFDR code, such that it now calculates p0 and p1 for a sample of data points and uses linear interpolation for the rest. I’ve included a line of code that deals with NA values for SNPs with \(p\)-value greater than the maximum \(p\)-value used.

I’ve also extended the code such that it now takes more samples in regions where we care more about accuracy (\(p<p.thr\)) and less where we care less about accuracy (\(p>=p.thr\)). I include the \(p\) value threshold (p.thr) and the digits for rounding for both p<p.thr (digits_lowp) and p>=p.thr (digits_highp) as optional parameters in the function code.

Manipulating these parameter values results in a big save in time but not at the expense of a drop in accuracy. In the table below “Accuracy” is calculated as the correlation between the resultant \(v\) values from runing the function with the listed parameter values and from no interpolation (i.e setting digits=50).

##   p.thr digits_highp digits_lowp Accuracy           Time
## 1  0.01           50          50 1.000000 18.958078 secs
## 2  0.01            1           2 0.999948  9.839347 secs
## 3  0.01            1           3 0.999948  9.060714 secs
## 5   0.1            1           3 0.999948 10.482241 secs
## 6   0.1            1           3 0.999948 10.443753 secs

If I plot the resultant \(v\)-values from the “gold standard” implementation (no interpolation) against the most extreme case, where I set p.thr=0.01; digits_highp=1; digits_lowp=2 then we see the results are extremely similar.


Setting p.thr=0.01; digits_highp=1; digits_lowp=2 (default values?) in the T1D application (for the Bioinformatics application note) take 18 minutes for 114,641 SNPs (as opposed to 42 minutes for the “gold standard” with no interpolation). It takes 12 minutes for iteration 1 (leveraging RA \(p\)-values) and 4 minues for iteration 2 (leveraging H3K27ac counts).


Notes and comments

Aims before Christmas: