Two options:
We could make a hybrid of the two, so that we have the dense genotyped SNPs measured in large partipant cohorts from Robertson et al. but also other SNPs directly genotyped/imputed in non-ImmunoChip regions from Cooper et al.
TFBS make up 31% of GWAS SNPs, yet only comprise 8% of the genome (ENCODE), suggesting that they may play a role in disease pathogenesis.
Position weight matrices (PWM) are the most widely used approach to modelling TFBSs. In contrast to the consensus model which just gives the most common base(s) at each position in a binding motif, the PWM is a \(4*m\) matrix of scores (log transformed normalised frequencies that have been transofrmed using a background model) of each base in each position of the motif based on empirical evidence (e.g. ChIP-seq data). However, they assume statistical independence between positions in the pattern (the probabilities are calculated independently of other positions).
There are many PWM databases including JASPAR (multiple data sources incuding SELEX, individual promoter assays and ChIP-Seq), HOCOMOCO (ChIP-Seq based), UniPROBE and SwissRegulon. Whilst there is considerable redundancy between these databases, there are examples of extremely different motifs reported between databases for the same TF. There are also many tools to perform genome scanning for these PWM including HOMER, TFBSTools and LASAGNE.
Shannon and Richards developed a tool called MotifDB which includes the latest versions of all four databases. If I use this to query the IKZF1 motif, then we find that the same motif is reported in humans (from HOCOMOCO and SwissRegulon) and mouse (HOCOMOCO).
library("MotifDb")
motifs <- query(MotifDb, "ikzf1")
sapply(motifs, consensusString)
## Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C
## "?TGGGAG?"
## Mmusculus-HOCOMOCOv10-IKZF1_MOUSE.H10MO.C
## "?TGGGAG?"
## Hsapiens-SwissRegulon-IKZF1.SwissRegulon
## "?TGGGAG?"
library(seqLogo)
seqLogo(motifs[[1]]) # Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C
motifs_ikzf2 <- query(MotifDb, "ikzf2")
sapply(motifs_ikzf2, consensusString)
## Hsapiens-SwissRegulon-IKZF2.SwissRegulon
## "A?AAGGAAAA?"
seqLogo(motifs_ikzf2[[1]]) # Hsapiens-SwissRegulon-IKZF2.SwissRegulon
Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C
motif for IKZF1 (note that the C on the end is the quality score; C is “normal” and “models down to C quality are suitable for quantitative analyses”). I can’t find any specific information on the CHiP-Seq datasets used to generate the motif, and so I can’t be sure that the predicted and empirical IKZF1 binding site data I’m using are independent.Currently the effect of mutations in a TFBS is estimated using PWM. Nishizaki et al. 2020 developed a SNP effect matrix pipeline (the SEMpl method) to derive SNP effect matrices (SEM) which estimate the consequence of mutations in a TFBS. See how these differ from PWM below:
Brief overview of the method:
For the transcription factor of interest, enumerate all valid kmers (“endogenous kmer list”).
Introduce all possible SNPs to each kmer.
Align all kmers to the genome and filter for regions of open chromatin by DNase-seq and calculate the “average ChIP-seq scores” for each alignment (which is the highest signal vallue over the region 50 bp before and after the aligned site).
The SEM score for each position is computed as the log2 of the average ChIP-seq signal to endogenous signal ratio for the mapped kmers for each mutated kmer list.
Then repeat the process many times until convergence using an EM-like method (to correct for differences arising from unique starting kmers). Note that this is a generalisation of the intragenomic replicates (IGR) method (https://www.nature.com/articles/ng.2416)
Note that they have precomputed SEMpl scores for some TFs (including Ikaros in kidney - but this was then discarded due to the PWM and/or dataset not containing enough information about the binding of a TF and therefore does not produce any significant enrichments). To calculate the scores yourself, you need ChIP-seq data and PWMs.
Example of usage:
./iterativeSEM -PWM examples/MA0114.1.pwm
-merge_file examples/wgEncodeOpenChromDnaseHepg2Pk.narrowPeak
-big_wig examples/wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101RawRep1.bigWig
-TF_name HNF4A
-genome data/hg19
-output results/HNF4A
Liu and colleagues (2020) have designed a SNPs filtering workflow called 3 Dimensional Functional Annotation of Accessible Cell Type Specific SNPs (3DFAACTS-SNP) that utilises overlapping profiles of Treg-specific epigenomic data (ATAC-seq, Hi-C and FOXP3-CHIP) to identify regulatory elements potentially driving the effect of variants associated with T1D and the genes they control. They focus on Tregs cells (which are mediated by expression of FOXP3) as these have been shown to be implicated in T1D. Their hypothesis is that: “the genetic variation that specifically alters Treg function will reside in open chromatin in Treg cells that is bound by FOXP3 and the genes controlled by these regulatory regions can be identified by chromosome conformation capture”. For their analysis they used the Bradfield GWAS data and the Onengut fine-mapping data. For the 1228 fine-mapped T1D SNPs, they identified 36 (from 14 loci) that met their filtering criteria and call these “T1D 3DFAACTS SNPs”.
[Perhaps our hypothesis is similar: “(some of) the genetic variation that specifically alters CD4 + T cell function will reside in open chromatin in CD4 + T cells that is bound by Ikaros TFs and the genes controlled by these regulatory regions can be identifed by chromosome conformation capture”. Perhaps we can then expand this to memory/naive/stimulated CD4 + T cells.]
Gosia’s group have a new paper about costimulation in T cell activation. They defined naïve cells as CD45RA+ and memory cells as CD45RA-. They treated cells with abatacept (a drug used to treat some immune diseases) which blocks costimulation, and found that naive cells were more sensitive to abatacept inhibition than memory cells –> supporting evidence that memory T cells are less dependent on costimulaton than naïve T cells and will continue to proliferate even when CD28 engagement is blocked. They also went one step further with this hypothesis, framing it in a different way - that abatacept doesn’t completely block costimuation, so that memory cells can still respond to very low levels of costimulation and are therefore more sensitive to costimuation. This is supported by CRISPR studies which show that complete loss of costimulation caused substantial further inhibition, particularly in memory cells. Overall, they found that naïve cells are more sensitive to TCR stimulation. Perhaps because memory T cells have more effective CD28 costimulation, whereas naïve cells are more dependent on the TCR and therefore susceptible to its loss.
They also found that CD28- and TCR-sensitive genes (and their regulatory regions) were enriched for T1D variants (by investigating whether disease SNPs map within active or open chromatin regions near the -senstive genes).
They also tested to see if any SNPs disrupted a TFBS. But note that to define “disease SNPs” they used the lead-SNP and friends method.
Side note: is there a way to link this to Thomas et al. (2007) ’s finding that “CD4+ T cells with reduced Ikaros DNA-binding activity no longer require signals from the TCR or CD28 for histone acetylation at the IL2 promoter, and no longer require CD28 costimulation for expression of the IL2 gene”.
I’ve pre-processed the two datasets into GRanges objects:
ikzf1_pred_full <- readRDS("ikzf1_pred_Granges.RDS")
ikzf1_pred <- ikzf1_pred_full[which(score(ikzf1_pred_full)>200)]
df <- data.frame(ikzf1_pred)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_pred <- GRanges(out)
ikzf1_chip <- readRDS("ikzf1_chip_Granges.RDS")
df <- data.frame(ikzf1_chip)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_chip <- GRanges(out)
Note that for the predicted binding site data set, HOCOMOCO motif info is used along with DNase data from ENCODE. I can’t gaurantee that the empirical ChIP-seq data (also from ENCODE) I’m using is not used in either of these specifications.
Altogether there are 13676 predicted binding sites after thresholding, and 60776 empirical CHiP-seq peaks. We saw last week that the distribution of peaks across chromosomes was similar for the predicted and empirical peaks. I now consider if there is much overlap between the empirical and the predicted peaks.
However, there is 1 region (chr16:90008631-90009267) that houses 5 predicted peaks. Below, I show a simple schematic of where the predicted binding sites reside on the CHiP-seq peak. But looking at this region in the UCSC genome browser, it doesn’t look very interesting. It seems to house the \(\textit{DBNDD1}\) gene, but when I match my Cooper SNPs - none of them fall in this genomic region.
There are also 89 regions that house 4 predicted binding sites, which I could look into in the future.
Using the predicted TFBS data set only, I find that only 222 of the 8,618,379 Cooper SNPs fall into these predicted TFBS. Some of these have very low \(p\)-values for T1D, which look promising
However, only 23 were fine-mapped and most of these have pip=0
, with the largest pip
being 0.0002.
This seems odd - what do the regions look like for the two SNPs with tiny \(p\)-values (\(p\approx 10^{-140}\): chr6:31497713 (in chr6_block24) and chr6:32894143 (in chr6_block26))?
Ah - its because they reside in the MHC region of the genome. The first region contains 7 independent signals identifed by SuSiE, whilst the second contains 8. I guess it makes sense for these SNPs not to be in the credible sets here.
Investigated stimulation- and differentiation- associated ATAC-seq data measured in a variety of immune cells (Calderon et al., 2019), including mapping the T1D associated/causal SNPs to these peaks (see https://annahutch.github.io/PhD/12nov20.html).
Reading on Ikaros family of transcription factors (because we think these could be involved in T1D pathogenesis). I have some IKZF1 CHiP-seq data in CD4 + T cells which I’m yet to analyse. In the meantime, I’ve been using pre-analysed IKZF1 CHiP-seq data from a lymphoblastoid cell line. It seems that GWAS SNPs are enriched in these CHiP-seq peaks, which suggests that leveraging this data in cFDR could be useful (https://annahutch.github.io/PhD/18nov20.html). This has the potential to not only find new associations but also find the function through which the SNPs cause disease (e.g. by disrupting a IKZF1 binding site). [Note - I’m now in a position to run binary cFDR on the T1D GWAS results, leveraging a binary indicator for whether the SNP falls in a CHiP-seq peak].
A new paper has also come out on predicted TFBS, which may be useful for other Ikaros TFs for which we don’t have CHiP-Seq data. I’ve began investigating the similarity of these predicted peaks with those empirical peaks from the CHiP-Seq data, but I’m not sure whether the latter was used to predict the former. I’ve also checked that distance to gene is not confounding our analysis. See https://annahutch.github.io/PhD/26nov20.html.
Chris has sped up the binary cFDR code, such that it now calculates p0
and p1
for a sample of data points and uses linear interpolation for the rest. I’ve included a line of code that deals with NA values for SNPs with \(p\)-value greater than the maximum \(p\)-value used.
I’ve also extended the code such that it now takes more samples in regions where we care more about accuracy (\(p<p.thr\)) and less where we care less about accuracy (\(p>=p.thr\)). I include the \(p\) value threshold (p.thr
) and the digits for rounding for both p<p.thr (digits_lowp
) and p>=p.thr (digits_highp
) as optional parameters in the function code.
Manipulating these parameter values results in a big save in time but not at the expense of a drop in accuracy. In the table below “Accuracy” is calculated as the correlation between the resultant \(v\) values from runing the function with the listed parameter values and from no interpolation (i.e setting digits=50
).
## p.thr digits_highp digits_lowp Accuracy Time
## 1 0.01 50 50 1.000000 18.958078 secs
## 2 0.01 1 2 0.999948 9.839347 secs
## 3 0.01 1 3 0.999948 9.060714 secs
## 5 0.1 1 3 0.999948 10.482241 secs
## 6 0.1 1 3 0.999948 10.443753 secs
If I plot the resultant \(v\)-values from the “gold standard” implementation (no interpolation) against the most extreme case, where I set p.thr=0.01; digits_highp=1; digits_lowp=2
then we see the results are extremely similar.
Setting p.thr=0.01; digits_highp=1; digits_lowp=2
(default values?) in the T1D application (for the Bioinformatics application note) take 18 minutes for 114,641 SNPs (as opposed to 42 minutes for the “gold standard” with no interpolation). It takes 12 minutes for iteration 1 (leveraging RA \(p\)-values) and 4 minues for iteration 2 (leveraging H3K27ac counts).
Nice summary of local FDR: https://biodatascience.github.io/compbio/test/localfdr.html
When doing QQ plots, could also do for PP/PIP rather than -log10.
When modelling distance to gene confounding, do on -log10(p) scale.
I need to read up on how close to a motif a SNP needs to be to disrupt binding (i.e. how much to extend the 7bp predicted binding regions in my analysis).
Aims before Christmas:
Send cFDR to AJHG (need potential reviewers) - get online submission ready to go
Finish binary cFDR code and send paper to James to check
More thesis writing.