3rd December

T1D GWAS data

Two options:

Robertson et al. (2020):
- Genotyped 140,000 ImmunoChip variants in 61,427 participants and then imputed up to 715,000 SNPs
- Fine-mapped 52 ImmunoChip regions that reached genome-wide significance in the European case-control data and then did trans-ethnic fine-mapping on the regions with evidence of a single CV using PAINTOR.
- Note that when defining regions to fine-map, they took the lead variant and examined all variants that were included with 1.5Mb of the lead variant (750kb either side) which usually consisted of the whole ImmunoChip region and any other proximal SNPs.
- Do they have a different number of samples per SNP?
Cooper et al. (2017):
- (Why was this not published?)
- Genotyped variants genome-wide in 6000 cases and 9000 controls then imputed.
- Chris has fine-mapped the whole genome (defining regions using LD detect) with SuSiE.
- I.e. have GWAS \(p\)-values (imputed) for 8,618,379 SNPs and GUESSFM fine-mapping results for 284,262 SNPs across 44 LD detect regions with evidence of an association (\(p<1e-06\)).

We could make a hybrid of the two, so that we have the dense genotyped SNPs measured in large partipant cohorts from Robertson et al. but also other SNPs directly genotyped/imputed in non-ImmunoChip regions from Cooper et al.

Reading on TFBS

https://mediaspace.itap.purdue.edu/media/Transcription+Factor+Binding+Site+Prediction+in+DNA/1_329lc2ft/82915941

TFBS make up 31% of GWAS SNPs, yet only comprise 8% of the genome (ENCODE), suggesting that they may play a role in disease pathogenesis.
Position weight matrices (PWM) are the most widely used approach to modelling TFBSs. In contrast to the consensus model which just gives the most common base(s) at each position in a binding motif, the PWM is a \(4*m\) matrix of scores (log transformed normalised frequencies that have been transofrmed using a background model) of each base in each position of the motif based on empirical evidence (e.g. ChIP-seq data). However, they assume statistical independence between positions in the pattern (the probabilities are calculated independently of other positions).
There are many PWM databases including JASPAR (multiple data sources incuding SELEX, individual promoter assays and ChIP-Seq), HOCOMOCO (ChIP-Seq based), UniPROBE and SwissRegulon. Whilst there is considerable redundancy between these databases, there are examples of extremely different motifs reported between databases for the same TF. There are also many tools to perform genome scanning for these PWM including HOMER, TFBSTools and LASAGNE.
Shannon and Richards developed a tool called MotifDB which includes the latest versions of all four databases. If I use this to query the IKZF1 motif, then we find that the same motif is reported in humans (from HOCOMOCO and SwissRegulon) and mouse (HOCOMOCO).

library("MotifDb")
motifs <- query(MotifDb, "ikzf1")
sapply(motifs, consensusString)

##  Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C 
##                                "?TGGGAG?" 
## Mmusculus-HOCOMOCOv10-IKZF1_MOUSE.H10MO.C 
##                                "?TGGGAG?" 
##  Hsapiens-SwissRegulon-IKZF1.SwissRegulon 
##                                "?TGGGAG?"

library(seqLogo)
seqLogo(motifs[[1]])  # Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C

No motif information is found for IKZF3-5, but IKZF2 is listed in SwissRegulon.

motifs_ikzf2 <- query(MotifDb, "ikzf2")
sapply(motifs_ikzf2, consensusString)

## Hsapiens-SwissRegulon-IKZF2.SwissRegulon 
##                            "A?AAGGAAAA?"

seqLogo(motifs_ikzf2[[1]])  # Hsapiens-SwissRegulon-IKZF2.SwissRegulon

In the predicted TF binding site paper (Funk et al. 2020), it appears that they used the Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C motif for IKZF1 (note that the C on the end is the quality score; C is “normal” and “models down to C quality are suitable for quantitative analyses”). I can’t find any specific information on the CHiP-Seq datasets used to generate the motif, and so I can’t be sure that the predicted and empirical IKZF1 binding site data I’m using are independent.

SEMpl method

Currently the effect of mutations in a TFBS is estimated using PWM. Nishizaki et al. 2020 developed a SNP effect matrix pipeline (the SEMpl method) to derive SNP effect matrices (SEM) which estimate the consequence of mutations in a TFBS. See how these differ from PWM below:

“PWM versus SEM of transcription factor GATA1. (A) The PWM can be read as likely nucleotides along a transcriptions factor’s motif. (B) Similarly, the SEM can be read as nucleotides along a motif, but with additional information about the effect any given SNP may have on transcription factor-binding affinity. The solid gray line represents endogenous binding, the dashed gray line represents a scrambled background. We define anything above the solid gray line as predicted to increase binding on average, anything between the two lines as decreasing average binding and anything falling below the dashed gray line as ablating binding on average”

Brief overview of the method:

For the transcription factor of interest, enumerate all valid kmers (“endogenous kmer list”).
Introduce all possible SNPs to each kmer.
Align all kmers to the genome and filter for regions of open chromatin by DNase-seq and calculate the “average ChIP-seq scores” for each alignment (which is the highest signal vallue over the region 50 bp before and after the aligned site).
The SEM score for each position is computed as the log2 of the average ChIP-seq signal to endogenous signal ratio for the mapped kmers for each mutated kmer list.

Then repeat the process many times until convergence using an EM-like method (to correct for differences arising from unique starting kmers). Note that this is a generalisation of the intragenomic replicates (IGR) method (https://www.nature.com/articles/ng.2416)

Note that they have precomputed SEMpl scores for some TFs (including Ikaros in kidney - but this was then discarded due to the PWM and/or dataset not containing enough information about the binding of a TF and therefore does not produce any significant enrichments). To calculate the scores yourself, you need ChIP-seq data and PWMs.

Example of usage:

./iterativeSEM  -PWM examples/MA0114.1.pwm 
                -merge_file examples/wgEncodeOpenChromDnaseHepg2Pk.narrowPeak 
                -big_wig examples/wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101RawRep1.bigWig 
                -TF_name HNF4A 
                -genome data/hg19 
                -output results/HNF4A

Extra readings

Liu and colleagues (2020) have designed a SNPs filtering workflow called 3 Dimensional Functional Annotation of Accessible Cell Type Specific SNPs (3DFAACTS-SNP) that utilises overlapping profiles of Treg-specific epigenomic data (ATAC-seq, Hi-C and FOXP3-CHIP) to identify regulatory elements potentially driving the effect of variants associated with T1D and the genes they control. They focus on Tregs cells (which are mediated by expression of FOXP3) as these have been shown to be implicated in T1D. Their hypothesis is that: “the genetic variation that specifically alters Treg function will reside in open chromatin in Treg cells that is bound by FOXP3 and the genes controlled by these regulatory regions can be identified by chromosome conformation capture”. For their analysis they used the Bradfield GWAS data and the Onengut fine-mapping data. For the 1228 fine-mapped T1D SNPs, they identified 36 (from 14 loci) that met their filtering criteria and call these “T1D 3DFAACTS SNPs”.

[Perhaps our hypothesis is similar: “(some of) the genetic variation that specifically alters CD4 + T cell function will reside in open chromatin in CD4 + T cells that is bound by Ikaros TFs and the genes controlled by these regulatory regions can be identifed by chromosome conformation capture”. Perhaps we can then expand this to memory/naive/stimulated CD4 + T cells.]

Gosia’s group have a new paper about costimulation in T cell activation. They defined naïve cells as CD45RA+ and memory cells as CD45RA-. They treated cells with abatacept (a drug used to treat some immune diseases) which blocks costimulation, and found that naive cells were more sensitive to abatacept inhibition than memory cells –> supporting evidence that memory T cells are less dependent on costimulaton than naïve T cells and will continue to proliferate even when CD28 engagement is blocked. They also went one step further with this hypothesis, framing it in a different way - that abatacept doesn’t completely block costimuation, so that memory cells can still respond to very low levels of costimulation and are therefore more sensitive to costimuation. This is supported by CRISPR studies which show that complete loss of costimulation caused substantial further inhibition, particularly in memory cells. Overall, they found that naïve cells are more sensitive to TCR stimulation. Perhaps because memory T cells have more effective CD28 costimulation, whereas naïve cells are more dependent on the TCR and therefore susceptible to its loss.

They also found that CD28- and TCR-sensitive genes (and their regulatory regions) were enriched for T1D variants (by investigating whether disease SNPs map within active or open chromatin regions near the -senstive genes).

They also tested to see if any SNPs disrupted a TFBS. But note that to define “disease SNPs” they used the lead-SNP and friends method.

Side note: is there a way to link this to Thomas et al. (2007) ’s finding that “CD4+ T cells with reduced Ikaros DNA-binding activity no longer require signals from the TCR or CD28 for histone acetylation at the IL2 promoter, and no longer require CD28 costimulation for expression of the IL2 gene”.

My analysis

I’ve pre-processed the two datasets into GRanges objects:

Predicted IKZF1 binding sites in lymphoblast cells (Funk et al. 2020). Note that this contains all predicted binding sites. I limit my analysis to those regions with a HINT score \(>200\), which the authors found to be an optimal filtering threshold to control both false-positive and false-negative errors. There are also some duplicate regions with different HINT20 scores. For now, I just keep the first instance of the region in the data frame (need to think of a better way to deal with these duplicates later).

ikzf1_pred_full <- readRDS("ikzf1_pred_Granges.RDS")
ikzf1_pred <- ikzf1_pred_full[which(score(ikzf1_pred_full)>200)]

df <- data.frame(ikzf1_pred)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_pred <- GRanges(out)

Empirical IKZF1 binding sites from ChIP-Seq in B-lymphocytes (from a female donor) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105587). Note that this data is already the conservative idr thresholded peaks. Briefly, these are the peaks which can be replicated based on “true” replicates (not pool) (see https://biodatascience.github.io/compbio/test/IDR.html). Again, there are duplicates and I keep only the first instance of the region in the data frame (but there are still overlapping CHiP-seq peaks).

ikzf1_chip <- readRDS("ikzf1_chip_Granges.RDS")

df <- data.frame(ikzf1_chip)
out <- df[!duplicated(paste0(df$seqnames, df$start, df$end)),]
rownames(out) <- c()
ikzf1_chip <- GRanges(out)

Note that for the predicted binding site data set, HOCOMOCO motif info is used along with DNase data from ENCODE. I can’t gaurantee that the empirical ChIP-seq data (also from ENCODE) I’m using is not used in either of these specifications.

Altogether there are 13676 predicted binding sites after thresholding, and 60776 empirical CHiP-seq peaks. We saw last week that the distribution of peaks across chromosomes was similar for the predicted and empirical peaks. I now consider if there is much overlap between the empirical and the predicted peaks.

54280/60776 (89%) of empirical CHiP-seq peaks do not house a predicted binding site.
For the remaining 11%, below we see that most house just one predicted peak.

However, there is 1 region (chr16:90008631-90009267) that houses 5 predicted peaks. Below, I show a simple schematic of where the predicted binding sites reside on the CHiP-seq peak. But looking at this region in the UCSC genome browser, it doesn’t look very interesting. It seems to house the \(\textit{DBNDD1}\) gene, but when I match my Cooper SNPs - none of them fall in this genomic region.

There are also 89 regions that house 4 predicted binding sites, which I could look into in the future.

Using the predicted TFBS data set only, I find that only 222 of the 8,618,379 Cooper SNPs fall into these predicted TFBS. Some of these have very low \(p\)-values for T1D, which look promising

However, only 23 were fine-mapped and most of these have pip=0, with the largest pip being 0.0002.

This seems odd - what do the regions look like for the two SNPs with tiny \(p\)-values (\(p\approx 10^{-140}\): chr6:31497713 (in chr6_block24) and chr6:32894143 (in chr6_block26))?

Ah - its because they reside in the MHC region of the genome. The first region contains 7 independent signals identifed by SuSiE, whilst the second contains 8. I guess it makes sense for these SNPs not to be in the credible sets here.

Summary of where I’m up to with application project

Investigated stimulation- and differentiation- associated ATAC-seq data measured in a variety of immune cells (Calderon et al., 2019), including mapping the T1D associated/causal SNPs to these peaks (see https://annahutch.github.io/PhD/12nov20.html).
Reading on Ikaros family of transcription factors (because we think these could be involved in T1D pathogenesis). I have some IKZF1 CHiP-seq data in CD4 + T cells which I’m yet to analyse. In the meantime, I’ve been using pre-analysed IKZF1 CHiP-seq data from a lymphoblastoid cell line. It seems that GWAS SNPs are enriched in these CHiP-seq peaks, which suggests that leveraging this data in cFDR could be useful (https://annahutch.github.io/PhD/18nov20.html). This has the potential to not only find new associations but also find the function through which the SNPs cause disease (e.g. by disrupting a IKZF1 binding site). [Note - I’m now in a position to run binary cFDR on the T1D GWAS results, leveraging a binary indicator for whether the SNP falls in a CHiP-seq peak].
A new paper has also come out on predicted TFBS, which may be useful for other Ikaros TFs for which we don’t have CHiP-Seq data. I’ve began investigating the similarity of these predicted peaks with those empirical peaks from the CHiP-Seq data, but I’m not sure whether the latter was used to predict the former. I’ve also checked that distance to gene is not confounding our analysis. See https://annahutch.github.io/PhD/26nov20.html.

Binary cFDR

Chris has sped up the binary cFDR code, such that it now calculates p0 and p1 for a sample of data points and uses linear interpolation for the rest. I’ve included a line of code that deals with NA values for SNPs with \(p\)-value greater than the maximum \(p\)-value used.

I’ve also extended the code such that it now takes more samples in regions where we care more about accuracy (\(p<p.thr\)) and less where we care less about accuracy (\(p>=p.thr\)). I include the \(p\) value threshold (p.thr) and the digits for rounding for both p<p.thr (digits_lowp) and p>=p.thr (digits_highp) as optional parameters in the function code.

Manipulating these parameter values results in a big save in time but not at the expense of a drop in accuracy. In the table below “Accuracy” is calculated as the correlation between the resultant \(v\) values from runing the function with the listed parameter values and from no interpolation (i.e setting digits=50).

##   p.thr digits_highp digits_lowp Accuracy           Time
## 1  0.01           50          50 1.000000 18.958078 secs
## 2  0.01            1           2 0.999948  9.839347 secs
## 3  0.01            1           3 0.999948  9.060714 secs
## 5   0.1            1           3 0.999948 10.482241 secs
## 6   0.1            1           3 0.999948 10.443753 secs

If I plot the resultant \(v\)-values from the “gold standard” implementation (no interpolation) against the most extreme case, where I set p.thr=0.01; digits_highp=1; digits_lowp=2 then we see the results are extremely similar.

Setting p.thr=0.01; digits_highp=1; digits_lowp=2 (default values?) in the T1D application (for the Bioinformatics application note) take 18 minutes for 114,641 SNPs (as opposed to 42 minutes for the “gold standard” with no interpolation). It takes 12 minutes for iteration 1 (leveraging RA \(p\)-values) and 4 minues for iteration 2 (leveraging H3K27ac counts).

Notes and comments

Nice summary of local FDR: https://biodatascience.github.io/compbio/test/localfdr.html
When doing QQ plots, could also do for PP/PIP rather than -log10.
When modelling distance to gene confounding, do on -log10(p) scale.
I need to read up on how close to a motif a SNP needs to be to disrupt binding (i.e. how much to extend the 7bp predicted binding regions in my analysis).

Aims before Christmas:

Send cFDR to AJHG (need potential reviewers) - get online submission ready to go
Finish binary cFDR code and send paper to James to check
More thesis writing.