T cell development

Source: https://www.youtube.com/watch?v=JeV-HuPq7CI; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5865843/


T cell activation + differentiation

Source: https://en.wikipedia.org/wiki/T_helper_cell


Ikaros

The Ikaros family of transcription factors consists of 5 members: Ikaros (encoded by the gene \(\textit{IKZF1}\)), Helios (\(\textit{IKZF2}\)), Aiolos (\(\textit{IKZF3}\)), Eos (\(\textit{IKZF4}\)) and Pegasus (\(\textit{IKZF5}\)). They can both positively and negatively regulate gene expression through direct DNA interaction and by forming transcriptional complexes with other proteins (Powell et al. 2019). The IKFL1-derived lineage consists of Ikaros and Aiolos, which been linked to many autoimmune disorders.

Vyse and Cunninghame Graham (2020) used trans-ancestral fine-mapping and functional genomic data to narrow down the sets of potentially causal variants for SLE in the IKZF1 and IKZF3 gene regions.

Powell et al. (2019) review the role of Ikaros transcription factors in the differentitation of effector CD4+ T helper cell subsets.

Lyon de Ana et al. (2019) created an Ikaros conditional knockout mouse and found that mature CD4 T cells could not attain the iTreg cell fate. Moreover, the knockout showed enhanced expression of some proinflammatory cytokines which results in a phenotype that is associated with autoimmunity and pathological inflammation, suggesting that Ikaros acts as a suppressor of this phenotype.


Cusanovich et al. (2014) knocked down 59 TFs (including IKZF3) in a lymphoblastoid cell line to see which genes were differentially expressed. For the IKZF3 knockdown experiment, the \(p\)-values for a likelihood-ratio test comparing the knockdown samples to the controls for each gene are shown below (sorted by \(p\)-value):

library(data.table)
x <- fread("TableS3.txt")

x[order(x$IKZF3_ENSG00000264198_DE_p),c("Symbol","IKZF3_ENSG00000264198_DE_p")]
##         Symbol IKZF3_ENSG00000264198_DE_p
##    1: LGALS3BP               1.162717e-10
##    2:    CXCL9               1.273708e-10
##    3:    FABP4               6.191764e-10
##    4:   PBXIP1               1.617614e-09
##    5:   IFITM3               1.650501e-09
##   ---                                    
## 8868:   MIR300                         NA
## 8869: VTRNA1-1                         NA
## 8870:  MIR1909                         NA
## 8871:   MIR342                         NA
## 8872:  MIR1224                         NA

The interpretation of this is that (e.g.) \(\textit{LGALS3BP}\) is significantly differentially expressed between controls and IKZF3 TF knockouts, suggesting that IKZF3 may be binding in such a way that significantly enhancer/decreases \(\textit{LGALS3BP}\) expression.

Burren et al. 2014 used this data in their VSEAMS method to find that the genes perturbed by IKZF3 (+/- 200kb around the TSS) in the knock-down experiment were enriched for association with T1D. This suggests that some T1D associated variants may act through differential IKZF3 TF binding.

This is a sort of “backwards” approach as it involves testing SNPs already associated with T1D. Now that we know that T1D SNPs are enriched in genomic regions perturbed by IKZF3, perhaps we could incorporate auxiliary data relating to IKZF3 to help us prioritise SNPs using a more “forward” approach (e.g. cFDR).


Fine-mapping results

\(\textit{IKZF1}\) resides in the 7p12.2 region (chr7:50221511-51009338). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). It seems that there is evidence for 6 independent signals in this region (although I’m not sure what PPsum is - is it possible to form credible sets for each of these independent signals?)

library(readxl)

x <- data.frame(read_excel("robertson_supp.xlsx", sheet = "ST11"))

region_7p12.2 <- x[which(x$Region=="chr7:50221511-51009338"),]

out <- split(region_7p12.2, f = region_7p12.2$Index_Rsid)

names(out)
## [1] "rs10277986" "rs11575472" "rs11575500" "rs11763837" "rs6944602" 
## [6] "rs9649738"
#lapply(out, function(x){
#  cs <- credset(x$PP, thr = 0.95)
#  plot(x$Position, x$PP, xlab = "Position", ylab = "PP", pch = 16, main = paste0("Index SNP: ", x$Index_Rsid[1]))
#  points(x$Position[cs$credset], x$PP[cs$credset], col = "blue", pch = 16)
#})

Similarly, \(\textit{IKZF3}\) resides in the 17q12 region (chr17:39747478-40657572). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). It looks like there’s evidence for 3 independent signals here. Interestingly, the \(\textit{IKZF3}\) gene regions has been shown to be differentially associated between T1D patients with different age-of-onsets (see Inshaw et al. (2020)).

region_17q12 <- x[which(x$Region=="chr17:39747478-40657572"),]
out <- split(region_17q12, f = region_17q12$Index_Rsid)
names(out)
## [1] "rs2941522"             "rs74252458;rs11464691" "rs870829"

Preliminary analysis

Aim: See if there is a relationship between T1D GWAS \(p\)-values and IKZF1/IKZF3 binding sites (CHiP-Seq peaks). If there is then using cFDR could be helpful to identify new associations by leveraging this auxiliary data.

I’ve found some processed IKZF1 CHiP-Seq data for a lymphoblastoid cell line (human GM12878; see https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105587). I’m using the GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz file which contains the conservative IDR thresholded peaks. Note that to calculate the IDR, background counts are used along with data from the 2 replicates (see https://www.encodeproject.org/experiments/ENCSR441VHN/).

ikzf1_chip <- read.table("GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed")

# https://genome.ucsc.edu/FAQ/FAQformat.html#format12
colnames(ikzf1_chip) <- c("chr","start","end","name","score","strand","signalValue","pValue","qValue","peak")
head(ikzf1_chip)
##    chr     start       end name score strand signalValue pValue  qValue peak
## 1 chr3  15667631  15668267    .   734      .    16.25951     -1 0.57814  318
## 2 chr6 156396171 156396807    .   548      .    16.25955     -1 0.57815  318
## 3 chr2  42975128  42975764    .   974      .    16.26102     -1 0.57819  318
## 4 chr7 121339144 121339780    .   750      .    16.27006     -1 0.57872  318
## 5 chr2  88691253  88691974    .  1000      .    16.27493     -1 0.57901  537
## 6 chr6  21208416  21209052    .   944      .    16.28223     -1 0.57955  318

The \(p\)-values for the peaks are not available (reported as -1), but the \(q\)-values are - for which I plot a histogram below. Note that these do not appear to have been thresholded. Should I threshold these peaks? E.g. to those with \(q\leq 0.01\).

hist(ikzf1_chip$qValue, xlab = "q values (-log10)", main = "Histogram of q-values")
abline(v = -log10(0.01), lty = "dashed", col = "red")


I’ve mapped the Robertson et al. (2020) GWAS SNPs to these peaks using bedtools. Of the 715,023 GWAS SNPs, 24,480 (3.4%) mapped to a peak. I save the GWAS \(p\)-values for those SNPs that reside in a peak and those that do not (note that some peaks overlap).

new_robertson_gwas <- readRDS("robertson2020_hg19.RDS")
res <- read.table("ikzf1_match")

peak_p <- res$V5

not <- which(!new_robertson_gwas$ID %in% res$V4)
no_peak_p <- new_robertson_gwas$pphaseIandII[not]

# replace those with 0 p-val to the min non-zero p-val in whole dataset
no_peak_p[which(no_peak_p==0)]<- min(new_robertson_gwas$pphaseIandII[-which(new_robertson_gwas$pphaseIandII==0)])

It seems that \(p\)-values are generally smaller for those SNPs residing in a peak.

boxplot(-log10(no_peak_p), -log10(peak_p), names = c("Not in peak", "In peak"), ylim = c(0,5), ylab = "-log10(P) (truncated)", main = "Boxplots comparing p-values")

It also seems that using information on the \(q\)-values may be useful:

sig_peaks_p <- res$V5[which(as.numeric(res$V14) > -log10(0.01))]
not_sig_peaks_p <- res$V5[-which(as.numeric(res$V14) > -log10(0.01))]

boxplot(-log10(no_peak_p), -log10(sig_peaks_p), -log10(not_sig_peaks_p), names = c("Not in peak", "In sig peak", "In non-sig peak"), ylim = c(0,5), ylab = "-log10(P) (truncated)", main = "Boxplots comparing p-values\nstratified by q-vals (FDR<0.01)")


I construct a QQ plot to compare the quantiles of the \(p\)-values for those SNPs that reside in a peak and those that do not reside in a peak. It seems that T1D SNPs falling in peaks are enriched for lower T1D GWAS \(p\)-values. Note that there were some SNPs with 0 \(p\)-values in the GWAS data set, I’ve replaced these \(p\)-values to be equal to the minimum non-zero \(p\)-value in whole dataset (\(4.940656e-324\); but note that this is smaller than .Machine$double.xmin=2.225074e-308).

qqplot(-log10(no_peak_p), -log10(peak_p), ylab = "SNPs in peak", xlab = "SNPs not in peak", pch = 16, main = "QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")

I also see whether the distribution of GWAS \(p\)-values differ between SNPs falling in significant or not-significant peaks:

qqplot(-log10(no_peak_p), -log10(sig_peaks_p), ylab = "SNPs in peak (stratified by q-value)", xlab = "SNPs not in peak", pch = 16, main = "QQ plot for (-log10) p-vals\nstratified by significance of peak")
points(qqplot(-log10(no_peak_p), -log10(not_sig_peaks_p), plot.it = FALSE), col = "red", pch = 16)
abline(0, 1, lty = "dashed")
legend(100,70, legend = c("SNPs in significant peak", "SNPs in non-significant peak"), col = c("black","red"), pch = 16)

qqplot(-log10(not_sig_peaks_p), -log10(sig_peaks_p), ylab = "SNPs in significant peak", xlab = "SNPs in non-significant peak", pch = 16, main = "QQ plot for (-log10) p-vals\nof SNPs residing in\nsignificant or not significant peaks")
abline(0, 1, lty = "dashed")


There is also information on the signalValue (“a measurement of overall (usually, average) enrichment for the region”). I investigate whether this provides relevant information about the GWAS \(p\)-values. I do this by slitting up the SNPs residing in peaks by quantile of signalValue.

res$group <- cut(res$V12, breaks = quantile(res$V12))
m <- melt(res[,c("V5","group")],"group")
unique(m$group)
## [1] (16.3,64]      (107,214]      (64,107]       (214,3.72e+03] <NA>          
## Levels: (16.3,64] (64,107] (107,214] (214,3.72e+03]
qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[1])]), ylab = "SNPs in peak (stratifed by signal value)", xlab = "SNPs not in peak", pch = 16, main = "Stratified QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[2])]), plot.it = FALSE), col = "red", pch = 16)
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[3])]), plot.it = FALSE), col = "blue", pch = 16)
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[4])]), plot.it = FALSE), col = "green", pch = 16)
legend(200,150, legend = c(levels(m$group)[1], levels(m$group)[2], levels(m$group)[3], levels(m$group)[4]), col = c("black","red","blue","green"), pch = 16, title = "Signal value")

There doesn’t seem to be a systematic difference when splitting up the SNPs in this way. But there is a systematic difference if I split the data up into 2 bins rather than 4, where we see that SNPs residing in peaks with higher signal values tend to have smaller \(p\)-values than those residing in peaks with lower signal values, which tend to have smaller \(p\)-values than those SNPs not residing in a peak.

res$group <- cut(res$V12, breaks = quantile(res$V12, probs = c(0, 0.5, 1)))
m <- melt(res[,c("V5","group")],"group")
unique(m$group)
## [1] (16.3,107]     (107,3.72e+03] <NA>          
## Levels: (16.3,107] (107,3.72e+03]
qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[1])]), ylab = "SNPs in peak (stratifed by signal value)", xlab = "SNPs not in peak", pch = 16, main = "Stratified QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[2])]), plot.it = FALSE), col = "red", pch = 16)
legend(200,150, legend = c(levels(m$group)[1], levels(m$group)[2]), col = c("black","red"), pch = 16, title = "Signal value")


This analysis (and results from Burren et al. 2014) suggest that leveraging information about IKZF1 binding sites with T1D GWAS \(p\)-values may be useful. Indeed, there is a fairly strong correlation between T1D GWAS \(p\)-values and (i) peak significance (FDR value) (ii) binary indicator of peak (iii) peak signal value.

df <- data.frame("p" = new_robertson_gwas$pphaseIandII, "peak_qval" = 0 ,"q_binary" = 0, "q_signalvalue" = 0)

df$peak_qval[which(new_robertson_gwas$ID %in% res$V4)] <- res$V14[match(new_robertson_gwas[new_robertson_gwas$ID %in% res$V4,"ID"], res$V4)]
cor(df$p, df$peak_qval)
## [1] -0.01282665
df$q_binary[which(new_robertson_gwas$ID %in% res$V4)] <- 1
cor(df$p, df$q_binary)
## [1] -0.01352838
df$q_signalvalue[which(new_robertson_gwas$ID %in% res$V4)] <- res$V12[match(new_robertson_gwas[new_robertson_gwas$ID %in% res$V4,"ID"], res$V4)]
cor(df$p, df$q_signalvalue)
## [1] -0.008700345

cFDR

Using the cFDR framework would enable us to re-weight T1D GWAS \(p\)-values using this auxiliary data.

Binary cFDR could be use to leverage a binary indicator of whether the SNP falls in a IKZF1 chip-seq peak. But the current software is too slow for ~715,000 SNPs.

library(fcfdr)

cfdr_binary_res <- binary_cfdr(p = df$p, q = df$q_binary, chr = new_robertson_gwas$chromosome)

Using flexible cFDR, I could leverage the log transformed signal values for each peak (0 if SNP does not fall in peak). But I need to define an independent subset of SNPs to fit the KDE. To do this I usually use LDAK weighting with European 1000 Genomes reference data, but I’m not sure that holds here as these GWAS \(p\)-values are from a meta-analysis? (Will have similar problem with defining loci, for which I use the same reference data with PLINK).


Good tool to find data: http://remap.univ-amu.fr/target_page/IKZF1:9606)