18th November

T cell development

Source: https://www.youtube.com/watch?v=JeV-HuPq7CI; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5865843/

T cell precursors are present in the bone marrow (“bone marrow-derived T cell progenitors”).
Chemokines from the thymus help these T cell precursors to migrate to the thymus.
Once in the thymus, thymic factors stimulate specific genes in the T cell:
1. Recombination-activating genes produce RAG1 and RAG2 proteins which shuffle the DNA to derive a specific TCR on the surface of the T cell.
2. Other subsets of genes produce cluster differentiation (CD) proteins on the surface of the T cell.
The T cell precursors then undergo positive selection whereby those that are able to recognise the self-MHC are kept whereas those that cannot are depleted (by apoptosis).
The T cell precursors then undergo negative selection whereby those that recognise and bind to self antigens presented by the MHC are depleted (by apoptosis) as these could destroy our own body cells and lead to autoimmune disorders. This is the process of the elimination of potentially auto-reactive T cells.
- Note that this is where the VNTRs in the $\textit{INS}$ gene region that are associated with T1D come in. The class I VNTR is the shortest and results in less insulin being produced in the thymus. This means that there are not enough self insulin antigens avaliable to be presented to the T cells in the negative selection stage. Meaning that auto-reacive T cells for insulin may not be sufficiently depleted in this step. Similarly, the class III VNTR is the longest and is protective for T1D because more insulin is being produced so more auto-reactive T cell negative selection can occur (and ultimately the CD8 T cells won’t identify the beta cells in the pancreas as self; see Pugliese 2005).
The T cells that have survived these selection processes travel to the secondary lymphoid organs (spleen and lymph nodes) and eventually become mature (but naive) CD4+ or CD8+ T cells. These naive cells reticulate between the blood and the lymph system.

T cell activation + differentiation

Source: https://en.wikipedia.org/wiki/T_helper_cell

During an immune response, an APC (typically a dendritic cell) engulfs the foreign antigen and travels from the infection site to the lymph nodes. Once at the lymph nodes, the APC presents the foreign antigen to the T cells (class II MHC activates CD4+ and class I MHC activates CD8+).
There are two signals required for activation:
1. TCR/MHC binding: The TCR-CD3 complex on the T cell binds with a processed antigenic peptide bound to class-I or class-II MHC molecules on the surface of the APC.
2. Co-stimulatory molecules ensure that the T cell is responding to a foreign antigen (i.e. not self-antigens). E.g. CD28 on the CD4+ T cell binds to CD80 or CD86 on the APC.
The T cells can now proliferate and differentiate into memory or effector cells.
Cytokines determine the differentiation of the activated T cells. Specifically, cytokines bind to their specific receptors and activate the JAK/STAT pathways which results in the translocation of STAT factors into the nucleus where they activate expression of “lineage-defining” transcription factors, which push cells to different fates - e.g. Treg cells.
This is where the T1D associations within the $\textit{IL2RA}$ gene region come in. $\textit{IL2RA}$ encodes a sub-unit of the IL-2 receptor. This receptor is critical for the development of Tregs (which in turn regulate the immune response) and accordingly, a mutation which results in higher $\textit{IL2RA}$ expression is protective against T1D (more/better regulation of the immune response).
I.e. 3 factors are reqipred for the differentiation of effector CD4+ T cells:
1. TCR/MHC binding
2. Co-stimulatory receptors
3. Cytokines.

Ikaros

The Ikaros family of transcription factors consists of 5 members: Ikaros (encoded by the gene $\textit{IKZF1}$), Helios ($\textit{IKZF2}$), Aiolos ($\textit{IKZF3}$), Eos ($\textit{IKZF4}$) and Pegasus ($\textit{IKZF5}$). They can both positively and negatively regulate gene expression through direct DNA interaction and by forming transcriptional complexes with other proteins (Powell et al. 2019). The IKFL1-derived lineage consists of Ikaros and Aiolos, which been linked to many autoimmune disorders.

$\textit{IKZF1}$: Widely expressed amongst blood cell types. Involved in the hematopoietic stem cell development (Yoshida et al. 2006) and in lymphoid development. $\textit{IKZF1}$ mice knockouts (without DNA-binding exons 3-5) do not have T, B, NK nor dendritic cells, but the myeloid cell types are unaffected (Schmitt et al. 2002). Thought to negatively regulate Th1 differentiation.
$\textit{IKZF3}$: Expression largely restricted to T and B cells. The $\textit{IKZF3}$ knockout mouse is characterised by the chronic activation of B cells with increased levels of autoantibodies and glomerulonephritis (inflammation of glomeruli in the kidney) (Wang et al. 1998; Vyse and Cunninghame Graham 2020).

Vyse and Cunninghame Graham (2020) used trans-ancestral fine-mapping and functional genomic data to narrow down the sets of potentially causal variants for SLE in the IKZF1 and IKZF3 gene regions.

Powell et al. (2019) review the role of Ikaros transcription factors in the differentitation of effector CD4+ T helper cell subsets.

Lyon de Ana et al. (2019) created an Ikaros conditional knockout mouse and found that mature CD4 T cells could not attain the iTreg cell fate. Moreover, the knockout showed enhanced expression of some proinflammatory cytokines which results in a phenotype that is associated with autoimmunity and pathological inflammation, suggesting that Ikaros acts as a suppressor of this phenotype.

Cusanovich et al. (2014) knocked down 59 TFs (including IKZF3) in a lymphoblastoid cell line to see which genes were differentially expressed. For the IKZF3 knockdown experiment, the $p$-values for a likelihood-ratio test comparing the knockdown samples to the controls for each gene are shown below (sorted by $p$-value):

library(data.table)
x <- fread("TableS3.txt")

x[order(x$IKZF3_ENSG00000264198_DE_p),c("Symbol","IKZF3_ENSG00000264198_DE_p")]

##         Symbol IKZF3_ENSG00000264198_DE_p
##    1: LGALS3BP               1.162717e-10
##    2:    CXCL9               1.273708e-10
##    3:    FABP4               6.191764e-10
##    4:   PBXIP1               1.617614e-09
##    5:   IFITM3               1.650501e-09
##   ---                                    
## 8868:   MIR300                         NA
## 8869: VTRNA1-1                         NA
## 8870:  MIR1909                         NA
## 8871:   MIR342                         NA
## 8872:  MIR1224                         NA

The interpretation of this is that (e.g.) $\textit{LGALS3BP}$ is significantly differentially expressed between controls and IKZF3 TF knockouts, suggesting that IKZF3 may be binding in such a way that significantly enhancer/decreases $\textit{LGALS3BP}$ expression.

Burren et al. 2014 used this data in their VSEAMS method to find that the genes perturbed by IKZF3 (+/- 200kb around the TSS) in the knock-down experiment were enriched for association with T1D. This suggests that some T1D associated variants may act through differential IKZF3 TF binding.

This is a sort of “backwards” approach as it involves testing SNPs already associated with T1D. Now that we know that T1D SNPs are enriched in genomic regions perturbed by IKZF3, perhaps we could incorporate auxiliary data relating to IKZF3 to help us prioritise SNPs using a more “forward” approach (e.g. cFDR).

Fine-mapping results

$\textit{IKZF1}$ resides in the 7p12.2 region (chr7:50221511-51009338). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). It seems that there is evidence for 6 independent signals in this region (although I’m not sure what PPsum is - is it possible to form credible sets for each of these independent signals?)

library(readxl)

x <- data.frame(read_excel("robertson_supp.xlsx", sheet = "ST11"))

region_7p12.2 <- x[which(x$Region=="chr7:50221511-51009338"),]

out <- split(region_7p12.2, f = region_7p12.2$Index_Rsid)

names(out)

## [1] "rs10277986" "rs11575472" "rs11575500" "rs11763837" "rs6944602" 
## [6] "rs9649738"

#lapply(out, function(x){
#  cs <- credset(x$PP, thr = 0.95)
#  plot(x$Position, x$PP, xlab = "Position", ylab = "PP", pch = 16, main = paste0("Index SNP: ", x$Index_Rsid[1]))
#  points(x$Position[cs$credset], x$PP[cs$credset], col = "blue", pch = 16)
#})

Similarly, $\textit{IKZF3}$ resides in the 17q12 region (chr17:39747478-40657572). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). It looks like there’s evidence for 3 independent signals here. Interestingly, the $\textit{IKZF3}$ gene regions has been shown to be differentially associated between T1D patients with different age-of-onsets (see Inshaw et al. (2020)).

region_17q12 <- x[which(x$Region=="chr17:39747478-40657572"),]
out <- split(region_17q12, f = region_17q12$Index_Rsid)
names(out)

## [1] "rs2941522"             "rs74252458;rs11464691" "rs870829"

Preliminary analysis

Aim: See if there is a relationship between T1D GWAS $p$-values and IKZF1/IKZF3 binding sites (CHiP-Seq peaks). If there is then using cFDR could be helpful to identify new associations by leveraging this auxiliary data.

I’ve found some processed IKZF1 CHiP-Seq data for a lymphoblastoid cell line (human GM12878; see https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105587). I’m using the GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz file which contains the conservative IDR thresholded peaks. Note that to calculate the IDR, background counts are used along with data from the 2 replicates (see https://www.encodeproject.org/experiments/ENCSR441VHN/).

ikzf1_chip <- read.table("GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed")

# https://genome.ucsc.edu/FAQ/FAQformat.html#format12
colnames(ikzf1_chip) <- c("chr","start","end","name","score","strand","signalValue","pValue","qValue","peak")
head(ikzf1_chip)

##    chr     start       end name score strand signalValue pValue  qValue peak
## 1 chr3  15667631  15668267    .   734      .    16.25951     -1 0.57814  318
## 2 chr6 156396171 156396807    .   548      .    16.25955     -1 0.57815  318
## 3 chr2  42975128  42975764    .   974      .    16.26102     -1 0.57819  318
## 4 chr7 121339144 121339780    .   750      .    16.27006     -1 0.57872  318
## 5 chr2  88691253  88691974    .  1000      .    16.27493     -1 0.57901  537
## 6 chr6  21208416  21209052    .   944      .    16.28223     -1 0.57955  318

The $p$-values for the peaks are not available (reported as -1), but the $q$-values are - for which I plot a histogram below. Note that these do not appear to have been thresholded. Should I threshold these peaks? E.g. to those with $q\leq 0.01$.

hist(ikzf1_chip$qValue, xlab = "q values (-log10)", main = "Histogram of q-values")
abline(v = -log10(0.01), lty = "dashed", col = "red")

I’ve mapped the Robertson et al. (2020) GWAS SNPs to these peaks using bedtools. Of the 715,023 GWAS SNPs, 24,480 (3.4%) mapped to a peak. I save the GWAS $p$-values for those SNPs that reside in a peak and those that do not (note that some peaks overlap).

new_robertson_gwas <- readRDS("robertson2020_hg19.RDS")
res <- read.table("ikzf1_match")

peak_p <- res$V5

not <- which(!new_robertson_gwas$ID %in% res$V4)
no_peak_p <- new_robertson_gwas$pphaseIandII[not]

# replace those with 0 p-val to the min non-zero p-val in whole dataset
no_peak_p[which(no_peak_p==0)]<- min(new_robertson_gwas$pphaseIandII[-which(new_robertson_gwas$pphaseIandII==0)])

It seems that $p$-values are generally smaller for those SNPs residing in a peak.

boxplot(-log10(no_peak_p), -log10(peak_p), names = c("Not in peak", "In peak"), ylim = c(0,5), ylab = "-log10(P) (truncated)", main = "Boxplots comparing p-values")

It also seems that using information on the $q$-values may be useful:

sig_peaks_p <- res$V5[which(as.numeric(res$V14) > -log10(0.01))]
not_sig_peaks_p <- res$V5[-which(as.numeric(res$V14) > -log10(0.01))]

boxplot(-log10(no_peak_p), -log10(sig_peaks_p), -log10(not_sig_peaks_p), names = c("Not in peak", "In sig peak", "In non-sig peak"), ylim = c(0,5), ylab = "-log10(P) (truncated)", main = "Boxplots comparing p-values\nstratified by q-vals (FDR<0.01)")

I construct a QQ plot to compare the quantiles of the $p$-values for those SNPs that reside in a peak and those that do not reside in a peak. It seems that T1D SNPs falling in peaks are enriched for lower T1D GWAS $p$-values. Note that there were some SNPs with 0 $p$-values in the GWAS data set, I’ve replaced these $p$-values to be equal to the minimum non-zero $p$-value in whole dataset ($4.940656e-324$; but note that this is smaller than .Machine$double.xmin=2.225074e-308).

qqplot(-log10(no_peak_p), -log10(peak_p), ylab = "SNPs in peak", xlab = "SNPs not in peak", pch = 16, main = "QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")

I also see whether the distribution of GWAS $p$-values differ between SNPs falling in significant or not-significant peaks:

qqplot(-log10(no_peak_p), -log10(sig_peaks_p), ylab = "SNPs in peak (stratified by q-value)", xlab = "SNPs not in peak", pch = 16, main = "QQ plot for (-log10) p-vals\nstratified by significance of peak")
points(qqplot(-log10(no_peak_p), -log10(not_sig_peaks_p), plot.it = FALSE), col = "red", pch = 16)
abline(0, 1, lty = "dashed")
legend(100,70, legend = c("SNPs in significant peak", "SNPs in non-significant peak"), col = c("black","red"), pch = 16)

qqplot(-log10(not_sig_peaks_p), -log10(sig_peaks_p), ylab = "SNPs in significant peak", xlab = "SNPs in non-significant peak", pch = 16, main = "QQ plot for (-log10) p-vals\nof SNPs residing in\nsignificant or not significant peaks")
abline(0, 1, lty = "dashed")

There is also information on the signalValue (“a measurement of overall (usually, average) enrichment for the region”). I investigate whether this provides relevant information about the GWAS $p$-values. I do this by slitting up the SNPs residing in peaks by quantile of signalValue.

res$group <- cut(res$V12, breaks = quantile(res$V12))
m <- melt(res[,c("V5","group")],"group")
unique(m$group)

## [1] (16.3,64]      (107,214]      (64,107]       (214,3.72e+03] <NA>          
## Levels: (16.3,64] (64,107] (107,214] (214,3.72e+03]

qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[1])]), ylab = "SNPs in peak (stratifed by signal value)", xlab = "SNPs not in peak", pch = 16, main = "Stratified QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[2])]), plot.it = FALSE), col = "red", pch = 16)
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[3])]), plot.it = FALSE), col = "blue", pch = 16)
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[4])]), plot.it = FALSE), col = "green", pch = 16)
legend(200,150, legend = c(levels(m$group)[1], levels(m$group)[2], levels(m$group)[3], levels(m$group)[4]), col = c("black","red","blue","green"), pch = 16, title = "Signal value")

There doesn’t seem to be a systematic difference when splitting up the SNPs in this way. But there is a systematic difference if I split the data up into 2 bins rather than 4, where we see that SNPs residing in peaks with higher signal values tend to have smaller $p$-values than those residing in peaks with lower signal values, which tend to have smaller $p$-values than those SNPs not residing in a peak.

res$group <- cut(res$V12, breaks = quantile(res$V12, probs = c(0, 0.5, 1)))
m <- melt(res[,c("V5","group")],"group")
unique(m$group)

## [1] (16.3,107]     (107,3.72e+03] <NA>          
## Levels: (16.3,107] (107,3.72e+03]

qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[1])]), ylab = "SNPs in peak (stratifed by signal value)", xlab = "SNPs not in peak", pch = 16, main = "Stratified QQ plot for (-log10) p-vals")
abline(0, 1, lty = "dashed")
points(qqplot(-log10(no_peak_p), -log10(m$value[which(m$group==levels(m$group)[2])]), plot.it = FALSE), col = "red", pch = 16)
legend(200,150, legend = c(levels(m$group)[1], levels(m$group)[2]), col = c("black","red"), pch = 16, title = "Signal value")

This analysis (and results from Burren et al. 2014) suggest that leveraging information about IKZF1 binding sites with T1D GWAS $p$-values may be useful. Indeed, there is a fairly strong correlation between T1D GWAS $p$-values and (i) peak significance (FDR value) (ii) binary indicator of peak (iii) peak signal value.

df <- data.frame("p" = new_robertson_gwas$pphaseIandII, "peak_qval" = 0 ,"q_binary" = 0, "q_signalvalue" = 0)

df$peak_qval[which(new_robertson_gwas$ID %in% res$V4)] <- res$V14[match(new_robertson_gwas[new_robertson_gwas$ID %in% res$V4,"ID"], res$V4)]
cor(df$p, df$peak_qval)

## [1] -0.01282665

df$q_binary[which(new_robertson_gwas$ID %in% res$V4)] <- 1
cor(df$p, df$q_binary)

## [1] -0.01352838

df$q_signalvalue[which(new_robertson_gwas$ID %in% res$V4)] <- res$V12[match(new_robertson_gwas[new_robertson_gwas$ID %in% res$V4,"ID"], res$V4)]
cor(df$p, df$q_signalvalue)

## [1] -0.008700345

cFDR

Using the cFDR framework would enable us to re-weight T1D GWAS $p$-values using this auxiliary data.

Binary cFDR could be use to leverage a binary indicator of whether the SNP falls in a IKZF1 chip-seq peak. But the current software is too slow for ~715,000 SNPs.

library(fcfdr)

cfdr_binary_res <- binary_cfdr(p = df$p, q = df$q_binary, chr = new_robertson_gwas$chromosome)

Using flexible cFDR, I could leverage the log transformed signal values for each peak (0 if SNP does not fall in peak). But I need to define an independent subset of SNPs to fit the KDE. To do this I usually use LDAK weighting with European 1000 Genomes reference data, but I’m not sure that holds here as these GWAS $p$-values are from a meta-analysis? (Will have similar problem with defining loci, for which I use the same reference data with PLINK).

Good tool to find data: http://remap.univ-amu.fr/target_page/IKZF1:9606)