Flexible cFDR: Bioinformatics application note

T1D GWAS data

There are two options:

Cooper et al. 2017 GWAS. Guille has already processed this but it contains 9 million SNPs which will be way too slow for an example.
Onengut Immunochip GWAS. The summary statistics are avaliable from the GWAS catalog which will enable reproducibility. This contains 120,000 SNPs which is more manageable for an example.

I decide to use the Onengut ImmunoChip data.

Iteration 1 (binary auxiliary data)

To show off binary cFDR, I need to find some relevant binary annotation data. I’ve looked at “Coding_UCSC”, “non-syn”, “DGF_ENCODE”, “DHS_Trynka” and “DHS_peaks_Trynka” annotations in the baseline LD model. [2860 out of 123,130 GWAS SNPs are not present in baseline LD model].

The coding annotation would have been my first choice, but only a tiny proportion of SNPs are coding (5323; 4%) and the correlation with \(p\) actually comes out to be positive (i.e. coding SNPs have higher \(p\)). [Note: In the big Cooper GWAS the correlation is relatively strongly negative so the annotation is “correct” and I think it comes down to the small immunochip dataset].

Alternatively, 1440 (1.2%) are non-synonomous, 28294 (23%) are open in DGF encode annotation, 33282 (27%) are open in DHS trynka and 22827 (19%) are in open peaks in the trynka data. [For DHS_Trynka they added 100bp window around the chip-seq peaks from DHS_peaks_Trynka].

I stick with the DGF encode annotation:

“Digital genomic footprint annotations were obtained from ENCODE and post-processed by Gusev et al..”
“DNaseI digital genomic footprinting (DGF) regions were downloaded for 57 cell lines (see http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDgf/). All regions from the narrow-peak classification were then merged into a single DGF annotation.”

Iteration 1 results

Accordingly, SNPs in open chromatin regions (averaged across many cell types) are given slightly smaller \(v\)-values and those in closed regions are given slightly larger \(v\)-values.

Iteration 2: Continuous auxiliary data

I consider 3 pieces of auxiliary data which seem relevant for T1D from the literature.

H3K27ac fold change values in primary T helper cells from peripheral blood (CD4+_CD25-_Th_Primary_Cells)
- “Credible sets of disease-associated variats are specifically enriched in immune cell accessible chromatin, particularly in CD4+ effector T cells” (https://www.biorxiv.org/content/10.1101/2020.06.19.158071v1.full.pdf).
DNase fold change values in primary T cells from peripheral blood (CD3_Primary_Cells_Peripheral_UW)
- I thought this would be interesting because a recent clinical trial found that an anti-CD3 antibody delays progression to T1D in high-risk participants (https://www.nejm.org/doi/full/10.1056/NEJMoa1902226).
- CD3 protein complex is involved in activating cytotoxic T cells (CD8+) and effector T cells (CD4+) which are thought to be key components that destroy beta cells in T1D patients.
- https://www.youtube.com/watch?v=W0d5ZDu9dgE
DNase fold change values in primary natural killer cells from peripheral blood (CD56_Primary_Cells)
- “Type 1 Diabetes and Its Multi-Factorial Pathogenesis: The Putative Role of NK Cells” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5877655/).

Side note: May also be good to look at CD8+ T cells as these have been shown to be the most likely cell attacking the beta cells. But not sure whether its memory or naive CD8 cells.

The data is long tailed and takes unique values, so I take logs and add some noise.

The relationship is nicely monotonic for all three of these annotations.

I think it would be nice to stick with chromatin accessibility data. I.e. state that iteration 1 adjusts based on general chromatin accessibility across many cell types and that iteration 2 focusses in on a specific cell type which is known to be relevant for T1D. For now, I choose CD3+ DNase data.

I am currently running LDAK to get an independent subset of T1D SNPs to use in the flexible cFDR method.

To consider

Application notes are approximately 1000 words + 1 figure.
Stated in Bioinformatics about application notes: “We consider statistical methodology only when there is significant bioinformatics content such as new algorithms or software. We do not consider software that implements methods recently published elsewhere by the same authors”.
Binary cFDR implements the leave-one-chromosome out method, whereas flexible cFDR uses LD weights. Is this ok? Both are quite slow (especially binary cFDR).