4th March

Introduction

Aim: Test whether certain pre-selected SNP pairs are interacting to implicate T1D.

Most SNPs in my analysis are imputed, meaning that we don’t have exact categorical genotypes but instead have genotype probabilities (posterior probabilities from IMPUTE2). From these, I can calculate the expected genotype for each individual (although even at this initial step I am throwing away information about the posterior probability distribution).

The most natural method to test for interaction is to regress the expected genotypes at the two SNPs plus their interaction term on T1D case/ control status, and assess the significance of the interaction term. I could also include a covariate for study, so that a second step is not required for the meta-analysis between studies.

Alternatively, I could go down the case-only test for interaction route which potentially has greater power. In this method, we exploit the idea that if the genotypes at the two SNPs are not correlated in the general population, then a correlation between genotypes in cases may indicate SNP interaction. The simplest approach to test for association between SNP genotypes in cases is to use a \(\chi^2\) test. However, this requires the data to be nominal and therefore I must round the expected genotypes so that the only values they can take is 0, 1 or 2. This test has 4 degrees of freedom and may therefore be too low powered to detect interactions. Moreover, a second step is required to meta-analyse the \(\chi^2\) statistics from the two studies.

The Cochran-Mantel-Haenszel (CMH) test is essentially an extention of the \(\chi^2\) test for multiple contingency tables (one for each study). It tests for association between the genotypes at the two SNPs whilst controlling for the stratification by study. This method therefore incorporates the meta-analysis step but requires rounded genotypes and is still low powered, requiring 4 degrees of freedom.

Both the standard \(\chi^2\) test and the CMH test assume that the data is nominal. That is, it assumes a genotype model in that the levels of the genotype vector are not functions of eachother. If we instead assume an additive, dominant or recessive model then the genotypes are now ordinal variables. For example, assuming an additive model would mean that the effect of genotype 2 is the twice the effect of genotype 1, and this is what is generally assumed in GWAS.

The generalised CMH test (Landis et al. 1978) can be used when considering ordinal data types. Specifically, the CMH correlation statistic can be used when considering both factors as ordinal. This tests the null hypothesis that there is no correlation between genotypes at the two SNPs, assuming an additive model.

However, all the CMH tests require categorical data and thus the expected genotypes must be rounded to the nearest integer. This throws away valuable information on the genotype posterior probability distributions. A formal way to deal with the additional variability that exists due to our uncertain genotypes is multiple imputation. Briefly, multiple imputation methods involve sampling many data sets of genotypes that are compatible with the genotype posterior probability distributions and analysing each of these data sets seperately, as if they are what was observed. The method then jointly analyses the outputs and provides summary measures that account for the variability between samples. The R package, mice (https://cran.r-project.org/web/packages/mice/mice.pdf) provides a tool to run this type of analysis.

Another approach that avoids rounding is a meta-analysis on the correlations between expected genotypes using a fixed- or random- effects model. Briefly, a fixed-effect model would assume that there is a true underlying correlation between the genotypes, and that any deviations from this true value are due to estimation error. On the other hand, the random-effects model would assume that each study has its own true underlying correlation and thus aims to estimate both these study-specific true correlations and the “grand” correlation underlying all studies. Since we only have two studies, and estimating all the quantities in a random-effects model may be difficult, I decide to use a fixed-effect analysis. Note that these models obviously treat the data as ordinal (or continuous if using expected genotypes), assuming an additive model.

Source: https://www.meta-analysis.com/downloads/Intro_Models.pdf

Method

I have screened the 21 prioritised SNPs (those with T1D \(p<0.05\) and that overlap a predicted IKZF1 binding site and a ChIP-seq peak) based on the regional Manhattan plots and the functional genomic tracks. For each of these, I run a fixed effect meta-analysis of correlations between expected genotypes in cases and compare this to that for controls (using metacor). For now, I exclude SNPs residing in the MHC.

I used the observed genotypes for directly genotyped SNPs. For imputed SNPs, I recorded the quality of the imputation (“info” score) and searched for any proxy SNPs that were directly genotyped on the array. I used the LDproxy tool to download lists of SNPs within 500-kb and with \(r^2>0.5\) with the SNP of interest. I then searched through the genotype data to see if any of these SNPs were directly genotyped on the array. If a proxy SNP was directly genotyped on the array and had an \(r^2\) value with the SNP of interest greater than the imputed info score for that SNP, then I use the observe genotype for the proxy SNP in the analysis. If not, then I used the expected genotypes calculated from the imputed genotype posterior probabilities.

Note that using the expected genotypes does not properly account for variation in the genotype posterior probabilties. Therefore, if the above analysis gives promising results then I could formalise the method to use multiple imputation (sample data, use CMH test of non-zero correlation, convert \(\chi^2\) statistics using the Wilson-Hilferty transformation and then combine using the multiple imputation method - see Lu 2020).

Results

My initial results are shown below. Of the 21 SNPs, I only have results for 12 because 7 reside in the MHC, rs955704300 (1:40862492 CCT/C) is not a 1000G SNP and rs117701653 (2:203764072 A/C) is not genotyped or imputed and does not have a proxy SNP.

snp	chr	pos	a0	a1	T1D_p	HINT20	aCD4	LCL1	LCL2	t1dgc_type	t1dgc_info	t1dgc_certainty	t1dgc_proxy	t1dgc_proxy_r2	wtccc_info	wtccc_certainty	wtccc_proxy	wtccc_proxy_r2	controls_t1dgc_cor	controls_wtccc_cor	controls_p	cases_t1dgc_cor	cases_wtccc_cor	cases_p
rs10246505	7	127912071	C	G	0.0235307	309	0	1	0	0	0.999	1.000	rs3808081	1.0000	0.999	1.000	rs3808081	1.0000	0.0210157	0.0181839	0.0918578	-0.0080198	-0.0022414	0.6374529
rs10460586	2	85328154	G	A	0.0203409	668	1	1	1	0	NA	NA	rs4459734	0.9377	0.959	0.974	NA	NA	-0.0081690	0.0560797	0.0727807	0.0181256	-0.0259702	0.7789847
rs11582679	1	25030876	C	T	0.0455003	897	1	1	1	0	0.993	0.998	rs12403541	0.6907	0.995	0.999	rs12027713	0.9732	0.0067088	0.0024699	0.6824785	-0.0188124	-0.0156085	0.1725153
rs1243654	14	20609317	G	A	0.0141063	520	1	0	0	2	1.000	1.000	rs12436543	0.7529	0.694	0.864	rs12436592	0.5116	-0.0103899	-0.0171955	0.2499162	0.0064128	0.0303866	0.2749063
rs1335540	10	26438309	T	A	0.0340060	564	1	1	1	0	0.999	1.000	rs2992333	1.0000	0.991	0.994	rs1335552	0.9711	0.0286068	-0.0197590	0.5661457	0.0043078	-0.0274869	0.6422847
rs1534430	2	12504610	C	T	0.0000008	229	1	1	1	0	NA	NA	rs10929817	0.8480	0.857	0.913	NA	NA	-0.0069979	-0.0083184	0.5172122	0.0066353	-0.0154458	0.9626553
rs1989245	4	10584191	A	G	0.0081652	277	0	1	1	0	1.000	1.000	rs2014303	1.0000	0.995	0.998	rs17467609	1.0000	-0.0117500	-0.0057389	0.4407216	-0.0415417	-0.0339743	0.0026859
rs2241351	19	18322415	T	C	0.0017966	237	1	0	0	2	1.000	1.000	rs2241351	1.0000	0.803	0.940	NA	0.0000	0.0066006	0.0204019	0.2726678	-0.0181472	0.0140435	0.5566315
rs2722707	19	44304844	C	T	0.0401083	362	1	1	1	0	0.999	1.000	rs1233434	1.0000	0.999	1.000	rs1233462	0.9907	0.0169538	-0.0175288	0.9084431	-0.0100726	-0.0026312	0.5571686
rs34723737	21	42415456	G	A	0.0056756	437	1	1	1	0	0.981	0.996	rs3788014	0.9488	0.916	0.985	NA	0.0000	0.0062752	-0.0106596	0.9052837	0.0383607	-0.0236408	0.1633423
rs71607410	4	87220060	T	C	0.0170970	261	1	1	1	0	0.992	0.998	rs13145898	1.0000	0.978	0.994	rs13143182	0.7722	-0.0116476	-0.0073034	0.4084051	-0.0047054	0.0082385	0.9700619
rs74684534	15	73951273	C	T	0.0117355	374	0	0	1	0	0.889	0.986	NA	0.0000	0.707	0.972	NA	0.0000	0.0098736	-0.0257873	0.5923466	-0.0013154	0.0032367	0.9897262

Caption: Table showing the results for my case-only interaction analysis of 12 SNPs with rs6944602 (the likely CV in the IKZF1 gene region). T1D_p is the \(p\)-value from the T1D GWAS (Cooper et al. 2017) and HINT20 is the score using footprints derived from the HINT algorithm with seed length 20 for the predicted IKZF1 binding site housing the SNP of interest (the higher the score the more likely it is to represent a true binding site). For SNP prioritisation, I selected SNPs with T1D_p\(<0.05\) and with a HINT20 score \(>200\) (as in Funk et al. 2020). aCD4, LCL1 and LCL2 columns are binary indicators for whether the SNP resides in a ChIP-seq peak in activated CD4+ T cells or LCLs (encode accessions: LCL1: ENCSR441VHN and LCL2: ENCSR874AFU) respectively. *_type refers to whether the SNP was directly genotyped (=2) or not (=0). *_info and *_certainty are the imputation info and certainty statistics. *_proxy is the best proxy SNP which was directly genotyped in the study, with *_proxy_r2 showing the \(r^2\) value with the SNP of interest. *_cor is the Pearson’s correlation coefficient for case and control samples in each study, as specified and *_p is the corresponding \(p\)-value from the meta-analysis using a fixed-effect model.

rs1989245

The only SNP which looks to be potentially interesting is rs1989245 (4:10584191 A>/G). This SNP resides in a predicted IKZF1 binding site in LCLs and is also found in ChIP-seq peaks in seperate studies conducted in LCLs. Whilst not directly genotyped in T1DGC, this SNP was imputed to very high quality (info=1) and so the imputed genotypes are used. A proxy SNP which was directly genotyped and in perfect LD with rs1989245 was used in the WTCCC data set.

The correlation between control sample genotypes is relatively low (T1DGC \(r=-0.0117500\), WTCCC \(r=-0.0057389\)) and the meta-analysis \(p\)-value suggests no evidence to reject the null of no correlation in control samples (\(p=0.44\)). However, the correlation coefficients are much larger in case samples (T1DGC \(r=-0.0415417\), WTCCC \(r=-0.0339743\)) and the low \(p\)-value (\(p=0.0026859\)) suggests that there is substantial evidence to reject the null hypothesis of no correlation between genotypes in T1D cases. Note that this is using \(r\), if I use \(r^2\) then all \(p\)-values are \(>0.9\).

rs1989245 is an intron variant in \(\textit{CLNK}\) (Cytokine Dependent Hematopoietic Cell Linker) which plays a role in the regulation of immunoreceptor signaling.
rs1989245 has eQTL evidence for an adjacent gene, \(\textit{ZNF518B}\), in whole blood (\(\beta=-0.1\), \(p=1.4e-19\)) which has been associated with colorectal cancer (Gimena-Valiente et al. 2019). \(\textit{ZNF518B}\) encodes a zinc finger protein (Ikaros is also a zinc finger protein) that regulates histone methyltransferases (Maier et al. 2015) and has been associate with blood glucose in patients with gout (Zhang et al. 2015). Bacos and colleagues (https://www.nature.com/articles/ncomms11089) showed that DNA methylation at \(\textit{ZNF518B}\) associates with age in both blood and human islets (more methylation the older you get). The methylation level inversely correlates with \(\textit{ZNF518B}\) expression in islets (perhaps because activatory TFs are no longer able to bind). Moreover \(\textit{ZNF518B}\) knockdown models mimiced the age-associated decline in expression seen in human islets. “Of note, our functional experiments in β-cells suggest that the age-associated changes in methylation of FHL2, ZNF518B and GNPNAT1 in human islets may be protective and compensatory changes to cope with increased demands of insulin due to insulin resistance in peripheral tissues.”

Comments and notes

CMH papers: https://www.jstor.org/stable/1402373?origin=crossref&seq=1#metadata_info_tab_contents and https://www.annualreviews.org/doi/pdf/10.1146/annurev.pu.09.050188.001011