25th Feb

1. Introduction

I have identified the likely causal variants for T1D in the IKZF1 gene region. I have also shown that there is genome-wide enrichment for T1D association at SNPs in sites where IKZF1 binds in immune-relevant cell types. A follow-up question is whether there is interaction between the likely causal variants in the IKZF1 gene region and the SNPs in IKZF1 binding sites. Whilst epistasis is difficult to interpret biologically (Cordell 2009), the idea is that if less IKZF1 is being produced anyway due to the IKZF1 causal variants and it is harder for any IKZF1 to bind, then this may implicate T1D.

2. Testing strategy

CVs in IKZF1

I have previously prioritised SNPs rs6944602 (Robertson et al. 2020) and rs10264390 (Asimit et al. 2019) as the likely causal variants in the IKZF1 gene region ($r^2=0.63$).

SNP rs6944602 is directly genotyped on the illumina platform (T1DGC) but not on the affymetrix chip (WTCCC). The SNP is reasonably well imputed in WTCCC (info=0.865; certainty=0.841), but I add a proxy SNP to the analysis just to be safe. The proxy SNP I select, rs11978267, is the SNP with the highest $r^2$ with rs6944602 in EUR samples from LDproxy Tool ($r^2=0.6143$) that is directly genotyped on the illumina chip. (Note that the info score for the imputation of rs6944602 on affymetrix is greater than the correlation with the proxy SNP, so using the imputated genotype should be better).

The proxy SNP identified above is actually a near-perfect proxy ($r^2=0.968$) for the other potential CV I was going to consider (rs10264390 which is not genotyped directly on either platform) and they both reside in the blue credible set from MFM. Note that the proxy SNP is not directly genotyped in T1DGC but it is imputed to high quality (info=0.996; certainty=0.998).

I therefore continue my analysis using SNPs rs6944602 (https://genetics.opentargets.org/variant/7_50406053_G_A) and rs11978267 (https://genetics.opentargets.org/variant/7_50398606_A_G) as the potential CVs in IKZF1. Both of these SNPs significantly decrease $\textit{IKZF1}$ expression in blood cell types.

	rs6944602	rs11978267
T1DGC	Genotyped	Imputed (info = 0.996)
WTCCC	Imputed (info = 0.865)	Genotyped

Potentially interacting SNPs

To select which potentially interacting SNPs to test, I find those SNPs which overlap a predicted IKZF1 binding site in LCLs (Funk et al 2020) and an IKZF1 ChIP-seq peak in LCLs or activated CD4+ T cells. There are 160 of these.

I do not want to test too many hypotheses and loose power by correcting for multiple testing. Instead, I firstly stratify by T1D GWAS $p$-value. Specifically, I firstly look for evidence of interaction at the specific 21 SNPs with $p<0.05$ and that overlap the functional genomic data-sets.

After I have done that initial analysis, I will then look for general enrichment of signal across the 160 SNPs.

Potential problem:

See https://github.com/Boyle-Lab/SEMpl/issues/20. It is possible that the IKZF1 binding motif is wrong (it only has a hocomoco score of C anyway) which may invalidate the Funk et al. 2020 results I am using to specify potentially interacting SNPs…

Testing for association

The overall aim here is to see whether SNPs are interacting to implicate T1D. The most natural way to answer this question is to include an interaction term in the logistic regression framework and test whether the interaction coefficient is 0.

However, to increase power we can exploit the idea that if the genotypes at the two loci are not correlated in the general population, then a correlation between genotypes in cases may indicate SNP interaction, called a ``case-only analysis’’ (Cordell 2009). There are several ways I can test for an association between genotypes at the two loci:

Chi-squared test (but need to use rounded expected genotypes and this is relatively low powered due to 4 df).
Pearsons correlation coefficients and a corresponding correlation test to test for significance.

Walkthrough example (T1DGC)

As a walk-through example, I consider the interaction between rs6944602 and rs34723737 (21:42415456; overlaps peaks in all data sets but T1D $p=5.6e-03$), firstly in the T1DGC data. SNP rs34723737 is not directly genotyped in T1DGC but it has good imputation quality (info=0.981; certainty=0.996), so I proceed with my analysis.

The interaction term in the logistic regression framework is not significant:

Let’s consider a case-only analysis by first looking at the $3\times 3$ contingency tables of proportions:

The chi-squared test suggests that there is no evidence of association between genotypes at the two SNPs in controls ($p\_controls=0.1682$) but that there is evidence of association between the genotypes in cases ($p\_cases=0.0921$).

The Pearson’s correlation coefficient is larger in cases ($r=0.0384$) than controls ($r=0.0063$), suggesting that genotypes are more correlated in cases than controls. A formal test for significant correlation coefficients suggests that this correlation is significant in cases ($p=0.01564$) but not controls ($p=0.6918$).

Walkthrough (WTCCC)

Since rs6944602 is not directly genotyped in WTCCC, I consider two parallel analyses (1) using the imputed genotypes for rs6944602 and (2) using the observed genotype for the proxy SNP rs11978267, to test for interaction with rs34723737. We expect to trust results from (1) more as the imputation info>0.8 but the proxy in (2) only has $r^2\approx 0.6$. I use the imputed genotype for rs34723737 since this is high quality (info=0.916) and there is no obvious proxy SNP.

The results from the logistic regressions suggest that both the SNPs and the interactions are not significant for T1D case/control outcome:

The $3\times 3$ tables are shown below, first for (1; imputed genotype) and secondly for (2; proxy SNP):

Chi-squared results:

	p_controls	p_cases
rs6944602_imp	0.667	0.7189
rs6944602_proxy	0.7424	0.8892

Pearsons correlation coefficients:

	r_all	r_controls	r_cases
rs6944602_imp	-0.0154	-0.0107	-0.0236
rs6944602_proxy	-0.0035	-0.0051	-0.0007

Correlation test:

	p_controls	p_cases
rs6944602_imp	0.5401	0.3005
rs6944602_proxy	0.7692	0.9764

These results suggest little evidence of interaction between the two SNPs in the WTCCC data.

Joint analysis

I now consider ways to jointly analyze the genotype data from the two studies. Before doing this I needed to check that samples do not appear in both studies. From the experimental design, WTCCC controls were included as controls in T1DGC and WTCCC controls were ``topped up’’ using bipolar disorder controls (https://www.nature.com/articles/ng.381/tables/1). I’ve checked that the WTCCC controls appearing in the T1DGC data set are not in my WTCCC data set.

For my joint analysis I consider analysis (1) for the WTCCC data, since the imputation quality for rs6944602 is better than the correlation with the proxy SNP. (using https://influentialpoints.com/Training/multiple_2_by_2_tables.html)

Of course, I could crudely pool case genotypes from the two studies into one:

and do a $\chi^2$ test ($p_{pool}=0.4504$; cf $p_{t1dgc}=0.0921$ and $p_{wtccc}=0.7189$).. but this cannot be trusted due to unequal ratios of genotypes and it also throws away valuable information on the genotyping method (also see Simpson’s paradox whereby significant/non-significant results can be found when pooling samples but these do not translate across when looking at the original stratified data).

Instead, I should stick with the (rounded) full information:

The Cochran-Mantel-Haenszel chi-squared test can be used to test for association while taking into account the stratification. In my case, this is testing the null hypothesis that the genotypes are independent, after controlling for the genotyping platform (note how similar this is to the pooled $\chi^2$, I think because the sample sizes of the two studies are similar).

## 
##  Cochran-Mantel-Haenszel test
## 
## data:  rs6944602_geno and rs34723737_geno and study
## Cochran-Mantel-Haenszel M^2 = 3.6273, df = 4, p-value = 0.4588

An assumption of this test is that the contingency tables are homogenous (the proportions are similar). Usually, a Woolf test is used to check this - but I can only find this in the context of $2\times 2$ contingency tables using odds ratios. If the data are heterogenous between study, then the Cochran-Mantel-Haenszel test should not be used. Here, I assume that the data would be homogenous because the relationship between genotypes would still prevail regardless of the genotyping platform (I hope).

There are other methods such as combining the logarithms of the odds ratios but these seem specific to $2\times 2$ contingency tables. Of course, I could pool genotypes 1 and 2 (or 0 and 1) if we assume a dominant (or recessive) model to form $2\times 2$ contingency tables (could also pool genotypes 1 and 2 for rare SNPs).

The Cochran-Mantel-Haenszel chi-squared test above assumes that my data is nominal, when it can actually be thought of as ordinal (e.g. an additive model whereby the effect of genotype 2 is the twice the effect of genotype 1). The generalized Cochran-Mantel-Haenszel tests for association at two (possibly) ordered factors stratified by another factor and is more powerful test. The $ALL results are what we’re interested in (note that the general association test is the same as the result above using the standard Cochran-Mantel-Haenszel test - it treats the data as unordered).

my_array <- array(c(as.vector(table(merge_t1dgc_cases$rs6944602_geno, round(merge_t1dgc_cases$rs34723737_geno,0))), as.vector(table(round(merge_wtccc_cases$rs6944602_geno,0), round(merge_wtccc_cases$rs34723737_geno,0)))), dim = c(3,3,2))

CMHtest(my_array, overall = TRUE, type = c("cor","general"))

## $<NA>
## Cochran-Mantel-Haenszel Statistics NA 
## 
##               AltHypothesis  Chisq Df     Prob
## cor     Nonzero correlation 5.7262  1 0.016713
## general General association 7.9838  4 0.092172
## 
## 
## $<NA>
## Cochran-Mantel-Haenszel Statistics NA 
## 
##               AltHypothesis   Chisq Df    Prob
## cor     Nonzero correlation 0.75058  1 0.38629
## general General association 2.09067  4 0.71909
## 
## 
## $ALL
## Cochran-Mantel-Haenszel Statistics 
##  Overall tests, controlling for all strata 
## 
##               AltHypothesis  Chisq Df    Prob
## cor     Nonzero correlation 1.9673  1 0.16074
## general General association 3.6273  4 0.45879

The methods described so far require categorical data and thus I have had to round expected genotype values at imputed SNPs. This could be very problematic for SNPs with uncertain expected genotypes and with no directly genotyped proxy. It also throws away information.

Instead, I could consider a direct meta-analysis of correlation coefficients, which would not require any collapsing of genotypes or rounding of genotypes. (as with the generalized Cochran-Mantel-Haenszel test above, this assumes an additive model).

In contrast to a fixed-effects method which assumes that all studies come from a single homogeneous population, the random-effects method assumes that each study comes from their own “populations” which are related to the same underlying population. This allows us to incorporate both sampling error and a second source of error that arises due to the true effect sizes for each sample not equaling the the mean of the over-arching distribution of true effect sizes (see https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/random.html). We therefore have to estimate the variance of the distribution of true effect sizes (between-study variance), denoted by $\tau^2$. There are many ways to do this including the DerSimonian-Laird estimator (see Jackson et al. 2010).

The R package metacor can be used to perform a meta-analysis on correlation coefficients. It uses the DerSimonian-Laird method.

Note that there are other methods for meta-analyzing correlation coefficients (e.g. Hedges-Olkin and Hunter-Schmidt) but these are unsuitable for a small number of studies (usually require $>30$).

Comments and notes

Also see https://www.rdocumentation.org/packages/coin/versions/1.3-1/topics/ContingencyTests
Automated code in share/anna/t1d_genotypes and lists of proxy snps in ah2011/case-interaction
t1dgc_chr2.RData doesn’t open and t1dgc_chr6.RData doesn’t exist.
Six of my 21 SNPs reside in the MHC, should I exclue these?