## cFDR method for reweighting GWAS P values using functional data

### 1. Notation

$$H$$ denotes a hypothesis of association with T1D; $$H=1$$ indicates association, $$H=0$$ indicates no association (null hypothesis).

$\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}$

$\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split} \end{equation}$

$\begin{equation} P|H=0 \sim U(0,1) \end{equation}$

$\begin{equation} P \perp Q|H = 0 \end{equation}$

$\begin{equation} \begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split} \end{equation}$

$\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}$

$\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}$

$\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}$

### 2. Method

1. Generate the set of possible rejection regions.

Since $$Q$$ can only be 0 or 1, we just need two P value thresholds, $$p_0$$ for $$Q=0$$ and $$p_1$$ for $$Q=1$$.

$\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}$

2. Pick the smallest rejection region that $$(p[i], q[i])$$ belongs.

Following on from our aim, we want the rejection region such that:

1. the probability of a null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is fixed at some value $$\alpha$$, that is

$\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}$

1. the probability of a non-null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is maximal, that is

$\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}$

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points $$\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}$$ which means that in our case, the chosen $$p_0$$ and $$p_1$$ should satisfy

$\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}$

using $$f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1$$.

To find $$p_0$$ and $$p_1$$ that satisfy this equation, we derive an estimator of

$\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split} \end{equation}$

Let

$\begin{equation} \begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split} \end{equation}$

where $$\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}$$ and $$\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}$$ (since $$\# \{Q = q\}$$ cancels in the denominator).

So we want to solve

$\begin{equation} \hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1) \end{equation}$

for

• $$p_1$$ if $$q=0$$ (set $$p_0=p_i$$)
• $$p_0$$ if $$q=1$$ (set $$p_1=p_i$$)

3. Integrate the null distribution over this region to get $$v$$.

$\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split} \end{equation}$

where $$q_0=A(1)$$.

### 3. New data set

I have downloaded the case-control T1D GWAS results (Onengut-Gumuscu et al.) from the GWAS catalog. This consists of 123130 SNPs genotyped in 6670 T1D cases and 12262 controls.

I breifly compare this to the ic_t1d_onengut_allresults.csv.gz file on the HPC which I assume also includes the meta-analysis (124726 SNPs). For the SNPs that can be matched by position, I compare the reported P values in the two datasets. The T1D paper (cc_ichip.sumstats from GWAS Catalog) does not apply genomic control as we discussed (comparing chi-squared distributions) but does account for cyptic relatedness using KING (to estimate the kinship coefficient and remove problematic individuals) and population structure using the PCA method in KING. I assume this means that I shouldn’t genome-control the P values (again). Note: I checked, and this data set gives a lambda value of 0.93 suggesting that there is no population structure (although shouldn’t this equal 1 in a completely homogenous population?).

I decide to use the publically avaliable data downloaded directly from the GWAS catalog for my analysis.

### 4. Annotation data set

To derive a final annotation data set, for each of the 123130 SNPs in the data set, I record their reported genomic annotation in 19 different cell types (mostly immune related except brain cells to use as a control) using the segway encyclopedia. For each cell type, I also constuct a binary “active” column that is 1 if the SNP falls in enhancer, transcribed, promoter or reg permissive, 0 if the SNP falls in quiescent, constit het or facult het and NA if the SNP falls in bivalent, unclassified or low confidence.

snps <- readRDS("final_final_annots.RDS")
snps[1,]
##   chrom     pos       rsid other_allele effect_allele         p
## 1     1 1118275 rs61733845            C             T 0.8342689
##          beta         se     OR  OR_lower OR_upper
## 1 -0.01126319 0.05383169 0.9888 0.8897869 1.098831
##   BRAIN_INFERIOR_TEMPORAL_LOBE CD14_PRIMARY_CELLS
## 1               FacultativeHet     FacultativeHet
##   CD19_PRIMARY_CELLS_PERIPHERAL_UW CD3_PRIMARY_CELLS_CORD_BI
## 1                         Bivalent                  Bivalent
##   CD3_PRIMARY_CELLS_PERIPHERAL_UW CD4_MEMORY_PRIMARY_CELLS
## 1                  FacultativeHet                 Enhancer
##   CD4_NAIVE_PRIMARY_CELLS CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1           RegPermissive                               Enhancer
##   CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1                                Enhancer
##   CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1                                                           FacultativeHet
##   CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1                                                     Enhancer
##   CD4+_CD25-_TH_PRIMARY_CELLS CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1                    Bivalent                       FacultativeHet
##   CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS CD56_PRIMARY_CELLS
## 1                               Enhancer           Bivalent
##   CD8_MEMORY_PRIMARY_CELLS CD8_NAIVE_PRIMARY_CELLS   PANISLETS    THYMUS
## 1           FacultativeHet           RegPermissive Transcribed Quiescent
##   b_BRAIN_INFERIOR_TEMPORAL_LOBE b_CD14_PRIMARY_CELLS
## 1                              0                    0
##   b_CD19_PRIMARY_CELLS_PERIPHERAL_UW b_CD3_PRIMARY_CELLS_CORD_BI
## 1                                 NA                          NA
##   b_CD3_PRIMARY_CELLS_PERIPHERAL_UW b_CD4_MEMORY_PRIMARY_CELLS
## 1                                 0                          1
##   b_CD4_NAIVE_PRIMARY_CELLS b_CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1                         1                                        1
##   b_CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1                                         1
##   b_CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1                                                                          0
##   b_CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1                                                              1
##   b_CD4+_CD25-_TH_PRIMARY_CELLS b_CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1                            NA                                      0
##   b_CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS b_CD56_PRIMARY_CELLS
## 1                                        1                   NA
##   b_CD8_MEMORY_PRIMARY_CELLS b_CD8_NAIVE_PRIMARY_CELLS b_PANISLETS
## 1                          0                         1           1
##   b_THYMUS PP
## 1        0 NA

The baseline proportions of annotations across the ~123,000 ImmunoChip SNPs are shown below: