1. Introduction

Aim: Take a set of P values for association with a given disease (\(p_1,...,p_n \in (0,1]\)) and a set of binary covariates measuring overlap with an active mark in a given cell type (\(q_1,...,q_n\)) and transform the P values into values \(v_1,...,v_n\) such that for all \(\alpha\)

\[\begin{equation} Pr(v_i<\alpha| H_0)=\alpha \end{equation}\]

\[\begin{equation} Pr(v_i<\alpha| H_1)\text{ is maximal}. \end{equation}\]

Method: Choose a set of rejection regions, and for each \((p_i, q_i)\) pair find the smallest rejection region in which \((p_i, q_i)\) is contained. Estimate the distribution of P, Q under the null and integrate this over the selected rejection region to get \(v_i\).

2. Notation

\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).

\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=0)=q_1 \\ & \Longrightarrow P(Q=0|H=0)=1-q_1=q_0 \end{split} \end{equation}\]

\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]

\[\begin{equation} P \perp Q|H = 0 \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]

3. Detailed method

1. Generate the set of possible rejection regions.

Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).

\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]

2. Pick the smallest rejection region that \((p[i], q[i])\) belongs.

Following on from our aim, we want the rejection region such that:

a.) the probability of a null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is fixed at some value \(\alpha\), that is

\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]

b.) the probability of a non-null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is maximal, that is

\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]

(using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\)).

But how do we find \(p_0\) and \(p_1\) that satisfy this equation? James provides 2 options:

Option 1: Estimate \(f\) directly

Consider \[\begin{equation} \begin{split} g(p,q) & = \dfrac{f_0(p,q)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p, Q=q|H_0)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p|Q=q, H_0) f(Q=q|H_0)}{f(p,q)} \end{split} \end{equation}\]

then since the P values are distributed uniformly under the null, \(f(P=p|Q=q, H_0)=1\). We can also estimate \(f(Q=q|H_0)\) by \(q_0\) or \(q_1\) depending on the value of \(q\).

Side note: Looking at the denominator, we know

\[\begin{equation} \color{gray}{f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1.} \end{equation}\]

If we assume the event to be rare then \(\pi_1\approx 0\) which leaves

\[\begin{equation} \color{gray}{f(p,q)\approx f_0(p,q)} \end{equation}\]

which would cancel with the numerator in \(g\)????

James suggests that the denominator can be estimated by

\[\begin{equation} \hat{f}(p,q) = \hat{f}(p|q=0)q_0 + \hat{f}(p|q=1)q_1, \end{equation}\]

using KDE. So that,

\[\begin{equation} g(p,q)=\dfrac{f_0(p,q)}{f(p,q)} \approx \hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

I’m not entirely sure how KDE works but surely this gives

\[\begin{equation} \hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}=\dfrac{q_0}{\hat{f}(p|q=0)q_0}=\dfrac{1}{\hat{f}(p|q=0)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}=\dfrac{q_1}{\hat{f}(p|q=1)q_1}=\dfrac{1}{\hat{f}(p|q=1)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

He then suggests that if \(q[i]=1\) in our \((p[i],q[i])\) pair we are trying to find \(v_i\) for, then we should set \(p_1=p[i]\) and solve

\[\begin{equation} \hat{g}(p_0,0)=\hat{g}(p[i],1) \end{equation}\]

for \(p_0\). I.e. solve,

\[\begin{equation} \begin{aligned} \hat{f}(p_0|q=0)=\hat{f}(p[i]|q=1) \end{aligned} \end{equation}\]

Similarly, if \(q[i]=0\) he suggests setting \(p_0=p[i]\) and solving

\[\begin{equation} \hat{g}(p[i],0)=\hat{g}(p_1,1) \end{equation}\]

to find \(p_1\). And similarly,

\[\begin{equation} \begin{aligned} \hat{f}(p[i]|q=0)=\hat{f}(p_1|q=1) \end{aligned} \end{equation}\]

Option 2: cFDR (CDF) approach

Rather than estimating \(f\) directly, we can follow the cFDR approach. This approach is again concerned with choosing \(p_0\) and \(p_1\) that satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)}. \end{equation}\]


\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\ & \approx \dfrac{p Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)\times(Pr(Q=q|H_0)Pr(H_0)+Pr(Q=q|H_1)Pr(H_1))} \end{split} \end{equation}\]

Note that,

\[\begin{equation} Pr(Q=q|H_0) = \left\{\begin{array}{ll}{q_0} & {\text{ if }q = 0} \\ {q_1} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

so we can use our empirical estimates of these. Moreover, if we assume that the event is rare, then \(Pr(H_0)\approx 1\) and \(Pr(H_1)\approx 0\). Therefore, the denominator reduces to \(Pr(P\leq p|Q=q)\times q_0\) if \(q=0\) or \(Pr(P\leq p|Q=q)\times q_1\) if \(q=1\) where the first term can be empirically estimated by counting.

Set \(\hat{h}(p,q)\) equal to this approximation and note that

\[\begin{equation} \hat{h}(p,q) = \left\{\begin{array}{ll}{\dfrac{p}{\hat{Pr}(P\leq p|Q=0)}} & {\text{ if }q = 0} \\ {\dfrac{p}{\hat{Pr}(P\leq p|Q=1)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

Again, if \(q[i]=1\) then we set \(p_1=p[i]\) and solve \(\hat{h}(p_0,0)=\hat{h}(p_i,1)\) for \(p_0\), i.e.

\[\begin{equation} \dfrac{p_0}{Pr(P\leq p_0|Q=0)}=\dfrac{p[i]}{Pr(P\leq p[i]|Q=1)} \end{equation}\]

and if \(q[i]=0\) then we set \(p_0=p[i]\) and solve \(\hat{h}(p[i],0)=\hat{h}(p_1,1)\) for \(p_1\), i.e.

\[\begin{equation} \dfrac{p[i]}{Pr(P\leq p[i]|Q=0)}=\dfrac{p_1}{Pr(P\leq p_1|Q=1)} \end{equation}\]

but how would I solve this for \(p_0\) and \(p_1\)?

3. Integrate the null distribution over this region to get \(v\).

\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times q_0 + p_1\times (1-q_0) \end{split} \end{equation}\]

The last line follows from the fact that under the null, the P values are uniform and independent of Q (so that \(Pr(P\leq p_i|Q=0/1, H_0)=p_i\)). We set \(Pr(Q=0|H_0)=q_0\) and note that \(q_0\) can be estimated by the number of \(Q=0\) in the data. This means that we only need to find \(p_0\) and \(p_1\) and we can derive,

\[\begin{equation} v_i=p_0\hat{q_0}+p_1(1-\hat{q_0}). \end{equation}\]

4. Data

I have data for 16,000 SNPs (from 39 T1D associated regions) consisting of P values (and PPs) measuring strength of evidence of association with T1D and a binary indicator of whether that SNP overlaps and active chromatin annotation for 19 different cell types.

x <- readRDS("binary_annots.RDS")
x[1:5, 1:5]
##             snps         P           PP THYMUS PANISLETS
## 1 imm_7_50868985 0.7040051 3.710506e-10      0         0
## 2 imm_7_50869125 0.6381124 4.490570e-10      0         0
## 3 imm_7_50870347 0.2296590 1.466098e-09      0         0
## 4 imm_7_50870355 0.7335254 9.073471e-10      0         0
## 5 imm_7_50871208 0.9917130 3.460071e-10      0         0

I’ve already investigated using logistic and quantile regression approaches to explore the relationship between P values and active/ inactive marks in the 19 cell types. My results were weird though, with active marks in the brain indicative of smaller P values, and active marks in the thymus indicative of larger P values for T1D.