### 1. Introduction

Aim: Take a set of P values for association with a given disease ($$p_1,...,p_n \in (0,1]$$) and a set of binary covariates measuring overlap with an active mark in a given cell type ($$q_1,...,q_n$$) and transform the P values into values $$v_1,...,v_n$$ such that for all $$\alpha$$

$$$Pr(v_i<\alpha| H_0)=\alpha$$$

$$$Pr(v_i<\alpha| H_1)\text{ is maximal}.$$$

Method: Choose a set of rejection regions, and for each $$(p_i, q_i)$$ pair find the smallest rejection region in which $$(p_i, q_i)$$ is contained. Estimate the distribution of P, Q under the null and integrate this over the selected rejection region to get $$v_i$$.

### 2. Notation

$$H$$ denotes a hypothesis of association with T1D; $$H=1$$ indicates association, $$H=0$$ indicates no association (null hypothesis).

$$$H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1$$$

$$$\begin{split} Q|H=0 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=0)=q_1 \\ & \Longrightarrow P(Q=0|H=0)=1-q_1=q_0 \end{split}$$$

$$$P|H=0 \sim U(0,1)$$$

$$$P \perp Q|H = 0$$$

$$$f(P=p, Q=q|H_0)=f_0(p,q)$$$

$$$f(P=p, Q=q|H_1)=f_1(p,q)$$$

$$$f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q)$$$

### 3. Detailed method

#### 1. Generate the set of possible rejection regions.

Since $$Q$$ can only be 0 or 1, we just need two P value thresholds, $$p_0$$ for $$Q=0$$ and $$p_1$$ for $$Q=1$$.

$$$L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1).$$$

#### 2. Pick the smallest rejection region that $$(p[i], q[i])$$ belongs.

Following on from our aim, we want the rejection region such that:

a.) the probability of a null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is fixed at some value $$\alpha$$, that is

$$$\int_{L(p_0, p_1)}df_0=\alpha$$$

b.) the probability of a non-null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is maximal, that is

$$$\int_{L(p_0, p_1)}df_1 \text{ is maximal}$$$

Previous works (e.g.Â appendix 8.1 in Jamesâ€™ paper) show that this optimal region is given by the set of points $$\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}$$ which means that in our case, the chosen $$p_0$$ and $$p_1$$ should satisfy

$$$\dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)}$$$

(using $$f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1$$).

But how do we find $$p_0$$ and $$p_1$$ that satisfy this equation? James provides 2 options:

Option 1: Estimate $$f$$ directly

Consider $$$\begin{split} g(p,q) & = \dfrac{f_0(p,q)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p, Q=q|H_0)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p|Q=q, H_0) f(Q=q|H_0)}{f(p,q)} \end{split}$$$

then since the P values are distributed uniformly under the null, $$f(P=p|Q=q, H_0)=1$$. We can also estimate $$f(Q=q|H_0)$$ by $$q_0$$ or $$q_1$$ depending on the value of $$q$$.

Side note: Looking at the denominator, we know

$$$\color{gray}{f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1.}$$$

If we assume the event to be rare then $$\pi_1\approx 0$$ which leaves

$$$\color{gray}{f(p,q)\approx f_0(p,q)}$$$

which would cancel with the numerator in $$g$$????

James suggests that the denominator can be estimated by

$$$\hat{f}(p,q) = \hat{f}(p|q=0)q_0 + \hat{f}(p|q=1)q_1,$$$

using KDE. So that,

$$$g(p,q)=\dfrac{f_0(p,q)}{f(p,q)} \approx \hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}} & {\text{ if }q = 1}\end{array}\right.$$$

Iâ€™m not entirely sure how KDE works but surely this gives

$$$\hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}=\dfrac{q_0}{\hat{f}(p|q=0)q_0}=\dfrac{1}{\hat{f}(p|q=0)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}=\dfrac{q_1}{\hat{f}(p|q=1)q_1}=\dfrac{1}{\hat{f}(p|q=1)}} & {\text{ if }q = 1}\end{array}\right.$$$

He then suggests that if $$q[i]=1$$ in our $$(p[i],q[i])$$ pair we are trying to find $$v_i$$ for, then we should set $$p_1=p[i]$$ and solve

$$$\hat{g}(p_0,0)=\hat{g}(p[i],1)$$$

for $$p_0$$. I.e. solve,

\begin{aligned} \hat{f}(p_0|q=0)=\hat{f}(p[i]|q=1) \end{aligned}

Similarly, if $$q[i]=0$$ he suggests setting $$p_0=p[i]$$ and solving

$$$\hat{g}(p[i],0)=\hat{g}(p_1,1)$$$

to find $$p_1$$. And similarly,

\begin{aligned} \hat{f}(p[i]|q=0)=\hat{f}(p_1|q=1) \end{aligned}

Option 2: cFDR (CDF) approach

Rather than estimating $$f$$ directly, we can follow the cFDR approach. This approach is again concerned with choosing $$p_0$$ and $$p_1$$ that satisfy

$$$\dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)}.$$$

Approximate,

$$$\begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\ & \approx \dfrac{p Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)\times(Pr(Q=q|H_0)Pr(H_0)+Pr(Q=q|H_1)Pr(H_1))} \end{split}$$$

Note that,

$$$Pr(Q=q|H_0) = \left\{\begin{array}{ll}{q_0} & {\text{ if }q = 0} \\ {q_1} & {\text{ if }q = 1}\end{array}\right.$$$

so we can use our empirical estimates of these. Moreover, if we assume that the event is rare, then $$Pr(H_0)\approx 1$$ and $$Pr(H_1)\approx 0$$. Therefore, the denominator reduces to $$Pr(P\leq p|Q=q)\times q_0$$ if $$q=0$$ or $$Pr(P\leq p|Q=q)\times q_1$$ if $$q=1$$ where the first term can be empirically estimated by counting.

Set $$\hat{h}(p,q)$$ equal to this approximation and note that

$$$\hat{h}(p,q) = \left\{\begin{array}{ll}{\dfrac{p}{\hat{Pr}(P\leq p|Q=0)}} & {\text{ if }q = 0} \\ {\dfrac{p}{\hat{Pr}(P\leq p|Q=1)}} & {\text{ if }q = 1}\end{array}\right.$$$

Again, if $$q[i]=1$$ then we set $$p_1=p[i]$$ and solve $$\hat{h}(p_0,0)=\hat{h}(p_i,1)$$ for $$p_0$$, i.e.

$$$\dfrac{p_0}{Pr(P\leq p_0|Q=0)}=\dfrac{p[i]}{Pr(P\leq p[i]|Q=1)}$$$

and if $$q[i]=0$$ then we set $$p_0=p[i]$$ and solve $$\hat{h}(p[i],0)=\hat{h}(p_1,1)$$ for $$p_1$$, i.e.

$$$\dfrac{p[i]}{Pr(P\leq p[i]|Q=0)}=\dfrac{p_1}{Pr(P\leq p_1|Q=1)}$$$

but how would I solve this for $$p_0$$ and $$p_1$$?

#### 3. Integrate the null distribution over this region to get $$v$$.

$$$\begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times q_0 + p_1\times (1-q_0) \end{split}$$$

The last line follows from the fact that under the null, the P values are uniform and independent of Q (so that $$Pr(P\leq p_i|Q=0/1, H_0)=p_i$$). We set $$Pr(Q=0|H_0)=q_0$$ and note that $$q_0$$ can be estimated by the number of $$Q=0$$ in the data. This means that we only need to find $$p_0$$ and $$p_1$$ and we can derive,

$$$v_i=p_0\hat{q_0}+p_1(1-\hat{q_0}).$$$

### 4. Data

I have data for 16,000 SNPs (from 39 T1D associated regions) consisting of P values (and PPs) measuring strength of evidence of association with T1D and a binary indicator of whether that SNP overlaps and active chromatin annotation for 19 different cell types.

x <- readRDS("binary_annots.RDS")
x[1:5, 1:5]
##             snps         P           PP THYMUS PANISLETS
## 1 imm_7_50868985 0.7040051 3.710506e-10      0         0
## 2 imm_7_50869125 0.6381124 4.490570e-10      0         0
## 3 imm_7_50870347 0.2296590 1.466098e-09      0         0
## 4 imm_7_50870355 0.7335254 9.073471e-10      0         0
## 5 imm_7_50871208 0.9917130 3.460071e-10      0         0

Iâ€™ve already investigated using logistic and quantile regression approaches to explore the relationship between P values and active/ inactive marks in the 19 cell types. My results were weird though, with active marks in the brain indicative of smaller P values, and active marks in the thymus indicative of larger P values for T1D.