### Notation

$$H$$ denotes a hypothesis of association with T1D; $$H=1$$ indicates association, $$H=0$$ indicates no association (null hypothesis).

$$$H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1$$$

$$$\begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split}$$$

$$$P|H=0 \sim U(0,1)$$$

$$$P \perp Q|H = 0$$$

$$$\begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split}$$$

$$$f(P=p, Q=q|H_0)=f_0(p,q)$$$

$$$f(P=p, Q=q|H_1)=f_1(p,q)$$$

$$$f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q)$$$

### Method

#### 1. Generate the set of possible rejection regions.

Since $$Q$$ can only be 0 or 1, we just need two P value thresholds, $$p_0$$ for $$Q=0$$ and $$p_1$$ for $$Q=1$$.

$$$L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1).$$$

#### 2. Pick the smallest rejection region that $$(p[i], q[i])$$ belongs.

Following on from our aim, we want the rejection region such that:

1. the probability of a null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is fixed at some value $$\alpha$$, that is

$$$\int_{L(p_0, p_1)}df_0=\alpha$$$

1. the probability of a non-null instance of $$P,Q$$ falling inside $$L(p_0, p_1)$$ is maximal, that is

$$$\int_{L(p_0, p_1)}df_1 \text{ is maximal}$$$

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points $$\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}$$ which means that in our case, the chosen $$p_0$$ and $$p_1$$ should satisfy

$$$\dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)}$$$

using $$f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1$$.

To find $$p_0$$ and $$p_1$$ that satisfy this equation, we derive an estimator of

$$$\begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split}$$$

Let

$$$\begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split}$$$

where $$\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}$$ and $$\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}$$.

Thus we want to solve

$$$\hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1)$$$

for $$p_1$$ if $$q=0$$ (set $$p_0=p_i$$) and for $$p_0$$ if $$q=1$$ (set $$p_1=p_i$$).

#### 3. Integrate the null distribution over this region to get $$v$$.

$$$\begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split}$$$

where $$q_0=A(1)$$.

### Implementation

Begin by making a data frame with p = P value for association with T1D and q = binary active annotation indicator in thymus cells.

##             snps chr         p q
## 1 imm_7_50868985   7 0.7040051 0
## 2 imm_7_50869125   7 0.6381124 0
## 3 imm_7_50870347   7 0.2296590 0
## 4 imm_7_50870355   7 0.7335254 0
## 5 imm_7_50871208   7 0.9917130 0
## 6 imm_7_50871428   7 0.2443444 0

I first attempt to find the corresponding V value for the first $$(p,q)$$ pair. Note that I use a leave-one-chromosome-out method to account for LD.

i <- 1

pi <- df$p[i] qi <- df$q[i]

df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out A <- function(data, q){ # empirical estimate of P(Q = qi | H0) length(which(data$q == q & data$p > 0.5))/length(which(data$p > 0.5))
}

B <- function(data, p, q){ # numerator of empirical estimate of P(P < pi | Q = qi)
# as denominator cancels in h
length(which(data$p <= p & data$q == q)) # to deal with smallest P vals
}

Since $$qi=0$$, I set $$p_0=pi$$ and find $$p_1$$ by solving

$$$\begin{split} \dfrac{p_0 A(0)}{B(p_0, q = 0)} &= \dfrac{p_1 A(1)}{B(p_1, q = 1)} \\[2ex] \dfrac{p_0 A(0)}{A(1) B(p_0, q = 0)} &= \dfrac{p_1}{B(p_1, q = 1)} \end{split}$$$

q0 <- A(data = df_loo, q = 1)

p0 <- pi

LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

RHS <- function(data, p, q) p/length(which(data$p <= p & data$q == q))

x.test <- seq(0, 1, 0.01) # possible p1 values

RHS.test <- sapply(x.test, function(x) RHS(data = df_loo, p = x, q = 1))

plot(x.test, RHS.test, xlab = "p1", ylab = "RHS")
abline(h = LHS, col = 2, lty = 2)

So there is no solution to the equation… This happens for ~8% of $$(p,q)$$ pairs, all of which have large P values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.6299  0.7804  0.8505  0.8455  0.9148  0.9999

The summary of the $$p_1$$ values that are $$>1$$ are shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.000   1.057   1.110   1.110   1.159   1.274

I’ve checked and there are no instances of $$p_1$$ or $$p_0$$ $$<0$$ and also no instances of $$p_0>1$$ when $$qi=1$$.

For these data pairs, I use the extendInt parameter in the uniroot function and set the value for $$p_1$$ to 1.

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c

epsilon <- 1e-35
p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes")$root p1 ## [1] 1.011278 p1 <- 1 v <- p0*(1-q0) + p1*q0 data.frame(p0, p1, q0, pi, qi, v) ## p0 p1 q0 pi qi v ## 1 0.7040051 1 0.4550499 0.7040051 0 0.8386976 ### Implementation on whole data set When I implement the method on the whole data set, I get errors for 78 rows. The distribution of P values for these 78 data pairs is shown below (i.e. the error occurs for very small P values): ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000e+00 0.000e+00 0.000e+00 7.894e-37 3.000e-42 8.689e-36 I follow through an example where this error occurs and find that it is because $$\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}=0}$$, i.e. there are no P values in the leave-one-chromosome-out data set that have a smaller P value. Thus, the denominator of a fraction is 0… not sure what to do about this… I remove these rows for now. error_index <- readRDS("error_index.RDS") error_index1 <- readRDS("error_index1.RDS") data <- readRDS("cFDR_data.RDS") # p = pvals, q = active in thymus df <- na.omit(data[,c("snps","chr","P","THYMUS")]) colnames(df)[c(3,4)] <- c("p","q") df <- df[-error_index,] v <- vector("numeric", dim(df)[1]) # prepare a container cFDR <- for(i in 1:nrow(df)){ tryCatch( expr = { pi <- df$p[i]
qi <- df$q[i] df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out

q0 <- A(data = df_loo, 1)
epsilon <- 1e-35

if(qi == 0){

p0 <- pi

LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c

p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000)$root if(p1 > 1) p1 <- 1 } else { p1 <- pi LHS <- (p1 * A(df_loo, 1))/(A(df_loo, 0) * B(df_loo, p1, q = 1)) f <- function(data, c, p0, q) (p0/length(which(data$p <= p0 & data$q))) - c p0 <- uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 5000)$root

}

v[i] <- p0*(1-q0) + p1*q0

},
error = function(e){
message("* Caught an error on itertion ", i)
print(e)
}
)

}
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 1,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>