\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).
\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]
\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split} \end{equation}\]
\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]
\[\begin{equation} P \perp Q|H = 0 \end{equation}\]
\[\begin{equation} \begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split} \end{equation}\]
\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]
\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]
\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]
Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).
\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]
Following on from our aim, we want the rejection region such that:
\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]
\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]
Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy
\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]
using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\).
To find \(p_0\) and \(p_1\) that satisfy this equation, we derive an estimator of
\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split} \end{equation}\]
Let
\[\begin{equation} \begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split} \end{equation}\]
where \(\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}\) and \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}\).
Thus we want to solve
\[\begin{equation} \hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1) \end{equation}\]
for \(p_1\) if \(q=0\) (set \(p_0=p_i\)) and for \(p_0\) if \(q=1\) (set \(p_1=p_i\)).
\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split} \end{equation}\]
where \(q_0=A(1)\).
Begin by making a data frame with p = P value for association with T1D and q = binary active annotation indicator in thymus cells.
## snps chr p q
## 1 imm_7_50868985 7 0.7040051 0
## 2 imm_7_50869125 7 0.6381124 0
## 3 imm_7_50870347 7 0.2296590 0
## 4 imm_7_50870355 7 0.7335254 0
## 5 imm_7_50871208 7 0.9917130 0
## 6 imm_7_50871428 7 0.2443444 0
I first attempt to find the corresponding V value for the first \((p,q)\) pair. Note that I use a leave-one-chromosome-out method to account for LD.
i <- 1
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out
A <- function(data, q){ # empirical estimate of P(Q = qi | H0)
length(which(data$q == q & data$p > 0.5))/length(which(data$p > 0.5))
}
B <- function(data, p, q){ # numerator of empirical estimate of P(P < pi | Q = qi)
# as denominator cancels in h
length(which(data$p <= p & data$q == q)) # to deal with smallest P vals
}
Since \(qi=0\), I set \(p_0=pi\) and find \(p_1\) by solving
\[\begin{equation} \begin{split} \dfrac{p_0 A(0)}{B(p_0, q = 0)} &= \dfrac{p_1 A(1)}{B(p_1, q = 1)} \\[2ex] \dfrac{p_0 A(0)}{A(1) B(p_0, q = 0)} &= \dfrac{p_1}{B(p_1, q = 1)} \end{split} \end{equation}\]
q0 <- A(data = df_loo, q = 1)
p0 <- pi
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
RHS <- function(data, p, q) p/length(which(data$p <= p & data$q == q))
x.test <- seq(0, 1, 0.01) # possible p1 values
RHS.test <- sapply(x.test, function(x) RHS(data = df_loo, p = x, q = 1))
plot(x.test, RHS.test, xlab = "p1", ylab = "RHS")
abline(h = LHS, col = 2, lty = 2)
So there is no solution to the equation… This happens for ~8% of \((p,q)\) pairs, all of which have large P values:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6299 0.7804 0.8505 0.8455 0.9148 0.9999
The summary of the \(p_1\) values that are \(>1\) are shown below:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.057 1.110 1.110 1.159 1.274
I’ve checked and there are no instances of \(p_1\) or \(p_0\) \(<0\) and also no instances of \(p_0>1\) when \(qi=1\).
For these data pairs, I use the extendInt
parameter in the uniroot
function and set the value for \(p_1\) to 1.
f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c
epsilon <- 1e-35
p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes")$root
p1
## [1] 1.011278
p1 <- 1
v <- p0*(1-q0) + p1*q0
data.frame(p0, p1, q0, pi, qi, v)
## p0 p1 q0 pi qi v
## 1 0.7040051 1 0.4550499 0.7040051 0 0.8386976
When I implement the method on the whole data set, I get errors for 78 rows. The distribution of P values for these 78 data pairs is shown below (i.e. the error occurs for very small P values):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 7.894e-37 3.000e-42 8.689e-36
I follow through an example where this error occurs and find that it is because \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}=0}\), i.e. there are no P values in the leave-one-chromosome-out data set that have a smaller P value. Thus, the denominator of a fraction is 0… not sure what to do about this… I remove these rows for now.
error_index <- readRDS("error_index.RDS")
error_index1 <- readRDS("error_index1.RDS")
data <- readRDS("cFDR_data.RDS")
# p = pvals, q = active in thymus
df <- na.omit(data[,c("snps","chr","P","THYMUS")])
colnames(df)[c(3,4)] <- c("p","q")
df <- df[-error_index,]
v <- vector("numeric", dim(df)[1]) # prepare a container
cFDR <- for(i in 1:nrow(df)){
tryCatch(
expr = {
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out
q0 <- A(data = df_loo, 1)
epsilon <- 1e-35
if(qi == 0){
p0 <- pi
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c
p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000)$root
if(p1 > 1) p1 <- 1
} else {
p1 <- pi
LHS <- (p1 * A(df_loo, 1))/(A(df_loo, 0) * B(df_loo, p1, q = 1))
f <- function(data, c, p0, q) (p0/length(which(data$p <= p0 & data$q))) - c
p0 <- uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 5000)$root
}
v[i] <- p0*(1-q0) + p1*q0
},
error = function(e){
message("* Caught an error on itertion ", i)
print(e)
}
)
}
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
plot_data <- cbind(df, v)
plot_data <- plot_data[-which(plot_data$v==0), ]
p_BH <- p.adjust(plot_data$p, method = "BH")
v_BH <- p.adjust(plot_data$v, method = "BH")
par(mfrow=c(1,3))
plot(plot_data$p, p_BH, xlab = "Original P", ylab = "p.ajusted BH corrected P values")
abline(a = 0, b = 1)
plot(plot_data$v, v_BH, xlab = "Original V", ylab = "p.ajusted BH corrected V values")
abline(a = 0, b = 1)
plot(plot_data$v, p_BH, xlab = "Original V", ylab = "p.ajusted BH corrected P values")
abline(a=0, b=1)
###############
length(which(plot_data$p < 0.01))/dim(plot_data)[1] * 100 # 31.4% of original p values below threshold
## [1] 31.37723
length(which(p_BH < 0.01))/length(p_BH) * 100 # 26.6% of BH adjusted p values below threshold
## [1] 26.64614
length(which(plot_data$v < 0.01))/dim(plot_data)[1] * 100 # 31.2% of v values below threshold
## [1] 31.25785
length(which(v_BH < 0.01))/length(v_BH) * 100 # 26.4% of BH adjusted v values below threshold
## [1] 26.44509
###################
prop.table(table(plot_data[which(plot_data$p<0.01),"q"]))
##
## 0 1
## 0.4665599 0.5334401
prop.table(table(plot_data[which(p_BH<0.01),"q"]))
##
## 0 1
## 0.4595614 0.5404386
prop.table(table(plot_data[which(plot_data$v<0.01),"q"]))
##
## 0 1
## 0.4528643 0.5471357
prop.table(table(plot_data[which(v_BH<0.01),"q"]))
##
## 0 1
## 0.4490378 0.5509622
##################
prop.table(table(plot_data[which(plot_data$v<0.01 & plot_data$p<0.01),"q"]))
##
## 0 1
## 0.4582062 0.5417938
prop.table(table(plot_data[which(v_BH<0.01 & p_BH<0.01),"q"]))
##
## 0 1
## 0.4519369 0.5480631
Comments
Need to think of a better way to deal with no solution to the equation for large P values (rather than setting \(p1=1\)) as in the walkthrough example.
Need to consider what to do for the lowest P values in the data set (so that \(B(p, q)=\# \{P\leq p, Q = q\}=0\)). Could set this to e.g. 1 instead of 0 (so we are not dividing by 0 in a fraction) but this doesn’t help with the RHS of the equation we are trying to solve..
[0+epsilon, 1]
withepsilon <- 1e-35
as the interval in the uniroot equation because there are no solutions to B for 0 (no P values less than 0). Note that this is why there are some blue values with very small P value and fixed V value.