19 Dec

Notation

\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).

\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split} \end{equation}\]

\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]

\[\begin{equation} P \perp Q|H = 0 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split} \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]

Method

1. Generate the set of possible rejection regions.

Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).

\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]

2. Pick the smallest rejection region that \((p[i], q[i])\) belongs.

Following on from our aim, we want the rejection region such that:

the probability of a null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is fixed at some value \(\alpha\), that is

\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]

the probability of a non-null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is maximal, that is

\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]

using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\).

To find \(p_0\) and \(p_1\) that satisfy this equation, we derive an estimator of

\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split} \end{equation}\]

Let

\[\begin{equation} \begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split} \end{equation}\]

where \(\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}\) and \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}\).

Thus we want to solve

\[\begin{equation} \hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1) \end{equation}\]

for \(p_1\) if \(q=0\) (set \(p_0=p_i\)) and for \(p_0\) if \(q=1\) (set \(p_1=p_i\)).

3. Integrate the null distribution over this region to get \(v\).

\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split} \end{equation}\]

where \(q_0=A(1)\).

Implementation

Begin by making a data frame with p = P value for association with T1D and q = binary active annotation indicator in thymus cells.

##             snps chr         p q
## 1 imm_7_50868985   7 0.7040051 0
## 2 imm_7_50869125   7 0.6381124 0
## 3 imm_7_50870347   7 0.2296590 0
## 4 imm_7_50870355   7 0.7335254 0
## 5 imm_7_50871208   7 0.9917130 0
## 6 imm_7_50871428   7 0.2443444 0

I first attempt to find the corresponding V value for the first \((p,q)\) pair. Note that I use a leave-one-chromosome-out method to account for LD.

i <- 1

pi <- df$p[i]
qi <- df$q[i]

df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out

A <- function(data, q){ # empirical estimate of P(Q = qi | H0)
  length(which(data$q == q & data$p > 0.5))/length(which(data$p > 0.5))
}

B <- function(data, p, q){ # numerator of empirical estimate of P(P < pi | Q = qi)
                           # as denominator cancels in h
  length(which(data$p <= p & data$q == q)) # to deal with smallest P vals
}

Since \(qi=0\), I set \(p_0=pi\) and find \(p_1\) by solving

\[\begin{equation} \begin{split} \dfrac{p_0 A(0)}{B(p_0, q = 0)} &= \dfrac{p_1 A(1)}{B(p_1, q = 1)} \\[2ex] \dfrac{p_0 A(0)}{A(1) B(p_0, q = 0)} &= \dfrac{p_1}{B(p_1, q = 1)} \end{split} \end{equation}\]

q0 <- A(data = df_loo, q = 1)

p0 <- pi

LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

RHS <- function(data, p, q) p/length(which(data$p <= p & data$q == q))

x.test <- seq(0, 1, 0.01) # possible p1 values

RHS.test <- sapply(x.test, function(x) RHS(data = df_loo, p = x, q = 1))

plot(x.test, RHS.test, xlab = "p1", ylab = "RHS")
abline(h = LHS, col = 2, lty = 2)

So there is no solution to the equation… This happens for ~8% of \((p,q)\) pairs, all of which have large P values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6299  0.7804  0.8505  0.8455  0.9148  0.9999

The summary of the \(p_1\) values that are \(>1\) are shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.057   1.110   1.110   1.159   1.274

I’ve checked and there are no instances of \(p_1\) or \(p_0\) \(<0\) and also no instances of \(p_0>1\) when \(qi=1\).

For these data pairs, I use the extendInt parameter in the uniroot function and set the value for \(p_1\) to 1.

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c 

epsilon <- 1e-35
p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes")$root

p1

## [1] 1.011278

p1 <- 1

v <- p0*(1-q0) + p1*q0

data.frame(p0, p1, q0, pi, qi, v)

##          p0 p1        q0        pi qi         v
## 1 0.7040051  1 0.4550499 0.7040051  0 0.8386976

Implementation on whole data set

When I implement the method on the whole data set, I get errors for 78 rows. The distribution of P values for these 78 data pairs is shown below (i.e. the error occurs for very small P values):

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 7.894e-37 3.000e-42 8.689e-36

I follow through an example where this error occurs and find that it is because \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}=0}\), i.e. there are no P values in the leave-one-chromosome-out data set that have a smaller P value. Thus, the denominator of a fraction is 0… not sure what to do about this… I remove these rows for now.

error_index <- readRDS("error_index.RDS")
error_index1 <- readRDS("error_index1.RDS")

data <- readRDS("cFDR_data.RDS")

# p = pvals, q = active in thymus

df <- na.omit(data[,c("snps","chr","P","THYMUS")])

colnames(df)[c(3,4)] <- c("p","q")

df <- df[-error_index,]

v <- vector("numeric", dim(df)[1]) # prepare a container

cFDR <- for(i in 1:nrow(df)){
  tryCatch(
    expr = {
      pi <- df$p[i]
      qi <- df$q[i]
      
      df_loo <- df[-which(df$chr == df[i, "chr"]), ] # df leave-one-out
      
      q0 <- A(data = df_loo, 1)
      epsilon <- 1e-35
      
      if(qi == 0){
        
        p0 <- pi
        
        LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
        
        f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c 
        
        p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000)$root
        
        if(p1 > 1) p1 <- 1
        
      } else {
        
        p1 <- pi
        
        LHS <- (p1 * A(df_loo, 1))/(A(df_loo, 0) * B(df_loo, p1, q = 1)) 
        
        f <- function(data, c, p0, q) (p0/length(which(data$p <= p0 & data$q))) - c
        
        p0 <- uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 5000)$root
        
      } 
      
      v[i] <- p0*(1-q0) + p1*q0
      
    },
    error = function(e){
      message("* Caught an error on itertion ", i)
            print(e)
    }
  )
  
}

FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 0,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>
FALSE <simpleError in uniroot(f, c(0 + epsilon, 1), data = df_loo, c = LHS, q = 1,     extendInt = "yes", maxiter = 5000): no sign change found in 5000 iterations>

Comments

Need to think of a better way to deal with no solution to the equation for large P values (rather than setting \(p1=1\)) as in the walkthrough example.
Need to consider what to do for the lowest P values in the data set (so that \(B(p, q)=\# \{P\leq p, Q = q\}=0\)). Could set this to e.g. 1 instead of 0 (so we are not dividing by 0 in a fraction) but this doesn’t help with the RHS of the equation we are trying to solve..

# what if there are no P values smaller than p?

B <- function(data, p, q){ # empirical estimate of P(P < pi | Q = qi)
  max(length(which(data$p <= p & data$q == q)), 1) # to deal with smallest P vals
}

At the moment I’m setting [0+epsilon, 1] with epsilon <- 1e-35 as the interval in the uniroot equation because there are no solutions to B for 0 (no P values less than 0). Note that this is why there are some blue values with very small P value and fixed V value.

Extra stuff before meeting

plot_data <- cbind(df, v)

plot_data <- plot_data[-which(plot_data$v==0), ]

p_BH <- p.adjust(plot_data$p, method = "BH")

v_BH <- p.adjust(plot_data$v, method = "BH")

par(mfrow=c(1,3))

plot(plot_data$p, p_BH, xlab = "Original P", ylab = "p.ajusted BH corrected P values")
abline(a = 0, b = 1)

plot(plot_data$v, v_BH, xlab = "Original V", ylab = "p.ajusted BH corrected V values")
abline(a = 0, b = 1)

plot(plot_data$v, p_BH, xlab = "Original V", ylab = "p.ajusted BH corrected P values")
abline(a=0, b=1)

###############

length(which(plot_data$p < 0.01))/dim(plot_data)[1] * 100 # 31.4% of original p values below threshold

## [1] 31.37723

length(which(p_BH < 0.01))/length(p_BH) * 100 # 26.6% of BH adjusted p values below threshold

## [1] 26.64614

length(which(plot_data$v < 0.01))/dim(plot_data)[1] * 100 # 31.2% of v values below threshold

## [1] 31.25785

length(which(v_BH < 0.01))/length(v_BH) * 100 # 26.4% of BH adjusted v values below threshold

## [1] 26.44509

###################

prop.table(table(plot_data[which(plot_data$p<0.01),"q"]))

## 
##         0         1 
## 0.4665599 0.5334401

prop.table(table(plot_data[which(p_BH<0.01),"q"]))

## 
##         0         1 
## 0.4595614 0.5404386

prop.table(table(plot_data[which(plot_data$v<0.01),"q"]))

## 
##         0         1 
## 0.4528643 0.5471357

prop.table(table(plot_data[which(v_BH<0.01),"q"]))

## 
##         0         1 
## 0.4490378 0.5509622

##################

prop.table(table(plot_data[which(plot_data$v<0.01 & plot_data$p<0.01),"q"]))

## 
##         0         1 
## 0.4582062 0.5417938

prop.table(table(plot_data[which(v_BH<0.01 & p_BH<0.01),"q"]))

## 
##         0         1 
## 0.4519369 0.5480631