cFDR

1. Introduction

Aim: Take a set of P values for association with a given disease (\(p_1,...,p_n \in (0,1]\)) and a set of binary covariates measuring overlap with an active mark in a given cell type (\(q_1,...,q_n\)) and transform the P values into values \(v_1,...,v_n\) such that for all \(\alpha\)

\[\begin{equation} Pr(v_i<\alpha| H_0)=\alpha \end{equation}\]

\[\begin{equation} Pr(v_i<\alpha| H_1)\text{ is maximal}. \end{equation}\]

Method: Choose a set of rejection regions, and for each \((p_i, q_i)\) pair find the smallest rejection region in which \((p_i, q_i)\) is contained. Estimate the distribution of P, Q under the null and integrate this over the selected rejection region to get \(v_i\).

2. Notation

\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).

\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=0)=q_1 \\ & \Longrightarrow P(Q=0|H=0)=1-q_1=q_0 \end{split} \end{equation}\]

\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]

\[\begin{equation} P \perp Q|H = 0 \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]

3. Detailed method

1. Generate the set of possible rejection regions.

Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).

\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]

2. Pick the smallest rejection region that \((p[i], q[i])\) belongs.

Following on from our aim, we want the rejection region such that:

a.) the probability of a null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is fixed at some value \(\alpha\), that is

\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]

b.) the probability of a non-null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is maximal, that is

\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]

(using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\)).

But how do we find \(p_0\) and \(p_1\) that satisfy this equation? James provides 2 options:

Option 1: Estimate \(f\) directly

Consider \[\begin{equation} \begin{split} g(p,q) & = \dfrac{f_0(p,q)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p, Q=q|H_0)}{f(p,q)} \\[2ex] & = \dfrac{f(P=p|Q=q, H_0) f(Q=q|H_0)}{f(p,q)} \end{split} \end{equation}\]

then since the P values are distributed uniformly under the null, \(f(P=p|Q=q, H_0)=1\). We can also estimate \(f(Q=q|H_0)\) by \(q_0\) or \(q_1\) depending on the value of \(q\).

Side note: Looking at the denominator, we know

\[\begin{equation} \color{gray}{f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1.} \end{equation}\]

If we assume the event to be rare then \(\pi_1\approx 0\) which leaves

\[\begin{equation} \color{gray}{f(p,q)\approx f_0(p,q)} \end{equation}\]

which would cancel with the numerator in \(g\)????

James suggests that the denominator can be estimated by

\[\begin{equation} \hat{f}(p,q) = \hat{f}(p|q=0)q_0 + \hat{f}(p|q=1)q_1, \end{equation}\]

using KDE. So that,

\[\begin{equation} g(p,q)=\dfrac{f_0(p,q)}{f(p,q)} \approx \hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

I’m not entirely sure how KDE works but surely this gives

\[\begin{equation} \hat{g}(p,q) = \left\{\begin{array}{ll}{\dfrac{q_0}{\hat{f}(p,q)}=\dfrac{q_0}{\hat{f}(p|q=0)q_0}=\dfrac{1}{\hat{f}(p|q=0)}} & {\text{ if }q = 0} \\ {\dfrac{q_1}{\hat{f}(p,q)}=\dfrac{q_1}{\hat{f}(p|q=1)q_1}=\dfrac{1}{\hat{f}(p|q=1)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

He then suggests that if \(q[i]=1\) in our \((p[i],q[i])\) pair we are trying to find \(v_i\) for, then we should set \(p_1=p[i]\) and solve

\[\begin{equation} \hat{g}(p_0,0)=\hat{g}(p[i],1) \end{equation}\]

for \(p_0\). I.e. solve,

\[\begin{equation} \begin{aligned} \hat{f}(p_0|q=0)=\hat{f}(p[i]|q=1) \end{aligned} \end{equation}\]

Similarly, if \(q[i]=0\) he suggests setting \(p_0=p[i]\) and solving

\[\begin{equation} \hat{g}(p[i],0)=\hat{g}(p_1,1) \end{equation}\]

to find \(p_1\). And similarly,

\[\begin{equation} \begin{aligned} \hat{f}(p[i]|q=0)=\hat{f}(p_1|q=1) \end{aligned} \end{equation}\]

Option 2: cFDR (CDF) approach

Rather than estimating \(f\) directly, we can follow the cFDR approach. This approach is again concerned with choosing \(p_0\) and \(p_1\) that satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)}. \end{equation}\]

Approximate,

\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\ & \approx \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\ & \approx \dfrac{p Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)\times(Pr(Q=q|H_0)Pr(H_0)+Pr(Q=q|H_1)Pr(H_1))} \end{split} \end{equation}\]

Note that,

\[\begin{equation} Pr(Q=q|H_0) = \left\{\begin{array}{ll}{q_0} & {\text{ if }q = 0} \\ {q_1} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

so we can use our empirical estimates of these. Moreover, if we assume that the event is rare, then \(Pr(H_0)\approx 1\) and \(Pr(H_1)\approx 0\). Therefore, the denominator reduces to \(Pr(P\leq p|Q=q)\times q_0\) if \(q=0\) or \(Pr(P\leq p|Q=q)\times q_1\) if \(q=1\) where the first term can be empirically estimated by counting.

Set \(\hat{h}(p,q)\) equal to this approximation and note that

\[\begin{equation} \hat{h}(p,q) = \left\{\begin{array}{ll}{\dfrac{p}{\hat{Pr}(P\leq p|Q=0)}} & {\text{ if }q = 0} \\ {\dfrac{p}{\hat{Pr}(P\leq p|Q=1)}} & {\text{ if }q = 1}\end{array}\right. \end{equation}\]

Again, if \(q[i]=1\) then we set \(p_1=p[i]\) and solve \(\hat{h}(p_0,0)=\hat{h}(p_i,1)\) for \(p_0\), i.e.

\[\begin{equation} \dfrac{p_0}{Pr(P\leq p_0|Q=0)}=\dfrac{p[i]}{Pr(P\leq p[i]|Q=1)} \end{equation}\]

and if \(q[i]=0\) then we set \(p_0=p[i]\) and solve \(\hat{h}(p[i],0)=\hat{h}(p_1,1)\) for \(p_1\), i.e.

\[\begin{equation} \dfrac{p[i]}{Pr(P\leq p[i]|Q=0)}=\dfrac{p_1}{Pr(P\leq p_1|Q=1)} \end{equation}\]

but how would I solve this for \(p_0\) and \(p_1\)?

3. Integrate the null distribution over this region to get \(v\).

\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times q_0 + p_1\times (1-q_0) \end{split} \end{equation}\]

The last line follows from the fact that under the null, the P values are uniform and independent of Q (so that \(Pr(P\leq p_i|Q=0/1, H_0)=p_i\)). We set \(Pr(Q=0|H_0)=q_0\) and note that \(q_0\) can be estimated by the number of \(Q=0\) in the data. This means that we only need to find \(p_0\) and \(p_1\) and we can derive,

\[\begin{equation} v_i=p_0\hat{q_0}+p_1(1-\hat{q_0}). \end{equation}\]

4. Data

I have data for 16,000 SNPs (from 39 T1D associated regions) consisting of P values (and PPs) measuring strength of evidence of association with T1D and a binary indicator of whether that SNP overlaps and active chromatin annotation for 19 different cell types.

x <- readRDS("binary_annots.RDS")
x[1:5, 1:5]

##             snps         P           PP THYMUS PANISLETS
## 1 imm_7_50868985 0.7040051 3.710506e-10      0         0
## 2 imm_7_50869125 0.6381124 4.490570e-10      0         0
## 3 imm_7_50870347 0.2296590 1.466098e-09      0         0
## 4 imm_7_50870355 0.7335254 9.073471e-10      0         0
## 5 imm_7_50871208 0.9917130 3.460071e-10      0         0

I’ve already investigated using logistic and quantile regression approaches to explore the relationship between P values and active/ inactive marks in the 19 cell types. My results were weird though, with active marks in the brain indicative of smaller P values, and active marks in the thymus indicative of larger P values for T1D.

For now, let’s just use the binary annotation data for thymus cells for our cFDR approach.

x <- readRDS("binary_annots.RDS")
data <- na.omit(x[,c(1,2,4)])
colnames(data) <- c("snp","p","q")
head(data)

##              snp         p q
## 1 imm_7_50868985 0.7040051 0
## 2 imm_7_50869125 0.6381124 0
## 3 imm_7_50870347 0.2296590 0
## 4 imm_7_50870355 0.7335254 0
## 5 imm_7_50871208 0.9917130 0
## 6 imm_7_50871428 0.2443444 0

5. Implementation (walkthrough example 1)

Aim: Find \(v_1\).

To estimate \(v_i\) we leave out \((p[i], q[i])\) so I omit \((p[1], q[1])\). I then estimate \(q_0=Pr(Q=0|H_0)\) by the proportion of SNPs with non-active binary annotation and \(q_1=Pr(Q=1|H_0)\) by the proportion of SNPs with active binary annotation.

p <- data$p
q <- data$q
i = 1
data_i <- data[i,]
data_i

##              snp         p q
## 1 imm_7_50868985 0.7040051 0

data_cut <- data[-i,]
q0 <- length(which(data_cut$q==0))/length(data_cut$q)
q1 <- length(which(data_cut$q==1))/length(data_cut$q)
c(q0, q1)

## [1] 0.5080645 0.4919355

Next, I need to find \(p_0\) and \(p_1\). James suggests that since \(q[i=1]=0\) here, I should set \(p_0=p[i=1]\) and then solve

\[\begin{equation} \hat{g}(p[i=1],0)=\hat{g}(p_1,1) \end{equation}\]

to find \(p_1\) (I am using option 1 (pdfs) for now). I.e. solve the following for \(p_1\)

\[\begin{equation} \hat{f}(p[i]|q=0)=\hat{f}(p_1|q=1) \end{equation}\]

First, I visualise the KDEs and see if the P values look normal, they do not…

par(mfrow=c(1,3))

# P|Q=0
dens_q0 = density(data_cut[which(data_cut$q==0),"p"])
plot(dens_q0, main = "PDF for P | Q = 0")

# P|Q=1
dens_q1 = density(data_cut[which(data_cut$q==1),"p"])
plot(dens_q1, main = "PDF for P | Q = 1")

qqnorm(data$p)
qqline(data$p, col = "steelblue", lwd = 2)

anyway..

p0 <- p[i]
f_pi_q0 <- approx(dens_q0$x, dens_q0$y, xout = p0)$y

# use interpolation to find p1
approxs <- approx(dens_q1$x, dens_q1$y, n = 1000)

# remove P<0

approxs_x <- approxs$x[which(approxs$x>0)]
approxs_y <- approxs$y[which(approxs$x>0)]

p1 <- approxs_x[which(abs(approxs_y-f_pi_q0)==min(abs(approxs_y-f_pi_q0)))]

par(mfrow=c(1,1))
plot(approxs_x, approxs_y, xlab = "P", ylab = "f(P|Q=1)", main = "Approximated PDF for P|Q=1") 
abline(h = f_pi_q0, col = "red")

But it hits the curve loads of times? Which one should I use?

Finally, I find \(v_i\).

v1 <- (p0*q0) + p1*(1-q0)

data.frame(v = v1, p = p[i], q = q[i])

##           v         p q
## 1 0.5958209 0.7040051 0

6. Implementation (walkthrough example 2)

I do another walkthrough example for a (p,q) pair with \(q=1\) to see if the P value decreases as expected… Hmm, it doesnt. And note that the solution doesn’t even hit the density function!

p <- data$p
q <- data$q
i = 1989
data_i <- data[i,]
data_i

##                 snp            p q
## 2106 imm_5_35918933 0.0001164736 1

data_cut <- data[-i,]
q0 <- length(which(data_cut$q==0))/length(data_cut$q)
q1 <- length(which(data_cut$q==1))/length(data_cut$q)

dens_q0 = density(data_cut[which(data_cut$q==0),"p"])
dens_q1 = density(data_cut[which(data_cut$q==1),"p"])

p1 <- p[i]

f_pi_q1 <- approx(dens_q1$x, dens_q1$y, xout = p1)$y
f_pi_q1

## [1] 4.018477

# use interpolation to find p0
approxs <- approx(dens_q0$x, dens_q0$y, n = 1000)

# remove P<0

approxs_x <- approxs$x[which(approxs$x>0)]
approxs_y <- approxs$y[which(approxs$x>0)]

p0 <- approxs_x[which(abs(approxs_y-f_pi_q1)==min(abs(approxs_y-f_pi_q1)))]

par(mfrow=c(1,1))
plot(approxs_x, approxs_y, xlab = "P", ylab = "f(P|Q=0)", main = "Approximated PDF for P|Q=0", ylim = c(0,5)) 
abline(h = f_pi_q1, col = "red")

v1 <- (p0*q0) + p1*(1-q0)

data.frame(v = v1, p = p[i], q = q[i])

##             v            p q
## 1 0.007197406 0.0001164736 1

7. Full implementation

I now write code to perform this method for all \((p,q)\) value pairs in my data set (P values for T1D and binary annotation for active chromatin in thymus cells).

x <- readRDS("binary_annots.RDS")
data <- na.omit(x[,c(1,2,4)])
colnames(data) <- c("snp","p","q")

v <- vector("numeric", dim(data)[1]) # prepare a container

cFDR_v <- for(i in 1:nrow(data)){
  pi <- data$p[i]
  qi <- data$q[i]
  
  data_cut <- data[-i,]
  
  q0 <- length(which(data_cut$q==0))/length(data_cut$q)
  q1 <- length(which(data_cut$q==1))/length(data_cut$q)
  
  dens_q0 = density(data_cut[which(data_cut$q==0),"p"])

  dens_q1 = density(data_cut[which(data_cut$q==1),"p"])
  
  if(qi == 0){
    p0 = pi
    
    f_pi_q0 <- approx(dens_q0$x, dens_q0$y, xout = p0)$y
    
    # use interpolation to find p1
    approxs <- approx(dens_q1$x, dens_q1$y, n = 1000)
    
    # remove P<0
    
    approxs_x <- approxs$x[which(approxs$x>0)]
    approxs_y <- approxs$y[which(approxs$x>0)]
    
    p1 <- approxs_x[which(abs(approxs_y-f_pi_q0)==min(abs(approxs_y-f_pi_q0)))]
    
  } else{
    p1 <- p[i]
    
    f_pi_q1 <- approx(dens_q1$x, dens_q1$y, xout = p1)$y
    
    # use interpolation to find p0
    approxs <- approx(dens_q0$x, dens_q0$y, n = 1000)
    
    # remove P<0
    
    approxs_x <- approxs$x[which(approxs$x>0)]
    approxs_y <- approxs$y[which(approxs$x>0)]
    
    p0 <- approxs_x[which(abs(approxs_y-f_pi_q1)==min(abs(approxs_y-f_pi_q1)))]
  }
  v[i] <- (p0*q0) + p1*(1-q0)
}

Oh dear.. at least it’s pretty.