26 March 2020

Aim: Write cFDR parametrically

\[\begin{equation} \begin{split} \hat{cFDR} &= P(H_0^p\text{ | }P\leq p, Q\leq q)\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p, Q\leq q) P(H_0^p\text{ | }Q\leq q)}{P(P\leq p \text{ | }Q\leq q)}\\[2ex] &= \dfrac{p \times P(H_0^p\text{ | }Q\leq q)}{P(P\leq p \text{ | }Q\leq q)}\\[2ex] &= \dfrac{p \times \frac{P(H_0^p) P(Q\leq q\text{ | }H_0^p)}{P(Q\leq q)}}{\frac{P(P\leq p, Q\leq q)}{P(Q\leq q)}}\\[2ex] &= \dfrac{p \times P(H_0^p) P(Q\leq q\text{ | }H_0^p)}{P(P\leq p, Q\leq q)} \end{split} \end{equation}\]

We can approximate \(P(H_0^p)\) with 1 to generate a conservative cFDR estimator.
\(P(Q\leq q\text{ | }H_0^p)\) can be approximated by \(P(Q\leq q\text{ | }P>1/2)\) which can be modelled using a Gaussian mixture.
\(P(P\leq p, Q\leq q)\) can be modelled using a bivariate mixture of normals.

1. Modelling \(f(Q\text{ | }H_0^p)\)

I model \(f(Q\text{ | }H_0^p) \approx f(Q\text{ | }P>1/2)\) and use this to approximate \(P(Q\leq q\text{ | }H_0^p)\) for Q1, Q2 and Q3.

Q1

I use the mclust R package to model \(f(Q1\text{ | }H_0^p) \approx f(Q1\text{ | }P>1/2)\) using a Gaussian mixture model. This is modelled using 6 Gaussians, with their weights, means and variances shown in the table below.

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 6 components: 
## 
##  log-likelihood     n df       BIC       ICL
##       -25788.06 49359 17 -51759.84 -79139.49

##       Weight       Mean     Variance
## 1 0.11624221 -0.7104803 0.0004452516
## 2 0.25382864 -0.6299951 0.0034543467
## 3 0.20929539 -0.4406951 0.0185984792
## 4 0.19171657  0.1426794 0.1336937072
## 5 0.08149797  0.8100512 0.0061517932
## 6 0.14741922  1.2848616 0.4350631819

This density estimation can then be used to approximate \(P(Q1\leq q\text{ | }H_0^p)\).

q1_null <- function(x) {
  sum <- 0
  for(i in 1:length(q1_mclust$parameters$mean)){
  sum <- sum + q1_mclust$parameters$pro[i] * pnorm(x,mean = q1_mclust$parameters$mean[i], sd = sqrt(q1_mclust$parameters$variance$sigmasq[i]))
  }
  print(sum)
  }

For example, \(P(Q1\leq 0.1\text{ | }H_0^p) \approx 0.672\).

q1_null(0.1)

## [1] 0.6716489

Q2

\(f(Q2\text{ | }H_0^p) \approx f(Q2\text{ | }P>1/2)\) is modelled by 8 Gaussians, with their weights, means and variances shown in the table below.

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 8 components: 
## 
##  log-likelihood     n df       BIC    ICL
##       -22538.47 49359 23 -45325.49 -82751

##       Weight       Mean     Variance
## 1 0.03014613 -1.7862069 6.277092e-05
## 2 0.01623277 -1.6705778 4.426572e-04
## 3 0.04362900 -1.4327442 2.665499e-02
## 4 0.33556434  0.1584163 1.404298e-02
## 5 0.04483450  0.2047153 2.681079e-04
## 6 0.17355507  0.2645235 1.848158e-03
## 7 0.34488972  0.0687391 5.302849e-01
## 8 0.01114846  2.0400943 1.431901e-02

This density estimation can then be used to approximate \(P(Q2\leq q\text{ | }H_0^p)\).

q2_null <- function(x) {
  sum <- 0
  for(i in 1:length(q2_mclust$parameters$mean)){
  sum <- sum + q2_mclust$parameters$pro[i] * pnorm(x,mean = q2_mclust$parameters$mean[i], sd = sqrt(q2_mclust$parameters$variance$sigmasq[i]))
  }
  print(sum)
  }

For example, \(P(Q2\leq 0.1\text{ | }H_0^p) \approx 0.374\).

q2_null(0.1)

## [1] 0.3727371

Q3

\(f(Q3\text{ | }H_0^p) \approx f(Q3\text{ | }P>1/2)\) is modelled by 5 Gaussians, with their weights, means and variances shown in the table below.

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 5 components: 
## 
##  log-likelihood     n df       BIC       ICL
##       -18809.26 49359 14 -37769.81 -82026.24

##       Weight        Mean    Variance
## 1 0.01596818 -1.95645661 0.240837277
## 2 0.32882455 -0.22453928 0.016075237
## 3 0.16677963 -0.04297176 0.005651741
## 4 0.14404266  0.06747826 0.014848945
## 5 0.34438497  0.26579276 0.285519094

This density estimation can then be used to approximate \(P(Q3\leq q\text{ | }H_0^p)\).

q3_null <- function(x) {
  sum <- 0
  for(i in 1:length(q3_mclust$parameters$mean)){
  sum <- sum + q3_mclust$parameters$pro[i] * pnorm(x,mean = q3_mclust$parameters$mean[i], sd = sqrt(q3_mclust$parameters$variance$sigmasq[i]))
  }
  print(sum)
  }

For example, \(P(Q3\leq 0.1\text{ | }H_0^p) \approx 0.722\).

q3_null(0.1)

## [1] 0.7224955

I’ve checked with the empirical values (by counting), and these probabilities look reasonable.

2. Modelling \(f(P, Q)\)

I use a bivariate mixture of normals to model \(f(ZP, Q)\) and then use this to approximate \(P(P\leq p, Q\leq q)\). Note that I convert the P values to Z scores to be able to model them using a GMM (P values are uniformly distributed so it doesn’t make sense to model them using a GMM).

df_pq <- data.frame("z" = -qnorm(p/2), "q1" = q1)

# joint <- densityMclust(df_pq)

joint <- readRDS("zq_joint.RDS")
summary(joint, parameters = TRUE)

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust VVI (diagonal, varying volume and shape) model with 9 components: 
## 
##  log-likelihood      n df       BIC       ICL
##       -223412.4 121879 44 -447340.1 -526574.5
## 
## Mixing probabilities:
##          1          2          3          4          5          6 
## 0.05023858 0.09498104 0.05950001 0.12375738 0.18296299 0.12167108 
##          7          8          9 
## 0.04682377 0.14927248 0.17079268 
## 
## Means:
##         [,1]       [,2]     [,3]      [,4]      [,5]       [,6]      [,7]
## z  4.3189527  0.2274909 1.213810 0.3502670 1.4406614  1.9182004 0.9496216
## q1 0.5008692 -0.6315374 1.816454 0.5592739 0.3556564 -0.5668422 0.8231415
##          [,8]       [,9]
## z   0.6797708  0.9382327
## q1 -0.4162857 -0.6732045
## 
## Variances:
## [,,1]
##           z        q1
## z  6.638785 0.0000000
## q1 0.000000 0.6277607
## [,,2]
##             z          q1
## z  0.02130109 0.000000000
## q1 0.00000000 0.005886361
## [,,3]
##            z        q1
## z  0.5789952 0.0000000
## q1 0.0000000 0.2186955
## [,,4]
##             z        q1
## z  0.04845419 0.0000000
## q1 0.00000000 0.3760025
## [,,5]
##            z        q1
## z  0.4883255 0.0000000
## q1 0.0000000 0.2399573
## [,,6]
##            z        q1
## z  0.7846349 0.0000000
## q1 0.0000000 0.0119245
## [,,7]
##            z          q1
## z  0.3203132 0.000000000
## q1 0.0000000 0.002147366
## [,,8]
##            z        q1
## z  0.1450473 0.0000000
## q1 0.0000000 0.0263504
## [,,9]
##            z          q1
## z  0.2351207 0.000000000
## q1 0.0000000 0.002233113

# plot(joint, what = "density", data = df_pq,
#      drawlabels = FALSE, points.pch = 20)
plot(joint, what = "density", type = "hdr")

plot(joint, what = "density", type = "persp")

test_prob <- function(z0, q0) {
  sum <- 0
  for(i in 1:ncol(joint$parameters$mean)){
  sum <- sum + (joint$parameters$pro[i] * pmvnorm(lower = c(z0,-Inf), upper = c(Inf, q0), mean = joint$parameters$mean[,i], sigma = joint$parameters$variance$sigma[, , i]))[1]
  }
  print(sum)
}

test_prob(z0 = 3, q0 = 2)

## [1] 0.05015193

3. Understanding James’ code

I begin by simulating some data and setting the parameter values.

set.seed(1)                                           # ensure reproducibility
n=10000; n1p=100; n1pq=100; n1q=100                   # parameters
zp=c(rnorm(n1p,sd=3), rnorm(n1q,sd=1),
     rnorm(n1pq,sd=3), rnorm(n-n1p-n1q-n1pq,sd=1))    # simulate z-scores corresponding to p
zq=c(rnorm(n1p,sd=1), rnorm(n1q,sd=3), 
     rnorm(n1pq,sd=3), rnorm(n-n1p-n1q-n1pq,sd=1))    # simulate z-scores corresponding to q

p=2*pnorm(-abs(zp)); q=2*pnorm(-abs(zq))              # convert to p-values

fold_id=(1:n) %% 3 # chr for each p,q pair

candidate_indices=which(p<0.01 | q< 0.001)

fold1=which(fold_id==1) # which p,q pair on chr 1

ind1=intersect(candidate_indices,fold1) # calculate coords for indices on chr 1

indices=ind1
mode=2 # specifies leave-one-out method
fold=fold1 # parameter fold specifies indices to leave out 
nt = 5000 # number of test points in x-direction (p)
nv = 1000 # resolution for constructing L-curves (y=q)
p_threshold = 0

adj = FALSE

gx = 10^-5

closed = TRUE

I then follow James’ code to obtain the cfdr values for each \((p_i,q_i)\) pair, i.e. ccut_i=cfdr(zp_i, zq_i). He does this by:

Plotting cfsub_i-gx*xtest + gx*mx (strictly increasing version of cfdr for fixed q).
Finding the corresponding cfsub_i-gx*xtest + gx*mx (y) value for zp=zp[indices[i]]. This is named ccut.

He states: “Set ccut NOT equal to cfdr at the points; needs to be adjusted since an additional test point is used in determination of L” - not sure what this means.

zp=-qnorm(p/2); zq=-qnorm(q/2) # convert to Z scores

if (any(!is.finite(zp+zq))) stop("P-values p,q must be in [1e-300,1]")

mx=max(c(10,abs(-qnorm(p/2)))) 
my=max(c(10,abs(-qnorm(q/2)))) # maximum limits for integration

if (is.null(indices)) {
  if (is.null(at)) stop("One of the parameters 'indices', 'at' must be set")
  ccut=at
  mode=0
} else ccut=rep(0,length(indices))

yval2=seq(0,my,length.out=nv+1)[1:nv] # equally spaced values

xval2=outer(rep(1,length(ccut)),yval2) # yval2 repeated for each ind1

pval2=2*pnorm(-yval2) # convert yval2 to p vals

xtest=seq(0,mx,length.out=nt) # Z value scale
ptest=2*pnorm(-xtest) # P value scale

ccut=rep(0,length(indices))

for (i in 1:length(indices)) {
        w=which(zq[-fold] >= zq[indices[i]])
        if (length(w) >= 1) {
          cfsub= (1+(1/length(w)))*ptest/(1+(1/length(w))-ecdf(zp[-fold][w])(xtest)); cfsub=cummin(cfsub)
          ccut[i]=approx(xtest,cfsub-gx*xtest + gx*mx,zp[indices[i]],rule=2)$y
        } else ccut[i]=p[indices[i]]
      }

An example:

w_ex=which(zq[-fold] >= zq[indices[50]])
cfsub_ex= (1+(1/length(w_ex)))*ptest/(1+(1/length(w_ex))-ecdf(zp[-fold][w_ex])(xtest))
cfsub_ex=cummin(cfsub_ex)
ccut_1=approx(xtest,cfsub_ex-gx*xtest + gx*mx,zp[indices[50]],rule=2)$y

par(mfrow = c(1,1))
plot(xtest,cfsub_ex-gx*xtest + gx*mx, xlab = "Z values (xtest)", main = "Example for (p[50],q[50]) pair")
abline(v = zp[indices[50]], lty = "dashed")
text(x = 5, y = 0.5, "x = zp[50] = 3.06")
abline(h = ccut_1, lty = "dashed")
text(x = 1, y = 0.25, "y = ccut[50] = 0.229")

Note that James’ cfsub function is slightly odd, it uses 1+(1/length(w)) (which is usually approximately 1) and then approximates \(P(P\leq p|Q\leq q)\) using an ecdf on the Z scale. He subtracts the values from 1 to get back to the P value orientation. His approximation is therefore,

\[\begin{equation} \begin{split} \text{cfsub} &= \dfrac{(1+\dfrac{1}{|w|})*ptest}{(1+\dfrac{1}{|w|})-ecdf_{zp|zq\geq zq[1]}(xtest)}\\[2ex] &\approx \dfrac{ptest}{ecdf_{p|q\leq q[1]}(ptest)}. \end{split} \end{equation}\]

ecdf_z <- ecdf(zp[-fold][w])(xtest)
ecdf_p <- ecdf(p[-fold][w])(ptest)

par(mfrow = c(1,2))
plot(ecdf_z, ecdf_p)
plot(1-ecdf_z, ecdf_p)

Then, for each \((p_i,q_i)\) pair, he finds the xval2[j] value such that cfdr(xval2[j],yval2[j])=ccut_i.

ccut=ccut*(1+ 1e-6) # ccut + 1e-8 # prevent floating-point comparison errors

if (mode==2) {
    
    if (adj) {
      correct=cummin((1+ecdf(q[-fold][which(p[-fold]>0.5)])(pval2)*length(p[-fold]))/
          (1+ ecdf(q[-fold])(pval2)*length(p[-fold])))
      if (!is.null(indices)) correct_ccut=approx(pval2,correct,q[indices],rule=2)$y #cummin((1+ecdf(pc[which(p>0.5)])(pc)*length(p))/(1+ rank(pc)))
    } else {
      correct=rep(1,length(pval2)) # adjustment factor for pc[i]
      correct_ccut=rep(1,length(ccut))
    }
    if (!is.null(indices)) ccut=ccut*correct_ccut
    
    for (i in 1:length(yval2)) {
      w=which(zq[-fold] > yval2[i])
      if (length(w) >= 1 ) {
        cfsub= (1+(1/length(w)))*ptest/(1+(1/length(w))-ecdf(zp[-fold][w])(xtest)); cfsub=cummin(cfsub)
        xval2[,i]=approx((cfsub-gx*xtest + gx*mx)*correct[i],xtest,ccut,rule=2,f=1)$y 
      }  else xval2[,i]=-qnorm((ccut/correct_ccut)/2)
    }
    
  }

An example:

test_xval <- rep(0, length(yval2))

for (i in 1:length(yval2)) {
      w=which(zq[-fold] > yval2[i])
      if (length(w) >= 1 ) {
        cfsub= (1+(1/length(w)))*ptest/(1+(1/length(w))-ecdf(zp[-fold][w])(xtest)); cfsub=cummin(cfsub)
        test_xval[i]=approx((cfsub-gx*xtest + gx*mx)*correct[i],xtest,ccut[50],rule=2,f=1)$y 
      }  else test_xval[i]=-qnorm((ccut[50]/correct_ccut[50])/2)
      }

plot((cfsub-gx*xtest + gx*mx)*correct[i],xtest, main = "Example for (p[50],q[50]) pair", ylab = "Z values (xtest)")
abline(v = ccut[50], lty = "dashed")
text(x = 0.4, y = 6, "x = ccut[5] = 0.229")
abline(h = test_xval[i], lty = "dashed")
text(x = 0.1, y = 3, "y = xval2[i] = 1.2")

Overall, it seems that James:

Finds ccut_i=cfdr(p_i,q_i) for each \((p_i, q_i)\) pair.
For each element \(j\) of yval2 (i.e. how many equally spaced points yval2 is made up of) finds the xval2 value such that cfdr(xval2[i,j], yval2[i,j]) = ccut_i.

4. Amending James’ function

My cfsub approximation is,

\[\begin{equation} \text{cfsub_{anna}} = \dfrac{ptest * P(q\leq q[1]|p>1/2)}{P(p\leq ptest, q\leq q[1])} \end{equation}\]

where \(P(q\leq q[1]|p>1/2)\) is approximated using a GMM and \(P(p\leq ptest, q\leq q[1])\) is approximated using a bivariate GMM.

# GMM for P(Q | H_0)
zq_h0 <- densityMclust(zq[-fold][which(p[-fold]>1/2)])

# equation to evaluate P(Q < q| H_0)
zq_h0_func <- function(x) {
  sum <- 0
  for(i in 1:length(zq_h0$parameters$mean)){
    sum <- sum + zq_h0$parameters$pro[i] * pnorm(x,mean = zq_h0$parameters$mean[i], sd = sqrt(zq_h0$parameters$variance$sigmasq[i]), lower.tail = FALSE) # lower.tail FALSE to get P(Z>z)
  }
  return(sum)
}

par(mfrow=c(1,1))
plot(zq_h0, what = "density", data = zq[-fold][which(p[-fold]>1/2)], breaks = 100, xlab = "ZQ | P>1/2")

x <- seq(0, 7, length=100)
hx <- dnorm(x, mean = zq_h0$parameters$mean[1], sd = sqrt(zq_h0$parameters$variance$sigmasq[1]))

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

for (i in 1:length(zq_h0$parameters$mean)){
  lines(x, zq_h0$parameters$pro[i] * dnorm(x,mean = zq_h0$parameters$mean[i], sd = sqrt(zq_h0$parameters$variance$sigmasq[i])), lwd=2, col=colors[i])
}

zq_h0_func(0.8)

## [1] 0.4135718

# Bivariate GMM for P(P,Q)

df_zpzq <- data.frame("zp" = zp[-fold], "zq" = zq[-fold])

joint <- densityMclust(df_zpzq)

joint_func <- function(zp0, zq0) {
  sum <- 0
  for(i in 1:ncol(joint$parameters$mean)){
    sum <- sum + (joint$parameters$pro[i] * pmvnorm(lower = c(zp0, zq0), upper = c(Inf, Inf), mean = joint$parameters$mean[,i], sigma = joint$parameters$variance$sigma[, , i]))[1]
  }
  return(sum)
}

ccut_me=rep(0,length(indices))

i = 50

cfsub_me_orig <- (ptest * zq_h0_func(zq[indices[i]]))/sapply(xtest, function(x) joint_func(x, zq[indices[i]]))

ccut_me=approx(xtest,cfsub_me_orig-gx*xtest + gx*mx,zp[indices[50]],rule=2)$y

# can rewrite the joint_func to take a vector as input later

I compare his results so far (black) to mine (red) [note I now stick to P value range].

w=which(zq[-fold] >= zq[indices[50]])
cfsub_james_orig <- (1+(1/length(w)))*ptest/(1+(1/length(w))-ecdf(zp[-fold][w])(xtest))

par(mfrow=c(1,3))
plot(ptest, cfsub_james_orig, pch = 16, cex = 0.5, xlab = "P", ylab = "cfdr", main = "Original")
points(ptest, cfsub_me_orig, col = "red", pch = 16, cex = 0.5)

cfsub_james=cummin(cfsub_james_orig)
cfsub_me=cummin(cfsub_me_orig)

plot(ptest, cfsub_james, pch = 16, cex = 0.5, xlab = "P", ylab = "cfdr", main = "Non-decreasing")
points(ptest, cfsub_me, col = "red", pch = 16, cex = 0.5)

plot(ptest, cfsub_james-gx*ptest + gx*mx, pch = 16, cex = 0.5, xlab = "P", ylab = "cfdr", main = "Strictly increasing")
points(ptest, cfsub_me-gx*ptest + gx*mx, col = "red", pch = 16, cex = 0.5)

par(mfrow=c(1,3))
plot(cfsub_me_orig, cfsub_james_orig, main = "Original", xlab = "Me", ylab = "James")
abline(0,1)
plot(cfsub_me, cfsub_james, main = "Non-decreasing", xlab = "Me", ylab = "James")
abline(0,1)
plot(cfsub_me-gx*ptest + gx*mx, cfsub_james-gx*ptest + gx*mx, main = "Strictly increasing", xlab = "Me", ylab = "James")
abline(0,1)

The difference in ccut values for the 50th p,q pair using mine and James’ method is shown below:

Hmm something is going wrong…

Recall that his cfdr approximation is:

and mine is: