James’ vlyl()
function uses KDEs to approximate the cfdr and find L curves.
This is done in a two-step approach:
Find the ccut
values where ccut[i]=cfdr(p[i],q[i])
(fixing q to generate the cfdr curves)
Find the xval2
values where cfdr(xval2[i,j], yval2[i,j])=ccut[i]
(fixing yval2 to generate cfdr curves)
cfdr is estimated by cgrid
and this is interpolated to find ccut
and xval2
.
cgrid
estimates the adjusted cfdr by
\[\begin{equation} \begin{split} \hat{cFDR} &= P(H_0^p\text{ | }P\leq p, Q\leq q)\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p, Q\leq q) P(H_0^p\text{ | }Q\leq q)}{P(P\leq p \text{ | }Q\leq q)}\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p) \times \dfrac{P(Q\leq q\text{ | }H_0^p) P(H_0^p)}{P(Q\leq q)}}{\dfrac{P(P\leq p , Q\leq q)}{P(Q\leq q)}}\\[2ex] &\approx \dfrac{P(P\leq p\text{ | }H_0^p) \times P(Q\leq q\text{ | }H_0^p)}{P(P\leq p, Q\leq q)}\\[2ex] &\approx \dfrac{P(P\leq p, Q\leq q\text{ | }H_0^p)}{P(P\leq p, Q\leq q)}. \end{split} \end{equation}\]
We first estimate \(P(P\leq p, Q\leq q)\) (the denominator) using KDE:
# P,Q
kpq = kde2d(c(zp[sub],zp[sub],-zp[sub],-zp[sub]),c(zq[sub],-zq[sub],-zq[sub],zq[sub]), n=res, lims=lims, h=c(1,1))
kpq$z = apply(kpq$z, 2, function(x) rev(cummax(rev(x)))) # ensures non-increasing
The second line ensures that it is non-increasing - see below for an example:
We then use the independence assumption to estimate \(P(P\leq p, Q\leq q\text{ | }H_0^p)=P(P\leq p|H_0^p) \times P(Q\leq q|H_0^p)\) (the numerator):
# Q | H_0
kq = density(c(zq[sub][which(p[sub]>0.5)],-zq[sub][which(p[sub]>0.5)]), from=lims[1], to=lims[2], n=res)
# P | H_0
px = dnorm(seq(lims[1], lims[2], length.out=res))
# P,Q | H_0
kpq0 = kpq
kpq0$z = outer(px, kq$y) # P|H0 * Q|H0
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z)) # ensure sum(kpq$z)=sum(kpq0$z)
kpq$z[which(kpq$z==0)] = min(kpq$z[which(kpq$z>0)])
kpq0$z[which(kpq0$z==0)] = min(kpq0$z[which(kpq0$z>0)])
We then estimate cfdr with the ratio of these two followed by making the cfdr curves non-increasing:
cgrid = kpq
cgrid$z = pmin(kpq0$z/kpq$z,10) # don't need for super high Z
# scores as we know these are associated anyway
cgrid$z=apply(cgrid$z,2,function(x) cummin(x-rev(cummin(rev(x))))) # cfdr value (?)
I look at 3 examples for low ZQ (high Q), medium zQ and high ZQ (low Q). These look ok - we see that the cfdr curves are “lower” low Q values.
The ZQ values that give cfdr curves that exceed 1 are shown below. These are mainly for very high Q values (low ZQ values). Also note that for this to really be a problem (i.e. to give a cfdr value \(>1\)) the corresponding ZP values must be relatively small (i.e. big p values).
cgrid$x[apply(cgrid$z, 2, function(x) any(which(x>1)))]
## [1] 0.00000000 0.06030151 0.12060302 0.18090452 0.24120603
## [6] 0.30150754 0.36180905 0.42211055 0.48241206 0.54271357
## [11] 0.60301508 0.66331658 0.72361809 0.78391960 0.84422111
## [16] 0.90452261 0.96482412 1.02512563 1.20603015 1.26633166
## [21] 1.32663317 1.38693467 1.44723618 4.82412060 4.88442211
## [26] 4.94472362 5.00502513 5.06532663 5.12562814 5.18592965
## [31] 5.24623116 5.30653266 5.36683417 5.42713568 5.48743719
## [36] 5.54773869 6.33165829 6.75376884 6.81407035 6.87437186
## [41] 6.93467337 6.99497487 10.31155779
There is also one curve that extends beyond 2… This is the cfdr curve for ZQ = 10.31156. What is happening here?
cgrid$x[apply(cgrid$z, 2, function(x) any(which(x>2)))]
## [1] 10.31156
Note that when I run the code on p,q pairs with p < 0.005 | q < 0.001
(either or both - 55 of these), there is one that has a ccut=cfdr(zp,zq)
value \(>1\). This problematic p, q pair is for \((p = 0.00838, q = 5.86e-08)\).
# problematic pair
zp[indices][26] # zp
## [1] 2.636123
zq[indices][26] # zq
## [1] 5.42289
interp.surface(cgrid, cbind(zp[indices][26], zq[indices][26])) # ccut = cfdr(zp, zq)
## [1] 1.611308
The cfdr curve is shown below and ccut
is found by reading off from the red line (x = zp).
The resultant L curves when using vl()
and vlyl()
are shown below for all 55 indices with the problematic one shown in red below. Notice that it is “backwards to all the others” when using vlyl()
.
The final v-value for this example (calculated using vlyl()
and il()
) is \(>1\).
## [1] 1.00282
But when I use vl()
followed by il()
it is 1.178510e-04. This shows that it is a problem with the vlyl()
function. (problematic point is shown in red). Spoiler - this problem is “fixed” when I use vlyl + KDE for the integral later on…
Outstanding questions: Why do the cfdr values extend beyond 1? Is this problematic for the method? If so, how can I go about amending them?
My L curves are now defined on the Z scale (will help later when using continuous Q). I also can’t use the closed=TRUE
parameter because this gives infinite Z scores (see below):
if (closed) {
yval2 = c(Inf, 0, yval2, Inf, Inf)
xval2 = cbind(Inf, Inf, xval2, xval2[,nv], Inf)
}
This means that I need amend my usage of owin()
to define the closed polygons (L regions) for polyCub
to integrate over.
I need it to look more like that shown below. But I don’t know what the coordinates should be for the right hand side of the L region.
Infact, this is fine as I will convert the x axis to the P scale.
I need to approximate the joint null to integrate over the L curves.
Ideally, I’d use the same joint null as in the L curve functions. In the L curve functions, the joint null is approximated from the joint:
kpq = kde2d(c(zp[sub],zp[sub],-zp[sub],-zp[sub]),c(zq[sub],-zq[sub],-zq[sub],zq[sub]),n=res,lims=lims,h=c(1,1))
kpq$z = apply(kpq$z,2,function(x) rev(cummax(rev(x))))
# ZP | H0
px = dnorm(seq(lims[1],lims[2],length.out=res))
# ZQ | H0
zqs = zq[sub][which(p[sub]>0.5)]
kq = density(c(zqs,-zqs),from=lims[1],to=lims[2],n=res)
# ZP, ZQ | H0
kpq0 = kpq
kpq0$z = outer(px,kq$y)
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))
However, this just gives 3D coordinates and I need a function to feed into polyCub
.
I consider using the following which is based on the above. Firstly, I find P | H0:
# ZP | H0
fzph0 <- function(x) dnorm(x)
Then Q | H0 (using the leave-one-chromosome-out method):
# ZQ | H0
sub = setdiff(1:length(p),fold1)
qs = zq[sub][which(p[sub] > 0.5)] # q|P>1/2
kq = density(qs, from = 0, to = 12, n = 200)
fzqh0 <- approxfun(kq$x, kq$y)
Then P, Q | H0 as the product of these:
fzpzqh0 <- function(s){
fzph0(s[,1])*fzqh0(s[,2])
}
The key differences between my code and that in the L curve functions is:
density(qs,...)
(integral between 0 and 12 is \(\approx 0.96\)) rather than density(c(qs, -qs),...)
(integral between 0 and 12 \(\approx 0.5\)) to estimate Q|H0. Why does James use c(qs, -qs)
? Note that ultimately I’ll use density(qs,...)
both in the generation of the L curves and the integration step since my Q will be continuous.kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))
.kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))
Instead, I consider using the following hack to ensure the integral equals 1.
##### hack to get integral to be 1
# ZQ | H0
qs = zq[sub][which(p[sub] > 0.5)] # q|P>1/2
kq = density(qs, from = 0, to = 12, n = 200)
fzqh0 <- approxfun(kq$x, kq$y)
integrate(fzqh0, 0, 12)
## 0.9659558 with absolute error < 4.7e-05
fzqh0_hack <- approxfun(kq$x, kq$y/integrate(fzqh0, 0, 12)$value)
integrate(fzqh0_hack, 0, 12)
## 1 with absolute error < 4.9e-05
# estimate joint density
fzpzqh0 <- function(s){
fzph0(s[,1])*fzqh0_hack(s[,2])
}
Do I want the 2D integral to equal 1 though? Maybe not due to signed Z score thing?
library(pracma)
##
## Attaching package: 'pracma'
## The following objects are masked from 'package:spatstat':
##
## integral, quad, rat
fzpzqh0_xy <- function(x,y){
fzph0(x)*fzqh0_hack(y)
}
integral2(fzpzqh0_xy, 0, 12, 0, 12)
## $Q
## [1] 0.4997699
##
## $error
## [1] 1.538202e-08
Outstanding questions: How best to estimate the joint null for the integral step?
I compare V values that are generated using 4 different approaches, just to check that my KDE estimation for the joint null code is on the right track.
vl()
then il()
vlyl()
then il()
vl()
then KDE approximation of null and polyCubvlyl()
then KDE approximation of null and polyCubI use the following KDE approximation for 3 and 4:
# ZP | H0
fzph0 <- function(x) dnorm(x)
sub = setdiff(1:length(p),fold1)
# ZQ | H0
qs = zq[sub][which(p[sub] > 0.5)] # q|P>1/2
kq = density(qs, from = 0, to = 12, n = res)
fzqh0 <- approxfun(kq$x, kq$y)
integrate(fzqh0, 0, 12)
fzqh0_hack <- approxfun(kq$x, kq$y/integrate(fzqh0, 0, 12)$value)
integrate(fzqh0_hack, 0, 12)
# estimate joint density
fzpzqh0 <- function(s){
fzph0(s[,1])*fzqh0_hack(s[,2])
}
v <- apply(L$x, 1, function(x) polyCub(owin(poly=list(x = x, y = rev(L$y))), fzpzqh0))
Note that the results look promising even though I am using the wrong L regions at the moment..
This looks ok for now but note that in option 3, there are 6 indices that give the following error message:
Error in owin(poly = list(x = L_ecdf$x[6, ], y = rev(L_ecdf$y))): Area of polygon is negative - maybe traversed in wrong direction?
I amend the vlyl()
function to take signed Z scores and continuous Q (functional annotation data).
set.seed(1) # ensure reproducibility
n = 10000
n1p = 100
n1pq = 100
n1q = 100 # parameters
zp = c(rnorm(n1p, sd = 3), rnorm(n1q, sd = 1), rnorm(n1pq, sd = 3), rnorm(n - n1p - n1q - n1pq, sd = 1))
p = 2*pnorm(-abs(zp))
Q <- readRDS("Q.RDS")
q <- sample(Q[,1], 10000)
par(mfrow = c(1,2))
hist(p, breaks = 100)
hist(q, breaks = 100)
fold_id = (1:n)%%3 # chr for each p,q pair
candidate_indices = which(p < 0.05)
fold = which(fold_id == 1) # which p,q pair on chr 1
indices = intersect(candidate_indices, fold)
mode = 2
at = NULL
fold = which(fold_id == 1)
p_threshold = 0
nt = 5000
nv = 1000
zp = sample(c(-1,1), length(zp), replace = TRUE) * zp # signed Z scores
res=200
maxx = max(c(10, zp))
minx = min(c(-10, zp)) # dont need to bother for abs(ZP)>10
zp[which(zp > maxx)] = maxx
zp[which(zp < minx)] = minx
maxy = max(q)
miny = min(q)
if (mode==2) sub=setdiff(1:length(zp),fold) else sub=1:length(zp)
# define limits by adding 10% either side
q_10 <- (max(q[sub]) - min(q[sub])) * 0.1
p_10 <- (max(p[sub]) - min(p[sub])) * 0.1
lims=c(minx-p_10, maxx+p_10, miny-q_10, maxy+q_10)
I find the KDE of ZP,Q:
# P,Q|H0
kpq = kde2d(zp[sub], q[sub], n=res, lims=lims, h=c(1,1))
I need to make this non-increasing between 0–>Inf and non-decreasing between -Inf–>0.
# all(apply(kpq$z[which(kpq$x>=0),], 2, function(x) all(diff(x) <= 0)))
# all(apply(kpq$z[which(kpq$x<0),], 2, function(x) all(diff(x) >= 0)))
# make non-increasing between 0-->Inf
kpq$z[which(kpq$x>=0),] = apply(kpq$z[which(kpq$x>=0),], 2, function(x) rev(cummax(rev(x))))
# make non-decrasing between 0-->Inf
kpq$z[which(kpq$x<0),] = apply(kpq$z[which(kpq$x<0),], 2, function(x) rev(cummin(rev(x))))
all(apply(kpq$z[which(kpq$x>=0),], 2, function(x) all(diff(x) <= 0))) # check
## [1] TRUE
all(apply(kpq$z[which(kpq$x<0),], 2, function(x) all(diff(x) >= 0))) # check
## [1] TRUE
I now find the KDE of ZP,Q|H0 by first finding the KDE of ZP|H0 and Q|H0:
# ZP|H0
px = dnorm(seq(lims[1],lims[2],length.out=res))
# Q|H0
p = 2*pnorm(-abs(zp))
qs = q[sub][which(p[sub]>0.5)]
kq = density(qs, from=lims[3], to=lims[4], n=res)
# ZP,Q|H0
kpq0 = kpq
kpq0$z = outer(px,kq$y)
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))
kpq$z[which(kpq$z==0)] = min(kpq$z[which(kpq$z>0)])
kpq0$z[which(kpq0$z==0)] = min(kpq0$z[which(kpq0$z>0)])
I then find \(cfdr\approx \dfrac{f(P,Q)}{f(P,Q|H0)}\):
cgrid = kpq
cgrid$z = pmin(kpq0$z/kpq$z,10)
And make this non-increasing between 0–>Inf and non-decreasing between -Inf–>0.
# make non-increasing between 0 --> Inf
cgrid$z[which(cgrid$x>=0), ] = apply(cgrid$z[which(cgrid$x>=0), ], 2, function(x) cummin(x-rev(cummin(rev(x)))))
# make non-decreasing between -Inf --> 0
cgrid$z[which(cgrid$x<0), ] = apply(cgrid$z[which(cgrid$x<0), ], 2, function(x) rev(cummin((rev(x)-rev(cummin(x))))))
all(apply(cgrid$z[which(cgrid$x>=0),], 2, function(x) all(diff(x) <= 0))) # check
## [1] TRUE
all(apply(cgrid$z[which(cgrid$x<0),], 2, function(x) all(diff(x) >= 0))) # check
## [1] TRUE