17th April

cfdr for functional annotations

James’ vlyl() function uses KDEs to approximate the cfdr and find L curves.

This is done in a two-step approach:

Find the ccut values where ccut[i]=cfdr(p[i],q[i]) (fixing q to generate the cfdr curves)
Find the xval2 values where cfdr(xval2[i,j], yval2[i,j])=ccut[i] (fixing yval2 to generate cfdr curves)

cfdr is estimated by cgrid and this is interpolated to find ccut and xval2.

1. cfdr

cgrid estimates the adjusted cfdr by

\[\begin{equation} \begin{split} \hat{cFDR} &= P(H_0^p\text{ | }P\leq p, Q\leq q)\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p, Q\leq q) P(H_0^p\text{ | }Q\leq q)}{P(P\leq p \text{ | }Q\leq q)}\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p) \times \dfrac{P(Q\leq q\text{ | }H_0^p) P(H_0^p)}{P(Q\leq q)}}{\dfrac{P(P\leq p , Q\leq q)}{P(Q\leq q)}}\\[2ex] &\approx \dfrac{P(P\leq p\text{ | }H_0^p) \times P(Q\leq q\text{ | }H_0^p)}{P(P\leq p, Q\leq q)}\\[2ex] &\approx \dfrac{P(P\leq p, Q\leq q\text{ | }H_0^p)}{P(P\leq p, Q\leq q)}. \end{split} \end{equation}\]

Estimating $P(P\leq p, Q\leq q)$

We first estimate $P(P\leq p, Q\leq q)$ (the denominator) using KDE:

# P,Q
kpq = kde2d(c(zp[sub],zp[sub],-zp[sub],-zp[sub]),c(zq[sub],-zq[sub],-zq[sub],zq[sub]), n=res, lims=lims, h=c(1,1))
kpq$z = apply(kpq$z, 2, function(x) rev(cummax(rev(x)))) # ensures non-increasing

The second line ensures that it is non-increasing - see below for an example:

Estimating $P(P\leq p, Q\leq q\text{ | }H_0^p)$

We then use the independence assumption to estimate $P(P\leq p, Q\leq q\text{ | }H_0^p)=P(P\leq p|H_0^p) \times P(Q\leq q|H_0^p)$ (the numerator):

# Q | H_0
kq = density(c(zq[sub][which(p[sub]>0.5)],-zq[sub][which(p[sub]>0.5)]), from=lims[1], to=lims[2], n=res)

# P | H_0
px = dnorm(seq(lims[1], lims[2], length.out=res))

# P,Q | H_0
kpq0 = kpq 
kpq0$z = outer(px, kq$y) # P|H0 * Q|H0
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z)) # ensure sum(kpq$z)=sum(kpq0$z) 

kpq$z[which(kpq$z==0)] = min(kpq$z[which(kpq$z>0)])
kpq0$z[which(kpq0$z==0)] = min(kpq0$z[which(kpq0$z>0)])

Estimating cfdr

We then estimate cfdr with the ratio of these two followed by making the cfdr curves non-increasing:

cgrid = kpq 
cgrid$z = pmin(kpq0$z/kpq$z,10) # don't need for super high Z  
                                # scores as we know these are associated anyway
cgrid$z=apply(cgrid$z,2,function(x) cummin(x-rev(cummin(rev(x))))) # cfdr value (?)

I look at 3 examples for low ZQ (high Q), medium zQ and high ZQ (low Q). These look ok - we see that the cfdr curves are “lower” low Q values.

Why are some cfdr > 1?

The ZQ values that give cfdr curves that exceed 1 are shown below. These are mainly for very high Q values (low ZQ values). Also note that for this to really be a problem (i.e. to give a cfdr value $>1$) the corresponding ZP values must be relatively small (i.e. big p values).

cgrid$x[apply(cgrid$z, 2, function(x) any(which(x>1)))]

##  [1]  0.00000000  0.06030151  0.12060302  0.18090452  0.24120603
##  [6]  0.30150754  0.36180905  0.42211055  0.48241206  0.54271357
## [11]  0.60301508  0.66331658  0.72361809  0.78391960  0.84422111
## [16]  0.90452261  0.96482412  1.02512563  1.20603015  1.26633166
## [21]  1.32663317  1.38693467  1.44723618  4.82412060  4.88442211
## [26]  4.94472362  5.00502513  5.06532663  5.12562814  5.18592965
## [31]  5.24623116  5.30653266  5.36683417  5.42713568  5.48743719
## [36]  5.54773869  6.33165829  6.75376884  6.81407035  6.87437186
## [41]  6.93467337  6.99497487 10.31155779

There is also one curve that extends beyond 2… This is the cfdr curve for ZQ = 10.31156. What is happening here?

cgrid$x[apply(cgrid$z, 2, function(x) any(which(x>2)))]

## [1] 10.31156

Impact of cfdr > 1 (example)

Note that when I run the code on p,q pairs with p < 0.005 | q < 0.001 (either or both - 55 of these), there is one that has a ccut=cfdr(zp,zq) value $>1$. This problematic p, q pair is for $(p = 0.00838, q = 5.86e-08)$.

# problematic pair
zp[indices][26] # zp

## [1] 2.636123

zq[indices][26] # zq

## [1] 5.42289

interp.surface(cgrid, cbind(zp[indices][26], zq[indices][26])) # ccut = cfdr(zp, zq)

## [1] 1.611308

The cfdr curve is shown below and ccut is found by reading off from the red line (x = zp).

The resultant L curves when using vl() and vlyl() are shown below for all 55 indices with the problematic one shown in red below. Notice that it is “backwards to all the others” when using vlyl().

The final v-value for this example (calculated using vlyl() and il()) is $>1$.

## [1] 1.00282

But when I use vl() followed by il() it is 1.178510e-04. This shows that it is a problem with the vlyl() function. (problematic point is shown in red). Spoiler - this problem is “fixed” when I use vlyl + KDE for the integral later on…

Outstanding questions: Why do the cfdr values extend beyond 1? Is this problematic for the method? If so, how can I go about amending them?

2. Defining L regions

My L curves are now defined on the Z scale (will help later when using continuous Q). I also can’t use the closed=TRUE parameter because this gives infinite Z scores (see below):

if (closed) {
    yval2 = c(Inf, 0, yval2, Inf, Inf)
    xval2 = cbind(Inf, Inf, xval2, xval2[,nv], Inf)
  }

This means that I need amend my usage of owin() to define the closed polygons (L regions) for polyCub to integrate over.

I need it to look more like that shown below. But I don’t know what the coordinates should be for the right hand side of the L region.

Infact, this is fine as I will convert the x axis to the P scale.

3. KDE approximation of joint null

I need to approximate the joint null to integrate over the L curves.

Ideally, I’d use the same joint null as in the L curve functions. In the L curve functions, the joint null is approximated from the joint:

kpq = kde2d(c(zp[sub],zp[sub],-zp[sub],-zp[sub]),c(zq[sub],-zq[sub],-zq[sub],zq[sub]),n=res,lims=lims,h=c(1,1))

kpq$z = apply(kpq$z,2,function(x) rev(cummax(rev(x))))

# ZP | H0
px = dnorm(seq(lims[1],lims[2],length.out=res))

# ZQ | H0
zqs = zq[sub][which(p[sub]>0.5)]
kq = density(c(zqs,-zqs),from=lims[1],to=lims[2],n=res)

# ZP, ZQ | H0
kpq0 = kpq
kpq0$z = outer(px,kq$y)
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))

However, this just gives 3D coordinates and I need a function to feed into polyCub.

I consider using the following which is based on the above. Firstly, I find P | H0:

# ZP | H0
fzph0 <- function(x) dnorm(x)

Then Q | H0 (using the leave-one-chromosome-out method):

# ZQ | H0
sub = setdiff(1:length(p),fold1)
qs = zq[sub][which(p[sub] > 0.5)]  # q|P>1/2
kq = density(qs, from = 0, to = 12, n = 200)

fzqh0 <- approxfun(kq$x, kq$y)

Then P, Q | H0 as the product of these:

fzpzqh0 <- function(s){
  fzph0(s[,1])*fzqh0(s[,2])
}

The key differences between my code and that in the L curve functions is:

I use density(qs,...) (integral between 0 and 12 is $\approx 0.96$) rather than density(c(qs, -qs),...) (integral between 0 and 12 $\approx 0.5$) to estimate Q|H0. Why does James use c(qs, -qs)? Note that ultimately I’ll use density(qs,...) both in the generation of the L curves and the integration step since my Q will be continuous.

I don’t have James’ normalisation step: kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z)).

kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))

Instead, I consider using the following hack to ensure the integral equals 1.

##### hack to get integral to be 1

# ZQ | H0
qs = zq[sub][which(p[sub] > 0.5)]  # q|P>1/2
kq = density(qs, from = 0, to = 12, n = 200)

fzqh0 <- approxfun(kq$x, kq$y)

integrate(fzqh0, 0, 12)

## 0.9659558 with absolute error < 4.7e-05

fzqh0_hack <- approxfun(kq$x, kq$y/integrate(fzqh0, 0, 12)$value)

integrate(fzqh0_hack, 0, 12)

## 1 with absolute error < 4.9e-05

# estimate joint density
fzpzqh0 <- function(s){
  fzph0(s[,1])*fzqh0_hack(s[,2])
}

Do I want the 2D integral to equal 1 though? Maybe not due to signed Z score thing?

library(pracma)

## 
## Attaching package: 'pracma'

## The following objects are masked from 'package:spatstat':
## 
##     integral, quad, rat

fzpzqh0_xy <- function(x,y){
  fzph0(x)*fzqh0_hack(y)
}

integral2(fzpzqh0_xy, 0, 12, 0, 12)

## $Q
## [1] 0.4997699
## 
## $error
## [1] 1.538202e-08

Outstanding questions: How best to estimate the joint null for the integral step?

4. Comparing V values

I compare V values that are generated using 4 different approaches, just to check that my KDE estimation for the joint null code is on the right track.

vl() then il()
vlyl() then il()
vl() then KDE approximation of null and polyCub
vlyl() then KDE approximation of null and polyCub

I use the following KDE approximation for 3 and 4:

# ZP | H0
fzph0 <- function(x) dnorm(x)

sub = setdiff(1:length(p),fold1)

# ZQ | H0
qs = zq[sub][which(p[sub] > 0.5)]  # q|P>1/2
kq = density(qs, from = 0, to = 12, n = res)

fzqh0 <- approxfun(kq$x, kq$y)

integrate(fzqh0, 0, 12)

fzqh0_hack <- approxfun(kq$x, kq$y/integrate(fzqh0, 0, 12)$value)

integrate(fzqh0_hack, 0, 12)

# estimate joint density
fzpzqh0 <- function(s){
  fzph0(s[,1])*fzqh0_hack(s[,2])
}

v <- apply(L$x, 1, function(x) polyCub(owin(poly=list(x = x, y = rev(L$y))), fzpzqh0))

Note that the results look promising even though I am using the wrong L regions at the moment..

This looks ok for now but note that in option 3, there are 6 indices that give the following error message:

Error in owin(poly = list(x = L_ecdf$x[6, ], y = rev(L_ecdf$y))): Area of polygon is negative - maybe traversed in wrong direction?

5. New vlyl function

I amend the vlyl() function to take signed Z scores and continuous Q (functional annotation data).

set.seed(1)  # ensure reproducibility
n = 10000
n1p = 100
n1pq = 100
n1q = 100  # parameters
zp = c(rnorm(n1p, sd = 3), rnorm(n1q, sd = 1), rnorm(n1pq, sd = 3), rnorm(n - n1p - n1q - n1pq, sd = 1))

p = 2*pnorm(-abs(zp))

Q <- readRDS("Q.RDS")
q <- sample(Q[,1], 10000)

par(mfrow = c(1,2))
hist(p, breaks = 100)
hist(q, breaks = 100)

fold_id = (1:n)%%3  # chr for each p,q pair

candidate_indices = which(p < 0.05)

fold = which(fold_id == 1)  # which p,q pair on chr 1

indices = intersect(candidate_indices, fold)

mode = 2
at = NULL
fold = which(fold_id == 1)
p_threshold = 0

nt = 5000
nv = 1000

zp =  sample(c(-1,1), length(zp), replace = TRUE) * zp # signed Z scores

res=200

maxx = max(c(10, zp))
minx = min(c(-10, zp)) # dont need to bother for abs(ZP)>10

zp[which(zp > maxx)] = maxx
zp[which(zp < minx)] = minx

maxy = max(q)
miny = min(q)

if (mode==2) sub=setdiff(1:length(zp),fold) else sub=1:length(zp)

# define limits by adding 10% either side
q_10 <- (max(q[sub]) - min(q[sub])) * 0.1
p_10 <- (max(p[sub]) - min(p[sub])) * 0.1

lims=c(minx-p_10, maxx+p_10, miny-q_10, maxy+q_10)

I find the KDE of ZP,Q:

# P,Q|H0
kpq = kde2d(zp[sub], q[sub], n=res, lims=lims, h=c(1,1))

I need to make this non-increasing between 0–>Inf and non-decreasing between -Inf–>0.

# all(apply(kpq$z[which(kpq$x>=0),], 2, function(x) all(diff(x) <= 0)))
# all(apply(kpq$z[which(kpq$x<0),], 2, function(x) all(diff(x) >= 0)))

# make non-increasing between 0-->Inf
kpq$z[which(kpq$x>=0),] = apply(kpq$z[which(kpq$x>=0),], 2, function(x) rev(cummax(rev(x))))

# make non-decrasing between 0-->Inf
kpq$z[which(kpq$x<0),] = apply(kpq$z[which(kpq$x<0),], 2, function(x) rev(cummin(rev(x))))

all(apply(kpq$z[which(kpq$x>=0),], 2, function(x) all(diff(x) <= 0))) # check

## [1] TRUE

all(apply(kpq$z[which(kpq$x<0),], 2, function(x) all(diff(x) >= 0))) # check

## [1] TRUE

I now find the KDE of ZP,Q|H0 by first finding the KDE of ZP|H0 and Q|H0:

# ZP|H0
px = dnorm(seq(lims[1],lims[2],length.out=res))

# Q|H0
p = 2*pnorm(-abs(zp))
qs  = q[sub][which(p[sub]>0.5)]
kq = density(qs, from=lims[3], to=lims[4], n=res)

# ZP,Q|H0
kpq0  = kpq
kpq0$z = outer(px,kq$y)
kpq0$z = kpq0$z*(sum(kpq$z)/sum(kpq0$z))

kpq$z[which(kpq$z==0)] = min(kpq$z[which(kpq$z>0)])
kpq0$z[which(kpq0$z==0)] = min(kpq0$z[which(kpq0$z>0)])

I then find $cfdr\approx \dfrac{f(P,Q)}{f(P,Q|H0)}$:

cgrid = kpq
cgrid$z = pmin(kpq0$z/kpq$z,10)

And make this non-increasing between 0–>Inf and non-decreasing between -Inf–>0.

# make non-increasing between 0 --> Inf
cgrid$z[which(cgrid$x>=0), ] = apply(cgrid$z[which(cgrid$x>=0), ], 2, function(x) cummin(x-rev(cummin(rev(x)))))

# make non-decreasing between -Inf --> 0
cgrid$z[which(cgrid$x<0), ] = apply(cgrid$z[which(cgrid$x<0), ], 2, function(x) rev(cummin((rev(x)-rev(cummin(x))))))

all(apply(cgrid$z[which(cgrid$x>=0),], 2, function(x) all(diff(x) <= 0))) # check

## [1] TRUE

all(apply(cgrid$z[which(cgrid$x<0),], 2, function(x) all(diff(x) >= 0))) # check

## [1] TRUE