19th March

1. cFDR

Transform q values to normal distribution

A quick fix would be to convert the q values to a normal distribution and then directly use James’ functions.

The bestNormalize package (https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html) allows comparison between different normalisation methods (where normalisation refers to making data normal, i.e. not just \([0,1]\)).

mca_df <- readRDS("res.mca.mjca.RDS")
q <- mca_df[,1]
hist(q, breaks = 100)

I use the package to find the recommended method for normalisation. The package suggests the ordered quantile (ORQ) normalization via orderNorm, a normalization technique based off of a rank mapping to the normal distribution, which guarantees normally distributed transformed data (if ties are not present).

library(bestNormalize)
(BNobject <- bestNormalize(q))

## Warning in bestNormalize(q): boxcox  did not work;  Error in estimate_boxcox_lambda(x, ...) : x must be positive

## Warning in orderNorm(standardize = TRUE, warn = TRUE, x = c(1.23253769877005, : Ties in data, Normal distribution not guaranteed

## Best Normalizing transformation with 121879 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - No transform: 248.1234 
##  - Log_b(x+a): 62.6722 
##  - sqrt(x+a): 91.8964 
##  - exp(x): 2154.9042 
##  - arcsinh(x): 292.3957 
##  - Yeo-Johnson: 132.1966 
##  - orderNorm: 4.7114 
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 121879 nonmissing obs and ties
##  - 102874 unique values 
##  - Original quantiles:
##     0%    25%    50%    75%   100% 
## -1.344 -1.038 -0.620  0.997  4.357

The annotation data is now normally distributed.

orderNorm_obj <- orderNorm(q)

## Warning in orderNorm(q): Ties in data, Normal distribution not guaranteed

hist(orderNorm_obj$x.t, main = "orderNorm transformation", breaks = 100)

Problem

James’ code does not allow p or q \(<1e-300\) which corresponds to 50% of the dataset.

final_dataset <- readRDS("final_dataset.RDS")

q <- readRDS("transformed_q.RDS")

p <- final_dataset$p

source("cfdr_functions.R")

# p[which(p<1e-300)] <- 1e-300 # to avoid errors in James' code
# q[which(qs<1e-300)] <- 1e-300

candidate_indices <- which(p < 1) 

chr <- final_dataset$chrom

est_q0_pars <- fit.2g(q[which(p>0.5)])$pars

# prepare containers
fold <- vector(mode = "list", length = length(unique(chr)))
ind <- vector(mode = "list", length = length(unique(chr)))
L <- vector(mode = "list", length = length(unique(chr)))
v <- vector(mode = "list", length = length(unique(chr)))

for(i in 1:length(unique(chr))){
  fold[[i]] <- which(chr == i)
  ind[[i]] <- intersect(candidate_indices, fold[[i]])
  L[[i]] <- vl(p, q, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]]));  # compute L curves
  
  v[[i]] <- il(L[[i]], pi0_null = est_q0_pars[1], sigma_null = est_q0_pars[2], distx="norm", gx = min(p[ind[[i]]])) # integrate over L curves
}

save(p, q, v, file = "transform_res.RData")

New method

We need to:

Fit a density estimate to the Q data.
Use that same density estimate in generating the L curves.

Density estimation

We need to approximate the density of \(f_0(p,q)=f(P=p, Q=q|H_0^p)=f(Q=q|H_0^p)\). To do this we consider the distribution of Q where \(P>1/2\).

final_dataset <- readRDS("final_dataset.RDS")
p <- final_dataset$p
ind <- which(p>1/2)

mca_df <- readRDS("res.mca.mjca.RDS")
q1 <- mca_df[,1][ind]
q2 <- mca_df[,2][ind]
q3 <- mca_df[,3][ind]

par(mfrow=c(1,3))
hist(q1, breaks = 1000)
hist(q2, breaks = 1000)
hist(q3, breaks = 1000)

Notice that there are spikes at the edges of the distribution (for q2 and q3) which may cause us problems. We need a density estimator that extends beyond the data boundaries if needed (e.g. dimensions 2 and 3).

Idea: Try Gaussian mixture models for density estimation. To do this I’ll use the mclust R package.

Gaussian mixture models

Assume that all our data points come from one of \(k\) Gaussian distributions (with different means and variances).
Note that this is different to the \(K\)-means algorithm, where the clusters are modelled only by their mean.
EM algorithm assigns data to clusters with some probability and is used to infer the parameter values.
E-step (“Expectation”): Treats means, covariances and weights as fixed. For each data point, calculate the probability that it belongs to each Gaussian and treat these as the relative “assignment probabilities” for each of the Gaussians for that data point.
M-step (“Maximum”): Treat the assignment probabilities as fixed and update the parameters (mean, covariancee and weights) using these. E.g. the mean of cluster C is the mean of the data points weighted by their assignment probabilities to cluster C.
Repeat E-M steps until convergence.

Q1 (Dimension 1)

par(mfrow=c(2,2))
q1_mclust <- densityMclust(q1)
summary(q1_mclust)

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 7 components: 
## 
##  log-likelihood     n df       BIC       ICL
##       -46180.02 49359 20 -92576.18 -116196.8

plot(q1_mclust, what = "BIC") # E is equal variance, V is variable variance
plot(q1_mclust, what = "density", data = q1, breaks = 100)
plot(q1_mclust, what = "diagnostic", type = "cdf")
plot(q1_mclust, what = "diagnostic", type = "qq")

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values

Q2 (Dimension 2)

par(mfrow=c(2,2))
q2_mclust <- densityMclust(q2)
summary(q2_mclust)

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 8 components: 
## 
##  log-likelihood     n df       BIC       ICL
##       -56804.66 49359 23 -113857.9 -154768.3

plot(q2_mclust, what = "BIC") # E is equal variance, V is variable variance
plot(q2_mclust, what = "density", data = q2, breaks = 100)
plot(q2_mclust, what = "diagnostic", type = "cdf")
plot(q2_mclust, what = "diagnostic", type = "qq")

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values

Q3 (Dimension 3)

par(mfrow=c(2,2))
q3_mclust <- densityMclust(q3)
summary(q3_mclust)

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 5 components: 
## 
##  log-likelihood     n df       BIC       ICL
##       -102871.1 49359 14 -205893.5 -250528.2

plot(q3_mclust, what = "BIC") # E is equal variance, V is variable variance
plot(q3_mclust, what = "density", data = q3, breaks = 100)
plot(q3_mclust, what = "diagnostic", type = "cdf")
plot(q3_mclust, what = "diagnostic", type = "qq")

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values

Annotated L curve code

Idea: ammend this so that it uses the density estimations above.

##' Return co-ordinates of L-regions for cFDR method, with or without estimation of Pr(H0|Pj<pj).
##' 
##' Parameter 'mode' defines the way in which L-curves are constructed. L-curves define a mapping 
##' from the unit square [0,1]^2 onto the unit interval [0,1], and we consider the mapped values 
##' of each point (p[i],q[i]). In this method (cFDR1) the mapping depends on the points used to 
##' generate L, and if these points coincide with the points we are using the map for, the 
##' behaviour of the map is very complicated. Parameter 'mode' governs the set of points used in 
##' generating L-curves, and can be used to ensure that the points used to define curves L (and 
##' hence the mapping) are distinct from the points the mapping is used on.
##' 
##' @title vl
##' @param p principal p-values
##' @param q conditional p-values
##' @param adj adjust cFDR values and hence curves L using estimate of Pr(H0|Pj<pj)
##' @param indices instead of at cfdr cutoffs, compute v(L) at indices of points. Overrides parameter at if set.
##' @param at cfdr cutoff/cutoffs. Defaults to null
##' @param mode determines set of variables to use for computing cFDR. Set to 0 to leave all of 'indices' in, 1 to remove each index one-by-one (only when computing the curve L for that value of (p,q)), 2 to remove all indices in variable 'fold' and compute curves L only using points not in 'fold'
##' @param fold If mode=2, only compute L-curves using points not in 'fold'. 
##' @param p_threshold if H0 is to be rejected automatically whenever p<p_threshold, include this in all regions L
##' @param nt number of test points in x-direction, default 5000
##' @param nv resolution for constructing L-curves, default 1000
##' @param scale return curves on the p- or z- plane. Y values are equally spaced on the z-plane.
##' @param closed determines whether curves are closed polygons encircling regions L (closed=T), or lines indicating the rightmost border of regions L
##' @param verbose print progress if mode=1
##' @export
##' @author James Liley
##' @return list containing elements x, y. Assuming n curves are calculated (where n=length(indices) or length(at)) and closed=T, x is a matrix of dimension n x (4+nv), y ix a vector of length (4+nv).
vl = function(p, q, adj = TRUE, indices = NULL, at = NULL, 
              mode = 0, fold = NULL, nt = 5000, nv = 1000, p_threshold = 0, 
              scale = c("p", "z"), closed = TRUE, verbose = FALSE, 
              gx = 10^-5) {
  
  # easiest to operate on z scale, to space points more tightly where needed in small p
  zp = -qnorm(p/2)
  zq = -qnorm(q/2)
  if (any(!is.finite(zp + zq))) 
    stop("P values p,q must be in [1e-300, 1]")
  
  mx = max(c(10, abs(-qnorm(p/2))))
  my = max(c(10, abs(-qnorm(q/2))))  # maximum limits for integration
  
  # gx=0 #1/1000 # use for 'tiebreaks'- if a point is
  # on a curve with nonzero area, force the L-curve
  # through that point
  
  if (is.null(indices)) {
    if (is.null(at)) 
      stop("One of the parameters 'indices', 'at' must be set")
    ccut = at
    mode = 0
  } else ccut = rep(0, length(indices))
  
  # yval2 values are nv equally spaced points between [0,my]
  yval2 = seq(0, my, length.out = nv + 1)[1:nv]
  
  # get xval2 values in correct structure (number of indices * nv)
  xval2 = outer(rep(1, length(ccut)), yval2)
  
  pval2 = 2 * pnorm(-yval2)
  
  # xtest values are nt equally spaced points between [0,mx]
  xtest = seq(0, mx, length.out = nt)
  ptest = 2 * pnorm(-xtest)
  
  if (!is.null(indices)) {
    # set ccut. NOT equal to cfdr at the points; needs
    # to be adjusted since an additional test point is
    # used in determination of L
    # if (mode == 1) {
    #   ccut = rep(0, length(indices))
    #   for (i in 1:length(indices)) {
    #     w = which(zq[-indices[i]] >= zq[indices[i]])
    #     if (length(w) >= 1) {
    #       cfsub = (1 + (1/length(w))) * ptest/(1 + 
    #                                              (1/length(w)) - ecdf(zp[-indices[i]][w])(xtest))
    #       cfsub = cummin(cfsub)
    #       ccut[i] = approx(xtest, cfsub - gx * 
    #                          xtest + gx * mx, zp[indices[i]], 
    #                        rule = 2)$y
    #     } else ccut[i] = p[indices[i]] * 1/(1 + 
    #                                           length(which(zp[-indices[i]][w] >= 
    #                                                          zp[indices[i]])))
    #   }
    # }
    if (mode == 2) {
      ccut = rep(0, length(indices))
      for (i in 1:length(indices)) {
        w = which(zq[-fold] >= zq[indices[i]])
        if (length(w) >= 1) {
          
          ## prob P < p | Q > q
          # rough_cfdr=function(p,q) p*length(which(Q <= Q))/length(which(P <= p & Q <= q))
          cfsub = (1 + (1/length(w))) * ptest/(1 + 
                                                 (1/length(w)) - ecdf(zp[-fold][w])(xtest))
          
          # cummin gives the minimum for all values in x from the start
          # of the vector until the position of that value - i.e. make non-decreasing
          cfsub = cummin(cfsub)
          
          # linearly interpolate (xtest, cfsub) and find y value at x=z
          # note - add a very slowly increasing linear term (so it is non-decreasing and non-flat)
          ccut[i] = approx(xtest, cfsub - gx * 
                             xtest + gx * mx, zp[indices[i]], 
                           rule = 2)$y
        } else ccut[i] = p[indices[i]] # if smallest P value then set equal to original P
      }
      # ccut= p[indices]*sapply(indices,function(x)
      # (1+length(which(q[-fold] <=
      # q[x])))/(1+length(which(p[-fold] <= p[x] &
      # q[-fold] <= q[x]))))# set ccut
    }
    # if (mode == 0) {
    #   # yval2=seq(0,my,length.out=nv+1)[1:nv];
    #   # xval2=outer(rep(1,length(indices)),yval2);
    #   # pval2=2*pnorm(-yval2)
    #   # xtest=seq(0,my,length.out=nt);
    #   # ptest=2*pnorm(-xtest)
    #   
    #   ccut = rep(0, length(indices))
    #   for (i in 1:length(indices)) {
    #     w = which(zq >= zq[indices[i]])
    #     if (length(w) >= 2) {
    #       cfsub = (1 + (1/length(w))) * ptest/(1 + 
    #                                              (1/length(w)) - ecdf(zp[w])(xtest))
    #       cfsub = cummin(cfsub)
    #       ccut[i] = approx(xtest, cfsub - gx * 
    #                          xtest + gx * mx, zp[indices[i]], 
    #                        rule = 2)$y
    #     } else ccut[i] = p[indices[i]]
    #   }
    #   # ccut=p[indices]*sapply(indices,function(x)
    #   # (1+length(which(q <= q[x])))/(1+length(which(p <=
    #   # p[x] & q <= q[x]))))# set ccut
    # }
  }
  
  ccut = ccut * (1 + 1e-06)  # ccut + 1e-8 # prevent floating-point comparison errors
  
  # if (verbose & mode == 1) 
  #   print(paste0(length(ccut), " regions to calculate"))
  
  out = rep(0, length(ccut))
  
  # if (mode == 0) {
  #   
  #   if (adj) {
  #     correct = cummin((1 + ecdf(q[which(p > 
  #                                          0.5)])(pval2) * length(p))/(1 + ecdf(q)(pval2) * 
  #                                                                        length(p)))
  #     if (!is.null(indices)) 
  #       correct_ccut = approx(pval2, correct, 
  #                             q[indices], rule = 2)$y  #cummin((1+ecdf(pc[which(p>0.5)])(pc)*length(p))/(1+ rank(pc)))
  #   } else {
  #     correct = rep(1, length(pval2))  # adjustment factor for pc[i]
  #     correct_ccut = rep(1, length(ccut))
  #   }
  #   if (!is.null(indices)) 
  #     ccut = ccut * correct_ccut
  #   
  #   zp_ind = ceiling(zp * nt/mx)
  #   zp_ind = pmax(1, pmin(zp_ind, nt))
  #   zq_ind = ceiling(zq * nv/my)
  #   zq_ind = pmax(1, pmin(zq_ind, nv))
  #   zq[which(zq > my)] = my
  #   zp[which(zp > mx)] = mx
  #   p = 2 * pnorm(-abs(zp))
  #   q = 2 * pnorm(-abs(zq))
  #   
  #   for (i in 1:length(yval2)) {
  #     w = which(zq > yval2[i])
  #     if (length(w) >= 1) {
  #       cfsub = ptest/(1 + (1/length(w)) - 
  #                        ecdf(zp[w])(xtest))
  #       cfsub = cummin(cfsub)
  #       xval2[, i] = approx(cfsub * correct[i] - 
  #                             gx * xtest + gx * mx, xtest, ccut, 
  #                           rule = 2, method = "const", f = 1)$y
  #     } else xval2[, i] = -qnorm(ccut/2)
  #   }
  #   
  # }
  # if (mode == 1) {
  #   zp_ind = ceiling(zp * nt/mx)
  #   zp_ind = pmax(1, pmin(zp_ind, nt))
  #   zq_ind = ceiling(zq * nv/my)
  #   zq_ind = pmax(1, pmin(zq_ind, nv))
  #   # populate zp_ind[i],zq_ind[i] with the index of
  #   # xval2,yval2 closest (least above?) zp[i],zq[i].
  #   # Only need to do at 'sub'
  #   
  #   xmat = matrix(0, nt, nv)  # base
  #   
  #   pqmat = xmat
  #   qvec = rep(0, nv)
  #   
  #   for (i in 1:length(p)) {
  #     pqmat[1:zp_ind[i], 1:zq_ind[i]] = 1 + pqmat[1:zp_ind[i], 
  #                                                 1:zq_ind[i]]
  #     qvec[1:zq_ind[i]] = 1 + qvec[1:zq_ind[i]]
  #   }
  #   # matrix [i,j] is 1+ number of SNPs with
  #   # zp>xval2[i]; zq>yval2[i], only works for i in sub
  #   pqmat = 1 + pqmat
  #   qvec = 1 + qvec
  #   
  #   # qvec=1+ length(zq_ind)*(1-ecdf(zq_ind)(1:nv))
  #   # qvec[i] is 1+ number of SNPs with zq>yval[i],
  #   # only for i in sub
  #   
  #   cf_mat = outer(ptest, qvec)/pqmat  #pmat*qmat/pqmat
  #   # [i,j] has cFDR value at xtest[i], ytest[i]; only
  #   # valid for i in sub
  #   cf_mat = apply(cf_mat, 2, cummin)  # Smooth shape of L, see paragraph in methods
  #   
  #   l_new = rep(0, length(p))  # set to L in new method
  #   
  #   
  #   for (i in 1:length(indices)) {
  #     
  #     if (adj) 
  #       correctx = cummin((1 + ecdf(q[-indices[i]][which(p[-indices[i]] > 
  #                                                          0.5)])(pval2) * length(p[-indices[i]]))/(1 + 
  #                                                                                                     ecdf(q[-indices[i]])(pval2) * length(p[-indices[i]]))) else correctx = rep(1, length(pval2))
  #                                                                                                     
  #                                                                                                     ccut[i] = ccut[i] * approx(yval2, correctx, 
  #                                                                                                                                zq[indices[i]], rule = 2)$y  #ff(rev(correctx),rev(yval2),zq[indices[i]]) #correctx[zq_ind[indices[i]]]
  #                                                                                                     
  #                                                                                                     pqnew = pqmat
  #                                                                                                     pqnew[1:zp_ind[indices[i]], 1:zq_ind[indices[i]]] = -1 + 
  #                                                                                                       pqnew[1:zp_ind[indices[i]], 1:zq_ind[indices[i]]]  #remove contribution of current SNP
  #                                                                                                     qnew = qvec
  #                                                                                                     qnew[1:zq_ind[indices[i]]] = -1 + qnew[1:zq_ind[indices[i]]]
  #                                                                                                     cfx = apply(outer(ptest, qnew)/pqnew, 2, 
  #                                                                                                                 cummin)
  #                                                                                                     cfx = t(t(cfx - gx * xtest + gx * mx) * 
  #                                                                                                               correctx)
  #                                                                                                     cxi = ccut[i]
  #                                                                                                     xv = suppressWarnings(apply(cfx, 2, function(x) ff(xtest, 
  #                                                                                                                                                        x, cxi)))  #approx(x,xtest,cxi,rule=2)$y))
  #                                                                                                     xv[which(xv < 0)] = 0
  #                                                                                                     xv[which(!is.finite(xv))] = 0
  #                                                                                                     xval2[i, ] = xv
  #                                                                                                     if (verbose) 
  #                                                                                                       print(i)
  #   }
  # }
  if (mode == 2) {
    
    if (adj) { # adj = TRUE: adjust cFDR values and hence curves L using estimate of Pr(H0|Pj<pj)
      correct = cummin((1 + ecdf(q[-fold][which(p[-fold] > 
                                                  0.5)])(pval2) * length(p[-fold]))/(1 + ecdf(q[-fold])(pval2) * length(p[-fold])))
      if (!is.null(indices)) 
        correct_ccut = approx(pval2, correct, 
                              q[indices], rule = 2)$y  #cummin((1+ecdf(pc[which(p>0.5)])(pc)*length(p))/(1+ rank(pc)))
    } else {
      correct = rep(1, length(pval2))  # adjustment factor for pc[i]
      correct_ccut = rep(1, length(ccut))
    }
    if (!is.null(indices)) 
      ccut = ccut * correct_ccut
    
    for (i in 1:length(yval2)) {
      w = which(zq[-fold] > yval2[i])
      if (length(w) >= 1) {
        cfsub = (1 + (1/length(w))) * ptest/(1 + 
                                               (1/length(w)) - ecdf(zp[-fold][w])(xtest))
        cfsub = cummin(cfsub)
        xval2[, i] = approx((cfsub - gx * xtest + 
                               gx * mx) * correct[i], xtest, ccut, 
                            rule = 2, f = 1)$y
      } else xval2[, i] = -qnorm((ccut/correct_ccut)/2)
    }
    
  }
  
  xval2[which(xval2 > -qnorm(p_threshold/2))] = -qnorm(p_threshold/2)
  if (closed) {
    yval2 = c(Inf, 0, yval2, Inf, Inf)
    xval2 = cbind(Inf, Inf, xval2, xval2[, nv], 
                  Inf)
  }
  
  if (scale[1] == "p") {
    X = 2 * pnorm(-abs(xval2))
    Y = 2 * pnorm(-abs(yval2))
  } else {
    X = xval2
    Y = yval2
  }
  
  return(list(x = X, y = Y))
}

vly() function

I’ve noticed that James has a vly function: “Return co-ordinates of L-regions for cFDR method using kernel density method”.

Comparing L curves

I compare the L curves generated from 4 different methods:

Convert q to \([0,1]\) range and then use the vl() function.
Use vl() function and then convert back to original range.
Use Chris’ vl2() function which takes continuous q.
Use James’ vly() function that uses KDE method to estimate P,Q

FALSE    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
FALSE 0.00000 0.04688 0.26062 0.37302 0.72332 0.99354

Estimating joint distribution to integrate over

We found that using ecdf to estimate the joint distribution, \(f_{p,q}(p,q|H_0)\), struggled when Q had spikes at the extremes.

We found that using KDEs gave v values that never reached one…

I could try:

Using the gaussian mixtures above.
Using the KDE in James’ vly() function.

but I havent figured out how to extract the actual functions yet… e.g. using approxfun() for the Gaussian mixture models above doesn’t look good..

final_dataset <- readRDS("final_dataset.RDS")
p <- final_dataset$p
ind <- which(p>1/2)

mca_df <- readRDS("res.mca.mjca.RDS")
q1 <- mca_df[,1][ind]

q1_mclust <- densityMclust(q1)

par(mfrow=c(1,3))

plot(q1_mclust, what = "density", data = q1, breaks = 100)
plot(q1, q1_mclust$density, main = "Density values from mclust")

approx_func <- approxfun(x = q1, y = q1_mclust$density)

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values

x.test <- seq(from = min(q1), to = max(q1), length.out = 10000)
y.test <- approx_func(x.test)

plot(x.test, y.test, main = "Extracted from approxfun")

How do we “use the GMM density estimate in generating the L curves”?