Non-monotonic q


I explore how the cFDR methods perform with a non-monotonic \(q\), that is, where both low \(q\) and high \(q\) are enriched for low \(p\) (and should both therefore generate smaller \(v\)-values).

As expected, the methods cannot simultaneously shrink \(p\)-values for low \(q\) and high \(q\). Also note that James’ \(v\)-values extend slightly above 1 (1.002956). This highlights that the cFDR methods require the relationship between \(p\) and \(q\) to be monotonic (in that low \(q\) are enriched for low \(p\)). The methods also work for independent \(q\) where noise is added (see below). I’ll come back to this later (for the Asthma data).


Simulations


I simulate GWAS \(p\)-values using simGWAS:

I begin with 2 types of simulations, so that I can compare results with empirical cFDR.

  1. Independent uniform \(q\) (+ compare with fold removal empirical cFDR) - no left censoring or removing of L-curve segments
x <- readRDS("simulated_GWAS.RDS")
p <- x$p
n <- length(p)

q <- runif(n, 0, 1)
  1. Dependent uniform \(q\) representing \(p\)-values for a related trait (+ compare with fold removal empirical cFDR) - no left censoring or removing of L-curve segments
n <- length(p)
prob <- runif(n)<p
q <- ifelse(prob == TRUE, exp(-rexp(n)), exp(-rexp(n, rate = 0.5)))

Independent uniform walkthrough


I step through the functional cFDR method for an independent uniform \(q\). For speed, I only run on the subset of independent SNPs.

Firstly, we use kde2d to estimate the bivariate density and use this to derive an estimate for \(P(Q\leq q, P\leq p)\).

The next step is to estimate \(P(Q\leq q|H0)\). For this, I use locfdr to estimate \(P(H0|P=p)\):

I then use this, along with the bivariate KDE shown above to estimate

\[\begin{equation} \begin{split} P(P,Q,H0)&=P(H0|P,Q) \times P(P,Q)\\ &\approx P(H0|P) \times P(P,Q) \end{split} \end{equation}\]

and then sum up the rows (p) to estimate \(P(Q,H0)\) and sum up the rows and columns (p and q) to estimate \(P(H0)\). I use these to estimate \(P(Q|H0)=P(Q,H0)/P(H0)\).

Finally, I sum over the columns (q) to estimate our desired quantity: \(P(Q\leq q|H0)\).

kgrid is then \(P(Q\leq q, P\leq p)/P(Q\leq q|H0)\):

Which is then used to generate the cFDR curves, \(P(H0|P\leq p, Q\leq q)\approx p/kgrid\).


The L-curves are the contours of the cFDR curves at the cFDR values and the bivariate null is integrated over the L-curves to generate the \(v\) values:

These results are a “tighter” version of those from empirical cFDR:


Independent uniform results



Note that both of James’ methods gives maximum \(v\)-values \(>1\) (~1.0029), so I set these equal to 1 at each iteration.

I compare James’ empirical cFDR, James’ KDE cFDR and functional cFDR for one iteration. The colouring between both KDE methods match.


I drop comparison with James’ KDE method (vly()) as this will not be included in the manuscript (also it tails off for low \(p\) and there is no gx parameter included to fix this).

Note that the fold removal method (for LD blocks) was used in empirical cFDR (or we get horrible stripy results).


QQ plots for functional cFDR

Encouragingly, the QQ plots show that the distribution of the \(v\)-values is unchanged after iterating over independent uniform q.


QQ plots for empirical cFDR

I think this shows that empirical cFDR systematically decreases all points?


Dependent uniform results


I simulate auxiliary data that “looks like” \(p\)-values from a related trait. If a random uniform variable is less than \(p\) then \(q\) is sampled from exp(-rexp(n, rate = 1)), if it is greater than or equal to \(p\) the \(q\) is sampled from exp(-rexp(n, rate = 0.5)).

Note that the fold removal method (for LD blocks) was used in empirical cFDR.

n <- length(p)
prob <- runif(n)<p
q <- ifelse(prob == TRUE, exp(-rexp(n)), exp(-rexp(n, rate = 0.5)))
hist(q, main = paste0("Histogram of q: Corr = ", round(cor(p, q),5)))



QQ plots for functional cFDR


QQ plots for empirical cFDR


Other q (1)


I consider a crazy distribution for \(q\) based on the 3 use-cases we discussed (although the correlation is still only 0.02):


Other q (2)


I next consider a mixture normal distribution, note that it needs to be this extreme to boost the correlation.

mixture_comp1 <- function(x) rnorm(x, sample(c(3,4,5), 1), 1)
mixture_comp2 <- function(x) rnorm(x, sample(c(-3,-4,-5), 1), 0.5)
weight <- 0.01

n <- length(p)
z <- runif(n)
  
# sample non-"causals" (defined as 3 use-cases) from 0.99*MC1 + 0.01*MC2
q <- ifelse(z>weight, mixture_comp1(n), mixture_comp2(n))
  
# sample "causals" (defined as 3 use-cases) from 0.01*MC1 + 0.99*MC2
causals <- which(data$open==1)
z_causals <- runif(length(causals))
q[causals] <- ifelse(z_causals<=weight, mixture_comp1(length(causals)), mixture_comp2(length(causals)))

Using gridp=50, the “functional SNPs” are all left-censored anyway. Here, I use ngridp=10.



QQ plots for functional cFDR


Asthma results


We could have 2 sections within the Asthma GWAS results section showing how our method can be used in different ways:

  1. Comparing functional cFDR and FINDOR using baseline LD model annotations (i.e. dimensionality reduction and iterating) - possible follow-on paper of how to actually do this including folding etc.

  2. Comparing functional cFDR and GenoWAP using GenoCanyon scores. We originally ruled this one out as the correlation between p and GenoCanyon scores was very low, but actually it is about the same as the baseline LD annotations (albeit we’re not iterating). I think it is important to include an example of leveraging some functional significance scores as I predict this is how most people would use our method (i.e. if they have these “scores” it’s very easy to apply to one iteration of our method to get some nice results). Need to think about how we can compare the outputs (PPs and p-vals) - see figures here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963360/pdf/btv610.pdf (https://annahutch.github.io/PhD/9july29.html)


Although.. doesn’t find anything new in Asthma data..


Could also potentially compare with empirical cFDR as scores are between 0 and 1. But I’m having difficulties running vl() on the 2 million SNPs (takes > 20 hours and > 64gb) - probs due to how the L curves are calculated (one co-ord at a time). (also this is without any fold removal!!)


Comments + Queries