I explore how the cFDR methods perform with a non-monotonic \(q\), that is, where both low \(q\) and high \(q\) are enriched for low \(p\) (and should both therefore generate smaller \(v\)-values).
As expected, the methods cannot simultaneously shrink \(p\)-values for low \(q\) and high \(q\). Also note that James’ \(v\)-values extend slightly above 1 (1.002956). This highlights that the cFDR methods require the relationship between \(p\) and \(q\) to be monotonic (in that low \(q\) are enriched for low \(p\)). The methods also work for independent \(q\) where noise is added (see below). I’ll come back to this later (for the Asthma data).
I simulate GWAS \(p\)-values using simGWAS
:
I begin with 2 types of simulations, so that I can compare results with empirical cFDR.
x <- readRDS("simulated_GWAS.RDS")
p <- x$p
n <- length(p)
q <- runif(n, 0, 1)
n <- length(p)
prob <- runif(n)<p
q <- ifelse(prob == TRUE, exp(-rexp(n)), exp(-rexp(n, rate = 0.5)))
I step through the functional cFDR method for an independent uniform \(q\). For speed, I only run on the subset of independent SNPs.
Firstly, we use kde2d
to estimate the bivariate density and use this to derive an estimate for \(P(Q\leq q, P\leq p)\).
The next step is to estimate \(P(Q\leq q|H0)\). For this, I use locfdr
to estimate \(P(H0|P=p)\):
I then use this, along with the bivariate KDE shown above to estimate
\[\begin{equation} \begin{split} P(P,Q,H0)&=P(H0|P,Q) \times P(P,Q)\\ &\approx P(H0|P) \times P(P,Q) \end{split} \end{equation}\]
and then sum up the rows (p) to estimate \(P(Q,H0)\) and sum up the rows and columns (p and q) to estimate \(P(H0)\). I use these to estimate \(P(Q|H0)=P(Q,H0)/P(H0)\).
Finally, I sum over the columns (q) to estimate our desired quantity: \(P(Q\leq q|H0)\).
kgrid
is then \(P(Q\leq q, P\leq p)/P(Q\leq q|H0)\):
Which is then used to generate the cFDR curves, \(P(H0|P\leq p, Q\leq q)\approx p/kgrid\).
The L-curves are the contours of the cFDR curves at the cFDR values and the bivariate null is integrated over the L-curves to generate the \(v\) values:
These results are a “tighter” version of those from empirical cFDR:
Note that both of James’ methods gives maximum \(v\)-values \(>1\) (~1.0029), so I set these equal to 1 at each iteration.
I compare James’ empirical cFDR, James’ KDE cFDR and functional cFDR for one iteration. The colouring between both KDE methods match.
I drop comparison with James’ KDE method (vly()
) as this will not be included in the manuscript (also it tails off for low \(p\) and there is no gx
parameter included to fix this).
Note that the fold removal method (for LD blocks) was used in empirical cFDR (or we get horrible stripy results).
Encouragingly, the QQ plots show that the distribution of the \(v\)-values is unchanged after iterating over independent uniform q.
I think this shows that empirical cFDR systematically decreases all points?
I simulate auxiliary data that “looks like” \(p\)-values from a related trait. If a random uniform variable is less than \(p\) then \(q\) is sampled from exp(-rexp(n, rate = 1))
, if it is greater than or equal to \(p\) the \(q\) is sampled from exp(-rexp(n, rate = 0.5))
.
Note that the fold removal method (for LD blocks) was used in empirical cFDR.
n <- length(p)
prob <- runif(n)<p
q <- ifelse(prob == TRUE, exp(-rexp(n)), exp(-rexp(n, rate = 0.5)))
hist(q, main = paste0("Histogram of q: Corr = ", round(cor(p, q),5)))
I consider a crazy distribution for \(q\) based on the 3 use-cases we discussed (although the correlation is still only 0.02):
I next consider a mixture normal distribution, note that it needs to be this extreme to boost the correlation.
mixture_comp1 <- function(x) rnorm(x, sample(c(3,4,5), 1), 1)
mixture_comp2 <- function(x) rnorm(x, sample(c(-3,-4,-5), 1), 0.5)
weight <- 0.01
n <- length(p)
z <- runif(n)
# sample non-"causals" (defined as 3 use-cases) from 0.99*MC1 + 0.01*MC2
q <- ifelse(z>weight, mixture_comp1(n), mixture_comp2(n))
# sample "causals" (defined as 3 use-cases) from 0.01*MC1 + 0.99*MC2
causals <- which(data$open==1)
z_causals <- runif(length(causals))
q[causals] <- ifelse(z_causals<=weight, mixture_comp1(length(causals)), mixture_comp2(length(causals)))
Using gridp=50
, the “functional SNPs” are all left-censored anyway. Here, I use ngridp=10
.
We could have 2 sections within the Asthma GWAS results section showing how our method can be used in different ways:
Comparing functional cFDR and FINDOR using baseline LD model annotations (i.e. dimensionality reduction and iterating) - possible follow-on paper of how to actually do this including folding etc.
Comparing functional cFDR and GenoWAP using GenoCanyon scores. We originally ruled this one out as the correlation between p and GenoCanyon scores was very low, but actually it is about the same as the baseline LD annotations (albeit we’re not iterating). I think it is important to include an example of leveraging some functional significance scores as I predict this is how most people would use our method (i.e. if they have these “scores” it’s very easy to apply to one iteration of our method to get some nice results). Need to think about how we can compare the outputs (PPs and p-vals) - see figures here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963360/pdf/btv610.pdf (https://annahutch.github.io/PhD/9july29.html)
Although.. doesn’t find anything new in Asthma data..
Could also potentially compare with empirical cFDR as scores are between 0 and 1. But I’m having difficulties running vl()
on the 2 million SNPs (takes > 20 hours and > 64gb) - probs due to how the L curves are calculated (one co-ord at a time). (also this is without any fold removal!!)
Comments + Queries
The uniform \(q\) simulations do not use left-censoring or removal of problematic L-curve segments.
Would be nice to compare timings between functional cFDR and empirical cFDR with fold removal (as functional is faster, and empirical doesn’t really work at all for SNPs in the millions - even without fold removal), but this is difficult as there are several parameters - number of SNPs, number of independent SNPs (for functional cFDR), number of folds (for empirical cFDR). In these simulations, functional cFDR takes 10 mins, empirical cFDR (with fold removal) takes 30 mins.
Could the reason why empirical cFDR re-weights \(p\)-values more than functional cFDR be because the correlation between \(p\) and \(q\) tends to be lower in the independent subset we run functional cFDR on?