In this weeks summary, I use iteration 2 as the application throughout, where \(p=v1\) from the first iteration and \(q2=dim5\) from the PCAmix results.
Dimension 5 picks out promoter regions:
However, even though dimension 5 is picked as most significant, it is not monotonic in \(p\).
To make \(q\) monotonic, we fit a spline to find the inflection point (nadir) and fold the distribution at this point. Specifically, we seperate \(q\) into pre- and post- nadir and take the difference. This means that we can scale each side seperately if we so wish.
When using this transformed \(q\) in our method, the results look ok. Although it looks like some spline correction is required, and that things are getting shrunk too much/ too little.
I consider the spline correction method for this iteration. A spline with nknots=5
arbitrarily chosen is shown in red. It would be nice to use the SEs of the fit to decide parameter values.
Set \(v0\) equal to the original \(p\)-values for the principal trait and \(i=1\):
Regress log(\(v[i-1]\)) against the coordinates from PCAmix
(removing \(qj\) for \(j<i\))
Set \(qi\) equal to the PCAmix
co-ordinates from the dimension giving the largest absolute t-statistic (stop if nothing is significant)
Make \(qi\) monotonic in \(v[i-1]\)
Perform functional cFDR on \((v[i-1], qi)\) to obtain \(vi\). Set \(i=i+1\) and go to step 1.
Comments and queries
I’ve used exp(log(x)-log(y)) when dividing two potentially small things (for kgrid and and cgrid (p/kgrid)). Nothing (visible) changed but I’ll add it in to version 5.
Shared controls: James found that approximating \(P(P\leq p|H_0^p, Q\leq q)\) by \(p\) is biased when there are shared controls (and derived a new approximation that allows for shared controls). I assume I’m fine to use \(p\) here? (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004926)
Leave-one-out: James’ new paper states that a leave-one-out method must be used for \(P(v\leq \alpha)\leq \alpha\) to hold, but I am not using a leave-one-out method? (bottom of page 7; https://www.biorxiv.org/content/10.1101/414318v3.full.pdf)
Difficulty writing up L-curve section. James talks about adding test points and convergence etc (page 7; https://www.biorxiv.org/content/10.1101/414318v3.full.pdf) but am I ok to just say that the L curves are the contours of our estimated cFDR curves?
What would be an example of the wider non-disease specific structure/variability picked out by non-significant dimensions?
Monotonicity stuff: Am I correct to say that the cFDR framework requires a roughly monotonic relationship between p and q? Or is it that it requires an absence of non-monotonicity (because it works for independent p and q)??
Manuscript: Am I ok to only compare our method to FINDOR (which they found to be better than S-FDR, GBH, IHW and GenoWap)? Perhaps I could also compare to GenoWAP, their software looks easy-ish to use and I can use their GenoCanyon scores for my method too. (https://github.com/rlpowles/GenoWAP-V1.2)
Manuscript: Need a simulation section. Basic idea: Simulate a GWAS using simGWAS (specifying CVs, effect sizes and sample sizes) using e.g. 1000 Genomes haplotype data. In FINDOR “To induce functional enrichment, we altered the prior probability that a SNP was selected to be causal, setting this to be proportional to \(Var(\beta_j)=\sum_C C(j) \tau_c\) where \(\tau_c\) is the effect size of annotation c.” (https://www.sciencedirect.com/science/article/pii/S0002929718304117)