30th July

Simulations

I simulate GWAS results for 114,310 SNPs on chromosome 22, including 37 CVs.

I consider 3 use cases:

TFBS (causals +/- 5bp) –> 38 SNPs
ATAC/ChIP-seq peaks (causals +/- 250bp) –> 124 SNPs
Open chromatin regions (causals +/- 1000bp) –> 366 SNPs

[I have the code ready, but need to finalise the functional cFDR function first].

Independent uniform

I first consider simulations where $q$ is sampled from a uniform distribution, so that I can compare results to those using empirical cFDR.

Sample $q$ from w*exp(-rexp(x, rate = 1)) + (1-w)*exp(-rexp(x, rate = 1/2)) where w <- sample(seq(0.3, 0.7, 0.1), 1).

I need to ensure that the correlation between p and q is positive (for both the full dataset and the independent subset).

Independent mixture beta $q$

Sample $q$ from 0.99*rbeta(x, shape1 = 30, shape2 = 0.5) + 0.01*rbeta(x, shape1 = 0.5, shape2 = 30).

Dependent mixture beta $q$

Sample $q$ at null SNPs from 0.99*rbeta(x, shape1 = 30, shape2 = 0.5) + 0.01*rbeta(x, shape1 = 0.5, shape2 = 30) and $q$ at “causals” from 0.01*rbeta(x, shape1 = 30, shape2 = 0.5) + 0.99*rbeta(x, shape1 = 0.5, shape2 = 30) where “causals” are chosen as those SNPs defined in our 3 use cases above.

Comparing to James

Note that in the independent simulations, the correlation between p and q is close to 0. Infact, it may be (just) $>0$ in the whole dataset but (just) $<0$ in the independent subset of SNPs. This is what was responsible for the different colouring between mine and James’ method. I amend the cFDR function to return an error if the sign of the correlation is different between the whole dataset and the independent subset.

I compare the results from James’ KDE method (vly()), James’ empirical method (vl()) and my method for independent uniform $q$.

Since James’ KDE method is most similar to mine, I directly compare the results from functional cFDR to James’ KDE cFDR (vly()). For high $q$, James’ method increases the $p$ values more and for low $q$, James’ method decreases the $p$ values (a lot) more. Also need to investigate the gap in points..

I step through mine and James’ KDE code to see what may be causing this descrepancy.

Firstly, both methods estimate the bivariate density.

vly(): kde2d(zp[sub], zq[sub], n=200, lims=c(0,12,0,12))
cFDR(): kde2d(c(zp, -zp), c(q, q), n = c(300, 500), lims = lims)

I.e. we use a folding method so that the KDE doesn’t have to “come down” at p=1 (zp=0). [Note, results similar if I use n=200 and lims=c(0,12,0,12) in my method].

I then incorporate the adjustment directly (so that various terms cancel). Whereas James incorporates the adjustment seperately. My L curves are defined as the contours of the cFDR curves, whereas James’ are approximated using approx(cfx,xtest,ccut/correct[i],rule=2,method="const",f=1)$y where correct[i] incorporates the adjustment.

James’ L curves are defined on [0,1] whereas mine are defined on [lims[3], lims[4]].

Summarising annotation data

PCAmix: Current method, but possible non-monotonicity problem that orthogonal rotations (e.g. varimax) don’t seem to fix.
Non-negative matrix factorisation: but the following annotations have negative entries MAF_Adj_Predicted_Allele_Age; MAF_Adj_LLD_AFR; MAF_Adj_ASMC and I can’t see how I can easily make them non-negative.

To do

Finalise code.
Run simulations.
Run on Asthma data (once I’ve sorted out some dimensionality reduction for the annotation data).
Make R package.

30th July

Anna Hutchinson

Simulations

Independent uniform

Independent mixture beta \(q\)

Dependent mixture beta \(q\)

Comparing to James

Summarising annotation data

To do