Investigating correlations
When deciding to leverage log-transformed \(p\)-values in our related trait simulations, we found that the correlation between our simulated traits was too high. Specifically, the (Pearson’s) correlation between log-transformed \(p\)-values for the related traits (\(q^k\)) jumped up to \(\approx 0.2\) (from \(\approx 0.07\) when leveraging raw \(p\)-values) and the FDR was not controlled.
In my old simulations, \(p,q^k\) shared causal variants in 4 out of the 24 LD blocks, and \(q^k,q^l\) also shared causal variants in 4 out of the 24 LD blocks.
In my new simulations, \(p,q^k\) share causal variants in 2 out of 24 LD blocks, and \(q^k,q^l\) share causal variants in 1 out of the 24 LD blocks.
This makes the correlations slightly more representative of that which we see in real data (although the correlations in Tom’s PID data are tiny).
In the plots below, “real data” refers to data I collected from Guilles realm for T1D (Cooper et al. 2017) and related traits: RA (Ishigaki et al. 2020), UC (DeLange et al. 2017), CD (DeLange et al. 2017), PSC (Ji et al. 2017), PBD (Cordell et al. 2015) and CEL (Trynka et al. 2011). I also extracted GWAS \(p\)-values for T2D (Mahajan et al. 2018) to use as a negative control (shown in blue). I removed chromosome 6 so extremely significant \(p\)-values in the MHC do not bias the results.
The first column is for raw p-values, the second for log-transformed and the third for Z-scores.
However, using my new less correlated simulations - empirical cFDR performs very badly. [Note that the Z-score results are for fewer simulations hence the larger error bars, and also in these simulations (bottom panel), empirical cFDR was only run on \(p<0.01\)].
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
Why is this?
- I have re-ran empirical cFDR using only those SNPs with \(p<0.01\), as recommended in the software, but this doesn’t help.
- Perhaps it’s because there are a few simulations where the correlation is negative and empirical cFDR doesn’t know what to do? Although this can’t be causing the pattern we see as its over most simulations.
Comments and queries
Tom is re-running his results to see how many points get left censored in a real GWAS analysis (he has \(\approx 5\) million SNPs).
In Tom’s analysis he uses (i) a
check_indep_cor
flag to stop the check that the sign of the Spearman’s correlations between the whole data set and the independent subset are the the same and (ii) anenforce_p_q_cor
flag to stop the flipping of q if the Spearman’s correlation is negative. I’m wary about this and not sure whether to use them in my simulations.