16th July

1. Left-censoring walkthrough

To deal with the difficulty in estimating the KDE for low \(q\) where the data is very sparse, we use left-censoring whereby all \(q<q_{low}\) are set equal to \(q_{low}\). We can choose \(q_{low}\) by considering how many data points we need in a grid space to trust the estimated KDE (which is an intuitive parameter in the software).

The density of data points used for estimating the KDE at each grid region is shown below, with the region in the red triangle zoomed in below this. For example, if we say that approximately 50 data points are required then we would censor for approximately \(q< -13.5\) which would mean censoring approximately \(2000\) data points (0.1% of the whole dataset).

Using \(q_{low} = -13.5\) as an example, we can look at the cFDR curves. Using our left censoring, the cFDR curves for \(q<q_{low}\) basically become redundant (since we have replaced the \(q\) values here and are interpolating from the curve for \(q_{low}\) instead). In the plot below, the cFDR curves for these now redundant low \(q\) are shown in red. These are indeed the problematic ones.

Instead, the cFDR curves for \(q\) near to \(q_{low}\) will be interpolated for all of these censored points. These cFDR curves are shown in green below (red curves have been removed).

Left-censoring parameter: nogridp

In the software, we can include a parameter for how many grid points the user feels is acceptable to reliably estimate the KDE. Here, I have used nogridp=50 which seems good. But what should we use as a default value? Perhaps this should be a function of the total number of datapoints (nogridp=50 is 0.0025% of our total data-points).

Things to consider for left-censoring

Is it reasonable to assume that the histogram of \(q\) is roughly monotonically increasing? Is this enforced by our monotonicty requirement? Even in the walkthrough example here, it hits 50 and then drops down for the next couple of bins. Should we wait until the value where it hits 50 and then stays consistantly above it?
We’ve only considered sparsity in the \(q\) direction of grid points, what if for the \(q_{low}\) bin, the corresponding \(p\) values are clustered so that the data in the \(p\) direction is sparse, the KDE will still struggle.
Default/recommended value for ngridp ? Could this be some percentile of \(q\)? (here the corresponding percentile is 0.0025%…)
Should we output in the method how many data points have been left-censored or spline corrected? Will we need to show some results for this in the paper?

2. Simulations

Method

(in rds/hpc-work/cFDR_sims/)

Split bcf file (/UK10K/BCF/chr22.bcf.gz) into lddetect regions defined in /lddetect/EUR/fourier_ls-chr22.bed [24 regions ranging from 621,793 to 2,444,967 bps] and make haplotype matrix.
Only keep SNPs with \(maf>0.01\) [each region contains between ~2000-6000 SNPs] and make freq matrices.
Generate LDAK weights for these SNPs by converting haplotypes to genotypes to use in write.plink and then running the run-ldak.sh script.
Use the freq matrices in simGWAS, specifying 0, 1 or 2 CVs in each region (should I make the number of CVs dependent on the number of SNPs in the region?) with \(OR\sim N(0,0.2^2)\) and \(N0=N1=10000\).

To check:

\(OR\sim N(0,0.2^2)\) and \(N0=N1=10000\) reasonable? I’m planning on sticking with \(N0=N1=10000\) for all simulations and not splitting results up by parameters used to simulate the GWAS as this isn’t really the point here.
Number of CVs to reflect number of SNPs in region?
\(maf>0.01\) ok?
Should I do on bigger chromosome? [at the moment only have 30 CVs]
Does 114,000 SNPs on chromosome 22 in the UK10K data sound about right?

Independent uniform q

These results are for 114,310 SNPs including 30 CVs and 27,000 independent SNPs.
Approximately 120 SNPs are left-censored and 80 are spline corrected in each iteration.
I use the default parameter values nxbin = 1000, res_p = 300, res_q = 500, nogridp = 50, dist_thr = 0.5 in functional cFDR. These are the same values I was using in my 2 million SNP asthma analysis. Perhaps res_p = 300, res_q = 500 are too big here? Is there anything wrong with using too many grid points for KDE estimation??
For each iteration, I generate an independent uniform q (q <- runif(n, 0, 1)). But when checking whether \(p\) and \(q\) are indeed independent, I find that there are some instances of non-monotonicity which will obviously mess up the results. I also print the spearmans correlation. [horizontally scrollable]

Why does this relationship become non-monotonic? It is obviously nothing to do with \(q\) since this is just generated from runif(n, 0, 1) in each iteration.

It must therefore be due to the \(p\) used in that iteration (the \(v\) from the previous iteration)?

When looking at the results, things seem to mess up on the iterations where there is a non-monotonic relationship between \(p\) and \(q\).

Is cFDR finding structure when it shouldn’t be (when \(q\) is independent)? Note - I’m comparing it with the results from my Asthma analysis (i.e. where p and q are dependent as q is chosen from the PCAmix stuff). Although the magnitude of the change is smaller, there is still definitely a relationship that I don’t think should be there.

If we don’t bin, then the results look even more weird.

Unless we’re hoping that when we iterate over other independent \(q\) this difference will be “mopped up”??

Asthma p-vals with independent uniform q

To check whether these weird results are due to my simulated \(p\) values, I rerun functional cFDR on the Asthma data, this time using independent uniform \(q\).

Things look dodgy…

Again, cFDR seems to be finding some structure…?

To do

FINDOR comparison (will have to generate LD scores).
Read: https://www.biorxiv.org/content/biorxiv/early/2020/06/25/2020.06.25.169706.full.pdf
Write snakemake file for simulation method