Investigating deinflation


Below, the histogram shows my data, the red line shows the KDE and the blue points show the back-transformed y points from the L curve.


The solution may be to use a bounded density estimator or use an ecdf (but this may overfit the data). I investigate different R packages that give bounded density estimations. The right hand column looks back because they don’t reach 0 at the limits (so the integral under them between the y coordinates is probs \(<1\)). Below, I show the bounded KDEs and the area under the KDEs between the maximum and minimum \(Y\) coordinates from the L curves, transformed back to \((-Inf,Inf)\).

It seems that the bde::Muller94BoundaryKernel is best, I check how this performs on the other components.

“This class deals with Kernel estimators for bounded densities using boundary kernel described in Muller and Wang (1994). The kernel estimator is computed using the provided data samples. Using this kernel estimator, the methods implemented in the class can be used to compute densities, values of the distribution function, quantiles, sample the distribution and obtain graphical representations. Note that this kernel estimator is not normalized and therefore it is not a probability distribution (the cumulative density function may return values greater than 1).”

Hmm, it seems that things go dodgy for dimension 2 (it doesn’t get back down to 0 at the boundaries).


KDE estimation

Aim: Estimate \(f_q(q|H_0)\) using bounded KDE rather than the original KDE I was using, which extended beyond our range (meaning that the integral \(<1\) between the \(y\) coordinates of the L curve).

Below, I’ve plotted the KDE (red) over the histogram (grey). I’ve calculated the integral of the KDE between the maximum and minimum back-transformed y value (from the L curves).


However, extracting the \(x\) and \(y\) coordinates from the muller94BoundaryKernel object, to use in the approxfun, is not straight forward. Once obtaining the muller94BoundaryKernel object, I have to:

  1. Use density to obtain the coordinates of the red line.
fqh0_unnorm <- muller94BoundaryKernel(q[which(p>1/2)], lower.limit = min(q[which(p>1/2)]), upper.limit = max(q[which(p>1/2)]))
fqh0_density <- density(fqh0_unnorm, seq(min(q[which(p>1/2)]), max(q[which(p>1/2)]), length.out = 500))

plot(seq(min(q[which(p>1/2)]), max(q[which(p>1/2)]), length.out = 500), fqh0_density, ylim = c(0,2), type = "l", col = 2, main = "KDE using density() of\nmuller94BoundaryKernel")

hist(q[which(p>1/2)], xlim = c(-2, 5), ylim = c(0,2), prob = TRUE, breaks = 100, border = "gray", col = "gray", main = paste0('Plot from\nmuller94BoundaryKernel function'))
lines(fqh0_unnorm, col = 2)
points(x = rangeinf(as.vector(L[[1]]$y), q), y = rep(0, times = length(L[[1]]$y)), col = "blue")

  1. Use approxfun to get a functional representation of this line.
fqh0 <- approxfun(seq(min(q[which(p>1/2)]), max(q[which(p>1/2)]), length.out = 500),fqh0_density)

I compare this to that from my original unbounded KDE. I.e. for a given set of test q values, I see what the output from the bounded KDE function (black) and the original KDE function (red) are.

I now compare the output from the original bivariate distribution (using original KDE for q) and the new bivariate distribution (using bounded KDE for q) where \(x\) is the \(x\) coordinates from the L curve and \(y\) is the back transformed \(y\) coordinates of the L curve.

  • At the extreme values of the q distribution, the original KDE estimates a positive density and bounded KDE estimates 0.


Results

  • This doesn’t seem to be working…


New method


Rewriting vl() function


  • Another solution is to convert James’ vl() function to take a continuous q. Chris has written some code to do this.

If we compare the L curves for:

  1. My current method (q converted to \([0,1]\))

  2. My current method (original q)

  3. Chris’ new code (vl() function to take continuous q)

Without any transformation they are much more spikey and using Chris’ new code, they are much more spikey for high q.


Now that we have our L curves, we need to consider how to estimate the joint distribution (\(f_{p,q}(p,q|H_0)\)) to integrate. We can estimate this using KDE (see above) or ecdf. Consider 3 methods to generate V values:

  1. V values using my current method (using (non-bounded) KDE over transformed L curves): “my method”

  2. V values obtained by integrating ecdf over my L curves: “combo method”

  3. V values obtained by integrating ecdf over Chris’ L curves: “chris’ method”

Notice here that the V values do not reach 1 when integrating using KDE but they do when using ecdf (for both mine and Chris’ L curves). –> ecdf is the answer!


Results


  • Integrating over new curves adds noise.

  • Integrating over ecdf rather than kde extends the range of v values, meaning small v gets smaller and v near 1 gets nearer to 1 (regardless of annotation). Other than this, the results are very similar when integrating my L curves over kde or ecdf.


Comparing V values for the 3 methods



Full results for 121,000 SNPs



Iterating


We need to iterate over many dimensions because (even though dimensions 1 and 2 explain the vast majority of the variance) the majority of the variance between marks and cell types is not the same as that most relevant to disease causes. It’s hard to believe that whether a SNP was influential in alterating T1D risk would be amongst the first or second most defining genome features.

I remind myself of the results from MCA:


Results




## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'


Notes

MCA notes:

  • file:///Users/anna/Downloads/v20i03.pdf

  • For the default method="adjusted" option I’m using in mjca, the eigenvalues are given only for those dimensions \(k\), where the singular values from the Burt matrix \(\lambda_k\) (i.e. the principal inertias of the indicator matrix) satisfy the condition \(\lambda_k>1/Q\).

  • Results are often visualised with “symmetric maps” where the row and column coordinates on each axis are scaled to have inertias equal to the principal inertia along that axis (the “principal row and column coordinates”).