I recap the results of iterating our functional cFDR method for various \(q\) (horizontally scrollable).
Here, we see that there is a problem where \(v\) values for data points with high \(p\) and very low \(q\) are shrunk to 0. I am not too worried about this because it is unlikely that a high \(p\) value with one of the lowest \(q\) values will occur in our real data analysis.
Of note, I think that this occurs because \(P(P\leq p, Q\leq q)/P(Q\leq q|H_0)\) is not necessarily greater than 1, if low \(q\) is not enriched for low \(q\), which causes the cFDR curves (\(P/(P(P\leq p, Q\leq q)/P(Q\leq q|H_0))\)) to extend above 1. This can be fixed by increasing the bandwidth used in the estimation of the density of Q (so that more data points are used for the estimation in regions of sparse data - e.g. low \(q\)).
The default BW used in my method is:
bandwidth.nrd(c(q, q))
## [1] 0.9563462
I investigate using a larger BW.
If we then look at the fit to the CDF of \(q\), then we see that increasing the BW increases the minimum value of the estimated CDFs. (which may potentially decrease \(P(P\leq p, Q\leq q)/P(Q\leq q|H_0)\) as desired).
The cFDR values look better when using a larger BW (all those with ccut>1
were the problematic points shrunk to 0).
Which is reflected in the final results.
I check how functional cFDR performs when using dependent auxillary data from a mixture normal distribution. I think that the results look ok… note that the correlations between \(p\) and \(q\) are much larger than we’d expect in our real data (spearmans correlation is approximately 0.5 between \(p\) and \(q1\)). However, I can’t iterate past dimension 6 because some \(v\) values are shrunk to 0.
Below, I show the enrichment of low \(p\) for low \(q\) at each iteration:
If we look at the \(-log10\) results, then things are fine until the 7th iteration (note that there is no low \(p\) high \(q\) since the lower mixture normal is always picked because it is unlikely that runif<p
here).
What happens after the 6th iteration?
Sources: Non-parametric KDE book
Currently, we are using a rule-of-thumb to choose the fixed bandwidth of our 2 dimensional KDE, which is 1.06 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size of the negative 1/5th power (Scott 1992; bandwidth.nrd
). Using a factor of 0.9 rather than 1.06 would give Silverman’s rule of thumb (Silverman, 1986; bandwidth.nrd0
).
Rather, we would like to use an adaptive bandwidth, which varies with the density of the data - In areas of lower density, we want to use a larger bandwidth to reduce the effects of outliers on the overall KDE.
There are several options:
Balloon density estimate (ks::kde.balloon
) uses bandwidths that vary at each estimation point (i.e. res_p*res_q
bandwidths). The estimate of the KDE at a point \(x\) is calculated as an average of identically scaled kernels centered at each data point \(X_i\). Source: Loftsgaarden and Quesenberry 1964
Sample point density estimate (ks::kde.sp
) uses bandwidths that vary for each data point (i.e. nsnps
bandwidths). The estimate of the KDE at every \(x\) is therefore an average of differently scaled kernels centered at each data point \(X_i\). Source: Abramson, 1982
The variable bandwidth KDE of \(f\) is
\[\begin{equation} \hat{f}(x)=\dfrac{1}{n}\sum_{i=1}^n \dfrac{1}{h_i}K(\dfrac{x-x_i}{h_i}) \end{equation}\]
where \(h_i=\lambda_i h\) is the variable bandwidth (\(h\) is the fixed bandwidth from a pilot estimate) and
\[\begin{equation} \lambda_i=\{\dfrac{\hat{f}(x_i)}{g}\}^{-\alpha} \end{equation}\]
where \(g\) is the mean of \(\hat{f}(x_i)\) and \(\alpha\) is a sensitivity parameter (set to \(1/2\) in Abramson, 1982).
There are 3 main methods in R for bivariate adaptive KDE.
ks::kde.balloon
: Balloon estimation. Default BW is Hns(,deriv.order=2)
and the subsequent BWs are derived via a minimal MSE formula.
ks::kde.sp
: Sample point estimation. Default BW is Hns(,deriv.order=4)
and the subsequent BWs are derived via the Abramson formula (above).
sparr::bivariate.density
: Has many more options and is potentially faster, i.e. implements edge-correction factors (?), offers a partitioning approximation (davies.baddeley) etc. But I’m not sure whether the h0
or the hp
parameter defines the default BW.
Currently, we are using a fixed BW for our bivariate KDE (using a folded normal approach).
# kde2d (non-adaptive)
kpq <- kde2d(c(zp, -zp), c(q, q), n=c(res_p, res_q), lims = lims)
I investigate using balloon and sample point KDE. Note that for the latter, I am unable to use the folded normal method due to an error message (Error in svd(A) : infinite or missing values in ‘x’).
library(ks)
kde_data <- as.matrix(cbind(c(zp, -zp), c(q, q)))
kpq_balloon <- kde.balloon(x = kde_data, xmin = c(lims[1], lims[3]), xmax = c(lims[2], lims[4]))
kde_data <- as.matrix(cbind(zp, q))
kpq_sp <- kde.sp(x = kde_data, xmin = c(lims[1], lims[3]), xmax = c(lims[2], lims[4]))
I also investigate using the sparr
library (but I need to figure out all of the extra options):
library(sparr)
library(spatstat)
pp_data <- ppp(c(-zp,zp), c(q,q), xrange = c(lims[1], lims[2]), yrange = c(lims[3], lims[4]))
kpq_sparr <- bivariate.density(pp_data, h0 = 1, adapt = TRUE, resolution = 300, edge = "none", trim = Inf)
We see that not being able to do the folded normal method for the latter two methods is going to be problematic.
Looking at the estimate of \(Q|H0\):
Looking at the estimate of \(Q\leq q|H0\):
The default kde2d
(red line) seems to fit the data best.
And looking at the cFDR curves:
Finally, looking at the \(v\) values (scroll):
I don’t think adaptive BW is the way to go for the following reasons:
The red curve (original kde2d
estimate) fits the empirical data better (both \(Q|H0\) and \(Q\leq q|H0\)).
The adaptive BW methods find different BWs for each data point (or resolution) and are therefore extremely slow.
New idea for generating \(q\):
Perform MCA on some functional data
Regress the \(\chi^2\) statistics against the co-ordinates from MCA
Select the significant dimensions for \(q\) in our analysis
This idea works because although the first few dimensions pick up the majority of the variability in the original dataset, this is likely explaining variability that is not due specifically to our phenotype. We hope that this method will allow us to pick out those dimensions which are actually picking up the disease-specific variability.
Sources: https://www.stata.com/manuals13/u20.pdf#u20.21Obtainingrobustvarianceestimates; http://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf; https://www.youtube.com/watch?v=cqTBQTXF4Mo
Estimates of variance refer to estimated standard errors of our coefficients (or in a more complete sence, the estimated variance-covariance matrix where the standard errors are the square roots of the diagonal elements).
As in ldsc, we have the problem of correlated dependent variables (our \(\chi^2\) statistics are correlated amongst SNPs in LD) and heteroscedascity which may lead to misleadingly small standard errors (and therefore possibly predictors incorrectly called as significant).
To deal with this, we need “robust estimate of variance”, and specifically the “cluster robust estimator of variance” which deals with correlated observations (because our SNPs are in LD). This requires that things may be correlated within a cluster but cannot be correlated across the clusters.
White suggested doing this by estimating the regression model with limited or no control for within-cluster error correlation, and then post-estimation obtain “cluster-robust” standard errors.
I have matched the 2 million asthma SNPs to their annotation value for the 96 annotations in the baseline LD v2.2 model [Note - not cell type specific].
I have then run PCAmix
on the data to obtain a lower dimensional representation (number of dimensions capped at 50 for speed).
I then regress the lower dimensional co-ordinates against the \(\chi^2\) statistics for each SNP and pick out those dimensions which are significant (need to re-run this using cluster robust variance estimation when I have the data - or not if we are pruning the SNPs for the whole method).
regression_df <- readRDS("asthma_big_regression_df.RDS")
f <- ols(chisq_vals ~ ., data = regression_df, x = TRUE, y = TRUE)
f
## Linear Regression Model
##
## ols(formula = chisq_vals ~ ., data = regression_df, x = TRUE,
## y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 1999262 LR chi2 8938.22 R2 0.004
## sigma 1.9028 d.f. 50 R2 adj 0.004
## d.f. 1999211 Pr(> chi2) 0.0000 g 0.140
##
## Residuals
##
## Min 1Q Median 3Q Max
## -1.9269 -0.9543 -0.6065 0.2915 122.8350
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 1.0903 0.0013 810.20 <0.0001
## dim.1 0.0216 0.0005 45.01 <0.0001
## dim.2 -0.0014 0.0007 -2.09 0.0363
## dim.3 0.0047 0.0007 6.47 <0.0001
## dim.4 0.0067 0.0008 8.49 <0.0001
## dim.5 0.0222 0.0008 27.87 <0.0001
## dim.6 -0.0192 0.0008 -22.60 <0.0001
## dim.7 0.0066 0.0009 7.13 <0.0001
## dim.8 0.0175 0.0010 18.43 <0.0001
## dim.9 0.0193 0.0010 19.58 <0.0001
## dim.10 -0.0091 0.0010 -8.86 <0.0001
## dim.11 0.0460 0.0010 43.85 <0.0001
## dim.12 -0.0090 0.0011 -8.23 <0.0001
## dim.13 -0.0112 0.0011 -10.07 <0.0001
## dim.14 0.0425 0.0012 36.21 <0.0001
## dim.15 0.0244 0.0012 20.56 <0.0001
## dim.16 0.0052 0.0012 4.27 <0.0001
## dim.17 -0.0039 0.0012 -3.20 0.0014
## dim.18 0.0014 0.0012 1.12 0.2639
## dim.19 -0.0051 0.0013 -4.01 <0.0001
## dim.20 0.0051 0.0013 4.01 <0.0001
## dim.21 -0.0075 0.0013 -5.86 <0.0001
## dim.22 0.0009 0.0013 0.75 0.4559
## dim.23 0.0038 0.0013 2.94 0.0032
## dim.24 0.0067 0.0013 5.26 <0.0001
## dim.25 -0.0003 0.0013 -0.21 0.8347
## dim.26 0.0042 0.0013 3.26 0.0011
## dim.27 -0.0022 0.0013 -1.68 0.0920
## dim.28 0.0062 0.0013 4.75 <0.0001
## dim.29 0.0025 0.0013 1.95 0.0514
## dim.30 -0.0035 0.0013 -2.67 0.0075
## dim.31 -0.0176 0.0013 -13.44 <0.0001
## dim.32 -0.0195 0.0013 -14.87 <0.0001
## dim.33 0.0139 0.0013 10.52 <0.0001
## dim.34 -0.0045 0.0013 -3.44 0.0006
## dim.35 -0.0089 0.0013 -6.66 <0.0001
## dim.36 -0.0043 0.0013 -3.23 0.0012
## dim.37 0.0008 0.0013 0.60 0.5510
## dim.38 0.0005 0.0013 0.38 0.7075
## dim.39 -0.0019 0.0014 -1.39 0.1653
## dim.40 0.0001 0.0014 0.06 0.9555
## dim.41 0.0041 0.0014 3.00 0.0027
## dim.42 0.0005 0.0014 0.39 0.6944
## dim.43 -0.0036 0.0014 -2.58 0.0098
## dim.44 -0.0012 0.0014 -0.84 0.3996
## dim.45 -0.0021 0.0014 -1.52 0.1288
## dim.46 -0.0089 0.0014 -6.33 <0.0001
## dim.47 0.0018 0.0014 1.25 0.2106
## dim.48 0.0003 0.0014 0.24 0.8082
## dim.49 0.0045 0.0014 3.11 0.0019
## dim.50 0.0058 0.0014 4.01 <0.0001
##
robcov(f)
## Linear Regression Model
##
## ols(formula = chisq_vals ~ ., data = regression_df, x = TRUE,
## y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 1999262 LR chi2 8938.22 R2 0.004
## sigma 1.9028 d.f. 50 R2 adj 0.004
## d.f. 1999211 Pr(> chi2) 0.0000 g 0.140
##
## Residuals
##
## Min 1Q Median 3Q Max
## -1.9269 -0.9543 -0.6065 0.2915 122.8350
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 1.0903 0.0013 810.21 <0.0001
## dim.1 0.0216 0.0007 32.96 <0.0001
## dim.2 -0.0014 0.0008 -1.66 0.0972
## dim.3 0.0047 0.0009 5.43 <0.0001
## dim.4 0.0067 0.0009 7.31 <0.0001
## dim.5 0.0222 0.0012 18.20 <0.0001
## dim.6 -0.0192 0.0009 -21.91 <0.0001
## dim.7 0.0066 0.0014 4.60 <0.0001
## dim.8 0.0175 0.0012 14.48 <0.0001
## dim.9 0.0193 0.0014 13.78 <0.0001
## dim.10 -0.0091 0.0013 -7.05 <0.0001
## dim.11 0.0460 0.0014 33.98 <0.0001
## dim.12 -0.0090 0.0015 -5.99 <0.0001
## dim.13 -0.0112 0.0015 -7.53 <0.0001
## dim.14 0.0425 0.0015 29.22 <0.0001
## dim.15 0.0244 0.0017 14.68 <0.0001
## dim.16 0.0052 0.0015 3.45 0.0006
## dim.17 -0.0039 0.0014 -2.82 0.0048
## dim.18 0.0014 0.0016 0.86 0.3890
## dim.19 -0.0051 0.0014 -3.66 0.0003
## dim.20 0.0051 0.0014 3.63 0.0003
## dim.21 -0.0075 0.0015 -4.87 <0.0001
## dim.22 0.0009 0.0016 0.61 0.5406
## dim.23 0.0038 0.0014 2.72 0.0065
## dim.24 0.0067 0.0014 4.83 <0.0001
## dim.25 -0.0003 0.0012 -0.22 0.8287
## dim.26 0.0042 0.0012 3.45 0.0006
## dim.27 -0.0022 0.0012 -1.77 0.0762
## dim.28 0.0062 0.0016 3.75 0.0002
## dim.29 0.0025 0.0012 2.09 0.0369
## dim.30 -0.0035 0.0015 -2.29 0.0221
## dim.31 -0.0176 0.0018 -9.92 <0.0001
## dim.32 -0.0195 0.0017 -11.41 <0.0001
## dim.33 0.0139 0.0016 8.77 <0.0001
## dim.34 -0.0045 0.0018 -2.51 0.0120
## dim.35 -0.0089 0.0019 -4.75 <0.0001
## dim.36 -0.0043 0.0016 -2.76 0.0058
## dim.37 0.0008 0.0019 0.42 0.6719
## dim.38 0.0005 0.0018 0.29 0.7753
## dim.39 -0.0019 0.0017 -1.09 0.2749
## dim.40 0.0001 0.0022 0.03 0.9727
## dim.41 0.0041 0.0021 1.99 0.0470
## dim.42 0.0005 0.0021 0.26 0.7970
## dim.43 -0.0036 0.0018 -1.96 0.0498
## dim.44 -0.0012 0.0020 -0.59 0.5571
## dim.45 -0.0021 0.0017 -1.27 0.2028
## dim.46 -0.0089 0.0019 -4.80 <0.0001
## dim.47 0.0018 0.0019 0.94 0.3485
## dim.48 0.0003 0.0021 0.17 0.8668
## dim.49 0.0045 0.0018 2.46 0.0141
## dim.50 0.0058 0.0019 3.04 0.0024
##
So are these significant dimensions really monotonic in \(p\)? I investigate the first few. Yep, but very small increase (on -log10 scale).
x <- readRDS("matched_df.RDS")
gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.1) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 1")
gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.3) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 3")
gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.4) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 4")
I then run functional cFDR on the asthma \(p\) values and set \(q\) equal to the coordinates from dimension 1. The results look ok but I need to check the anomaly points.
Comments and next steps
Need to check anomoly results:
Shrinking to V=~0 for high p, low q for independent mixture normal. Note that this is fixed slightly by using a larger BW, so when a new BW is selected using only independent SNPs it may be better.
Iteration 6 [\((v5,q6)=v6\)] for dependent mixture normal.
Asthma results (high p, low q and low -log10(p), high -log10(q)). Again, may be fixed using a different BW.
I think most of these problems stem from the KDE for low \(q\) where the data is sparse.
LD pruning:
If I LD prune SNPs at the start of the analysis then I don’t think I’ll need to worry about robust variance estimation in the regression to select \(q\) dimensions (since our dependent variables (\(\chi^2\) statistics) will no longer be correlated).
I think the BW selector methods assume independent samples, in which case we should LD prune our SNPs. This will also speed things up. Annotation-wise we should be ok because most annotations include a supplementary 500kb flanking annotation.
I think that PLINK and LDAK require individual genotypes.
To LD prune, I think my method will be:
Download european ld detect regions here. These are generated from 1000 Genomes phase 1 (I am using phase 3?) data and the genome build is hg19.
Convert these co-ordinates to hg38 using liftOver.
Match my asthma SNPs to each region.
Only keep the SNP with the smallest P value in each region.
But does this method throw away independent signals?
Cell type specific annotations:
At the moment, I’m not including any cell type specific annotations in my application. The only ones I can think of are the genome segmentations from the Segway encyclopedia. But which of these should I choose? (https://noble.gs.washington.edu/proj/encyclopedia/interpreted/)
BC_LUNG_01-11002.bed.gz (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM999444)
BC_LUNG_01-11002.bed.gz
All immune cell types (CD3 to CD56)
MONOCYTES-CD14+_RO01746.bed.gz