11th June

Functional cFDR iteration check

I recap the results of iterating our functional cFDR method for various \(q\) (horizontally scrollable).

1. Independent uniform \(q\)

2. Independent \(q\) from various mixture normals

Here, we see that there is a problem where \(v\) values for data points with high \(p\) and very low \(q\) are shrunk to 0. I am not too worried about this because it is unlikely that a high \(p\) value with one of the lowest \(q\) values will occur in our real data analysis.

Of note, I think that this occurs because \(P(P\leq p, Q\leq q)/P(Q\leq q|H_0)\) is not necessarily greater than 1, if low \(q\) is not enriched for low \(q\), which causes the cFDR curves (\(P/(P(P\leq p, Q\leq q)/P(Q\leq q|H_0))\)) to extend above 1. This can be fixed by increasing the bandwidth used in the estimation of the density of Q (so that more data points are used for the estimation in regions of sparse data - e.g. low \(q\)).

The default BW used in my method is:

bandwidth.nrd(c(q, q))

## [1] 0.9563462

I investigate using a larger BW.

If we then look at the fit to the CDF of \(q\), then we see that increasing the BW increases the minimum value of the estimated CDFs. (which may potentially decrease \(P(P\leq p, Q\leq q)/P(Q\leq q|H_0)\) as desired).

The cFDR values look better when using a larger BW (all those with ccut>1 were the problematic points shrunk to 0).

Which is reflected in the final results.

3. Dependent \(q\) from various mixture normals

I check how functional cFDR performs when using dependent auxillary data from a mixture normal distribution. I think that the results look ok… note that the correlations between \(p\) and \(q\) are much larger than we’d expect in our real data (spearmans correlation is approximately 0.5 between \(p\) and \(q1\)). However, I can’t iterate past dimension 6 because some \(v\) values are shrunk to 0.

Below, I show the enrichment of low \(p\) for low \(q\) at each iteration:

If we look at the \(-log10\) results, then things are fine until the 7th iteration (note that there is no low \(p\) high \(q\) since the lower mixture normal is always picked because it is unlikely that runif<p here).

What happens after the 6th iteration?

Adaptive KDE

Sources: Non-parametric KDE book

Currently, we are using a rule-of-thumb to choose the fixed bandwidth of our 2 dimensional KDE, which is 1.06 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size of the negative 1/5th power (Scott 1992; bandwidth.nrd). Using a factor of 0.9 rather than 1.06 would give Silverman’s rule of thumb (Silverman, 1986; bandwidth.nrd0).

Rather, we would like to use an adaptive bandwidth, which varies with the density of the data - In areas of lower density, we want to use a larger bandwidth to reduce the effects of outliers on the overall KDE.

There are several options:

Balloon density estimate (ks::kde.balloon) uses bandwidths that vary at each estimation point (i.e. res_p*res_q bandwidths). The estimate of the KDE at a point \(x\) is calculated as an average of identically scaled kernels centered at each data point \(X_i\). Source: Loftsgaarden and Quesenberry 1964
Sample point density estimate (ks::kde.sp) uses bandwidths that vary for each data point (i.e. nsnps bandwidths). The estimate of the KDE at every \(x\) is therefore an average of differently scaled kernels centered at each data point \(X_i\). Source: Abramson, 1982

The variable bandwidth KDE of \(f\) is

\[\begin{equation} \hat{f}(x)=\dfrac{1}{n}\sum_{i=1}^n \dfrac{1}{h_i}K(\dfrac{x-x_i}{h_i}) \end{equation}\]

where \(h_i=\lambda_i h\) is the variable bandwidth (\(h\) is the fixed bandwidth from a pilot estimate) and

\[\begin{equation} \lambda_i=\{\dfrac{\hat{f}(x_i)}{g}\}^{-\alpha} \end{equation}\]

where \(g\) is the mean of \(\hat{f}(x_i)\) and \(\alpha\) is a sensitivity parameter (set to \(1/2\) in Abramson, 1982).

Implementation

There are 3 main methods in R for bivariate adaptive KDE.

ks::kde.balloon: Balloon estimation. Default BW is Hns(,deriv.order=2) and the subsequent BWs are derived via a minimal MSE formula.
ks::kde.sp: Sample point estimation. Default BW is Hns(,deriv.order=4) and the subsequent BWs are derived via the Abramson formula (above).
sparr::bivariate.density: Has many more options and is potentially faster, i.e. implements edge-correction factors (?), offers a partitioning approximation (davies.baddeley) etc. But I’m not sure whether the h0 or the hp parameter defines the default BW.

Currently, we are using a fixed BW for our bivariate KDE (using a folded normal approach).

# kde2d (non-adaptive)
kpq <- kde2d(c(zp, -zp), c(q, q), n=c(res_p, res_q), lims = lims)

I investigate using balloon and sample point KDE. Note that for the latter, I am unable to use the folded normal method due to an error message (Error in svd(A) : infinite or missing values in ‘x’).

library(ks)
kde_data <- as.matrix(cbind(c(zp, -zp), c(q, q)))
kpq_balloon <- kde.balloon(x = kde_data, xmin =  c(lims[1], lims[3]), xmax = c(lims[2], lims[4]))

kde_data <- as.matrix(cbind(zp, q))
kpq_sp <- kde.sp(x = kde_data, xmin =  c(lims[1], lims[3]), xmax = c(lims[2], lims[4]))

I also investigate using the sparr library (but I need to figure out all of the extra options):

library(sparr)
library(spatstat)
pp_data <- ppp(c(-zp,zp), c(q,q), xrange = c(lims[1], lims[2]), yrange = c(lims[3], lims[4]))
kpq_sparr <- bivariate.density(pp_data, h0 = 1, adapt = TRUE, resolution = 300, edge = "none", trim = Inf)

We see that not being able to do the folded normal method for the latter two methods is going to be problematic.

Looking at the estimate of \(Q|H0\):

Looking at the estimate of \(Q\leq q|H0\):

The default kde2d (red line) seems to fit the data best.

And looking at the cFDR curves:

Finally, looking at the \(v\) values (scroll):

I don’t think adaptive BW is the way to go for the following reasons:

The red curve (original kde2d estimate) fits the empirical data better (both \(Q|H0\) and \(Q\leq q|H0\)).
The adaptive BW methods find different BWs for each data point (or resolution) and are therefore extremely slow.

Variance robust estimation

New idea for generating \(q\):

Perform MCA on some functional data
Regress the \(\chi^2\) statistics against the co-ordinates from MCA
Select the significant dimensions for \(q\) in our analysis

This idea works because although the first few dimensions pick up the majority of the variability in the original dataset, this is likely explaining variability that is not due specifically to our phenotype. We hope that this method will allow us to pick out those dimensions which are actually picking up the disease-specific variability.

Details

Sources: https://www.stata.com/manuals13/u20.pdf#u20.21Obtainingrobustvarianceestimates; http://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf; https://www.youtube.com/watch?v=cqTBQTXF4Mo

Estimates of variance refer to estimated standard errors of our coefficients (or in a more complete sence, the estimated variance-covariance matrix where the standard errors are the square roots of the diagonal elements).
As in ldsc, we have the problem of correlated dependent variables (our \(\chi^2\) statistics are correlated amongst SNPs in LD) and heteroscedascity which may lead to misleadingly small standard errors (and therefore possibly predictors incorrectly called as significant).
To deal with this, we need “robust estimate of variance”, and specifically the “cluster robust estimator of variance” which deals with correlated observations (because our SNPs are in LD). This requires that things may be correlated within a cluster but cannot be correlated across the clusters.
White suggested doing this by estimating the regression model with limited or no control for within-cluster error correlation, and then post-estimation obtain “cluster-robust” standard errors.

Asthma application

I have matched the 2 million asthma SNPs to their annotation value for the 96 annotations in the baseline LD v2.2 model [Note - not cell type specific].

I have then run PCAmix on the data to obtain a lower dimensional representation (number of dimensions capped at 50 for speed).

I then regress the lower dimensional co-ordinates against the \(\chi^2\) statistics for each SNP and pick out those dimensions which are significant (need to re-run this using cluster robust variance estimation when I have the data - or not if we are pruning the SNPs for the whole method).

regression_df <- readRDS("asthma_big_regression_df.RDS")

f <- ols(chisq_vals ~ ., data = regression_df, x = TRUE, y = TRUE)
f

## Linear Regression Model
##  
##  ols(formula = chisq_vals ~ ., data = regression_df, x = TRUE, 
##      y = TRUE)
##  
##                   Model Likelihood     Discrimination    
##                      Ratio Test           Indexes        
##  Obs  1999262    LR chi2    8938.22    R2       0.004    
##  sigma 1.9028    d.f.            50    R2 adj   0.004    
##  d.f. 1999211    Pr(> chi2)  0.0000    g        0.140    
##  
##  Residuals
##  
##       Min       1Q   Median       3Q      Max 
##   -1.9269  -0.9543  -0.6065   0.2915 122.8350 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept  1.0903 0.0013 810.20 <0.0001 
##  dim.1      0.0216 0.0005  45.01 <0.0001 
##  dim.2     -0.0014 0.0007  -2.09 0.0363  
##  dim.3      0.0047 0.0007   6.47 <0.0001 
##  dim.4      0.0067 0.0008   8.49 <0.0001 
##  dim.5      0.0222 0.0008  27.87 <0.0001 
##  dim.6     -0.0192 0.0008 -22.60 <0.0001 
##  dim.7      0.0066 0.0009   7.13 <0.0001 
##  dim.8      0.0175 0.0010  18.43 <0.0001 
##  dim.9      0.0193 0.0010  19.58 <0.0001 
##  dim.10    -0.0091 0.0010  -8.86 <0.0001 
##  dim.11     0.0460 0.0010  43.85 <0.0001 
##  dim.12    -0.0090 0.0011  -8.23 <0.0001 
##  dim.13    -0.0112 0.0011 -10.07 <0.0001 
##  dim.14     0.0425 0.0012  36.21 <0.0001 
##  dim.15     0.0244 0.0012  20.56 <0.0001 
##  dim.16     0.0052 0.0012   4.27 <0.0001 
##  dim.17    -0.0039 0.0012  -3.20 0.0014  
##  dim.18     0.0014 0.0012   1.12 0.2639  
##  dim.19    -0.0051 0.0013  -4.01 <0.0001 
##  dim.20     0.0051 0.0013   4.01 <0.0001 
##  dim.21    -0.0075 0.0013  -5.86 <0.0001 
##  dim.22     0.0009 0.0013   0.75 0.4559  
##  dim.23     0.0038 0.0013   2.94 0.0032  
##  dim.24     0.0067 0.0013   5.26 <0.0001 
##  dim.25    -0.0003 0.0013  -0.21 0.8347  
##  dim.26     0.0042 0.0013   3.26 0.0011  
##  dim.27    -0.0022 0.0013  -1.68 0.0920  
##  dim.28     0.0062 0.0013   4.75 <0.0001 
##  dim.29     0.0025 0.0013   1.95 0.0514  
##  dim.30    -0.0035 0.0013  -2.67 0.0075  
##  dim.31    -0.0176 0.0013 -13.44 <0.0001 
##  dim.32    -0.0195 0.0013 -14.87 <0.0001 
##  dim.33     0.0139 0.0013  10.52 <0.0001 
##  dim.34    -0.0045 0.0013  -3.44 0.0006  
##  dim.35    -0.0089 0.0013  -6.66 <0.0001 
##  dim.36    -0.0043 0.0013  -3.23 0.0012  
##  dim.37     0.0008 0.0013   0.60 0.5510  
##  dim.38     0.0005 0.0013   0.38 0.7075  
##  dim.39    -0.0019 0.0014  -1.39 0.1653  
##  dim.40     0.0001 0.0014   0.06 0.9555  
##  dim.41     0.0041 0.0014   3.00 0.0027  
##  dim.42     0.0005 0.0014   0.39 0.6944  
##  dim.43    -0.0036 0.0014  -2.58 0.0098  
##  dim.44    -0.0012 0.0014  -0.84 0.3996  
##  dim.45    -0.0021 0.0014  -1.52 0.1288  
##  dim.46    -0.0089 0.0014  -6.33 <0.0001 
##  dim.47     0.0018 0.0014   1.25 0.2106  
##  dim.48     0.0003 0.0014   0.24 0.8082  
##  dim.49     0.0045 0.0014   3.11 0.0019  
##  dim.50     0.0058 0.0014   4.01 <0.0001 
##

robcov(f)

## Linear Regression Model
##  
##  ols(formula = chisq_vals ~ ., data = regression_df, x = TRUE, 
##      y = TRUE)
##  
##                   Model Likelihood     Discrimination    
##                      Ratio Test           Indexes        
##  Obs  1999262    LR chi2    8938.22    R2       0.004    
##  sigma 1.9028    d.f.            50    R2 adj   0.004    
##  d.f. 1999211    Pr(> chi2)  0.0000    g        0.140    
##  
##  Residuals
##  
##       Min       1Q   Median       3Q      Max 
##   -1.9269  -0.9543  -0.6065   0.2915 122.8350 
##  
##  
##            Coef    S.E.   t      Pr(>|t|)
##  Intercept  1.0903 0.0013 810.21 <0.0001 
##  dim.1      0.0216 0.0007  32.96 <0.0001 
##  dim.2     -0.0014 0.0008  -1.66 0.0972  
##  dim.3      0.0047 0.0009   5.43 <0.0001 
##  dim.4      0.0067 0.0009   7.31 <0.0001 
##  dim.5      0.0222 0.0012  18.20 <0.0001 
##  dim.6     -0.0192 0.0009 -21.91 <0.0001 
##  dim.7      0.0066 0.0014   4.60 <0.0001 
##  dim.8      0.0175 0.0012  14.48 <0.0001 
##  dim.9      0.0193 0.0014  13.78 <0.0001 
##  dim.10    -0.0091 0.0013  -7.05 <0.0001 
##  dim.11     0.0460 0.0014  33.98 <0.0001 
##  dim.12    -0.0090 0.0015  -5.99 <0.0001 
##  dim.13    -0.0112 0.0015  -7.53 <0.0001 
##  dim.14     0.0425 0.0015  29.22 <0.0001 
##  dim.15     0.0244 0.0017  14.68 <0.0001 
##  dim.16     0.0052 0.0015   3.45 0.0006  
##  dim.17    -0.0039 0.0014  -2.82 0.0048  
##  dim.18     0.0014 0.0016   0.86 0.3890  
##  dim.19    -0.0051 0.0014  -3.66 0.0003  
##  dim.20     0.0051 0.0014   3.63 0.0003  
##  dim.21    -0.0075 0.0015  -4.87 <0.0001 
##  dim.22     0.0009 0.0016   0.61 0.5406  
##  dim.23     0.0038 0.0014   2.72 0.0065  
##  dim.24     0.0067 0.0014   4.83 <0.0001 
##  dim.25    -0.0003 0.0012  -0.22 0.8287  
##  dim.26     0.0042 0.0012   3.45 0.0006  
##  dim.27    -0.0022 0.0012  -1.77 0.0762  
##  dim.28     0.0062 0.0016   3.75 0.0002  
##  dim.29     0.0025 0.0012   2.09 0.0369  
##  dim.30    -0.0035 0.0015  -2.29 0.0221  
##  dim.31    -0.0176 0.0018  -9.92 <0.0001 
##  dim.32    -0.0195 0.0017 -11.41 <0.0001 
##  dim.33     0.0139 0.0016   8.77 <0.0001 
##  dim.34    -0.0045 0.0018  -2.51 0.0120  
##  dim.35    -0.0089 0.0019  -4.75 <0.0001 
##  dim.36    -0.0043 0.0016  -2.76 0.0058  
##  dim.37     0.0008 0.0019   0.42 0.6719  
##  dim.38     0.0005 0.0018   0.29 0.7753  
##  dim.39    -0.0019 0.0017  -1.09 0.2749  
##  dim.40     0.0001 0.0022   0.03 0.9727  
##  dim.41     0.0041 0.0021   1.99 0.0470  
##  dim.42     0.0005 0.0021   0.26 0.7970  
##  dim.43    -0.0036 0.0018  -1.96 0.0498  
##  dim.44    -0.0012 0.0020  -0.59 0.5571  
##  dim.45    -0.0021 0.0017  -1.27 0.2028  
##  dim.46    -0.0089 0.0019  -4.80 <0.0001 
##  dim.47     0.0018 0.0019   0.94 0.3485  
##  dim.48     0.0003 0.0021   0.17 0.8668  
##  dim.49     0.0045 0.0018   2.46 0.0141  
##  dim.50     0.0058 0.0019   3.04 0.0024  
##

So are these significant dimensions really monotonic in \(p\)? I investigate the first few. Yep, but very small increase (on -log10 scale).

x <- readRDS("matched_df.RDS")

gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.1) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 1")

gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.3) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 3")

gg_qqboxplot(x$european_ancestry_pval_rand, regression_df$dim.4) + coord_cartesian(ylim = c(0,0.75)) + ggtitle("Dimension 4")

I then run functional cFDR on the asthma \(p\) values and set \(q\) equal to the coordinates from dimension 1. The results look ok but I need to check the anomaly points.

Comments and next steps

Need to check anomoly results:

Shrinking to V=~0 for high p, low q for independent mixture normal. Note that this is fixed slightly by using a larger BW, so when a new BW is selected using only independent SNPs it may be better.
Iteration 6 [\((v5,q6)=v6\)] for dependent mixture normal.
Asthma results (high p, low q and low -log10(p), high -log10(q)). Again, may be fixed using a different BW.

I think most of these problems stem from the KDE for low \(q\) where the data is sparse.

LD pruning:

If I LD prune SNPs at the start of the analysis then I don’t think I’ll need to worry about robust variance estimation in the regression to select \(q\) dimensions (since our dependent variables (\(\chi^2\) statistics) will no longer be correlated).
I think the BW selector methods assume independent samples, in which case we should LD prune our SNPs. This will also speed things up. Annotation-wise we should be ok because most annotations include a supplementary 500kb flanking annotation.
I think that PLINK and LDAK require individual genotypes.
To LD prune, I think my method will be:
1. Download european ld detect regions here. These are generated from 1000 Genomes phase 1 (I am using phase 3?) data and the genome build is hg19.
2. Convert these co-ordinates to hg38 using liftOver.
3. Match my asthma SNPs to each region.
4. Only keep the SNP with the smallest P value in each region.
But does this method throw away independent signals?

Cell type specific annotations:

At the moment, I’m not including any cell type specific annotations in my application. The only ones I can think of are the genome segmentations from the Segway encyclopedia. But which of these should I choose? (https://noble.gs.washington.edu/proj/encyclopedia/interpreted/)

BC_LUNG_01-11002.bed.gz (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM999444)
BC_LUNG_01-11002.bed.gz
All immune cell types (CD3 to CD56)
MONOCYTES-CD14+_RO01746.bed.gz