Selecting q (iteration 1)


Using only the subset of independent SNPs (~500,000 out of 2 million = 25%), I regress the \(\chi^2\) statistics against the co-ordinates for each dimension of the PCAmix results.

## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -1.583  -0.920  -0.577   0.297 104.827 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.069e+00  2.764e-03 386.568  < 2e-16 ***
## dim.1        1.267e-02  7.947e-04  15.948  < 2e-16 ***
## dim.2        1.038e-04  1.108e-03   0.094 0.925402    
## dim.3        2.820e-03  1.243e-03   2.268 0.023331 *  
## dim.4        3.368e-05  1.318e-03   0.026 0.979615    
## dim.5        1.406e-02  1.297e-03  10.840  < 2e-16 ***
## dim.6       -1.439e-02  1.496e-03  -9.617  < 2e-16 ***
## dim.7       -7.999e-04  1.529e-03  -0.523 0.600924    
## dim.8        1.587e-02  1.532e-03  10.360  < 2e-16 ***
## dim.9        1.287e-02  1.810e-03   7.111 1.15e-12 ***
## dim.10      -9.382e-03  1.664e-03  -5.638 1.73e-08 ***
## dim.11       2.722e-02  1.798e-03  15.140  < 2e-16 ***
## dim.12      -8.858e-03  1.774e-03  -4.994 5.91e-07 ***
## dim.13      -5.232e-03  1.802e-03  -2.903 0.003694 ** 
## dim.14       2.361e-02  1.825e-03  12.939  < 2e-16 ***
## dim.15       1.427e-02  1.915e-03   7.448 9.50e-14 ***
## dim.16       1.813e-03  2.030e-03   0.893 0.371637    
## dim.17      -1.593e-03  2.067e-03  -0.771 0.440920    
## dim.18       1.970e-04  2.033e-03   0.097 0.922773    
## dim.19      -1.184e-03  2.145e-03  -0.552 0.581091    
## dim.20       2.323e-03  2.527e-03   0.919 0.358016    
## dim.21      -5.577e-03  2.452e-03  -2.274 0.022968 *  
## dim.22      -3.404e-03  2.343e-03  -1.453 0.146240    
## dim.23       1.404e-04  2.394e-03   0.059 0.953226    
## dim.24       3.613e-03  2.309e-03   1.565 0.117613    
## dim.25       1.106e-03  2.238e-03   0.494 0.621382    
## dim.26       2.810e-03  2.172e-03   1.294 0.195706    
## dim.27      -1.134e-03  2.091e-03  -0.543 0.587459    
## dim.28       1.128e-02  2.030e-03   5.555 2.78e-08 ***
## dim.29       9.297e-04  2.009e-03   0.463 0.643466    
## dim.30      -5.843e-03  2.162e-03  -2.702 0.006884 ** 
## dim.31      -9.956e-03  2.138e-03  -4.656 3.23e-06 ***
## dim.32      -8.050e-03  2.148e-03  -3.748 0.000179 ***
## dim.33       8.989e-03  2.153e-03   4.175 2.98e-05 ***
## dim.34      -1.527e-03  2.147e-03  -0.711 0.476890    
## dim.35      -6.074e-03  2.162e-03  -2.810 0.004961 ** 
## dim.36       1.241e-03  2.181e-03   0.569 0.569274    
## dim.37      -8.856e-04  2.155e-03  -0.411 0.681075    
## dim.38      -4.526e-03  2.151e-03  -2.104 0.035400 *  
## dim.39       1.579e-03  2.132e-03   0.741 0.458898    
## dim.40       6.740e-03  2.122e-03   3.176 0.001493 ** 
## dim.41      -2.626e-03  2.136e-03  -1.229 0.218939    
## dim.42      -7.717e-04  2.217e-03  -0.348 0.727783    
## dim.43       5.616e-03  2.277e-03   2.466 0.013645 *  
## dim.44      -2.551e-03  2.302e-03  -1.108 0.267773    
## dim.45      -3.993e-03  2.288e-03  -1.745 0.080954 .  
## dim.46       1.702e-03  2.317e-03   0.734 0.462759    
## dim.47      -2.686e-03  2.336e-03  -1.150 0.250239    
## dim.48       1.314e-03  2.329e-03   0.564 0.572754    
## dim.49       2.781e-03  2.338e-03   1.190 0.234192    
## dim.50       1.377e-02  2.265e-03   6.078 1.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.678 on 512477 degrees of freedom
##   (5930 observations deleted due to missingness)
## Multiple R-squared:  0.00253,    Adjusted R-squared:  0.002433 
## F-statistic:    26 on 50 and 512477 DF,  p-value: < 2.2e-16

I then rank dimensions by their t-statistic to decide which to iterate over.

##  (Intercept)        dim.1       dim.11       dim.14        dim.5        dim.8 
## 386.56780315  15.94846255  15.14036283  12.93865664  10.84006800  10.36001754 
##        dim.6       dim.15        dim.9       dim.50       dim.10       dim.28 
##   9.61702473   7.44791817   7.11147365   6.07841542   5.63753951   5.55499607 
##       dim.12       dim.31       dim.33       dim.32       dim.40       dim.13 
##   4.99426904   4.65555664   4.17521508   3.74766142   3.17598369   2.90322393 
##       dim.35       dim.30       dim.43       dim.21        dim.3       dim.38 
##   2.80958011   2.70243932   2.46648739   2.27396723   2.26798054   2.10375315 
##       dim.45       dim.24       dim.22       dim.26       dim.41       dim.49 
##   1.74518227   1.56487544   1.45294324   1.29388343   1.22935554   1.18963075 
##       dim.47       dim.44       dim.20       dim.16       dim.17       dim.39 
##   1.14977056   1.10820573   0.91915268   0.89341195   0.77064192   0.74066254 
##       dim.46       dim.34       dim.36       dim.48       dim.19       dim.27 
##   0.73431187   0.71131455   0.56912093   0.56400112   0.55179195   0.54252279 
##        dim.7       dim.25       dim.29       dim.37       dim.42       dim.18 
##   0.52307229   0.49389356   0.46285884   0.41099676   0.34807639   0.09694169 
##        dim.2       dim.23        dim.4 
##   0.09363085   0.05865578   0.02555196

E.g. Here, dimensions 1 is the most significant. I check that this is monotonic in \(p\). Yes - it seems to be.


Results


I check the results using various parameter options. I also only use the 500,000 independent SNPs for the KDE.


  1. res_p=500; res_q=500; nxbin=500 (time ~ 25mins)


  1. res_p=500; res_q=500; nxbin=1000 (time ~ 40mins)


  1. res_p=500; res_q=500; nxbin=2000 (time ~ 1.4hrs)


Spline correction


At each iteration, we need to correct for awol \(v\) values (usually occuring for low \(q\) where the KDE struggles). To do this, I fit a spline to the data and map the point back to the spline if its distance is too far.

The blue points are the ones mapped back to the spline (red line). E.g. Using a dist_thr value of 0.5 corrects 116 out of the 2 million points.

df <- readRDS("first_iter_res_xbin1000.RDS")
df$q <- -df$q

plot(x = df$q, y = log10(df$v)-log10(df$p))

library(bigsplines)
## Loading required package: quadprog
bigspline_fit <- bigspline(x = df$q, y = log10(df$v/df$p), nknots = 5, rparm = NA)
lines(bigspline_fit$x, bigspline_fit$fitted, col  = "red")

dist <- abs(log10(df$v/df$p)-bigspline_fit$fitted.values)

dist_thr <- 0.5

points(bigspline_fit$x[which(dist>dist_thr)], log10(df$v/df$p)[which(dist>dist_thr)], col = "blue")

df$v_new <- df$v
df$v_new[which(dist>dist_thr)] <- pmin(10^bigspline_fit$fitted[which(dist>dist_thr)]*df$p[which(dist>dist_thr)],1)


Parameters


Optional parameters are:

res_p: Number of grid points for \(zp\) in KDE and cFDR curve approximation

res_q: Number of grid points for \(q\) in KDE and cFDR curve approximation

xbin: Number of bins in \(zp\) direction for hexbinning

nknots: Number of knots for spline in spline correction

dist_thr: Map \(v\) values back to spline for points where the distance of points to the spline is \(\geq\) dist_thr

In my example, \(lims_{zp}=(0, 12.26)\) and \(lims_{q}=(-6.66, 20.15)\) (10% added on to the range of values), which are split into grids of size res_p and res_q. The user doesn’t see these limits so I want to have a res_p and res_q selection process that depends on the range of \(zp\) and \(q\) which is easier for the user. E.g. “we recommend selecting res_p ~= range(zp)/0.04 and res_q ~= range(q)/0.04 value such that the range of \(zp\) and \(q\) are split into segments of approximately 0.04”.

xbin depends more on the number of data-points. I found that xbin=1000 was good for the asthma data for 1,978,000 SNPs. I think xbin=1000 is a good suggestion for nsnps<2000000, but should be increased for a bigger analysis.


Final results for iteration 1


For the first dimension, I think the following parameter values are best. This takes ~25 mins. See cFDR_final/asthma_v4/ on HPC. Note, that I’ve decided to use dist_thr=0.2 (so 1749 SNPs are spline corrected rather than 116 when using dist_thr=0.5).

res <- functional_cFDR_v4(p, q, ind_snps, nxbin = 1000, res_p = 300, res_q = 500,  dist_thr = 0.2, nknot = 5)


Problem


For each iteration we need monotonicity between \(p\) (\(v\) value from previous iteration) and \(q\). We could therefore use the following procedure:

  1. Regress log(p) on PCAmix dimensions
  2. Stop if nothing significant
  3. Identify the strongest dimension and use this as \(q\)
  4. Perform functional cFDR
  5. Go back to 1.

I check if this method works for iteration 2 (removing dimension 1 that was used as q in the first iteration). No, the most significant dimension (dimension 5) is not monotonic.

df <- readRDS("1iter_res_dist02.RDS")
ind_coords <- readRDS("final/data/LDSC_annots_PCAmix_indcoords.RDS")

p_ind <- df[match(ind_coords$SNP, df$SNP),"v"]

# use v value from previous iteration and remove dimensions used previously
regression_df <- data.frame(chisq = log(p_ind), ind_coords[,2:50])

lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.998  -0.374   0.312   0.724   1.594 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.0524614  0.0018387 -572.398  < 2e-16 ***
## dim.2       -0.0039299  0.0007427   -5.291 1.22e-07 ***
## dim.3        0.0196189  0.0008306   23.620  < 2e-16 ***
## dim.4        0.0125409  0.0008830   14.202  < 2e-16 ***
## dim.5       -0.0342125  0.0008671  -39.456  < 2e-16 ***
## dim.6       -0.0027541  0.0010022   -2.748 0.005994 ** 
## dim.7       -0.0080815  0.0010234   -7.897 2.87e-15 ***
## dim.8       -0.0258269  0.0010262  -25.167  < 2e-16 ***
## dim.9       -0.0219276  0.0012119  -18.094  < 2e-16 ***
## dim.10      -0.0139272  0.0011138  -12.504  < 2e-16 ***
## dim.11      -0.0080571  0.0012046   -6.688 2.26e-11 ***
## dim.12       0.0007173  0.0011885    0.604 0.546159    
## dim.13       0.0062826  0.0012073    5.204 1.95e-07 ***
## dim.14      -0.0321967  0.0012212  -26.365  < 2e-16 ***
## dim.15      -0.0188370  0.0012824  -14.689  < 2e-16 ***
## dim.16      -0.0076126  0.0013601   -5.597 2.18e-08 ***
## dim.17       0.0023767  0.0013849    1.716 0.086137 .  
## dim.18       0.0025553  0.0013621    1.876 0.060657 .  
## dim.19       0.0049649  0.0014371    3.455 0.000551 ***
## dim.20       0.0007614  0.0016937    0.450 0.653027    
## dim.21       0.0053968  0.0016437    3.283 0.001026 ** 
## dim.22       0.0063083  0.0015699    4.018 5.86e-05 ***
## dim.23       0.0003987  0.0016044    0.248 0.803751    
## dim.24      -0.0008819  0.0015474   -0.570 0.568729    
## dim.25      -0.0015179  0.0015004   -1.012 0.311691    
## dim.26      -0.0025854  0.0014557   -1.776 0.075736 .  
## dim.27      -0.0022792  0.0014015   -1.626 0.103890    
## dim.28      -0.0057994  0.0013604   -4.263 2.02e-05 ***
## dim.29      -0.0024318  0.0013462   -1.807 0.070840 .  
## dim.30       0.0087657  0.0014486    6.051 1.44e-09 ***
## dim.31       0.0069020  0.0014325    4.818 1.45e-06 ***
## dim.32       0.0001607  0.0014391    0.112 0.911065    
## dim.33      -0.0107898  0.0014425   -7.480 7.45e-14 ***
## dim.34       0.0098543  0.0014378    6.854 7.21e-12 ***
## dim.35       0.0026756  0.0014484    1.847 0.064699 .  
## dim.36      -0.0047257  0.0014612   -3.234 0.001220 ** 
## dim.37      -0.0033004  0.0014437   -2.286 0.022247 *  
## dim.38       0.0039173  0.0014415    2.717 0.006579 ** 
## dim.39      -0.0086003  0.0014268   -6.028 1.67e-09 ***
## dim.40      -0.0016387  0.0014219   -1.152 0.249127    
## dim.41       0.0035353  0.0014300    2.472 0.013428 *  
## dim.42      -0.0010432  0.0014853   -0.702 0.482473    
## dim.43      -0.0107018  0.0015245   -7.020 2.22e-12 ***
## dim.44      -0.0134190  0.0015416   -8.705  < 2e-16 ***
## dim.45       0.0127146  0.0015323    8.298  < 2e-16 ***
## dim.46      -0.0032753  0.0015523   -2.110 0.034861 *  
## dim.47       0.0035129  0.0015649    2.245 0.024785 *  
## dim.48       0.0022691  0.0015604    1.454 0.145899    
## dim.49      -0.0028781  0.0015668   -1.837 0.066225 .  
## dim.50      -0.0124647  0.0015175   -8.214  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 513062 degrees of freedom
## Multiple R-squared:  0.009776,   Adjusted R-squared:  0.009681 
## F-statistic: 103.4 on 49 and 513062 DF,  p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]

ordered_dims  <- sort(abs(t_values), decreasing = TRUE)

ordered_dims
## (Intercept)       dim.5      dim.14       dim.8       dim.3       dim.9 
## 572.3977901  39.4560184  26.3645066  25.1673200  23.6198880  18.0941186 
##      dim.15       dim.4      dim.10      dim.44      dim.45      dim.50 
##  14.6890460  14.2021711  12.5039404   8.7045150   8.2975158   8.2138314 
##       dim.7      dim.33      dim.43      dim.34      dim.11      dim.30 
##   7.8968800   7.4798500   7.0199238   6.8535636   6.6884220   6.0510221 
##      dim.39      dim.16       dim.2      dim.13      dim.31      dim.28 
##   6.0275183   5.5969995   5.2910663   5.2040785   4.8183345   4.2628332 
##      dim.22      dim.19      dim.21      dim.36       dim.6      dim.38 
##   4.0183412   3.4547218   3.2833741   3.2341923   2.7481376   2.7174303 
##      dim.41      dim.37      dim.47      dim.46      dim.18      dim.35 
##   2.4722156   2.2861387   2.2447502   2.1099801   1.8759915   1.8473393 
##      dim.49      dim.29      dim.26      dim.17      dim.27      dim.48 
##   1.8369023   1.8065047   1.7759835   1.7161425   1.6262811   1.4541740 
##      dim.40      dim.25      dim.42      dim.12      dim.24      dim.20 
##   1.1524738   1.0116810   0.7023317   0.6035262   0.5699246   0.4495613 
##      dim.23      dim.32 
##   0.2484954   0.1116953
#all_coords <- readRDS("final/data/LDSC_annots_PCAmix_coords.RDS")

all_coords <- coords

monotonic_plot(df$v, all_coords$dim.5, title = "Iteration 2 (p = v1, q = dim.5)")


Possible solution 1


A possible solution, which we looked at earlier is to iterate twice over problematic dimensions. I.e. (v1,q2)=v2_tmp –> (v2_tmp, -q2)=v2. I check the results for doing this where q2 is dimension 5 shown to be the most significant above. This doesnt seem to do the trick (was hoping it would make smaller v2 values for high and low \(q\)). But why doesn’t it make small q smaller in the second step?

For completeness I also show v2 against v1 plots.


Possible solution 2


Another solution would be to adjust for data pairs with q<q' and q>q' seperately where q' is chosen where the relationship between p and q changes.

E.g. iteration 2 (\(p=v1\) and \(q=dim.5\)), the inflection point is at \(~1.66\).

df <- readRDS("1iter_res_dist02.RDS")
# all_coords <- readRDS("final/data/LDSC_annots_PCAmix_coords.RDS")

df_tmp <- data.frame(p = df$v, q = all_coords$dim.5)

p <- df_tmp$p
q <- df_tmp$q
q2 <- q^2

quadratic.model <- lm(-log10(p) ~ q + q2)
summary(quadratic.model)
## 
## Call:
## lm(formula = -log10(p) ~ q + q2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2128 -0.3145 -0.1498  0.1397 27.9679 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.157e-01  3.834e-04 1084.28   <2e-16 ***
## q           -2.636e-02  2.695e-04  -97.82   <2e-16 ***
## q2           7.925e-03  3.326e-05  238.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5225 on 1978014 degrees of freedom
## Multiple R-squared:  0.02903,    Adjusted R-squared:  0.02903 
## F-statistic: 2.957e+04 on 2 and 1978014 DF,  p-value: < 2.2e-16
q_vals <- seq(range(q)[1],range(q)[2],0.005)
predp <- predict(quadratic.model, list(q = q_vals, q2 = q_vals^2))
plot(q_vals, predp, pch = 16, xlab = "q", ylab = "-log10(p)")

infl <- c(FALSE, diff(diff(predp)>0)!=0)
abline(v = q_vals[infl], col = "blue")

I split up the dataset into those with \(q<1.66\) and those with \(q>1.66\) and perform functional cFDR on each seperately (flipping the sign of \(q\) for the latter so that small p is enriched for small q).

df_low <- df_tmp[which(df_tmp$q<1.66),]
dim(df_low)
## [1] 1881448       2
df_high <- df_tmp[-which(df_tmp$q<1.66),]
dim(df_high)
## [1] 96569     2
monotonic_plot(df_low$p, df_low$q, title = "Data points with q<1.66")

monotonic_plot(df_high$p, df_high$q, title = "Data points with q>=1.66")

The results seem to do the right thing (ish) (v values shrunk for high and low q). The weird stripes are around the inflection point (1.66).


Comparing with UKBB


I compare the results above to those from the UK biobank data (after 2 iterations).

Of the 1,978,017 SNPs, 1,971,252 are found in the UKBB results.

Of these, when using \(5*10^{-6}\) as our significance threshold, we see that our method prioritises more SNPs that are also found to be significant in a larger sample (UKBB) after the first iteration, but not after the second…

Correlations:


Comments and notes