Using only the subset of independent SNPs (~500,000 out of 2 million = 25%), I regress the \(\chi^2\) statistics against the co-ordinates for each dimension of the PCAmix results.
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.583 -0.920 -0.577 0.297 104.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.069e+00 2.764e-03 386.568 < 2e-16 ***
## dim.1 1.267e-02 7.947e-04 15.948 < 2e-16 ***
## dim.2 1.038e-04 1.108e-03 0.094 0.925402
## dim.3 2.820e-03 1.243e-03 2.268 0.023331 *
## dim.4 3.368e-05 1.318e-03 0.026 0.979615
## dim.5 1.406e-02 1.297e-03 10.840 < 2e-16 ***
## dim.6 -1.439e-02 1.496e-03 -9.617 < 2e-16 ***
## dim.7 -7.999e-04 1.529e-03 -0.523 0.600924
## dim.8 1.587e-02 1.532e-03 10.360 < 2e-16 ***
## dim.9 1.287e-02 1.810e-03 7.111 1.15e-12 ***
## dim.10 -9.382e-03 1.664e-03 -5.638 1.73e-08 ***
## dim.11 2.722e-02 1.798e-03 15.140 < 2e-16 ***
## dim.12 -8.858e-03 1.774e-03 -4.994 5.91e-07 ***
## dim.13 -5.232e-03 1.802e-03 -2.903 0.003694 **
## dim.14 2.361e-02 1.825e-03 12.939 < 2e-16 ***
## dim.15 1.427e-02 1.915e-03 7.448 9.50e-14 ***
## dim.16 1.813e-03 2.030e-03 0.893 0.371637
## dim.17 -1.593e-03 2.067e-03 -0.771 0.440920
## dim.18 1.970e-04 2.033e-03 0.097 0.922773
## dim.19 -1.184e-03 2.145e-03 -0.552 0.581091
## dim.20 2.323e-03 2.527e-03 0.919 0.358016
## dim.21 -5.577e-03 2.452e-03 -2.274 0.022968 *
## dim.22 -3.404e-03 2.343e-03 -1.453 0.146240
## dim.23 1.404e-04 2.394e-03 0.059 0.953226
## dim.24 3.613e-03 2.309e-03 1.565 0.117613
## dim.25 1.106e-03 2.238e-03 0.494 0.621382
## dim.26 2.810e-03 2.172e-03 1.294 0.195706
## dim.27 -1.134e-03 2.091e-03 -0.543 0.587459
## dim.28 1.128e-02 2.030e-03 5.555 2.78e-08 ***
## dim.29 9.297e-04 2.009e-03 0.463 0.643466
## dim.30 -5.843e-03 2.162e-03 -2.702 0.006884 **
## dim.31 -9.956e-03 2.138e-03 -4.656 3.23e-06 ***
## dim.32 -8.050e-03 2.148e-03 -3.748 0.000179 ***
## dim.33 8.989e-03 2.153e-03 4.175 2.98e-05 ***
## dim.34 -1.527e-03 2.147e-03 -0.711 0.476890
## dim.35 -6.074e-03 2.162e-03 -2.810 0.004961 **
## dim.36 1.241e-03 2.181e-03 0.569 0.569274
## dim.37 -8.856e-04 2.155e-03 -0.411 0.681075
## dim.38 -4.526e-03 2.151e-03 -2.104 0.035400 *
## dim.39 1.579e-03 2.132e-03 0.741 0.458898
## dim.40 6.740e-03 2.122e-03 3.176 0.001493 **
## dim.41 -2.626e-03 2.136e-03 -1.229 0.218939
## dim.42 -7.717e-04 2.217e-03 -0.348 0.727783
## dim.43 5.616e-03 2.277e-03 2.466 0.013645 *
## dim.44 -2.551e-03 2.302e-03 -1.108 0.267773
## dim.45 -3.993e-03 2.288e-03 -1.745 0.080954 .
## dim.46 1.702e-03 2.317e-03 0.734 0.462759
## dim.47 -2.686e-03 2.336e-03 -1.150 0.250239
## dim.48 1.314e-03 2.329e-03 0.564 0.572754
## dim.49 2.781e-03 2.338e-03 1.190 0.234192
## dim.50 1.377e-02 2.265e-03 6.078 1.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.678 on 512477 degrees of freedom
## (5930 observations deleted due to missingness)
## Multiple R-squared: 0.00253, Adjusted R-squared: 0.002433
## F-statistic: 26 on 50 and 512477 DF, p-value: < 2.2e-16
I then rank dimensions by their t-statistic to decide which to iterate over.
## (Intercept) dim.1 dim.11 dim.14 dim.5 dim.8
## 386.56780315 15.94846255 15.14036283 12.93865664 10.84006800 10.36001754
## dim.6 dim.15 dim.9 dim.50 dim.10 dim.28
## 9.61702473 7.44791817 7.11147365 6.07841542 5.63753951 5.55499607
## dim.12 dim.31 dim.33 dim.32 dim.40 dim.13
## 4.99426904 4.65555664 4.17521508 3.74766142 3.17598369 2.90322393
## dim.35 dim.30 dim.43 dim.21 dim.3 dim.38
## 2.80958011 2.70243932 2.46648739 2.27396723 2.26798054 2.10375315
## dim.45 dim.24 dim.22 dim.26 dim.41 dim.49
## 1.74518227 1.56487544 1.45294324 1.29388343 1.22935554 1.18963075
## dim.47 dim.44 dim.20 dim.16 dim.17 dim.39
## 1.14977056 1.10820573 0.91915268 0.89341195 0.77064192 0.74066254
## dim.46 dim.34 dim.36 dim.48 dim.19 dim.27
## 0.73431187 0.71131455 0.56912093 0.56400112 0.55179195 0.54252279
## dim.7 dim.25 dim.29 dim.37 dim.42 dim.18
## 0.52307229 0.49389356 0.46285884 0.41099676 0.34807639 0.09694169
## dim.2 dim.23 dim.4
## 0.09363085 0.05865578 0.02555196
E.g. Here, dimensions 1 is the most significant. I check that this is monotonic in \(p\). Yes - it seems to be.
I check the results using various parameter options. I also only use the 500,000 independent SNPs for the KDE.
res_p=500; res_q=500; nxbin=500
(time ~ 25mins)
res_p=500; res_q=500; nxbin=1000
(time ~ 40mins)
res_p=500; res_q=500; nxbin=2000
(time ~ 1.4hrs)
At each iteration, we need to correct for awol \(v\) values (usually occuring for low \(q\) where the KDE struggles). To do this, I fit a spline to the data and map the point back to the spline if its distance is too far.
The blue points are the ones mapped back to the spline (red line). E.g. Using a dist_thr
value of 0.5 corrects 116 out of the 2 million points.
df <- readRDS("first_iter_res_xbin1000.RDS")
df$q <- -df$q
plot(x = df$q, y = log10(df$v)-log10(df$p))
library(bigsplines)
## Loading required package: quadprog
bigspline_fit <- bigspline(x = df$q, y = log10(df$v/df$p), nknots = 5, rparm = NA)
lines(bigspline_fit$x, bigspline_fit$fitted, col = "red")
dist <- abs(log10(df$v/df$p)-bigspline_fit$fitted.values)
dist_thr <- 0.5
points(bigspline_fit$x[which(dist>dist_thr)], log10(df$v/df$p)[which(dist>dist_thr)], col = "blue")
df$v_new <- df$v
df$v_new[which(dist>dist_thr)] <- pmin(10^bigspline_fit$fitted[which(dist>dist_thr)]*df$p[which(dist>dist_thr)],1)
Optional parameters are:
res_p
: Number of grid points for \(zp\) in KDE and cFDR curve approximation
res_q
: Number of grid points for \(q\) in KDE and cFDR curve approximation
xbin
: Number of bins in \(zp\) direction for hexbinning
nknots
: Number of knots for spline in spline correction
dist_thr
: Map \(v\) values back to spline for points where the distance of points to the spline is \(\geq\) dist_thr
In my example, \(lims_{zp}=(0, 12.26)\) and \(lims_{q}=(-6.66, 20.15)\) (10% added on to the range of values), which are split into grids of size res_p
and res_q
. The user doesn’t see these limits so I want to have a res_p
and res_q
selection process that depends on the range of \(zp\) and \(q\) which is easier for the user. E.g. “we recommend selecting res_p ~= range(zp)/0.04
and res_q ~= range(q)/0.04
value such that the range of \(zp\) and \(q\) are split into segments of approximately 0.04”.
xbin
depends more on the number of data-points. I found that xbin=1000
was good for the asthma data for 1,978,000 SNPs. I think xbin=1000
is a good suggestion for nsnps<2000000
, but should be increased for a bigger analysis.
For the first dimension, I think the following parameter values are best. This takes ~25 mins. See cFDR_final/asthma_v4/
on HPC. Note, that I’ve decided to use dist_thr=0.2
(so 1749 SNPs are spline corrected rather than 116 when using dist_thr=0.5
).
res <- functional_cFDR_v4(p, q, ind_snps, nxbin = 1000, res_p = 300, res_q = 500, dist_thr = 0.2, nknot = 5)
For each iteration we need monotonicity between \(p\) (\(v\) value from previous iteration) and \(q\). We could therefore use the following procedure:
I check if this method works for iteration 2 (removing dimension 1 that was used as q in the first iteration). No, the most significant dimension (dimension 5) is not monotonic.
df <- readRDS("1iter_res_dist02.RDS")
ind_coords <- readRDS("final/data/LDSC_annots_PCAmix_indcoords.RDS")
p_ind <- df[match(ind_coords$SNP, df$SNP),"v"]
# use v value from previous iteration and remove dimensions used previously
regression_df <- data.frame(chisq = log(p_ind), ind_coords[,2:50])
lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
##
## Call:
## lm(formula = chisq ~ ., data = regression_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.998 -0.374 0.312 0.724 1.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0524614 0.0018387 -572.398 < 2e-16 ***
## dim.2 -0.0039299 0.0007427 -5.291 1.22e-07 ***
## dim.3 0.0196189 0.0008306 23.620 < 2e-16 ***
## dim.4 0.0125409 0.0008830 14.202 < 2e-16 ***
## dim.5 -0.0342125 0.0008671 -39.456 < 2e-16 ***
## dim.6 -0.0027541 0.0010022 -2.748 0.005994 **
## dim.7 -0.0080815 0.0010234 -7.897 2.87e-15 ***
## dim.8 -0.0258269 0.0010262 -25.167 < 2e-16 ***
## dim.9 -0.0219276 0.0012119 -18.094 < 2e-16 ***
## dim.10 -0.0139272 0.0011138 -12.504 < 2e-16 ***
## dim.11 -0.0080571 0.0012046 -6.688 2.26e-11 ***
## dim.12 0.0007173 0.0011885 0.604 0.546159
## dim.13 0.0062826 0.0012073 5.204 1.95e-07 ***
## dim.14 -0.0321967 0.0012212 -26.365 < 2e-16 ***
## dim.15 -0.0188370 0.0012824 -14.689 < 2e-16 ***
## dim.16 -0.0076126 0.0013601 -5.597 2.18e-08 ***
## dim.17 0.0023767 0.0013849 1.716 0.086137 .
## dim.18 0.0025553 0.0013621 1.876 0.060657 .
## dim.19 0.0049649 0.0014371 3.455 0.000551 ***
## dim.20 0.0007614 0.0016937 0.450 0.653027
## dim.21 0.0053968 0.0016437 3.283 0.001026 **
## dim.22 0.0063083 0.0015699 4.018 5.86e-05 ***
## dim.23 0.0003987 0.0016044 0.248 0.803751
## dim.24 -0.0008819 0.0015474 -0.570 0.568729
## dim.25 -0.0015179 0.0015004 -1.012 0.311691
## dim.26 -0.0025854 0.0014557 -1.776 0.075736 .
## dim.27 -0.0022792 0.0014015 -1.626 0.103890
## dim.28 -0.0057994 0.0013604 -4.263 2.02e-05 ***
## dim.29 -0.0024318 0.0013462 -1.807 0.070840 .
## dim.30 0.0087657 0.0014486 6.051 1.44e-09 ***
## dim.31 0.0069020 0.0014325 4.818 1.45e-06 ***
## dim.32 0.0001607 0.0014391 0.112 0.911065
## dim.33 -0.0107898 0.0014425 -7.480 7.45e-14 ***
## dim.34 0.0098543 0.0014378 6.854 7.21e-12 ***
## dim.35 0.0026756 0.0014484 1.847 0.064699 .
## dim.36 -0.0047257 0.0014612 -3.234 0.001220 **
## dim.37 -0.0033004 0.0014437 -2.286 0.022247 *
## dim.38 0.0039173 0.0014415 2.717 0.006579 **
## dim.39 -0.0086003 0.0014268 -6.028 1.67e-09 ***
## dim.40 -0.0016387 0.0014219 -1.152 0.249127
## dim.41 0.0035353 0.0014300 2.472 0.013428 *
## dim.42 -0.0010432 0.0014853 -0.702 0.482473
## dim.43 -0.0107018 0.0015245 -7.020 2.22e-12 ***
## dim.44 -0.0134190 0.0015416 -8.705 < 2e-16 ***
## dim.45 0.0127146 0.0015323 8.298 < 2e-16 ***
## dim.46 -0.0032753 0.0015523 -2.110 0.034861 *
## dim.47 0.0035129 0.0015649 2.245 0.024785 *
## dim.48 0.0022691 0.0015604 1.454 0.145899
## dim.49 -0.0028781 0.0015668 -1.837 0.066225 .
## dim.50 -0.0124647 0.0015175 -8.214 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.125 on 513062 degrees of freedom
## Multiple R-squared: 0.009776, Adjusted R-squared: 0.009681
## F-statistic: 103.4 on 49 and 513062 DF, p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
## (Intercept) dim.5 dim.14 dim.8 dim.3 dim.9
## 572.3977901 39.4560184 26.3645066 25.1673200 23.6198880 18.0941186
## dim.15 dim.4 dim.10 dim.44 dim.45 dim.50
## 14.6890460 14.2021711 12.5039404 8.7045150 8.2975158 8.2138314
## dim.7 dim.33 dim.43 dim.34 dim.11 dim.30
## 7.8968800 7.4798500 7.0199238 6.8535636 6.6884220 6.0510221
## dim.39 dim.16 dim.2 dim.13 dim.31 dim.28
## 6.0275183 5.5969995 5.2910663 5.2040785 4.8183345 4.2628332
## dim.22 dim.19 dim.21 dim.36 dim.6 dim.38
## 4.0183412 3.4547218 3.2833741 3.2341923 2.7481376 2.7174303
## dim.41 dim.37 dim.47 dim.46 dim.18 dim.35
## 2.4722156 2.2861387 2.2447502 2.1099801 1.8759915 1.8473393
## dim.49 dim.29 dim.26 dim.17 dim.27 dim.48
## 1.8369023 1.8065047 1.7759835 1.7161425 1.6262811 1.4541740
## dim.40 dim.25 dim.42 dim.12 dim.24 dim.20
## 1.1524738 1.0116810 0.7023317 0.6035262 0.5699246 0.4495613
## dim.23 dim.32
## 0.2484954 0.1116953
#all_coords <- readRDS("final/data/LDSC_annots_PCAmix_coords.RDS")
all_coords <- coords
monotonic_plot(df$v, all_coords$dim.5, title = "Iteration 2 (p = v1, q = dim.5)")
A possible solution, which we looked at earlier is to iterate twice over problematic dimensions. I.e. (v1,q2)=v2_tmp
–> (v2_tmp, -q2)=v2
. I check the results for doing this where q2
is dimension 5 shown to be the most significant above. This doesnt seem to do the trick (was hoping it would make smaller v2 values for high and low \(q\)). But why doesn’t it make small q smaller in the second step?
—
For completeness I also show v2
against v1
plots.
Another solution would be to adjust for data pairs with q<q'
and q>q'
seperately where q'
is chosen where the relationship between p and q changes.
E.g. iteration 2 (\(p=v1\) and \(q=dim.5\)), the inflection point is at \(~1.66\).
df <- readRDS("1iter_res_dist02.RDS")
# all_coords <- readRDS("final/data/LDSC_annots_PCAmix_coords.RDS")
df_tmp <- data.frame(p = df$v, q = all_coords$dim.5)
p <- df_tmp$p
q <- df_tmp$q
q2 <- q^2
quadratic.model <- lm(-log10(p) ~ q + q2)
summary(quadratic.model)
##
## Call:
## lm(formula = -log10(p) ~ q + q2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2128 -0.3145 -0.1498 0.1397 27.9679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.157e-01 3.834e-04 1084.28 <2e-16 ***
## q -2.636e-02 2.695e-04 -97.82 <2e-16 ***
## q2 7.925e-03 3.326e-05 238.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5225 on 1978014 degrees of freedom
## Multiple R-squared: 0.02903, Adjusted R-squared: 0.02903
## F-statistic: 2.957e+04 on 2 and 1978014 DF, p-value: < 2.2e-16
q_vals <- seq(range(q)[1],range(q)[2],0.005)
predp <- predict(quadratic.model, list(q = q_vals, q2 = q_vals^2))
plot(q_vals, predp, pch = 16, xlab = "q", ylab = "-log10(p)")
infl <- c(FALSE, diff(diff(predp)>0)!=0)
abline(v = q_vals[infl], col = "blue")
I split up the dataset into those with \(q<1.66\) and those with \(q>1.66\) and perform functional cFDR on each seperately (flipping the sign of \(q\) for the latter so that small p is enriched for small q).
df_low <- df_tmp[which(df_tmp$q<1.66),]
dim(df_low)
## [1] 1881448 2
df_high <- df_tmp[-which(df_tmp$q<1.66),]
dim(df_high)
## [1] 96569 2
monotonic_plot(df_low$p, df_low$q, title = "Data points with q<1.66")
monotonic_plot(df_high$p, df_high$q, title = "Data points with q>=1.66")
The results seem to do the right thing (ish) (v values shrunk for high and low q). The weird stripes are around the inflection point (1.66).
I compare the results above to those from the UK biobank data (after 2 iterations).
Of the 1,978,017 SNPs, 1,971,252 are found in the UKBB results.
Of these, when using \(5*10^{-6}\) as our significance threshold, we see that our method prioritises more SNPs that are also found to be significant in a larger sample (UKBB) after the first iteration, but not after the second…
Correlations:
Comments and notes
Potentially useful: https://www.e-aair.org/search.php?where=aview&id=10.4168/aair.2020.12.e65&code=9999AAIR&vmode=FULL
I’ve ran FINDOR on the Asthma SNPs but only 1,034,758 remain after intersection with LD scores (presumably because some Asthma SNPs are not found in the 1000 Genomes phase 3 dataset of LD scores).
I’ve looked at using the rotated dimensions from
PCArot
but this doesn’t solve the problem either.Should I try running functional cFDR for the T1D data? (e.g. to put in my thesis - or is one application enough? I’m thinking asthma could be the non-cell-type specific example and T1D could be the cell-type-specific example (uses celltype segway data) - but quite a big task (ld pruning etc etc)).
Need to fix spline error that occurs for zero \(v\) values.
There are some “monotonicity check” methods that could search for any monotonic dimensions (but I don’t think any are). https://projecteuclid.org/download/pdf_1/euclid.aos/1016120363; https://www.rdocumentation.org/packages/sm/versions/2.2-5.4/topics/sm.monotonicity