New estimate of Q<=q|H0


I consider a new estimate for \(P(Q\leq q|H0)\).

Rather than assuming that all SNPs with \(P>1/2\) are null (which presents problems for estimating the PDF for low \(q\) when all of these SNPs used for the estimation have large \(q\) values), I calculate the local FDR values, that is \(P(H0|P=p)\) and use this as a sort of weighting.

The parameter bre allows me to specify breakpoints, I need to specify these such that the midpoints (where local FDR is evaluated) is equal to the values used in the KDE (i.e. kdq$x).

library(locfdr)

lfdr <- locfdr(c(zp_kde, -zp_kde), bre = c(c(kpq$x[-length(kpq$x)] + diff(kpq$x)/2, kpq$x[length(kpq$x)] + diff(kpq$x)[length(diff(kpq$x))]/2), -c(kpq$x[-length(kpq$x)] + diff(kpq$x)/2, kpq$x[length(kpq$x)] + diff(kpq$x)[length(diff(kpq$x))]/2)))

For example, below shows the FDR values for iteration 1 in the analysis with a vertical line at \(p=1/2\). We see that a lot more points will be used for the approximation using this new method (those with non-zero weights), than our original method that involved hard thresholding at \(p=1/2\).

I then use \(P(H0|P)\) above and \(P(P,Q)\) (bivariate KDE) to estimate

\[\begin{equation} \begin{split} P(P,Q,H0)&=P(H0|P,Q) \times P(P,Q)\\ &\approx P(H0|P) \times P(P,Q) \end{split} \end{equation}\]

and then sum up the rows (p) to estimate \(P(Q,H0)\) and sum up the rows and columns (p and q) to estimate \(P(H0)\). I use these to estimate \(P(Q|H0)=P(Q,H0)/P(H0)\).

Finally, I sum over the columns (q) to estimate our desired quantity: \(P(Q\leq q|H0)\).


Iteration 1 walkthrough


# download original p values and PCAmix coords
p_df <- readRDS("orig_p.RDS")
coords <- readRDS("LDSC_annots_PCAmix_coords.RDS")

# only perform regression on subset of independent SNPs
ind_snps <- readRDS("ind_snps.RDS")
ind <- na.omit(match(ind_snps, p_df$SNPID))
regression_df <- data.frame(chisq = log(p_df$european_ancestry_pval_rand)[ind], coords[ind,1:50])

lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.463  -0.372   0.326   0.733   1.322 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.041e+00  1.844e-03 -564.544  < 2e-16 ***
## dim.1       -7.927e-03  5.299e-04  -14.960  < 2e-16 ***
## dim.2        1.290e-04  7.392e-04    0.175 0.861448    
## dim.3       -2.216e-03  8.290e-04   -2.673 0.007511 ** 
## dim.4        9.927e-05  8.789e-04    0.113 0.910065    
## dim.5       -8.606e-03  8.646e-04   -9.954  < 2e-16 ***
## dim.6        9.369e-03  9.980e-04    9.388  < 2e-16 ***
## dim.7        9.483e-04  1.019e-03    0.930 0.352139    
## dim.8       -1.024e-02  1.021e-03  -10.023  < 2e-16 ***
## dim.9       -7.218e-03  1.207e-03   -5.979 2.24e-09 ***
## dim.10       6.982e-03  1.110e-03    6.292 3.13e-10 ***
## dim.11      -1.682e-02  1.199e-03  -14.031  < 2e-16 ***
## dim.12       5.467e-03  1.183e-03    4.622 3.80e-06 ***
## dim.13       3.195e-03  1.202e-03    2.659 0.007836 ** 
## dim.14      -1.444e-02  1.217e-03  -11.867  < 2e-16 ***
## dim.15      -8.744e-03  1.277e-03   -6.848 7.47e-12 ***
## dim.16      -9.360e-04  1.354e-03   -0.691 0.489256    
## dim.17       1.245e-03  1.378e-03    0.903 0.366280    
## dim.18      -4.661e-04  1.356e-03   -0.344 0.730982    
## dim.19       8.045e-04  1.430e-03    0.563 0.573774    
## dim.20      -1.064e-03  1.685e-03   -0.631 0.527771    
## dim.21       3.224e-03  1.636e-03    1.971 0.048708 *  
## dim.22       2.503e-03  1.562e-03    1.602 0.109124    
## dim.23       7.188e-06  1.597e-03    0.005 0.996407    
## dim.24      -2.258e-03  1.540e-03   -1.466 0.142555    
## dim.25      -2.710e-04  1.493e-03   -0.182 0.855951    
## dim.26      -1.455e-03  1.449e-03   -1.004 0.315244    
## dim.27       5.152e-04  1.395e-03    0.369 0.711793    
## dim.28      -7.076e-03  1.354e-03   -5.227 1.72e-07 ***
## dim.29      -5.331e-04  1.340e-03   -0.398 0.690640    
## dim.30       3.608e-03  1.442e-03    2.503 0.012323 *  
## dim.31       5.673e-03  1.425e-03    3.980 6.89e-05 ***
## dim.32       4.833e-03  1.432e-03    3.375 0.000739 ***
## dim.33      -5.546e-03  1.435e-03   -3.863 0.000112 ***
## dim.34       1.047e-03  1.431e-03    0.731 0.464492    
## dim.35       3.089e-03  1.441e-03    2.143 0.032106 *  
## dim.36      -4.905e-04  1.454e-03   -0.337 0.735847    
## dim.37       4.267e-04  1.437e-03    0.297 0.766443    
## dim.38       2.691e-03  1.434e-03    1.876 0.060692 .  
## dim.39      -8.609e-04  1.420e-03   -0.606 0.544350    
## dim.40      -3.962e-03  1.415e-03   -2.800 0.005104 ** 
## dim.41       1.717e-03  1.423e-03    1.207 0.227563    
## dim.42       4.741e-04  1.478e-03    0.321 0.748352    
## dim.43      -2.994e-03  1.517e-03   -1.974 0.048436 *  
## dim.44       1.459e-03  1.535e-03    0.950 0.341897    
## dim.45       3.016e-03  1.525e-03    1.978 0.047961 *  
## dim.46      -1.168e-03  1.545e-03   -0.756 0.449377    
## dim.47       2.013e-03  1.557e-03    1.293 0.196058    
## dim.48      -1.057e-03  1.553e-03   -0.681 0.496079    
## dim.49      -1.792e-03  1.559e-03   -1.150 0.250324    
## dim.50      -8.265e-03  1.510e-03   -5.473 4.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 513061 degrees of freedom
## Multiple R-squared:  0.002202,   Adjusted R-squared:  0.002105 
## F-statistic: 22.64 on 50 and 513061 DF,  p-value: < 2.2e-16
# pick dimension with biggest absolute t statistic
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims  <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
##  (Intercept)        dim.1       dim.11       dim.14        dim.8        dim.5 
## 5.645443e+02 1.495963e+01 1.403073e+01 1.186671e+01 1.002277e+01 9.954240e+00 
##        dim.6       dim.15       dim.10        dim.9       dim.50       dim.28 
## 9.388487e+00 6.848461e+00 6.292132e+00 5.979472e+00 5.473031e+00 5.227158e+00 
##       dim.12       dim.31       dim.33       dim.32       dim.40        dim.3 
## 4.622307e+00 3.980195e+00 3.863477e+00 3.374887e+00 2.800387e+00 2.673290e+00 
##       dim.13       dim.30       dim.35       dim.45       dim.43       dim.21 
## 2.659058e+00 2.502761e+00 2.143097e+00 1.977717e+00 1.973531e+00 1.971139e+00 
##       dim.38       dim.22       dim.24       dim.47       dim.41       dim.49 
## 1.875740e+00 1.602144e+00 1.466345e+00 1.292867e+00 1.206662e+00 1.149563e+00 
##       dim.26       dim.44        dim.7       dim.17       dim.46       dim.34 
## 1.004281e+00 9.504247e-01 9.304491e-01 9.034641e-01 7.564546e-01 7.314712e-01 
##       dim.16       dim.48       dim.20       dim.39       dim.19       dim.29 
## 6.914926e-01 6.806721e-01 6.314134e-01 6.062492e-01 5.625025e-01 3.979865e-01 
##       dim.27       dim.18       dim.36       dim.42       dim.37       dim.25 
## 3.694496e-01 3.438194e-01 3.373585e-01 3.208131e-01 2.970309e-01 1.815306e-01 
##        dim.2        dim.4       dim.23 
## 1.745319e-01 1.129572e-01 4.502561e-03
p <- p_df$european_ancestry_pval_rand
q <- coords$dim.1

monotonic_plot(p, q, title = "Iteration 1 (orig_p, dim1)")

No folding of the \(q\) distribution is required here as it is already monotonic in \(p\).


First, we estimate \(P(Q\leq q|H0)\) using our new method and a bivariate KDE fitted to the independent subset of SNPs. Next, we estimate \(P(Q\leq q, P\leq p)\) using the KDE. The following plot shows \(P(Q\leq q|H0)\) (red) overlaid on \(P(Q\leq q, P\leq p)\):

We then find kgrid which is the ratio of the above two quantities, and see that there is a problem for low \(q\).

The problem seems to occur for \(-18<q<-16\). If I look at this problematic region then it looks like our estimate of \(P(Q\leq q, P\leq p)\) is the problem:


I think that the method struggles for low \(q\) because we only use the independent SNPs for the bivariate KDE which have minimum \(q\) value -17.35, whereas the minimum \(q\) value in the whole dataset is \(-17.92\) and the limits for the KDE (10% less than the lowest \(q\) in the full dataset) is -20.15. This means that for for the first 53 out of 500 \(q\) values on the KDE grid, no data is used for the estimation of the KDE (this would be reduced to the first 42 if using the whole dataset). But note that the BW used for \(q\) is larger when using the subset of independent SNPs (0.72 vrs 0.5).


From kgrid, we then find cgrid which is the cFDR values (p/kgrid). We see that the problem above is confounded at this stage.

Which is reflected in the cFDR values:


Using this new estimate has definitely made things better in terms of not re-weighting values as much and not decreasing a load of \(p\)-values ~=1. But it looks like we’ll still have to implement the spline correction.


If I implement the spline correction using a distance threshold of 0.5 and nknots=5 then the points shown in blue are mapped back to the spline:


The results don’t look great:


GenoWAP


Method


Define \(Z_D\) as an indicator of disease-specific functionality [1 if this SNP or surronding region is involved in the disease pathway]. Define \(Z\) as an indicator of general functionality [1 if this SNP or surronding region is active in any genomic functional pathway].

The goal is to estimate the posterior probability of a SNP having a disease-specific function:

\[\begin{equation} P(Z_D=1|p)=\dfrac{f(p|Z_D=1)\times P(Z_D=1)}{f(p|Z_D=0)\times P(Z_D=0) + f(p|Z_D=1)\times P(Z_D=1)}. \end{equation}\]

The authors state that \(\{Z_D\} \subseteq \{Z\}\) (“if a SNP is disease-specific functional, it has to be functional in the general sense as well”) and use this to write:

\[\begin{equation} P(Z_D=1|p)=\dfrac{f(p|Z_D=1)\times P(Z_D=1|Z=1)P(Z=1)}{f(p|Z_D=0)\times P(Z_D=0) + f(p|Z_D=1)\times P(Z_D=1|Z=1)P(Z=1)} \end{equation}\]

Therefore, the following quantities need to be estimated to derive the posterior probability:

  1. \(P(Z=1)\)

  2. \(f(p|Z_D=0)\)

  3. \(f(p|Z_D=1)\)

  4. \(P(Z_D=1|Z=1)\)


1: \(P(Z=1)\)

\(P(Z=1)\) is estimated by the mean GenoCanyon functional score of its surrounding 10kb. [Note that below I use GenoCanyon10K scores that have already been smoothed over 10kb surronding regions.. should I use the original GenoCanyon scores in the GenoWAP software then? - have emailed to ask].


2: \(f(p|Z_D=0)\)

SNPs are partitioned into functional (\(Z=1\)) or non-functional (\(Z=0\)) based on their GenoCanyon scores (typically using a threshold of 0.1 - they state the method isn’t sensitive to this cutoff choice due to the bimodal pattern of GenoCanyon scores). They assume that the p-values for markers not related to the disease behave just like the p-values for markers that are not functional entirely (since the p-values are acquired from a disease-specific study) so that they can use their subsetted data to estimate: \(f(p|Z_D=0)=f(p|Z=0)\).

They do not assume a uniform distribution for \((p|Z_D=0)\) since the p-value for such a marker may be driven by a nearby disease-relate marker due to LD. Intead, they estimate it empirically using a histogram with the number of bins chosen through cross-validation.


3/4. \(f(p|Z_D=1)\); \(P(Z_D=1|Z=1)\)

They partition the functional subgroup (\(Z=1\)) into finer subgroups. Using \(f(p|Z=1, Z_D=0)=f(p|Z_D=0)=f(p|Z=0)\), they derive the following mixture:

\[\begin{equation} \begin{split} f(p|Z_D=1)&=P(Z_D=1|Z=1)\times f(p|Z=1,Z_D=1) + P(Z_D=0|Z=1)\times f(p|Z=1,Z_D=0) \\ &= P(Z_D=1|Z=1)\times f(p|Z_D=1) + P(Z_D=0|Z=1)\times f(p|Z_D=0) \end{split} \end{equation}\]

and assume a parametric form of \(f(p|Z_D=1) \longrightarrow (p|Z_D=1) \sim Beta(\alpha, 1)\) with \(0<\alpha<1\) to guarantee that a smaller p-value is more likely to occur than a larger p-value. They apply the EM algorithm on all the p-values in the functional subgroup to estimate both \(P(Z_D=1|Z=1)\) and \(P(p|Z_D=1)\).


GenoWAP has now been extended to use additional functional data (i.e. GenoSkyline tissue-specific scores) here by incorporating an extra indicator, \(Z_T\), into the above model which records tissue relevant functionality (where the relevant tissues are selected using LDSC).


Implementation


For my Asthma SNPs I obtain:

  1. GenoCanyon scores (union of functional elements in different cell types)

  2. GenoSkyline tissue-specific scores (lung tissue)

Note that GenoCanyon requires you to put in a threshold for the GS scores, such that anything above the threshold is called “functional” and anything below the threshold is called “non-functional”, they select 0.1 (as I use here - meaning that 40% of our SNPs are deemed “functional”).


I perform GenoWAP on my Asthma \(p\) values and the GenoCanyon scores (later, I can incorporate the cell type specific scores too). See manuscript here and GitHub here.


Functional cFDR with GenoCanyon scores

I then apply functional cFDR (version 5) on the original \(p\) values and the GenoCanyon scores as \(q\). There is a nice monotonic relationship between \(p\) and \(q\) (note that the sign of q will be flipped so the monotonicity is in the correct direction). But this is almost flat, suggesting that the covariate is not very imformative.

It will be interesting to see how functional cFDR performs here, where we have a bimodal distribution with the two modes at the extremes.


Functional cFDR seems to perform well, but doesn’t re-weight the \(p\) values much (the Spearmans correlation between \(p\) and \(v\) after functional cFDR is 0.986, whilst the spearmans correlation between \(q\) and \(v\) is 0.16).

So it is apparant that the GenoWAP method uses the auxillary data more so than functional cFDR. But note that the correlation between \(p\) and \(q\) here is only 0.017, so maybe GenoWAP is using the auxillary data too much when there is little evidence of it being important (i.e. prior is very informative), whereas our method decides how relevant the data is in its re-weighting?


Comments and queries


Method:

  • At the moment, we approximate \(P(H_0^p)\) by 1 (as standard) to derive:

\[\begin{equation}\label{cfdr} \begin{split} cFDR(p,q)&=P(H_0^p|P\leq p, Q\leq q) \\ &=\dfrac{P(P\leq p|H_0^P,Q\leq q) P(H_0^P|Q\leq q)}{P(P\leq p|Q\leq q)} &\approx \dfrac{p P(Q\leq q|H_0^p) P(H_0^p)}{P(P\leq p,Q\leq q)} &\approx \dfrac{p P(Q\leq q|H_0^p)}{P(P\leq p,Q\leq q)} \end{split} \end{equation}\]

but then we approximate it to derive our estimate for \(P(Q\leq q|H_0^p)\). Is this ok?

  • Need to sort out problems for low \(q\). We could just use final spline correction (but still need to think about choosing nknots and dist_thr parameters) but it would be nice to fix this problem earlier on in the method.

Extras

  • Still need a simulation section…

    • Iterating over independent uniform \(q\) adds noise.

    • Iterating over independent mixture \(q\) (two peaks) adds noise.

    • Iterating over independent absolute mixture \(q\) (single peak) adds noise.

    • Iterating over dependent vectors becomes useful…

    • I’d like to use simGWAS to simulate my \(p\)-values but I think that this is a region specific tool? E.g. I’m getting an error when simulating multiple (>2) CVs “Error in happrobs(freqmat, freqprob) : negative length vectors are not allowed”.

  • From now on, use the 1,978,016 SNP data set (removing the duplicated SNP).