24th Jan

cFDR

Introduction

The ‘conditional false discovery rate’ (cFDR) is defined as the probability that a SNP is not associated the principal phenotype given its P values for the principal and conditional phenotypes are below some thresholds.
The BH procedure is a method to adjust P values such that the (unconditional) FDR is controlled at some level. E.g. the maximum number of false positives is 5%.
Andersson et al (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003455) first used the cFDR approach to increase the power to detect variants associated with Schizophrenia and Bipolar disorder. Although this approach required distinct controls, does not guarantee that the expected FDR is below some threshold (like BH does) and is over-conservative.
James extended this model to allow for shared controls (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004926#sec004) and developed a new method for accurate error control (https://www.biorxiv.org/content/10.1101/414318v3.full). It is this latter method that I am hoping to re-purpose and extend as a way to incorporate functional annotation data with GWAS summary statistics.
So far, I have extended the cFDR method to work on binary conditional data.

####### Formalised (ish) code for cfdr with binary conditional data
####### leave one chromosome out method

#' A
#' Empirical estimate of P(Q = qi | H0^p)
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param q1 q value to estimate probability of under H0^p (0 or 1)
#'
#' @return empirical estimate of P(Q = qi | H0^p)
#' @export
#'
A <- function(p, q, q1){ 
  length(which(q == q1 & p > 0.5))/length(which(p > 0.5))
}

#' B
#' numerator of empirical estimate of P(P < pi | Q = qi)
#' since denominator cancels in maths
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param p1 value of P value to find those less than
#' @param q1 value of q (0 or 1)
#'
#' @return
#' @export
#'
B <- function(p, q, p1, q1){
  length(which(p <= p1 & q == q1)) 
}

#' cFDR_b
#' cFDR function extended for the use on binary data to condition on
#' length(p)==length(q)==length(chr)
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param chr vector of the chromosome each SNP falls in
#'
#' @return v-values
#' @export
#'
cFDR_b <- function(p, q, chr){
  
  v <- vector("numeric", p) # prepare a container
  
  cFDR <- for(i in 1:length(p)){
    tryCatch(
      expr = {
        
        pi <- p[i]
        qi <- q[i]
        
        p_loo <- p[-which(chr == chr[i])] # df leave-one-chromosome-out
        q_loo <- q[-which(chr == chr[i])] # df leave-one-chromosome-out
        
        q0 <- A(p_loo, q_loo, q1 = 1)
        
        if(qi == 0){
          
          p0 <- pi
          
          epsilon <- min(p_loo[which(q_loo == 1)])
          
          LHS <- (p0 * A(p_loo, q_loo, q1 = 0))/(A(p_loo, q_loo, q1 = 1) * B(p_loo, q_loo, p1 = p0, q1 = 0))
          
          f <- function(p, q, p1, q1, c) (p1/length(which(p <= p1 & q == q1))) - c 
          
          if(pi < 1e-4){
            p1 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 1, c = LHS, extendInt = "yes", maxiter = 5000, tol = pi)$root
          } else {
            p1 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 1, c = LHS, extendInt = "yes", maxiter = 5000)$root
          }
          
          if(p1 > 1) p1 <- 1
          
        } else {
          
          p1 <- pi
          
          epsilon <- min(p_loo[which(q_loo == 0)])
          
          LHS <- (p1 * A(p_loo, q_loo, q1 = 1))/(A(p_loo, q_loo, q1 = 0) * B(p_loo, q_loo, p1 = p1, q1 = 1)) 
          
          f <- function(p, q, p0, q1, c) (p0/length(which(p <= 0 & q == q1))) - c 
          
          if(pi < 1e-4){
            p0 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 0, c = LHS, extendInt = "yes", maxiter = 5000, tol = pi)$root
          } else {
            p0 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 0, c = LHS, extendInt = "yes", maxiter = 5000)$root
          }
          
        } 
        
        v[i] <- p0*(1-q0) + p1*q0
        
      },
      error = function(e){
        message("* Caught an error on iteration ", i)
        print(e)
      }
    )
  }
  
}

Example 1: Find V values for T1D using information on a specific genomic annotation in a specific cell type.

# this code uses:
# P values measuring SNP association with T1D
# Q values as a binary indicator of whether that SNP falls in a constitutive heterochromatic region in brain cells
# to obtain V values

final_dataset <- readRDS("final_dataset.RDS")
p <- final_dataset$p
q <- final_dataset$brain_ConstitutiveHet
chr <- final_dataset$chrom

test <- cFDR_b(p, q, chr)

Example 2: Find V values for T1D using information on whether the SNP falls in an active or inactive region in a specific cell type.

# this code uses:
# P values measuring SNP association with T1D
# Q values as a binary indicator of whether that SNP falls in an active region in Thymus cells
# to obtain V values

data <- readRDS("cFDR_data.RDS")
p <- data$P
q <- data$THYMUS
chr <- data$chr

test <- cFDR_b(p, q, chr)

I am now working on developing an umbrella function to find v-values incorporating functional annotations (incorporating PCA to find an independent lower dimensional representation of the annotation data that cFDR can be ran iteratively over).
Eventually, I hope to extend this to fine-mapping PPs.

Annotation data set to condition on

The idea is that researchers provide some annotation data that they wish to condition on. In our case, this data is genome segment annotation for ~123,000 SNPs in 19 different cell types (18 immune-related + 1 brain control).

cFDR cannot be applied iteratively off the bat because the annotation data is correlated. We therefore apply PCA to obtain an independent lower dimensional representation of the data.

Standardisation

Centering
- As a rule of thumb, data should always be centered (transformed so that the mean is 0).
- In regression analysis, this means that the data is shifted so that the intercept is 0.
- PCA is similar to linear regression without an intercept, and so the PCs always pass through the origin.
- https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca
Scaling
- Scaling means that the data is scaled to have unit variance.
- It should always be done for data that is on different scales to ensure one variable (on a larger scale) does not dominate the analysis (unless you are confident that e.g. it is twice as important as the other features).
- For binary data it is a bit more complicated. Binary data with a high variance will be lots of 1s and lots of 0s, whereas binary data with low variance will be e.g. lots of 1s and few 0s (or all 1s/0s). If we scale this data then the distance between 0s and 1s in the low variance examples will be stretched. We need to think if rarer features (lots of 0s, few 1s - e.g. promoter) matter equally to common features (more 1s - e.g. quiescent).
- PCs explain the direction of the most variance, so if some features are more variable than others (e.g. quiescent is more variable than promoter) then the PCs will just target the quiescent regions.
- I think that we actually want to investigate the structure of the annotations rather than their frequency, so I am leaning towards scaling. In the next section I investigate the impact of scaling and no scaling on PCA.

Comparing PCA on scaled vrs non-scaled data

I center the data and perform PCA twice, one with scaling and one without scaling. Note the differences in the y axis scale. Why are the variances/eigenvalues so much bigger on the scaled data?

There is strong correlation between the PCs using the scaled and unscaled data which is promising. Notice that (for PC4 and PC7 especially) there are some annotations that sit around 0 in the unscaled PCA but have very variable loadings in the scaled PCA. Why could this be? Could these be rarer annotations that the scaled PCA approach is finding variability within (that unscaled PCA could not detect due to annotations that are more variable dominating)?

I now investigate specifically which cell type specific annotations are being picked out by each PC. We’d hoped that using scaled data would help increase the cell type specificity (before we were seeing that the annotation mark is more important than the cell type and that these were therefore correlated across cell type). It seems that using scaled data does not help much with this.

Interpretation of PCs (scaled)

PC1 seems to be comparing quiescent to transcribed. Since the cFDR method looks to threshold Q (below some value), we may need to not only condition on Q (quiescent SNPs) but also 1-Q (transcribed SNPs). This is because we are interested in both extremes (picking out transcribed and quiescent), rather than just one (quiescent).

Implementation (on just PC1)

Data preparation

Extract PCA transformed annotation data.

pcs <- pca_scale$x

Convert to \([0,1]\) range (so we can use James’ original code in the cfdr package).

pcs_norm <- apply(pcs, 2, function(x) (x-min(x))/(max(x)-min(x)))
pcs_norm <- as.data.frame(pcs_norm)

# check
par(mfrow = c(1,2))
hist(pcs[,1], breaks = 10, xlab = "Original data", main = "")
hist(pcs_norm[,1], breaks = 10, xlab = "Normalised data", main = "")

p <- final_dataset$p
q <- pcs_norm$PC1

p[which(p<1e-300)] <- 1e-300 # to avoid errors in James' code
q[which(q<1e-300)] <- 1e-300

length(p)==length(q)

## [1] TRUE

cFDR

Brief summary:

Find the set of possible nested rejection regions defined on the support of \(P, Q\).
Find the smallest rejection region that each pair \((p_i, q_i)\) falls in (called an L-region usually with the data point on it’s right-most border).
Estimate the distribution of \(P, Q\) under the null.
Integrate this over the L-region to get the v-value (the probability that each point would lie within its associated region under the null).

Following the cFDR vignette, we are not going to reject \(H^P=0\) for variables with very large P values against \(H^P=0\), so to save time we consider a set of indices for candidate variables rather than all of them.

candidate_indices <- which(p < 0.01) # 9944 of these

Estimate the distribution of \(Q|H^p=0\). This function fits a specific two Gaussian mixture distribution to P. The output is the MLE for pi0 and sigma.

est_q0_pars <- fit.2g(q[which(p>0.5)])$pars
est_q0_pars

## [1]  0.99999 36.69314

Compute L-curves passing through data points (using leave-one-chromosome-out method) and integrate over these. The L-curves are effectively contours of the cFDR estimator function.

chr <- final_dataset$chrom

# check all chromosomes are covered by candidate indices

unique(chr[candidate_indices])

# prepare containers
fold <- vector(mode = "list", length = length(unique(chr)))
ind <- vector(mode = "list", length = length(unique(chr)))
L <- vector(mode = "list", length = length(unique(chr)))
v <- vector(mode = "list", length = length(unique(chr)))

for(i in 1:length(unique(chr))){
  fold[[i]] <- which(chr == i)
  ind[[i]] <- intersect(candidate_indices, fold[[i]])
  L[[i]] <- vl(p, q, indices=ind[[i]], mode=2, fold=fold[[i]]);  # compute L curves
  v[[i]] <- il(L[[i]], pi0_null = est_q0_pars[1], sigma_null = est_q0_pars[2], distx="norm") # integrate over L curves
}

Also run on 1-Q

I also run the cFDR analysis on 1-Q (as we are interested in picking out both extremes (quiescent and transcribed) as opposed to just one of these).

Visualisation of results

Since bigger Q implies quiescent and smaller Q implies transcribed by PC1, we see that the P values for those SNPs falling in transcribed regions are made smaller when running the standard analysis on Q = PC1, however things go weird for Q = 1-PC1.

This is because the functions I am using assume that the q values are transformed normals, which they most certainly are not.

Replacing Q by it’s rank

I try replacing q by its rank to deal with this problem.

q_rank <- rank(q)/(length(q)+1)

I find that this does solve the problem.

It also shows that the results are not identical when switching Q to 1-Q (e.g. V values change more when using 1-q ranked annotation). There is also something weird happening for very small P values \((\approx <10^{-8})\).

Investigating tailing off of -log10(v)

It seems that something is going wrong for very small P values (V values are tailing off).

Ways I can generate the tailing off problem:

Increase SDs of associated Z scores (with principal trait)
Change random seed
Decrease sample sizes

The problem is not fixed when:

Using the true pi0_null and sigma_null values.
Changing candidate_indices=which(p<0.01 | q< 0.001) to candidate_indices=which(p<0.1).

I think the problem comes after the L curves, i.e. with the il() function.

The problem isn’t fixed when I use true_q0_pars in the il() function, but it is fixed if I set either pi0_null=1 or/and sigma_null=1 in the il() function.

Even more weirdly, I can fix the problem by setting sigma_null to anything \(\leq 1\) (even pi0_null = est_q0_pars[1], sigma_null = 0.1 work) but pi0_null doesn’t fix it if it isn’t superrrrrr close to 1 (even pi0_null = 0.999999, sigma_null = est_q0_pars[2] doesn’t quite fix it).

Re-writing integral function using empirical CDFs

I attempt to re-write the integral function using empirical CDFs. I.e. to get the v-values I need to integrate \(f_0(p,q)=f(P=p, Q=q| H_0^p)\) over the L region:

\[\begin{equation} \int_L f_0(p,q)dpdq. \end{equation}\]

We can estimate \(f_0(p,q)\) by \((Q|H_0^p)\sim (Q|P\geq 1/2)\) where James estimates the latter with a mixture-Gaussian distribution (\(N(0,1)\) with probability pi0_null and \(N(0,`sigma_null`^2)\) with probability 1-pi0_null).

R’s cubature (numerical computation of a multiple integral) packages: “The R package polyCub implements cubature (numerical integration) over polygonal domains. It solves the problem of integrating a continuously differentiable function f(x,y) over simple closed polygons.”

Iterate over PCs

The idea is to iterate over the PCs to obtain a final set of V values that incorporate genome annotation data.

I.e.

Use P = original p values, Q = PC1 to obtain V1
Use P = V1, Q = PC2 to obtain V2
And so on, until V10 (P values using annotation data)

I need to figure out how I iterate on 1-q for each PC as well (due to asymmetry in the above analysis).

Comments and next steps

Use gchromVAR to find cell-type specific GWAS enrichments from GWAS summary statistics and quantitative epigenomic data. This method weights chromatin features by variant posterior probabilities and computes the enrichment for each cell type versus and empirical background matched for GC content and feature intensity.
Figure out iteration strategy.

I’ve shown that annotations are correlated amongst cell types. Would be good to look at those SNPs that vary in annotation mark across cell types and any correlation with P value (similar to James Lee’s suggestion from the CHEERS paper).

I’ve started this analysis by comparing the P values of SNPs with the same or different annotation in the brain cell type and each of the immune-related cell types.
To do this, for each immune cell type in turn I’ve ran a logistic regression model for same annotation (1) or different annotation (0) with brain cell type against P value.
E.g. looking at row 8, I find that there is a 10% decrease for being in the same annotation in brain and CD45RO+ memory T cells for a unit increase of P value (\(exp(-0.0983172277046655)=0.9063613\)).

##                                                                   cell_type
## 1                                                        CD14_PRIMARY_CELLS
## 2                                          CD19_PRIMARY_CELLS_PERIPHERAL_UW
## 3                                                 CD3_PRIMARY_CELLS_CORD_BI
## 4                                           CD3_PRIMARY_CELLS_PERIPHERAL_UW
## 5                                                  CD4_MEMORY_PRIMARY_CELLS
## 6                                                   CD4_NAIVE_PRIMARY_CELLS
## 7                                    CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 8                                   CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 9  CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 10             CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 11                                              CD4+_CD25-_TH_PRIMARY_CELLS
## 12                                     CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 13                                   CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS
## 14                                                       CD56_PRIMARY_CELLS
## 15                                                 CD8_MEMORY_PRIMARY_CELLS
## 16                                                  CD8_NAIVE_PRIMARY_CELLS
## 17                                                                PANISLETS
## 18                                                                   THYMUS
##             intercept               slope
## 1   -0.38969564517402  0.0462203783098224
## 2  -0.481489520238904 0.00263220173733286
## 3  -0.601417756235528  0.0688286328404611
## 4  -0.231139837387791  0.0980380126308234
## 5  -0.582839131417988  -0.143034101617728
## 6  -0.754740517526352 -0.0528445493582747
## 7  -0.655408777345804  -0.045066106220834
## 8  -0.967939035753792 -0.0983172277046655
## 9  -0.441436901748305  0.0311031886697509
## 10 -0.329409388410416 -0.0256662502644529
## 11 -0.536141333810621  0.0352092173420959
## 12  -0.21844654130867  0.0107246995268383
## 13 -0.558407300076902 -0.0815895700977407
## 14 -0.626421690387551  0.0534302718299232
## 15 -0.809809801063977 -0.0402130022619539
## 16 -0.783220135531082 -0.0706866463614783
## 17 -0.312315860765758   0.100766028095696
## 18 -0.278282968561817  0.0400430630040647

24th Jan

Anna Hutchinson

cFDR

Introduction

Annotation data set to condition on

Comparing PCA on scaled vrs non-scaled data

Interpretation of PCs (scaled)

Implementation (on just PC1)

Data preparation

cFDR

Also run on 1-Q

Visualisation of results

Replacing Q by it’s rank

Investigating tailing off of -log10(v)

Re-writing integral function using empirical CDFs

Iterate over PCs

Comments and next steps