The ‘conditional false discovery rate’ (cFDR) is defined as the probability that a SNP is not associated the principal phenotype given its P values for the principal and conditional phenotypes are below some thresholds.

The BH procedure is a method to adjust P values such that the (unconditional) FDR is controlled at some level. E.g. the maximum number of false positives is 5%.

Andersson et al (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003455) first used the cFDR approach to increase the power to detect variants associated with Schizophrenia and Bipolar disorder. Although this approach required distinct controls, does not guarantee that the expected FDR is below some threshold (like BH does) and is over-conservative.

James extended this model to allow for shared controls (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004926#sec004) and developed a new method for accurate error control (https://www.biorxiv.org/content/10.1101/414318v3.full). It is this latter method that I am hoping to re-purpose and extend as a way to incorporate functional annotation data with GWAS summary statistics.

So far, I have extended the cFDR method to work on binary conditional data.

```
####### Formalised (ish) code for cfdr with binary conditional data
####### leave one chromosome out method
#' A
#' Empirical estimate of P(Q = qi | H0^p)
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param q1 q value to estimate probability of under H0^p (0 or 1)
#'
#' @return empirical estimate of P(Q = qi | H0^p)
#' @export
#'
A <- function(p, q, q1){
length(which(q == q1 & p > 0.5))/length(which(p > 0.5))
}
#' B
#' numerator of empirical estimate of P(P < pi | Q = qi)
#' since denominator cancels in maths
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param p1 value of P value to find those less than
#' @param q1 value of q (0 or 1)
#'
#' @return
#' @export
#'
B <- function(p, q, p1, q1){
length(which(p <= p1 & q == q1))
}
#' cFDR_b
#' cFDR function extended for the use on binary data to condition on
#' length(p)==length(q)==length(chr)
#'
#' @param p vector of P values measuring association with principal trait
#' @param q binary vector of Q values
#' @param chr vector of the chromosome each SNP falls in
#'
#' @return v-values
#' @export
#'
cFDR_b <- function(p, q, chr){
v <- vector("numeric", p) # prepare a container
cFDR <- for(i in 1:length(p)){
tryCatch(
expr = {
pi <- p[i]
qi <- q[i]
p_loo <- p[-which(chr == chr[i])] # df leave-one-chromosome-out
q_loo <- q[-which(chr == chr[i])] # df leave-one-chromosome-out
q0 <- A(p_loo, q_loo, q1 = 1)
if(qi == 0){
p0 <- pi
epsilon <- min(p_loo[which(q_loo == 1)])
LHS <- (p0 * A(p_loo, q_loo, q1 = 0))/(A(p_loo, q_loo, q1 = 1) * B(p_loo, q_loo, p1 = p0, q1 = 0))
f <- function(p, q, p1, q1, c) (p1/length(which(p <= p1 & q == q1))) - c
if(pi < 1e-4){
p1 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 1, c = LHS, extendInt = "yes", maxiter = 5000, tol = pi)$root
} else {
p1 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 1, c = LHS, extendInt = "yes", maxiter = 5000)$root
}
if(p1 > 1) p1 <- 1
} else {
p1 <- pi
epsilon <- min(p_loo[which(q_loo == 0)])
LHS <- (p1 * A(p_loo, q_loo, q1 = 1))/(A(p_loo, q_loo, q1 = 0) * B(p_loo, q_loo, p1 = p1, q1 = 1))
f <- function(p, q, p0, q1, c) (p0/length(which(p <= 0 & q == q1))) - c
if(pi < 1e-4){
p0 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 0, c = LHS, extendInt = "yes", maxiter = 5000, tol = pi)$root
} else {
p0 <- uniroot(f, c(0+epsilon, 1), p = p_loo, q = q_loo, q1 = 0, c = LHS, extendInt = "yes", maxiter = 5000)$root
}
}
v[i] <- p0*(1-q0) + p1*q0
},
error = function(e){
message("* Caught an error on iteration ", i)
print(e)
}
)
}
}
```

- Example 1: Find V values for T1D using information on a specific genomic annotation in a specific cell type.

```
# this code uses:
# P values measuring SNP association with T1D
# Q values as a binary indicator of whether that SNP falls in a constitutive heterochromatic region in brain cells
# to obtain V values
final_dataset <- readRDS("final_dataset.RDS")
p <- final_dataset$p
q <- final_dataset$brain_ConstitutiveHet
chr <- final_dataset$chrom
test <- cFDR_b(p, q, chr)
```

- Example 2: Find V values for T1D using information on whether the SNP falls in an active or inactive region in a specific cell type.

```
# this code uses:
# P values measuring SNP association with T1D
# Q values as a binary indicator of whether that SNP falls in an active region in Thymus cells
# to obtain V values
data <- readRDS("cFDR_data.RDS")
p <- data$P
q <- data$THYMUS
chr <- data$chr
test <- cFDR_b(p, q, chr)
```

I am now working on developing an umbrella function to find v-values incorporating functional annotations (incorporating PCA to find an independent lower dimensional representation of the annotation data that cFDR can be ran iteratively over).

Eventually, I hope to extend this to fine-mapping PPs.

The idea is that researchers provide some annotation data that they wish to condition on. In our case, this data is genome segment annotation for ~123,000 SNPs in 19 different cell types (18 immune-related + 1 brain control).

cFDR cannot be applied iteratively off the bat because the annotation data is correlated. We therefore apply PCA to obtain an independent lower dimensional representation of the data.

**Standardisation**

Centering

- As a rule of thumb, data should always be centered (transformed so that the mean is 0).
- In regression analysis, this means that the data is shifted so that the intercept is 0.
- PCA is similar to linear regression without an intercept, and so the PCs always pass through the origin.
- https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca

Scaling

- Scaling means that the data is scaled to have unit variance.
- It should always be done for data that is on different scales to ensure one variable (on a larger scale) does not dominate the analysis (unless you are confident that e.g. it is twice as important as the other features).
- For binary data it is a bit more complicated. Binary data with a high variance will be lots of 1s and lots of 0s, whereas binary data with low variance will be e.g. lots of 1s and few 0s (or all 1s/0s). If we scale this data then the distance between 0s and 1s in the low variance examples will be stretched. We need to think if rarer features (lots of 0s, few 1s - e.g. promoter) matter equally to common features (more 1s - e.g. quiescent).
- PCs explain the direction of the most variance, so if some features are more variable than others (e.g. quiescent is more variable than promoter) then the PCs will just target the quiescent regions.
- I think that we actually want to investigate the structure of the annotations rather than their frequency, so I am leaning towards scaling. In the next section I investigate the impact of scaling and no scaling on PCA.

I center the data and perform PCA twice, one with scaling and one without scaling. Note the differences in the y axis scale. *Why are the variances/eigenvalues so much bigger on the scaled data?*