30th Jan

Extending cFDR for continuous annotations

1. Convert q’ back to q range

We consider three variations of the vector of annotations (\(q\)):

\(q\) (original PC) \(\in (-Inf, Inf)\)
\(q'\) (transformed PC) \(\in [0,1]\)
\(q''\) (ranked PC) \(\in [0,1]\)

Last week we settled on using \(q''\) to deal with the fact that \(q\) is not a transformed normal as James’ functions assume. We seemed to get away with this because the method ultimately cares about \(Pr(P<p|Q<q)\) and the functional form for the distribution of Q is convenience.

Our L curves are therefore defined on \([0,1]\times [0,1]\) which works for James’ functions, but as I am attempting to rewrite the integration code, I need the L curves to be defined on \([0,1]\times (-Inf, Inf)\). As converting back from \(q''\) to \(q\) would be hard, I remove the ranking step, calculate the L curves using \(q'\) (still defined on \([0,1]\times [0,1]\)) and then convert the \(y\) co-ordinates of the L curve back to the \((-Inf, Inf)\) range.

#' range01
#' convert vector from (-inf, inf) range to [0,1] range
#' @param x vector 
#'
#' @return transformed vector
#' @export
range01 <- function(x){(x-min(x))/(max(x)-min(x))}

#' rangeinf
#' convert vector back from [0,1] range to (-inf, inf) range
#' @param x vector in [0,1] range
#' @param s original vector to get max and min from
#'
#' @return vector in (-inf, inf) range
rangeinf <- function(x, s){x*(max(s)-min(s))+min(s)}

final_dataset <- readRDS("final_dataset.RDS")
pca_scale <- prcomp(final_dataset[,c(12:ncol(final_dataset))], center = TRUE, scale = TRUE)
pcs <- pca_scale$x

# user would input p in [0,1] and q in (-Inf, Inf)
p <- final_dataset$p
q <- pcs[,1]

# we would then transform q to [0,1] range to use vl()
q_trans <- range01(q)

p[which(p<1e-300)] <- 1e-300 # to avoid errors in James' code
q_trans[which(q_trans<1e-300)] <- 1e-300

candidate_indices <- which(p < 0.01) # 9944 of these

chr <- final_dataset$chrom

fold <- which(chr == 1)
ind <- intersect(candidate_indices, fold)
L <- vl(p, q_trans, indices=ind, mode=2, fold=fold)  # compute L curves

# convert y co-ords back
y_range01 <- as.vector(L$y)
y_rangeinf <- rangeinf(y_range01, q)

# consider a single example
p_example <- p[ind[1]]
q_example <- q[ind[1]]
q_trans_example <- q_trans[ind[1]]

par(mfrow=c(1,2))
plot(L$x[1,], L$y, type = "l", main = "Co-ords from vl() function", xlab = "p", ylab = "q")
points(p_example, q_trans_example, col = "blue", pch = 16)

plot(L$x[1,], y_rangeinf, type = 'l', main = "Transformed co-ords from vl() function", xlab = "p", ylab = "q")
points(p_example, q_example, col = "blue", pch = 16)

2. Function to integrate over

We want to integrate the joint distribution given the null. Given our independence assumption,

\[\begin{equation} f_{p,q'}(p,q|H_0^p)=f_p(p|H_0^p)\times f_{q'}(q|H_0^p) \end{equation}\]

where \(p\in[0,1]\) and \(q'\in[0,1]\). This is hard to do. However, now that we have transformed the q values back to the \((-Inf, Inf)\) range, we have

\[\begin{equation} f_{p,q}(p,q|H_0^p)=f_p(p|H_0^p)\times f_{q}(q|H_0^p) \end{equation}\]

where \(p\in[0,1]\) and \(q\in(-Inf, Inf)\).

Under the null, we assume that the P values are uniformly distributed, and we can use KDE for the \(q\) bit (since we are no longer in the \([0,1]\) range). Note that James approximated \(Q|P>1/2\) using a mixture Gaussian because you can’t use KDE when you’re limited to the \([0,1]\) range.

I can then use the polyCub R package to integrate this continous function over our polygon (the L region given by the L curve). (https://cran.r-project.org/web/packages/polyCub/vignettes/polyCub.html)

2a. KDE estimate of \(f(q|H_0^p)\)

I estimate \(f(q|H_0^p)\) empirically using those \(q\) values with corresponding \(p>1/2\) (hopefully representing null cases). For this reason, it is important that our method is used for genomic controlled P values (to ensure they are uniformly distributed under the null).

Should this be approximated using the leave-one-chromosome out method? For now I just use the full dataset to get an estimate.

I investigate using both the ks function and the density function for the KDE and decide to settle on the density function as this is more widely used.

library(ks)

p <- final_dataset$p
q <- pcs[,1]

fqh0_density <- density(q[which(p>1/2)])
fqh0_ks <- kde(q[which(p>1/2)])

par(mfrow=c(1,2))
plot(fqh0_density, col=3, main = "KDE for Q|P>1/2 using density()")
plot(fqh0_ks, col=3, main = "KDE for Q|P>1/2 using kde()")

# extract function
fqh0 <- approxfun(fqh0_density$x, fqh0_density$y)

2b. Estimate \(f(p|H_0^p)\)

We assume that this is the standard uniform distribution.

# estimate f_p(p|H_0) with standard uniform
fph0 <- function(x) dunif(x, 0, 1)

2c. Bivariate distribution to integrate

Both the functions are plotted below.

# estimate the product of these
# bivariate distribution: f(p,q | H0) = f_p(p | H0) x f_q(q | H0)
# from the polyCub documentation:
# a two-dimensional real-valued function to be integrated over polyregion. As
# its first argument it must take a coordinate matrix, i.e., a numeric matrix with
# two columns, and it must return a numeric vector of length the number of coordinates.
fpqh0 <- function(s){
  fph0(s[,1])*fqh0(s[,2])
}

3. Integration using polyCub function

The polyCub R package I am using for the integration uses the fact that the integral of a continuously differentiable function \(f(x,y)\) over a domain \(W\subset {\rm I\!R}^2\) can be approximated using an \(n\)-point cubature rule of the form:

\[\begin{equation} \int \int_W f(x,y)dxdy \approx \sum_{i=1}^n w_i f(x_i, y_i), \end{equation}\]

i.e., a weighted sum of function values at an appropriate set of nodes. I am using the general-purpose product Gauss cubature (Sommariva & Vianello, 2007).

library(polyCub)

# make xylist of polygon co-ords
# need points to go anticlockwise to define the hole
new_order <- c(seq(which.max(L$x[1,]), 0, -1),seq(length(L$x[1,]), which.max(L$x[1,])+1, -1))

polygon <- list(
    list(x = L$x[1,][new_order],
         y = y_rangeinf[new_order])
)

# note the polygon is there, just squished on the left and 
# it doesn't let me stretch the x axis
plotpolyf(polygon, fpqh0, xlim = c(0,1), ylim = c(-10,10), cuts = 20)

# integrate the joint null distribution over the polygon to get my V values
v_me <- polyCub(polygon, fpqh0)

4. Results

I compare the cFDR method using the three approaches:

A: James’ code with q=PC1 transformed to \([0,1]\) range.

B: James’ code with q=ranked PC1 transformedto \([0,1]\) range.

C: My code (detailed above).

PC1 compares quiescent (high q) with transcribed (low q) and so we see that the P values for those SNPs falling in transcribed regions are made smaller.

Tailing off

There is still this tailing off problem. This occurs at its worse for SNPs on chromosomes 10, 12, 16 and 18.

I have spoken to James about this and he says a quick fix is to make the gx parameter in the vl function smaller (although R then gives the following error message due to interpolating at very low gx: In regularize.values(x, y, ties, missing(ties)) : collapsing to unique ‘x’ values). But where does this gx come from? [The following is from James’ email].

It comes down to a trick he used to improve the robustness of the way L-curves are defined. Consider

\[\begin{equation} f(p,q)=\dfrac{p\times\#\{q_i<q\}}{\#\{p_i<p, q_i<q\}}=\text{traditional cfdr}. \end{equation}\]

The L region corresponding to a threshold \(\alpha\) is roughly the set of points with \(f(p,q)<\alpha\). The p co-ordinate of the L curve when it passes through the line q=qtrial should be the intersection of the curve Y=f(p,qtrial) with the line Y=alpha.

library(cfdr)

set.seed(1)                                           # ensure reproducibility

new_sd <- 5

n=1000; n1p=50; n1pq=50; n1q=50                   # parameters

par(mfrow = c(1,2))

zp=c(rnorm(n1p,sd=new_sd), rnorm(n1q,sd=1),
    rnorm(n1pq,sd=new_sd), rnorm(n-n1p-n1q-n1pq,sd=1))    # simulate z-scores corresponding to p
zq=c(rnorm(n1p,sd=1), rnorm(n1q,sd=3),
    rnorm(n1pq,sd=3), rnorm(n-n1p-n1q-n1pq,sd=1))    # simulate z-scores corresponding to q
P=2*pnorm(-abs(zp)); Q=2*pnorm(-abs(zq))              # convert to p-values

rough_cfdr=function(p,q) p*length(which(Q <= Q))/length(which(P <= p & Q <= q))

qtrial=0.2; ptrial=seq(0,1,length.out=100); ctrial=0*ptrial
for (i in 1:length(ptrial)) ctrial[i]=rough_cfdr(ptrial[i],qtrial)

par(mfrow=c(1,2))
plot(ptrial,ctrial,ylab="rough cfdr",xlab="p",type="l")
 abline(h=0.5)
abline(h=2.55,col="red")

So for alpha=0.5 this is fine, but for e.g. alpha=2.55 the line hits the curve twice since \(f(p,q)\) isn’t strictly increasing with \(q\). He therefore does something fancy to redefine g(p,q)=inf_{p'>p} f(p',p) to make the intersection unique. This \(g(p,q)\) is usually equal to \(f(p,q)\) but is now non-decreasing with p. It basically follows \(f\) and levels off in decreasing regions.

### g=function(p,q) inf_(p'>p) f(p',p)# this function is complicated to implement

gval=rev(cummin(rev(ctrial))) # this will give values g(p,qtrial)

plot(ptrial,ctrial,ylab="rough cfdr",xlab="p",type="l")
lines(ptrial,gval,col="red")
abline(h=0.5)
abline(h=2.55,col="red")

But note that there are some flat bits of \(g(p,q)\) where there is no unique p value solution. He originally resolved this by picking the right-hand edge of the flat bit, but this created problems if the flat bit was very long. Instead he defines a new version of g:

\[\begin{equation} h(p,q)=g(p,q)+gx*p \end{equation}\]

thereby adding a very-slowly-increasing linear term (so that the flat bit now very slowly increases). Thus, \(h\) is now strictly increasing rather than \(g\) which was just non-decreasing. He’d been using gx=1e-5 which was usually fine but it messes things up for very small P values.

Brief summary of the problem

This gx parameter comes from how he picks the P (x) coordinate of the L curve when there are multiple P values to pick from. L regions are the set of points for which \(cfdr(p,q)\leq \alpha\). The P co-ordinate of the L curve for some specific q=qtrial is the intersection of the curve Y=f(p,qtrial) and the line Y=alpha. But since f(p,q) is not strictly increasing, there can be more than 1 intersection point. James therefore defined a new function g(p,q) that basically follows f(p,q) but becomes flat for any regions where f(p,q) is not increasing. But which corresponding p value do we pick for these flat bits? As a workaround, he defines a new function h(p,q)=g(p,1q)=gx*p which is now a strictly increasing function, meaning that one sensible P value is picked.

Final results (using gx = min(p))

In these first results, Q=PC1 which compares quiescent (large) with transcribed (small), so it seems that the results are doing the correct thing; P values downweighted in quiescent regions and upweighted in transcribed regions.

In these second results, I’ve used Q=-PC1 so that transcribed (large) is being compared with quiescent (small). Again it seems that the method is doing the right thing; P values downweighted in quiescent regions and upweighted in transcribed regions.

It seems that the results using James’ vl() function for ranked q (middle) are similar to using my code (right). I investigate the difference in results for using these two methods.

The resultant V values seem the same for small V values but differ for higher V values (my V values are smaller when using q and bigger when using -q). James’ vl() function is faster than my code (but mine uses a for loop) and can be sped up.

Iterating

I investigate what happens when I iterate over the same PC (i.e. the most amount of correlation):

Using P = V values from James’ ranked q method (V) and following cfdr using James’ ranked q method to obtain V2.

Using P = V values from my method and following cfdr using my method.

I also investigate what happens when I iterate over (i) q = PC1 and then (ii) q = -PC1. note - I’m not sure of the colour coding of Q here as it switches…

Using P = V values from James’ ranked q method and following cfdr using James’ ranked q method.

Using P = V values from my method and following cfdr using my method.

5. Correspondance analysis (CA)

At the moment we are using PCA to obtain independent dimensions to summarise our annotation data, but we havent given this technique much thought. David suggested reading up on “correspondance analysis”. This is a tool similar to PCA that can be applied to categorical data.

CA and PCA are similar in that they both attempt to find a lower dimensional representation of the data using the SVD algorithm. The fundamental difference is that PCA decomposes relations between columns only (by decomposing (SVD) their covariance matrix and treating the rows as “cases”) whereas CA decomposes columns and rows simultaneous, treating them symmetrically (SVD on the standardised residuals measuring deviation from independence of the rows and columns). For this reason, CA is often applied to contingency tables. Moreover, PCA uses the original values (and can thus be used to compare how much of something there is) whereas CA uses the relative values (so can only be used to compare how much of something there is relative to the other stuff).

The simple method is:

Compute the average of each row and column.
Compute the expected values using the data in the table (the row average for that cell multiple by the column average and divided by the overall average).
Compute the residuals (observed-expected). These residuals show the associations between the row and column labels. I.e. the higher the residual for row \(i\) and column \(j\), the more associated they are with one another. Note that these residuals are calculated relative to the data we have so for example, if all the rows had the same value in a single column then the residual would be small for all (as non show a high assocation relative to the other rows).

On the resultant plot, each dimension represents the proportion of variance explained (as in PCA) and the labels with similar residuals are plotted close together. The further the distance from the origin, the more that row/column label is associated with some of the column/row variables (i.e. the further it is from independent to each of the rows/columns, i.e. the more discriminating that category is) and this relationship can be positive or negative. This is because the coordinates partition the Chi-square value used in testing independence (instead of e.g. the total variance in PCA). BUT this simplistic interpretation can only be used when comparing rows with rows or columns with columns.

To compare rows with columns or vice versa, we need to draw a line from the origin for the row/column and the column/row in question and measure the angle between these two lines. 90 degree angles indicate no relationship, angles near 180 indicate negative associations. Thus, just because a row and a column appear close to eachother, this doesn’t mean a high residual/high association, and we instead need to check that they are both pointing in the same-ish direction.

Good source: https://www.displayr.com/how-correspondence-analysis-works/ and https://www.displayr.com/math-correspondence-analysis/ and the original paper https://journals.sagepub.com/doi/pdf/10.1177/096228029200100106

Side note of SVD: SVD is a factorisation of some input matrix. The output is a vector, \(d\) (the singular values –> the eigenvalues are the square of these) and two matrices \(u\) and \(v\) containing the left singular vectors (represent rows of input table) and the right singular vectors (representing columns of the input table). We can use this to obtain a lower dimensional representation of our data (the columns of \(u\) are orthogonal to eachother and the columns of \(v\) are orhogonal o eachother). Source: https://www.displayr.com/singular-value-decomposition-in-r/

Multiple correspondance analysis (MCA)

This is an extention to CA to investigate relationships between several categorical dependent variables (i.e. nominal categorical data). It is basically the standard CA analysis on an indicator matrix (but with corrected percentages of explained variance and altered interpretation of the plot). It is used for multiple nominal variables (e.g. segmentation annotation in CD4+ T cells) that have several levels (e.g. enhancer, promoter, transcribed,…). Note that there will only be one 1 per nominal variable (e.g. SNP 1 can only be one annotation in each cell type) and all rows should have the same total sum. “The complete data table is composed of binary columns with one and only one column taking the value”1" per nominal variable".

As with CA, the resultant plot only compares points from the same set (rows with rows or columns with columns) but plots a point for each row and for each column. In this analysis we’re typically interested in the proximity between levels of different nominal variables (e.g. the distribution of SNPs in the various segmentation annotations is similar in CD4+ and CD8 cells). Note that since the levels of the same nominal variable cannot occur together, a different interpreation is needed.

Corrected percentages of variance explained: Since each nominal variable is represented by several binary columns rather than a single categorical column, the inertia (variance) of the solution space is artifically inflated, meaning that the percentage of inertia explained by the first dimension is severely underestimated. It has been shown mathematically that all the factors with an eigenvalue less than or equal to \(1/K\) where \(K\) is the number of nominal variables simply code the additional dimensions. Bernzecri (1979) and Greenacre (1993) offer corrections to this problem, with the Greenacre method being most popular (both are just plug in formulas to get corrected eigenvalues).

MCA is performed by applying the CA algorithm to either an indicator matrix (\(Z\)) or a Burt table (\(Z^T\times Z\)), the latter is a more natural extention of the conventional CA analysis.

Interpretation: As for CA, proximities are meaningful only between points from the same set (i.e. rows with rows or columns with columns). When two row points are close to each other they tend to select the same levels of the nominal variables (e.g. SNP 1 and 2 fall in similarly annotated segments across the cell types). The proximity between levels of different nominal variables (e.g. brain cells and CD4 + T cells) means that these levels tend to appear together in the observations.

Implementation

I run an MPC on my annotation data and compare the relative contributions of each annotation in each cell type with the results from my earlier PCA analysis. Note, as in the PCA analysis, I have removed any SNPs with NAs (the resultant dataframe contains 121879 SNPs rather than 123105).

Comparing scree plots for MPC and PCA:

res.mca <- MCA(data.active, graph = FALSE)

# print(res.mca)

eig.val <- get_eigenvalue(res.mca)
# fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

# Coordinates
#head(var$coord)
# Cos2: quality on the factore map
#head(var$cos2)
# Contributions to the principal components
#head(var$contrib)

fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 20), main = "Scree plot for MCA")

fviz_screeplot(pca_scale, addlabels = TRUE, ylim = c(0, 20), main = "Scree plot for PCA")

Visualise the correlation between variables and MCA principle dimensions:

fviz_mca_var(res.mca, choice = "mca.cor", 
            repel = TRUE, # Avoid text overlapping (slow)
            ggtheme = theme_minimal())

The first two dimensions from PCA and MCA only explain ~20% of the total variance, whilst the first 10 only explain 37.5% of the total variance - this provides evidence to iterate over many PCs. Note that not all points are equally well displayed in each dimension. The quality of the representation is called the squared cosine (cos2), which measure the degree of association between variable categories and a particular axis. For the first two dimensions:

fviz_cos2(res.mca, choice = "var", axes = 1:2) # want this to be high or they're not well represented by the dimensions

I investigate what variables are associated with each dimension from both MCA and PCA. The results from MPC look more cell type specific, which is what we wanted. Notice that there are two of the same coloured dots for some cell types… need to check this.

Comments and next steps

cFDR for binary annotation data:
- At the moment I have this working for a single q (i.e. binary indicator of whether the SNP is in an active region in a given cell type).
- Could extend this so that the code iterates over many q, but would need to check that these arent correlated, or think of a way to find an independent lower-dimensional representation (as with continuous cfdr).
- Will the gx problem affect the binary extension? I.e. are there any instances where there is a non-unique P value for the L curve?
cFDR for categorical/ continuous annotation data method:
1. Find an independent lower-dimensional representation of the annotation data. MCA or PCA?
2. Iterate the cfdr method over the dimensions, updating the V values each time.
- Should I:
1. use James’ method to find L curves using the ranked q vector (\(q''\)) and setting gx in vl() to the minimum P value in the dataset and using the il() function to perform the integration (Q is approximated as a mixture of normals) OR
2. use my code to find L curves using the transformed q vector (\(q'\)) and setting gx in vl() to the minimum P value in the dataset and using KDE estimation to integrate \(p,q|H_0^P\) over the L region (that has been tranformed to \([0,1]\times(-Inf,Inf)\)). [If this then I need to decide on whether I should use the leave-one-chromosome-out method to approximate the KDE].
The effect of taking negative q before the ranking/transforming to \([0,1]\)?
Make R package for binary and continuous annotation extension.
Extend to PPs?