Useful info


1. Comparing PCA and MCA


My code for PCA and MCA in earlier summaries was slightly off (only looking at 14 cell types). I’ve corrected this to look at all 19 cell types.


PCA

The loadings plots below show the coefficient of the linear combination of variables making up each dimension. PC1 compares quiescent and heterochromatin (high) with transcribed and enhancer (low) regions, whereby SNPs lying in a quiescent region in CD3 peripheral cells will be given a large value and those in transcribed regions in CD56 cells will be given a small value when mapped ono this first dimension.

Interestingly, PC2 compares transribed wiht enhancer regions.


MCA (FactoMineR::MCA)

MCA can be performed using the indicator matrix (\(Z\)) or the Burt matrix (\(B=Z^TZ\)). I investigate using both of these methods in FactoMineR::MCA.

df <- readRDS("categorical_annots.RDS")

res.mca.ind <- MCA(df, graph = FALSE, method = "Indicator")

res.mca.burt <- MCA(df, graph = FALSE, method = "Burt")

Note that in correspondance analysis, the term ‘loadings’ isn’t really used. Instead ‘coordinates under principal normalisation by inertia’ are used. These are equal to eigenvector elements multiplied by the square root of the respective eigenvalue (similar to loadings in PCA). Here, I plot the column coordinates.

For the indicator method (the functions default), PC1 compares promoter and enhancer (high) with quiescent and heterchromatin (low).


The Burt method gives the loadings that are proportional to that of the indicator method. This is because “The analysis of the indicator matrix Z and the Burt matrix B give identical standard coordinates of the category points, but slightly different scalings in the principal coordinates because the principal inertias of B are the squares of those of Z.” Indeed, both methods give the same coordinates (transformed values) but the variance explained differs substantially.

Since both MCA methods give the same transformation, I don’t need to worry about which one to use for now, only when I need to start iterating.


I compare the range of transformed values when using PCA and MCA. Those when using PCA and much larger than when using MCA.

# PCA
range(pca_scale$x[,1:5])
## [1] -12.60736  10.96371
# MCA
range(res.mca.ind$ind$coord)
## [1] -1.854589  2.792009

MCA (ca::mjca)

I am not sure if the FactoMineR::MCA function used so far adjusts the inertias or not:

“The so-called ‘percentage of inertia problem’ can be improved by using adjusted inertias procedure or eigenvalue correction. The adjusted inertias are calculated only for each singular value that satisfies the inequality >= 1/number of variables. They are expressed as a percentage of the average off-diagonal inertia, which can be calculated either by direct calculation on the off-diagonal tables in the Burt matrix. The adjusted solution not only does it considerably improve the measure of fit, but it also removes the inconsistency about the Burt matrix to analyse. This inconsistency is due to artificial dimensions added because one categorical variable is coded with several columns. As a consequence, the inertia (i.e., variance) of the solution space is artificially inflated and therefore the percentage of inertia explained by the first dimension is severely underestimated.” (https://uk.mathworks.com/matlabcentral/fileexchange/22558-multiple-correspondence-analysis-based-on-the-burt-matrix).

There seems to only be one R package that explicitly states whether or not they do this adjustment (ca::mjca). Here, the lambda option selects the scaling variant desired for reporting inertias:

  • lambda=“indicator” gives multiple correspondence analysis based on the correspondence analysis of the indicator matrix, with corresponding inertias (eigenvalues).

  • lambda=“Burt” gives the version of multiple correspondence analysis based on the correspondence analysis of the Burt matrix, the inertias of which are the squares of those for the indicator option.

  • lambda=“adjusted” is the default option, giving improved percentages of inertia based on fitting the off-diagonal submatrices of the Burt matrix by rescaling the multiple correspondence analysis solution. All these first three options give the same standard coordinates of the categories.

library(ca)

mca.mjca = mjca(df, lambda = "indicator", nd = 5)

mca.mjca.burt = mjca(df, lambda = "Burt", nd = 5)

mca.mjca.adj = mjca(df, lambda = "adjusted", nd = 5)

It seems that the FactoMineR::MCA function does not adjust to deal wih this ‘percentage of inertia’ problem (the scree plots are the same as the lambda = “indicator” and lambda = “Burt” options in the ca::mjca function). This means that if using the FactoMineR::MCA function then I can forget about any dimenisons with inertia \(\leq 1/19\).


I decide to stick with using the ca::mjca function in my analysis as this provides the adjusted inertia option. This function also provides both “standards” and “principal” coordinates, that are based on rescalings. The standard coordinates are a rescaling of the principal coordinates to unit inertia along each axis (see slides 17/18 here). Since I am transforming the Q vlaues to the \([0,1]\) range anyway, I don’t have to worry about this.


It is promising that this function gives similar coordinates to the annotations to that of the FactoMineR::MCA results (first 2 dimensions proportional, second 2 dimensions proportional but flipped).


I also check the point cloud for the individuals, colouring by annotation in each cell type (note the colours representing the annotation vary across cell types).


2. Results


I have my code running the fastest I can, by using the owin function in the spatstat package to define the L region. However, this function struggled to define the polygon when the range of \(x\) is very small (\(p\) values \(\leq 1e-20\)). I have spoken to the authors and they have two suggestions:

  1. Set check=FALSE and fix=FALSE in the function. The former ensures that the polygon coordinates will be retained but supressing this safety mechanism may lead to numerical errors due to numerical precision. The latter ensures that the polygon data is not modified and invalid entries (such as repeated vertices) are not removed. Using the owin function with these parameters gives me exactly the same results as the original code I wrote to do the same job, so I’m not too worried.

  2. Rescale the \(x\) co-ordinates of the L curve so that they are on the same scale as \(y\) (i.e. \((-Inf, Inf)\)) and then rescale the function I am trying to integrate over this region. Because this problem only occurs for extremely small P values (which are going to be called as significant anyway), I use the simple work around offers by the authors.


My method and James’ method

My method is:

  1. Calculate L curves (on transformed Q values) using vl() function with the gx=min(p) option (which solves the tailing off problem).

  2. Convert L curve Y co-ordinates back to \((-Inf,Inf)\) range and use these co-ordinates to define the polygon (using the owin() function with check=FALSE, fix=FALSE parameters so that it works for very small P values).

  3. Use the polyCub function to integrate the joint bivariate distribution (under the null - estimated using KDE) over this polygon to obtain the \(V\) values.

for(i in 1:length(unique(chr))){
  fold[[i]] <- which(chr == i)
  ind[[i]] <- intersect(candidate_indices, fold[[i]])
  L[[i]] <- vl(p, q_trans, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]]));  # compute L curves
  
  y_rangeinf <- rev(rangeinf(as.vector(L[[i]]$y), q)) # convert back to (-Inf, Inf)
  
  v[[i]] <- apply(L[[i]]$x, 1, function(x) polyCub(owin(poly=list(x = rev(x), y = y_rangeinf), check = FALSE, fix = FALSE), fpqh0))
}

James’ method is:

  1. Calculate the L curves (on ranked Q values) using vl() function with the gx=min(p) option (which solves the tailing off problem).

  2. Use the il() function to integrate over the L curves. (vl(p, q_rank, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]])))

for(i in 1:length(unique(chr))){
  fold[[i]] <- which(chr == i)
  ind[[i]] <- intersect(candidate_indices, fold[[i]])
  L_rank[[i]] <- vl(p, q_rank, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]]));  # compute L curves
  v_james_rank[[i]] <- il(L_rank[[i]], pi0_null = est_q0_pars[1], sigma_null = est_q0_pars[2], distx="norm") # integrate over L curves
}

James’ code takes ~ 27 seconds for 1148 SNPs, and mine takes ~ 78 seconds (about 3 times slower).

My code takes 1.5 hrs to run on all 121,000 SNPs.


q

I run the analysis using the results from the adjusted ca::mjca method (dim 1 coordinates = q). Recall that high Q indicates promoter/enhancer whilst low Q indicates quiescent/heterochromatin.

The results look good, P values for SNPs in promoters/enhancers are upweighted whilst those in quiescent/heterochromatin are downweighted.


QQ plots


I construct QQ plots to learn about the distribution of P and V values for various quantiles of Q values. I.e. I plot a qq plot of -log10(P) (and -log10(V)) for SNPs across 5 quantiles of Q (0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, 0.8-1).

[Note I have to compare with expected quantiles from a uniform distribution, not the classic normal distribution].

  • Lower q (black/red) predict predict higher P (makes sense because lower q is heterochromatin/quiescent).

Does it make sense for these to look so similar? I guess so…


-q

Recall now that high Q indicates quiescent/heterochromatin and low Q indicates promoter/enhancer.

The correct things are again being upweighted downweighted.


Comparing V values from mine and James’ methods

The V values look mostly the same for mine and James’ method. However, something goes wrong with my method for P values near to 1. I need to check this out (although not a huge problem as we wouldn’t be running for these big P values anyway).


3. Iterating


What to iterate over

I investigate how similar/dis-similar the V values are when using \(q\) and when using \(-q\). Hmm, it seems that they are pretty different.

Perhaps this is because the Q values range from \([-1,4]\) and when we transform to \([0,1]\) there will be more closer to 1 for the q method and more closer to 0 for the -q method?


Iterate over dim 1, dim 2 and dim 3

I check what the V values looks like when I iterate over the first 3 dimensions (that explain 91.1% of the total variance if we believe the adjusted scree plot):

  1. P = original P values, Q = MCA[,1] --> V
  2. P = V, Q = MCA[,2] --> V2
  3. P = V2, Q = MCA[,3] --> V3

Recall that Q = MCA[,1] compares promoter and enhancer (high) with quiescent and heterochromatin (low). Q2 = MCA[,2] compares promoter and enhancer (high) with transcribed (low). In the second iteration, the results therefore look to be doing the right sort of thing.. those in transcribed regions get reduced the most whereas those in promoter/enhancer get reduced slightly. However, I’d have hoped the cloud of points with medium Q value would lie closer to the \(y=x\) line.

Q3 = MCA[,3] compares reg permissive (high) with promoter (low). In the third iteration, there seems to be some weird patterns… however the overall results look ok in that the P values for those SNPs in promoters (low) are made smaller.


Single example

I consider a single example for SNP rs12755215, which has a P value of 2.692092e-06. It is in a transcribed region in the brain but quiescent or heterochromatin regions in all the immune-related cell types. In the first dimension of the MCA analysis, it has a relatively very small Q1 value (indicating quiescent/heterochromatin). In the second and third dimensions it has a relative mid range Q2 value (again indicating quiescent/heterochromatin).

In the first iteration, the resultant V value is bigger than the P value, however in the second and third iterations it is made smaller. ???

##   chr       rsid            p     OR       brain      CD14      CD19
## 1   1 rs12755215 2.692092e-06 1.2227 Transcribed Quiescent Quiescent
##         CD3.cord CD3.peripheral   CD4.mem CD4.naive         CD45RA
## 1 FacultativeHet      Quiescent Quiescent Quiescent FacultativeHet
##           CD45R0 IL17.macs IL17.primary        Th CD127.Treg CD127.Tmem
## 1 FacultativeHet Quiescent    Quiescent Quiescent  Quiescent  Quiescent
##        CD56 CD8.memory CD8.naive panislets    thymus         q1      q2
## 1 Quiescent  Quiescent Quiescent Quiescent Quiescent -0.9471884 0.17166
##            q3   q1_trans  q2_trans q3_trans          v1           v2
## 1 -0.03568253 0.06957817 0.4702977 0.599473 4.01217e-06 2.963965e-06
##             v3
## 1 1.553544e-06

Perhaps dimension 3 is messing things up… the range of loadings is a lot larger than any of the other dimensions so SNPs in promoter or reg permissive regions will change a lot whereas those in other regions not much? Also dimension 3 only explains ~4% of the variability compared with dimensions 1 and 2 that explain a lot more.

range(mca_df[,1])
## [1] -1.343831  4.356846
range(mca_df[,2])
## [1] -3.577356  4.394224
range(mca_df[,3])
## [1] -15.32941  10.18255

Dimension 3

I check what the results look like for \(P\) = original P values and \(Q\) = 3rd dimension from MCA. High indicates reg permissive and low indicates promotor, so it looks like the right sort of thing is happening. But what about the exclusive shrinking we see?


Comments and queries