MCA


Sources:



Comparing ca::mjca and PCAmixdata::PCAmix


Previously, I have been using ca::mjca for MCA, however the PCAmixdata::PCAmix package offers a varimax-style rotation method. I compare the standard MCA results (no rotation) from the two packages.

  • In the ca::mjca package, we can choose between 3 different MCA approaches (standard MCA on the indicator matrix, CA of the Burt matrix or “adjusted” MCA which deals with the percentage of inertia explained problem). The PCAmixdata::PCAmix package uses a different approach on the indicator matrix called “the single PCA approach”.

  • The factor coordinates of the levels are the same (using principal coordinates in ca::mjca), but the factor coordintates of the observations are multiplied by \(\sqrt{p}=\sqrt{19}\) in PCAmixdata::PCAmix. The authors state that “This property has no impact since results are identical to within one multiplier coefficient.”? This means that the range of the observation coordinates are \((-0.811, 2.628)\) for ca::mjca and \((-3.534, 11.457)\) for PCAmixdata::PCAmix.

  • PCAmixdata::PCAmix does not provide the adjustment for the percentage of inertia explained problem. This means that the percentage of inertia explained by each dimension when using PCAmixdata::PCAmix is under-estimated, which will impact our stopping rule for the iterations.


MCA rotation


  • I consider orthogonal rotation methods to obtain simplier loadings (need to be orthogonal rather than oblique due to our iteration procedure).

  • The varimax rotation (Kaiser 1958) is common in the PCA field. It maximises the sum of the variances of the squared loadings (the squared correlations between variables and factors) to obtain simplier structure in the loadings matrix (that is, each factor picks out some variables with very high loadings and others with neglible loadings).

  • In MCA however, where the variables are qualitative, we cannot generate correlations between variables and factors to optimise in the varimax procedure. An alternative measure for qualitative variables was suggested by Kiers (1991), and is the “contribution of a component to the inertia of a variable that is accounted for”, or in other words, the squared correlation between a variable optimally quantified and a factor (i.e. the correlation ratio).

  • The R package PCAmixdata is able to perform PCA/MCA for mixed data with the relevant varimax procedure (optimising squared correlations for quantitative variables in PCA and corrleation ratios for qualitative variable in MCA).


Unfortunately, this appears to help with monotonicity in some dimensions, but not in others (e.g. dimension 1).


Baseline annotations


But will a single score really capture disease-relevant information properly? Also the accuracy of annotations may vary.


Checking robustness of iteration


I check how functional cFDR performs when iterating using independent functional data. First, for independent uniform data (although note that here cor(p,q1)=--0.041 which is about how correlated my actually p and q are for the true functional data analysis… perhaps explaining why the P values change comparably to my actual analysis…).


Next, for independent data from various mixture normals (so there is >1 peak):


Summary


Option 1:

  1. Match SNPs to baseline annotations and cell type specific annotations to form functional data table.

  2. Perform PCA/MCA mix on this data.

  3. Use PCA/MCA rotation method and hope this ensures monotonicity.

  4. Use each dimension iteratively as \(q\) (but need to decide when to stop iterating - PCA/MCA mix method doesn’t adjust for percentage of inertia explained problem).

Option 2:

  1. Match SNPs to baseline annotations and cell type specific annotations to form functional data table.

  2. Perform PCA/MCA mix on this data.

  3. Use the resultant coordinates in S-LDSC to obtain \(T_{cj}=\tau_c * l_j\) where \(\tau_c\) is the effect size estimate for dimension \(c\) and \(l_j\) is the LD score for SNP \(j\).

  4. \(T_{cj}\) should now be monotonic in \(p\) and \(T_c\) can be used as \(q\) in functional cFDR (still need to decide when to stop iterating).

Option 3:

  1. Match SNPs to baseline annotations and cell type specific annotations to form functional data table.

  2. Use S-LDSC to obtain \(T_{c}\) for each column and use this as \(q\) in functional cFDR. But this doesn’t include any dimensionality reduction and would result in way too many iterations.


\[\begin{equation} \begin{split} \widehat{cFDR_{variation}}(p,q) &= P(H_0^p\text{ | }P\leq p, Q=q)\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p, Q=q) P(H_0^p\text{ | }Q=q)}{P(P\leq p \text{ | }Q=q)}\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p) \dfrac{P(Q=q|H_0^p)P(H_0^p)}{P(Q=q)}}{\dfrac{P(P\leq p, Q\leq q)}{P(Q\leq q)}}\\[2ex] &\approx \dfrac{p P(Q=q|H_0^p)}{P(P\leq p, Q=q)} \end{split} \end{equation}\]

But this will give values \(\geq 1\) when \(P(Q=q|H_0^p)\geq P(P\leq p, Q=q)\), e.g. where a specific \(q\) value is indicative of a null SNP. Not sure what other problems this will present…