28th May

MCA

Sources:

For cFDR, we need to ensure that \(q\) is monotonic in \(p\). However, in our current implementation, there are instances where both big negative and big positive \(q\) are enriched for smaller \(p\). For example, in dimension 2 and 4 below [horizontally scrollable figure]:

Comparing `ca::mjca` and `PCAmixdata::PCAmix`

Previously, I have been using ca::mjca for MCA, however the PCAmixdata::PCAmix package offers a varimax-style rotation method. I compare the standard MCA results (no rotation) from the two packages.

In the ca::mjca package, we can choose between 3 different MCA approaches (standard MCA on the indicator matrix, CA of the Burt matrix or “adjusted” MCA which deals with the percentage of inertia explained problem). The PCAmixdata::PCAmix package uses a different approach on the indicator matrix called “the single PCA approach”.
The factor coordinates of the levels are the same (using principal coordinates in ca::mjca), but the factor coordintates of the observations are multiplied by \(\sqrt{p}=\sqrt{19}\) in PCAmixdata::PCAmix. The authors state that “This property has no impact since results are identical to within one multiplier coefficient.”? This means that the range of the observation coordinates are \((-0.811, 2.628)\) for ca::mjca and \((-3.534, 11.457)\) for PCAmixdata::PCAmix.
PCAmixdata::PCAmix does not provide the adjustment for the percentage of inertia explained problem. This means that the percentage of inertia explained by each dimension when using PCAmixdata::PCAmix is under-estimated, which will impact our stopping rule for the iterations.

MCA rotation

I consider orthogonal rotation methods to obtain simplier loadings (need to be orthogonal rather than oblique due to our iteration procedure).
The varimax rotation (Kaiser 1958) is common in the PCA field. It maximises the sum of the variances of the squared loadings (the squared correlations between variables and factors) to obtain simplier structure in the loadings matrix (that is, each factor picks out some variables with very high loadings and others with neglible loadings).
In MCA however, where the variables are qualitative, we cannot generate correlations between variables and factors to optimise in the varimax procedure. An alternative measure for qualitative variables was suggested by Kiers (1991), and is the “contribution of a component to the inertia of a variable that is accounted for”, or in other words, the squared correlation between a variable optimally quantified and a factor (i.e. the correlation ratio).
The R package PCAmixdata is able to perform PCA/MCA for mixed data with the relevant varimax procedure (optimising squared correlations for quantitative variables in PCA and corrleation ratios for qualitative variable in MCA).

Unfortunately, this appears to help with monotonicity in some dimensions, but not in others (e.g. dimension 1).

Baseline annotations

I’d like to have some baseline annotations to include in all implementations of our method. It seems that the “gold-standard” is the 75 functional annotation model by Gazal et al, which is an extension of the S-LDSC 53 annotation model. The data is available for download here.
I’ve looked at matching SNPs on chromosome 22 to their 75 functional annotations, and the relationship (at least for the first dimension) does seem monotonic. However, the variability is spread over lots of dimensions and the method is not robust enough to iterate over all of these.

It would be nice to use some pre-comuted “scores” for each SNP. Some possible ideas:
- Predicted \(\chi^2\) statistics from FINDOR. Looks like I’d have to calculate these myself using LDSC so may be limited to GWAS with large sample sizes. “As a rule of thumb, LD Score regression tends to yield very noisy results when applied to datasets with fewer than ~5k samples, even for univariate h2 estimation. One needs even larger sample sizes for asking more complicated questions, e.g., partitioned h2. If your sample size is so small that standard error is the limiting factor, but you happen to have individual genotype data, consider using REML (e.g., as implemented in GCTA).”
- GenoCanyon scores which measure the functional potential of genomic locations (instead of the pathogenicity like other methods, e.g. FS scores) using “a collection of the comparative genomic conservation scores and biochemical signals obtained from the ENCODE project” (but averages over many cell types… the “general-functionality annotation tool”). This has been extended to GenoSkyline which is tissue-specific functional annotation. GenoCanyon scores are used in GenoWAP, a Bayesian fuctional annotation GWAS prioritisation method that outputs PPs.
- GWAVA (genome wide annotation of variants) “a tool which aims to predict the functional impact of non-coding genetic variants based on a wide range of annotations of non-coding elements (largely from ENCODE/GENCODE), along with genome-wide properties such as evolutionary conservation and GC-content”.
- deltaSVM scores suggested by Karl. Use a classifier (gkm-SVM) that encodes cell type-specific regulatory sequence vocabularies and the induced change in the gkm-SVM score, deltaSVM, quantifies the effects of the variants. These are listed for 3000 autoimmune related SNPs in the paper, but I can’t find the scores for all SNPs.

But will a single score really capture disease-relevant information properly? Also the accuracy of annotations may vary.

Checking robustness of iteration

I check how functional cFDR performs when iterating using independent functional data. First, for independent uniform data (although note that here cor(p,q1)=--0.041 which is about how correlated my actually p and q are for the true functional data analysis… perhaps explaining why the P values change comparably to my actual analysis…).

Next, for independent data from various mixture normals (so there is >1 peak):

Summary

Our functional cFDR method is working but we now need to consider the auxillary functional data.
We need some K-dimensional summary of some functional data such that each dimension k is (i) independent (ii) monotonic in p and (iii) k is as small as possible (to save on iterations).

Option 1:

Match SNPs to baseline annotations and cell type specific annotations to form functional data table.
Perform PCA/MCA mix on this data.
Use PCA/MCA rotation method and hope this ensures monotonicity.
Use each dimension iteratively as \(q\) (but need to decide when to stop iterating - PCA/MCA mix method doesn’t adjust for percentage of inertia explained problem).

Option 2:

Match SNPs to baseline annotations and cell type specific annotations to form functional data table.
Perform PCA/MCA mix on this data.
Use the resultant coordinates in S-LDSC to obtain \(T_{cj}=\tau_c * l_j\) where \(\tau_c\) is the effect size estimate for dimension \(c\) and \(l_j\) is the LD score for SNP \(j\).
\(T_{cj}\) should now be monotonic in \(p\) and \(T_c\) can be used as \(q\) in functional cFDR (still need to decide when to stop iterating).

Option 3:

Match SNPs to baseline annotations and cell type specific annotations to form functional data table.
Use S-LDSC to obtain \(T_{c}\) for each column and use this as \(q\) in functional cFDR. But this doesn’t include any dimensionality reduction and would result in way too many iterations.

My application will be asthma data (Demenais European data (N0= 107715 + N1 = 19954 = 127,669) on 2 million SNPs). I can then compare this with UKBB data (N0 = 319207 + N1 = 41934 = 361,141). Note, the data has a P column and a european_ancestry_pval_rand column, I assume the latter is from the random effects model (collating data from 56 European studies) and I should use this?
I would like to prune SNPs to apply our method on independent signals only but not sure how to do this. In FINDOR “We conservatively define independent loci using PLINK’s LD-clumping algorithm with a 5 Mb window and an \(r2\) threshold of 0.01.” But then I think this relies on the index SNP having all the interesting annotations?
One way to completely avoid the problematic monotonicity requirement is to use \(P(H_0|P<p, Q=q)\) instead:

\[\begin{equation} \begin{split} \widehat{cFDR_{variation}}(p,q) &= P(H_0^p\text{ | }P\leq p, Q=q)\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p, Q=q) P(H_0^p\text{ | }Q=q)}{P(P\leq p \text{ | }Q=q)}\\[2ex] &= \dfrac{P(P\leq p\text{ | }H_0^p) \dfrac{P(Q=q|H_0^p)P(H_0^p)}{P(Q=q)}}{\dfrac{P(P\leq p, Q\leq q)}{P(Q\leq q)}}\\[2ex] &\approx \dfrac{p P(Q=q|H_0^p)}{P(P\leq p, Q=q)} \end{split} \end{equation}\]

But this will give values \(\geq 1\) when \(P(Q=q|H_0^p)\geq P(P\leq p, Q=q)\), e.g. where a specific \(q\) value is indicative of a null SNP. Not sure what other problems this will present…