21st May 2020

Functional cFDR results (chromosome 22)

From my MCA analysis:

Dimension 1 explains 51% of the total variance
Dimension 2 explains 35.5% of the total variance
Dimension 3 explains 4.6% of the total variance
Dimension 4 explains 4% of the total variance
Dimension 5 explains 1.5% of the total variance

This implies that we should be iterating over several dimensions.

Iteration 1: Functional cFDR on original p values and -q1 (to ensure low q enriched for low p) to obtain v1.

Iteration 2: Functional cFDR in a two-step approach to deal with big positive and big negative q being enriched for small p. (1) (v1, -q2)=v2_tmp (2) (v2_tmp, q2)=v2. Note that cor(v2, q2)=0.00802974 here so this method wouldn’t automatically be selected.

Iteration 3: Functional cFDR on v2 and -q3 (to ensure low q enriched for low low p) to obtain v3.

Iteration 4: Functional cFDR on v3 and q4 to obtain v4.

Below, I plot V1,…,V4 against the original p values:

If I instead plot V1,…,V4 against the values used for the principal trait in that iteration:

And again on the -log10 scale (axes have been truncated).

These results suggest that something is going wrong for the 4th iteration…

Investigating 4th iteration

In the 4th iteration [(v3, q4)=v4], the cFDR curves for low Q have very low maximum values at \(ZP=0\) (\(p=1\)).

This means that when we interpolate at each p,q pair, SNPs with very low \(q4\) values will never be able to get high ccut values. This may be ok as we want to reduce v values for SNPs with low q4, but do we really want to reduce them this much?

I now turn to investigating the skew for smaller v at approx \(p<~10^{-2.5}\). If we look at the L curves on the Z scale, then the spike outwards is responsible for the skew at \(p<~10^{-2.5}\) on the plot above (corresponds to a spike inwards on the \(p\) value scale –> smaller area –> smaller v value).

This spike outwards occurs around \(q4<-2.1\) (note min(q[indices])=-2.130502), which is exactly where the cFDR curves do not reach very high (see below). Moreover, the slope of the curves are less steep, so that when we read off at the ccut (cFDR) values for these p,q pairs with \(p<~10^{-2.5}\), we get big zp (corresponding to small p) and the bulge in the L curve.

“Slicing problem”

In iteration 2, we have a problem whereby several \(p,q\) pairs are given the same \(v\) value, even though their \(p\) values may differ substantially. I call this problem the “slicing problem”. This problem specifically occurs when we iterate over \(q2\) for the second time (for both (i) \(q2\) then \(-q2\) and (ii) \(-q2\) then \(q2\)).

Let’s consider a walkthrough example using two \((p,q)\) pairs:

\((p=0.88, q=0.22)\)
\((p=0.38, q=0.25)\)

Even though their \(q\) values are similar, their \(p\) values are very different so we would expect different \(v\) values.

If we look at the cFDR curves for \(q\) values similar to the \(q\) values for our indices (which will therefore be interpolated to find cfdr (=ccut) values), we find that these curves level off very quickly. The vertical lines represent the \(p\) values for our indices, and thus they both have very similar cfdr (=ccut) values.

ccut

## [1] 1.243140 1.244626

We then find the contours of the cFDR curves at these values to obtain our L curves.

The L curves have this bend inwards where our cFDR curves do not reach 1 at \(p=1\) (discussed in more detail in last weeks summary).

This problem isn’t fixed when increasing any resolutions.

Small p enrichment method

To ensure that small q are enriched for small p, I check the correlations and flip the sign of the auxiliary variables if necessary. Using this method, \(q\) values are flipped for dimension 1 and 3. But should I be checking the correlation for the auxiliary variable and the original p values or the values used for the principal trait in that iteration? [Doesn’t matter at the moment as they would give the same results, but may matter in the future].

Also need to think of a way around enrichment for strong negative and strong positive q (as in dimension 2).

# against original p

cor(res$p, res$q1,method="spearman") # -0.08305887 so flip
cor(res$p, res$q2, method="spearman") #  0.00802974 so don't flip
cor(res$p, res$q3, method="spearman") #  -0.04093093 so flip
cor(res$p, res$q4, method="spearman") # 0.001665016 so don't flip

# against p used in that iteration
cor(res$p, res$q1,method="spearman") # -0.08305887 so flip
cor(res$v1, res$q2, method="spearman") #  0.04610008 so don't flip
cor(res$v2, res$q3, method="spearman") # -0.1005287 so flip
cor(res$v3, res$q4, method="spearman") # 0.0623231 so don't flip

MCA results

At the moment, I am including SNPs that fall within any annotated region (even if that annotation is “Low Confidence”).

Perhaps I should remove SNPs that fall in Low Confidence regions? If so then I would loose ~10% of the SNPs. Moreover, the distribution of the \(p\) values for these SNPs in low confidence regions more or less match the general distribution (over all SNPs).

How do the MCA results vary if I exclude these SNPs? Not much.

If I compare the percentage of inertia explained by both of the methods (incorporating low confidence SNPs and excluding them), then the results look similar.

Side note: In my analysis I am using the default lambda="adjusted" option for the MCA analysis. This helps to deal with the over-estimated total inertia problem which arises due to the structure of the Burt matrix (the sub-matrices of the diagonal elements are cross-tabulations of each variable with itself). This adjustment adjusts the resultant co-ordinates so that the off-diagonal sub-matrices are optimally fitted.

This method also returns results for 18 dimensions. But why 18? “the eigenvalues are given only for those dimensions \(k\), where the singular values from the Burt matrix \(\lambda_k\) (i.e. the principal inertias of the indicator matrix) satisfy \(\lambda_k>1/Q\).” In my analysis \(Q=19\).

See [http://www.econ.upf.edu/~michael/vienna/CARME5_BW.pdf]; [https://www.jstatsoft.org/article/view/v020i03].

Comments and questions

I should probably include some “baseline annotations”, e.g. syn/non-syn into the analysis. This would help in e.g. the region.
Should I be checking enrichment of small q for small p for original p values or what is used as p in that iteration? At the moment they give the same sign so it doesn’t matter, but it may in the future.
Interesting that the sign is flipped for dimension 3… I assume it’s because cons heterochromatin creeps in with promoter.
What to do for non-monotonic relationships between p and q. Perhaps have some cut off for the correlation value, after which we perform on q and -q (but then need to decide which order)…
Should I exclude low confidence annotations? Meaning I loose 10% of my SNPs.
In the second iteration, there are a few instances where the resultant \(v\) values are ever so slightly greater than 1 \((=1 + 2.220446e-16)\). This occurs for values of \(q2\) where the cFDR curves extend the most above 1 (shown between the vertical dashed line below).

This means that when we find the contours for the L curve, there is a slight bend inwards in this region. This is because they extend (a lot, relatively) above 1 so that the \(p\) values we read off are smaller than expected (because it doesn’t reach 1 by \(p=1\)).

When integrating over this region, due to the integration errors in the polyCub function we encountered last week, the integral is bigger than that over the full region (see last week where we found polyCub was returning bigger values for smaller regions).

This isn’t too much of a problem because I can just set these values to exactly one.