1. Comparing V values across cell types



Notice that the difference between V and P values vary between the \(q=0\) and \(q=1\) lines for each cell type (e.g. V values for variants without the annotation in panislet cells are much more similar to the P values than for those variants with the annotation). To explore this further, I calculate the distance to the \(y=x\) line for each \(q=0\) and \(q=1\) line in each cell type. This shows that V values for \(q=0\) SNPs in pancreatic islet cells are the most similar to the original P values whereas V values for \(q=1\) SNPs in CD127 Treg cells are the least similar to the original P values (i.e. uprated the most).


2. Iterating over cell types


Find V2 for thymus cells

  • I investigate what happens to the V values under the most possible correlation (iterating over annotations in thymus cells twice).

  • I find that:

    • The V2 values are upweighted/downweighted even more according to the annotation.

    • For larger P values, we hit the error of the solution to \(\hat{h}(p_0,q=0)=\hat{h}(p_1,q=1)\) for \(p_1\) being \(>1\) (and therefore in my code \(p_1\) being forced to equal 1) way more often (seen by the change in slope of the red line after \(\approx P>0.1\)).

  • Before iterating over all cell types (which are likely to be highly correlated), I investigate the structure between the annotations in the various cell types to see if we can reduce the dimension with which to iterate over.

3. Investigating structure in binary annotation data (123000*19)



PCA

“Compose m features from the avaliable feature space that gives us maximum variance. Note that we want to compose and not just select m features as it is” using eigenvectors (direction of variation) and eigenvalues (how much variation in that direction).

  • I opt to use the pcaMethods::pca function as this works with missing values (uses “PCA by non-linear iterative partial least squares”).

  • I specify cv = "q2" to find the cross-validated version of \(R^2\) (\(Q^2\)) which is interpreted as the ratio of variance that can be predicted independently by the PCA model (low \(Q^2%\) implies that the PCA model only describes noise and that the model is unrelated to the true data structure).

  • I fit 10 PCs to begin with and look at the scree plot to decide how many PCs to use.


4. Investigating structure in full annotation data (123000*130)

Reducing the dimension

Ideas:

  1. Use the transformations from PCA

  2. Identify the most relevent sub cell type for each cell type

  3. Take the mean across similar cell types


5. Use only those SNPs with PPs (16000*130)


For now, I limit the analysis to those SNPs for which I have a PP for (~16,000 of these). Note that I can now use the standard prcomp function as there are no missing values. Again, it seems that there is structure in the data.

I decide not to centre and scale the data because it is all on the same scale (all binary).

Focusing on the unstandardised results, the following plot shows what percent of variance has been explained for each number of PCs.

To help decide which PCs to use, I run a multivariate quantile regression on PPs (regressing each PC on PP and assessing model fit).

I compare the model AICs for increasing the number of PCs. It seems that 5-7 components will suffice.


I compare the results of PCA using all 123000 SNPs (see section 3) or just those from the T1D analysis (16,000 of these, see section 4).


Comments and queries