I have used the cFDR approach to calculate V values using genomic segmentation annotations for each of the 19 cell types.
James’ research suggests that the V values should be similar to the P values for irrelevant cell types. To quantify this, I calculate the difference between the \(q=0\) and the \(q=1\) lines on the V vrs P value plot. We’d expect this distance to be smallest for the brain cell type (for which I am using as the control). Instead, it seems that the area between the lines is smallest for the pancreatic islet cell type, closely followed by brain. This analysis also suggests that active marks in the CD3_cord cell type (CD3_PRIMARY_CELLS_CORD_BI) is the most informative as the distance between the CD3_cord lines is the biggest.
Below, ‘ABL’ stands for area between \(q=0\) and \(q=1\) lines by cell type.
Notice that the difference between V and P values vary between the \(q=0\) and \(q=1\) lines for each cell type (e.g. V values for variants without the annotation in panislet cells are much more similar to the P values than for those variants with the annotation). To explore this further, I calculate the distance to the \(y=x\) line for each \(q=0\) and \(q=1\) line in each cell type. This shows that V values for \(q=0\) SNPs in pancreatic islet cells are the most similar to the original P values whereas V values for \(q=1\) SNPs in CD127 Treg cells are the least similar to the original P values (i.e. uprated the most).
The idea was to iteratively calculate V across all cell types (James’ research shows that the order over which we iterate doesn’t matter) to ultimately find FDR adjusted P values incorporating genome segmentation annotations across different cell types.
The problem is that our cell types are correlated and we’re unsure what will happen…
I investigate what happens to the V values under the most possible correlation (iterating over annotations in thymus cells twice).
I find that:
The V2 values are upweighted/downweighted even more according to the annotation.
For larger P values, we hit the error of the solution to \(\hat{h}(p_0,q=0)=\hat{h}(p_1,q=1)\) for \(p_1\) being \(>1\) (and therefore in my code \(p_1\) being forced to equal 1) way more often (seen by the change in slope of the red line after \(\approx P>0.1\)).
I begin by looking for structure in the binary annotation data for each cell type (that is a binary measure of the SNP falling in an active or inactive genomic segment).
However, some of the SNPs are encoded as NA (if they were classified “Low Confidence” or “Unclassified” by Segway).
Note: Do not code NAs with another value as this introduces bias to the means and also to estimates of the variance and thus correlation.
I find that the binary annotations across cell types are highly correlated.
“Compose m features from the avaliable feature space that gives us maximum variance. Note that we want to compose and not just select m features as it is” using eigenvectors (direction of variation) and eigenvalues (how much variation in that direction).
I opt to use the pcaMethods::pca
function as this works with missing values (uses “PCA by non-linear iterative partial least squares”).
I specify cv = "q2"
to find the cross-validated version of \(R^2\) (\(Q^2\)) which is interpreted as the ratio of variance that can be predicted independently by the PCA model (low \(Q^2%\) implies that the PCA model only describes noise and that the model is unrelated to the true data structure).
I fit 10 PCs to begin with and look at the scree plot to decide how many PCs to use.
Rather than converting the genome segment annotations to binary quantities of active/inactive region (and therefore losing resolution and having to deal with missing data), I use the original genome annotation that each SNP falls in, removing any SNPs not allocated to an annotation.
This means that my new binary data frame has 130 columns (and 123,000 rows) for each annotation in each cell type for each SNP.
The correlation plot below shows that there is significant correlation amongst some variables.
prcomp
function as there are no missing values.Ideas:
Use the transformations from PCA
Identify the most relevent sub cell type for each cell type
Take the mean across similar cell types
For now, I limit the analysis to those SNPs for which I have a PP for (~16,000 of these). Note that I can now use the standard prcomp
function as there are no missing values. Again, it seems that there is structure in the data.
I decide not to centre and scale the data because it is all on the same scale (all binary).
Focusing on the unstandardised results, the following plot shows what percent of variance has been explained for each number of PCs.
To help decide which PCs to use, I run a multivariate quantile regression on PPs (regressing each PC on PP and assessing model fit).
I compare the model AICs for increasing the number of PCs. It seems that 5-7 components will suffice.
I compare the results of PCA using all 123000 SNPs (see section 3) or just those from the T1D analysis (16,000 of these, see section 4).
Comments and queries