Secret paper

Meeting with him on Friday: https://docs.google.com/document/d/1vIyvrJlLwQR49vzHn-a0XGlQ-4-vVXtfJI5EMSORwPM/edit


Nobel et al.Ā data

My previous analysis of the Nobel et al.Ā data of genome annotations for 167 cell types only considers SNPs on the immunochip that were in the 39 T1D-associated genomic regions that I analysed in my previous project (approximately 17000 of these).

Iā€™ve found that the distribution of genomic annotations across the whole genome differs to that across SNPs. Since my research focusses on SNPs, it would be useful to use the distibution of genomic annotations across all immunochip SNPs as the baseline (rather than the whole genome). However, there are approximately 200,000 SNPs on the immunochip, which would take too long to map to the relevent genomic annotation in many cell types. For this reason, I keep the T1D SNPs and add additional SNPs genotyped on the immunochip, although in the future I should run this for all immunochip SNPs.

Note that in the existing methods, this ā€œempirical distribution of the enrichment under the null hypothesisā€ is estimated in various ways. E.g. GARFIELD uses replication over variants matched by key metrics and GoShifter uses a circularised permuation method that accounts for the non-random distribution of genomic annotations with respect to each other and the correlation between GWAS signals caused by LD to estimate the null enrichment statistics.

My baseline SNP file contains annotation information of 19 cell types for 36,278 SNPs on the immunochip (including those in my original T1D analysis).


T1D credible set SNP enrichment

I investigate the enrichment of annotations amongst T1D 95% credible set variants. To do this, I obtain the proportions of annotations across only the T1D 95% credible set variants and divide these by the original proportions (across a sample of immunochip SNPs). The plot is fairly sparse because the annotation must be present in the 95% credible set T1D SNPs (only ~700 of these). Lines above 1 indicate positive enrichment of that annotation in that cell type, and lines below 1 indicate negative enrichment of that annotation in that cell type in 95% credible set T1D SNPs.

The results look sensible:

  • Constitutive heterochromatin is negatively enriched in credible set SNPs in most cell types (except pancreatic islets and CD14 and CD19 cells).

  • Facultative heterochromatin is negatively enriched in credible set SNPs in most cell types (except brain and CD14).

  • Enhancers are positively enriched in credible set SNPs in most cell types (except brain).

  • Promoters are positively enriched in credible set SNPs in most cell types for which there is data.

  • Quiescent is negatively enriched in credible set SNPs in most cell types.


This is similar to a section of the prostate cancer paper that: ā€œFor comparison to the conditional QR approach, we also used Fisherā€™s exact test to examine the representation of individual annotation features across variants included in the 95% credible set of prospective PrCa causal variants relative to variants not selected. Independent tests were conducted for each annotation upon the set of 37,863 tag variants analysed by JAM, of which 343 tags represented the 95% credible set of 3700 SNPs and annotations for all proxy SNPs were inherited by the tag variant.ā€

My approach to copy this:

  1. Form a SNP set consiting of a tag variant for each of the 39 credible sets.

  2. Add binary annotations columns for whether any of the SNPs in the credible set had that annotation (here annotations refer to active/inactive marks in different cell types).

  3. Plot these proportions against those for non credible set SNPs.

Note that I donā€™t use Fisherā€™s exact test as I read this is best suited when counts are low and obviously weā€™re going to see enrichment in credible set SNPs as the red bars are for whether any of the variants in the credible set had an active mark in that cell type whereas the blue bars are the proportions over all SNPs not in the credible sets.


Logistic Regression

Rather than making inferences manually from the above figure, it would be good to formalise these findings using some statistical methods. To do this, I use penalised logistic regression (Lasso) to see which annotations in which cell types are enriched in 95% credible set SNPs. Note that the data is for the whole sample of immunochip SNPs (not just those in the T1D analysis) and I am using constitutive heterochromatin as the baseline.


Thymus cells

Interpretation: There is a 120% increase (\(exp(0.796413)=2.2\)) of a 95% T1D credible set SNP being in a promoter region in Thymus cells compared with constitutive heterochromatin.


Pancreatic islet cells


CD8_NAIVE_PRIMARY_CELLS


CD8_MEMORY_PRIMARY_CELLS


CD56_PRIMARY_CELLS


CD4p_CD25p_CD127m_TREG_PRIMARY_CELLS


CD4p_CD25m_TH_PRIMARY_CELLS


CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS


CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS


CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLS


CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLS


CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLS


CD4_NAIVE_PRIMARY_CELLS


CD4_MEMORY_PRIMARY_CELLS


CD3_PRIMARY_CELLS_CORD_BI


CD19_PRIMARY_CELLS_PERIPHERAL_UW


CD14_PRIMARY_CELLS


CD3_PRIMARY_CELLS_PERIPHERAL_UW


BRAIN_INFERIOR_TEMPORAL_LOBE


All cell types


1. Credible set membership

  • I run a lasso regression on a binary indicator of credible set membership and annotation marks in each cell type. From this, I can see the most important annotations in the most important cell types for credible set membership.

  • The baseline factor is constitutive heterochromatin in thymus cells.


2. PP

  • I now do the same but using PP as the response to see which annotations in which cell types are the most helpful for PP. Note that the data set used for the analysis now only contains the T1D analysis SNPs (~17000 as opposed to ~35000 above).


3. -log10(P)


Collapse into active/ inactive

  • I seem to be getting very variable and spurious results. To increase power, I collapse the annotations into a binary active/inactive mark and regress these on a binary indicator of contained, to see which cell types are most indicative of credible set membership. Although to do this I will have to remove ā€œunsureā€ annotations (bivalent, unclassified, low confidence).

  • Active: enhancer, transcribed, promoter, reg permissive.

  • Inactive: quiescent, constitutive heterochromatin, facultative heterochromatin.


1. Credible set membership

  • Interpretation: Active marks in thymus cells increase the odds of credible set membership the most.


2. PP

  • I do the same as above by now use PP as the response, remember that now this analysis is only for the SNPs that I have the association statistics for T1D (~17000 rather than ~35000 above).


3. -log10(P)

  • Why do active marks in the brain increase -log10 P value and thymus the least? This is the opposite to what we see for credible set membership.


Quantile regression

  • I do 97th and 99th QR (approx PP=0.01 and PP=0.05) for just the 95% T1D credible set SNPs. Although this is only for 657 SNPs so the 0.97 quantile only contains ~20 SNPs and the 0.99 quantile only contains ~5 SNPs.

I also do this for all the SNPs I have the data for (~17,000) so the 0.97 quantile contains ~500 SNPs and the 0.99 quantile has ~170.

Note that this method literally just draws a line from the Xth PP quantile in the active=0 SNPs and the Xth PP quantile in the active=1 SNPs. Would be more involved to use a continuous predictor.


and also for all the SNPs in the 39 T1D-associated regions (note the change in y axis) (I canā€™t do this for all the immunochip SNPs as I donā€™t have their association statistics for T1D).


P Values


and for all T1D SNPs from the 39 associated regions.


Summary


1. Logistic Regression

I use logistic regression to look at the relationship of credible set memebership/PP/P vals with annotations in many cell types. I use lasso to pick out the most relevant annotations.

  • Credible set membership ~ annotations for each cell type individually.

  • Credible set membership ~ annotations in all cell types.

  • PP ~ annotations in all cell types.

  • -log10(P) ~ annotations in all cell types.

To increase power, I also investigate collapsing annotations into binary active/inactive marks. In this analysis, lasso can be used to pick out the most relevant cell type (although note the most relevant cell types are not the same for the different responses).

  • Credible set membership ~ binary active/inactive mark in all cell types.

  • PP ~ binary active/inactive mark in all cell types.

  • P ~ binary active/inactive mark in all cell types.


Quantile Regression

Quantile regression offers an alternative approach to investigating relationships between annotations and PPs/P values. I think that a continuous predictor would be better suited to this analysis. Need to go over the prostate cancer paper to figure out how this can be used to reweight PPs.

  • PPs ~ active/inactive chromatin mark ā€“> Find which cell types are important (in terms of annotation increasing/decreasing PP).

  • P values ~ active/inactive chromatin mark ā€“> Find which cell types are important (in terms of annotation increasing/decreasing P value).



Notes:

  • Is there a way to use information on the length of the fragment that the SNP falls in?

  • My logistic regression credible set membership method is similar to GARFIELD where they look for annotation enrichment in SNPs with P<T (T is some threshold) using logistic regression (instead of P<T Iā€™m using cumsum(PP[o])>T)). But here they account for confounding from LD and distance to nearest gene by including these as covariates.