Meeting with him on Friday: https://docs.google.com/document/d/1vIyvrJlLwQR49vzHn-a0XGlQ-4-vVXtfJI5EMSORwPM/edit
My previous analysis of the Nobel et al.Ā data of genome annotations for 167 cell types only considers SNPs on the immunochip that were in the 39 T1D-associated genomic regions that I analysed in my previous project (approximately 17000 of these).
Iāve found that the distribution of genomic annotations across the whole genome differs to that across SNPs. Since my research focusses on SNPs, it would be useful to use the distibution of genomic annotations across all immunochip SNPs as the baseline (rather than the whole genome). However, there are approximately 200,000 SNPs on the immunochip, which would take too long to map to the relevent genomic annotation in many cell types. For this reason, I keep the T1D SNPs and add additional SNPs genotyped on the immunochip, although in the future I should run this for all immunochip SNPs.
Note that in the existing methods, this āempirical distribution of the enrichment under the null hypothesisā is estimated in various ways. E.g. GARFIELD uses replication over variants matched by key metrics and GoShifter uses a circularised permuation method that accounts for the non-random distribution of genomic annotations with respect to each other and the correlation between GWAS signals caused by LD to estimate the null enrichment statistics.
My baseline SNP file contains annotation information of 19 cell types for 36,278 SNPs on the immunochip (including those in my original T1D analysis).
I investigate the enrichment of annotations amongst T1D 95% credible set variants. To do this, I obtain the proportions of annotations across only the T1D 95% credible set variants and divide these by the original proportions (across a sample of immunochip SNPs). The plot is fairly sparse because the annotation must be present in the 95% credible set T1D SNPs (only ~700 of these). Lines above 1 indicate positive enrichment of that annotation in that cell type, and lines below 1 indicate negative enrichment of that annotation in that cell type in 95% credible set T1D SNPs.
The results look sensible:
Constitutive heterochromatin is negatively enriched in credible set SNPs in most cell types (except pancreatic islets and CD14 and CD19 cells).
Facultative heterochromatin is negatively enriched in credible set SNPs in most cell types (except brain and CD14).
Enhancers are positively enriched in credible set SNPs in most cell types (except brain).
Promoters are positively enriched in credible set SNPs in most cell types for which there is data.
Quiescent is negatively enriched in credible set SNPs in most cell types.
This is similar to a section of the prostate cancer paper that: āFor comparison to the conditional QR approach, we also used Fisherās exact test to examine the representation of individual annotation features across variants included in the 95% credible set of prospective PrCa causal variants relative to variants not selected. Independent tests were conducted for each annotation upon the set of 37,863 tag variants analysed by JAM, of which 343 tags represented the 95% credible set of 3700 SNPs and annotations for all proxy SNPs were inherited by the tag variant.ā
My approach to copy this:
Form a SNP set consiting of a tag variant for each of the 39 credible sets.
Add binary annotations columns for whether any of the SNPs in the credible set had that annotation (here annotations refer to active/inactive marks in different cell types).
Plot these proportions against those for non credible set SNPs.
Note that I donāt use Fisherās exact test as I read this is best suited when counts are low and obviously weāre going to see enrichment in credible set SNPs as the red bars are for whether any of the variants in the credible set had an active mark in that cell type whereas the blue bars are the proportions over all SNPs not in the credible sets.
Rather than making inferences manually from the above figure, it would be good to formalise these findings using some statistical methods. To do this, I use penalised logistic regression (Lasso) to see which annotations in which cell types are enriched in 95% credible set SNPs. Note that the data is for the whole sample of immunochip SNPs (not just those in the T1D analysis) and I am using constitutive heterochromatin as the baseline.
Interpretation: There is a 120% increase (\(exp(0.796413)=2.2\)) of a 95% T1D credible set SNP being in a promoter region in Thymus cells compared with constitutive heterochromatin.
I run a lasso regression on a binary indicator of credible set membership and annotation marks in each cell type. From this, I can see the most important annotations in the most important cell types for credible set membership.
The baseline factor is constitutive heterochromatin in thymus cells.
I seem to be getting very variable and spurious results. To increase power, I collapse the annotations into a binary active/inactive mark and regress these on a binary indicator of contained, to see which cell types are most indicative of credible set membership. Although to do this I will have to remove āunsureā annotations (bivalent, unclassified, low confidence).
Active: enhancer, transcribed, promoter, reg permissive.
Inactive: quiescent, constitutive heterochromatin, facultative heterochromatin.
I also do this for all the SNPs I have the data for (~17,000) so the 0.97 quantile contains ~500 SNPs and the 0.99 quantile has ~170.
Note that this method literally just draws a line from the Xth PP quantile in the active=0 SNPs and the Xth PP quantile in the active=1 SNPs. Would be more involved to use a continuous predictor.
and also for all the SNPs in the 39 T1D-associated regions (note the change in y axis) (I canāt do this for all the immunochip SNPs as I donāt have their association statistics for T1D).
and for all T1D SNPs from the 39 associated regions.
I use logistic regression to look at the relationship of credible set memebership/PP/P vals with annotations in many cell types. I use lasso to pick out the most relevant annotations.
Credible set membership ~ annotations for each cell type individually.
Credible set membership ~ annotations in all cell types.
PP ~ annotations in all cell types.
-log10(P) ~ annotations in all cell types.
To increase power, I also investigate collapsing annotations into binary active/inactive marks. In this analysis, lasso can be used to pick out the most relevant cell type (although note the most relevant cell types are not the same for the different responses).
Credible set membership ~ binary active/inactive mark in all cell types.
PP ~ binary active/inactive mark in all cell types.
P ~ binary active/inactive mark in all cell types.
Quantile regression offers an alternative approach to investigating relationships between annotations and PPs/P values. I think that a continuous predictor would be better suited to this analysis. Need to go over the prostate cancer paper to figure out how this can be used to reweight PPs.
PPs ~ active/inactive chromatin mark ā> Find which cell types are important (in terms of annotation increasing/decreasing PP).
P values ~ active/inactive chromatin mark ā> Find which cell types are important (in terms of annotation increasing/decreasing P value).
Is there a way to use information on the length of the fragment that the SNP falls in?
My logistic regression credible set membership method is similar to GARFIELD where they look for annotation enrichment in SNPs with P<T (T is some threshold) using logistic regression (instead of P<T Iām using cumsum(PP[o])>T)). But here they account for confounding from LD and distance to nearest gene by including these as covariates.