14 November

Enrichment of emission states and SNPs
Collapsing chromatin states
Distribution of chromatin states
New annotation data (Noble et al. 2019)
Secret manuscript

I’ve decided to go down the PP route (rather than the chromatin contact route) to start with as this makes more sense given my previous fine-mapping project. May revisit the MPPC/CHiCAGO work at a later date.

1. Enrichment of emission states and SNPs

Using the T1D GWAS data that I fine-mapped in my previous project (https://www.nature.com/articles/ng.3245#t1) and the chromHMM data for both activated and non-activated CD4+ T cells (peaky paper and https://www.sciencedirect.com/science/article/pii/S0092867416313228) I investigate ways of incorporating genomic annotations with GWAS SNPs.

Motivating question: No SNPs fall in E12 or E13 regions, are these just not common across the genome?

For activated and non-activated cells separately, I investigate SNP enrichment in emission states across (i) the whole genome, (ii) T1D ImmunoChip SNPs and (iii) 95% credible set T1D SNPs.

A big difference is that ~80% of the genome is allocated to state E14, whereas only ~40% of the GWAS SNPs are allocated to this state. This is likely because they are allocated to E2 instead, which also represents no chromatin marks.

I therefore merge E14 and E2 into a single chromatin state that is indicative of no chromatin modifications.

1a. Modelling this difference (activated cells)

I consider a model to quantify the enrichment of credible SNPs in emission states. I regress a binary indicator of contained within the credible set against the emission states, using E14/E2 as the baseline since this mark is not indicative of any histone modifications.

## 
## Call:
## glm(formula = contained ~ E, family = binomial, data = act_full)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.459  -0.273  -0.273  -0.273   2.786  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.27078    0.04959 -65.950  < 2e-16 ***
## EE1         -0.53184    0.22119  -2.404  0.01620 *  
## EE10         1.07355    1.05526   1.017  0.30899    
## EE11        -0.58995    0.71620  -0.824  0.41009    
## EE15        -0.15617    0.21760  -0.718  0.47294    
## EE3          0.69056    0.23717   2.912  0.00359 ** 
## EE4          0.54174    0.12069   4.489 7.16e-06 ***
## EE5          0.36420    0.25397   1.434  0.15156    
## EE6          0.22625    0.28818   0.785  0.43238    
## EE7          0.41640    0.16278   2.558  0.01052 *  
## EE8         -0.22573    0.50994  -0.443  0.65801    
## EE9         -0.34014    1.01463  -0.335  0.73745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5538.9  on 16695  degrees of freedom
## Residual deviance: 5495.7  on 16684  degrees of freedom
## AIC: 5519.7
## 
## Number of Fisher Scoring iterations: 6

Interpretation:

The odds of being in the credible set over the odds of not being in the credible set given that you’re in E14/E2 is \(exp(-3.27078)=0.038\) (this is just the proportion of SNPs falling in E14/E2 that are in the credible sets).
There is a 40% decrease (\(exp(-0.53184)=0.587\)) of being in E1 if you’re in the credible set relative to if you’re in E14/E2
There is a 99% increase (\(exp(0.69056)=1.99\)) of being in E3 if you’re in the credible set relative to if you’re in E14/E2.
There is 72% increase (\(exp(0.54174)=1.72\)) of being in E4 if you’re in the credible set relative to if you’re in E14/E2.
There is a 52% increase (\(exp(0.41640)=1.52\)) of being in E7 if you’re in the credible set relative to if you’re in E14/E2.

Note that these results formalise what we see in the plot above.

1b. Modelling this difference (non-activated cells)

I do the same but for the non-activated cells.

## 
## Call:
## glm(formula = contained ~ E, family = binomial, data = non_full)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3990  -0.2779  -0.2779  -0.2779   2.6771  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.23499    0.04790 -67.537  < 2e-16 ***
## EE1          -0.27831    0.18051  -1.542  0.12312    
## EE10         -0.19090    0.51032  -0.374  0.70835    
## EE11          0.28222    0.16189   1.743  0.08129 .  
## EE15        -14.33108  246.30103  -0.058  0.95360    
## EE3           0.74415    0.28221   2.637  0.00837 ** 
## EE4           0.44690    0.26199   1.706  0.08805 .  
## EE5           0.35555    0.15892   2.237  0.02526 *  
## EE6           0.07413    0.22288   0.333  0.73943    
## EE7           0.54464    0.24186   2.252  0.02433 *  
## EE8           0.46240    0.46346   0.998  0.31841    
## EE9          -0.32036    1.01532  -0.316  0.75236    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5545.1  on 16693  degrees of freedom
## Residual deviance: 5501.2  on 16682  degrees of freedom
## AIC: 5525.2
## 
## Number of Fisher Scoring iterations: 16

Interpretation:

The probability of being in the credible set given that you’re in E14/E2 is \(exp(-3.23499)=0.039\) (this is just the proportion of SNPs falling in E14/E2 that are in the credible sets).
There is a 33% increase (\(exp(0.28222)=1.32\)) of being in E11 if you’re in the credible set relative to if you’re in E14/E2.
There is a 110% increase (\(exp(0.74415)=2.10\)) of being in E3 if you’re in the credible set relative to if you’re in E14/E2.
There is a 56% increase (\(exp(0.44690)=1.56\)) of being in E4 if you’re in the credible set relative to if you’re in E14/E2.
There is a 43% increase (\(exp(0.35555)=1.43\)) of being in E5 if you’re in the credible set relative to if you’re in E14/E2.
There is a 72% increase (\(exp(0.54464)=1.72\)) of being in E7 if you’re in the credible set relative to if you’re in E14/E2.

Notes:

E15 is completely depleted in credible set SNPs.
The AIC is lower for the activated cell type model.

1c. Credible set SNPs in activated or non-activated cells

I compare the enrichment across cell types for credible set SNPs only (i.e. comparing the blue columns in the first plot), we see that E4, E7 and E15 are more enriched in credible set SNPs in activated rather than non-activated cells.

To formalise this, I model this difference by regressing a binary indicator of activated/non-activated cell type against the emission states for the credible set SNPs.

## 
## Call:
## glm(formula = Act ~ E, family = binomial, data = combine)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9348  -1.1475  -0.2982   1.2077   2.5042  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.07089    0.06765  -1.048  0.29474    
## EE1          -0.36443    0.28186  -1.293  0.19602    
## EE10         -1.31541    1.12008  -1.174  0.24024    
## EE11         -3.02016    0.72616  -4.159  3.2e-05 ***
## EE15         15.63696  303.47137   0.052  0.95891    
## EE3           0.42756    0.35497   1.204  0.22840    
## EE4           1.77563    0.28007   6.340  2.3e-10 ***
## EE5          -0.92454    0.29179  -3.169  0.00153 ** 
## EE6          -0.45521    0.35631  -1.278  0.20140    
## EE7           0.91064    0.28273   3.221  0.00128 ** 
## EE8          -0.15226    0.67422  -0.226  0.82134    
## EE9           0.07089    1.41583   0.050  0.96007    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1823.0  on 1314  degrees of freedom
## Residual deviance: 1655.7  on 1303  degrees of freedom
## AIC: 1679.7
## 
## Number of Fisher Scoring iterations: 14

Interpretation:

There is a 95% decrease of being in E11 if the SNP is in an active cell type, compared with E14/E2. Unsure of the biology of this mark… see below

“Previous annotation efforts have reported regulatory elements with marginal strength characterized by H3K4me1 without H3K27ac, which they have typically described as “weak enhancers” [2, 23]. This terminology has caused confusion [30] because it suggests that these elements either are called with low confidence, or promote expression to a lesser degree than “strong enhancers”. In fact, it has not been verified that this pattern of activity (+H3K4me1, –H3K27ac) corresponds to either of these hypotheses. To avoid this confusion, we apply the term “permissive regulatory region” (RegPermissive) to these regions instead. Our RegPermissive annotations are mildly enriched in the vicinity of genes and are mildly enriched for conservation." [From https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1784-2#Sec16]

There is a 500% increase of being in E4 if the SNP is in an active cell type, compared with E14/E2. Makes sense biologically: E4 is high in only H3K27ac (enhancer mark)!
There is a 60% decrease of being in E5 if the SNP is in an active cell type, compared with E14/E2. Doesn’t make sense biologically: E5 is high in H3K27ac and H3K4me1 (marks of enhancers)???
There is a 150% increase of being in E7 if the SNP is in an active cell type, compared with E14/E2. Kind of makes sense biologically: E7 is high in H3K27ac and H3K4me3 (active marks of transcription - colocalise in areas around TSS and have been shown to have a positive correlation with gene expression), but ultimately they mark TSS???

2. Collapsing states

So there is clearly some movement between chromatin states for activated and non-activated cells. Infact, 20% of variants “switch state” between activated and non-activated cells, but are non-activate states switching between themselves and active states switching between themselves or are there some systematic differences? Note that this switching could be due to the nature of the Markov chain and the ChromHMM algorithm - e.g. E2 and E14 both represent low probability of any histone modifications so may be interchangeable. I also need to look at the transition probabilities that show e.g. you can go from 8 or 12 to get to 14. It’s going to be hard to show that these movements are actually anything biologically interesting or just an artifact of the Markov chain and the transition probabilities.

I want to investigate whether there are systematic changes in chromatin states for the GWAS variants between activated and non-activated cells. For example, whether SNPs are generally allocated to non-active chromatin marks in non-active cells and active marks in active cells. Because of the nature of switching between states that arises from the ChromHMM method, I collapse cell states into “active” (E4-11) and “inactive” (E1-3, E12-15), in the hope that this will help us to find more “real” switches in chromatin state.

All SNPs

In the activated cells, 17.1% of SNPs are in active regions.

## 
##      A      N 
## 17.104 82.896

In the non-activated cells, 11.9% of SNPs are in active regions.

## 
##        A        N 
## 11.86197 88.13803

6.4% of variants move from non-activated states in non-active cells to active states in active cells, whilst only 1.1% of variants move from non-activated states in active cells to active states in non-active cells.

table(data$act, data$non)/dim(data)[1] * 100

##    
##             A         N
##   A 10.747664  6.356338
##   N  1.114306 81.781692

Credible set SNPs only

In the activated cells, 24.7% of SNPs are in active regions.

## 
##        A        N 
## 24.65753 75.34247

In the non-activated cells, 15.5% SNPs are in active regions.

## 
##        A        N 
## 15.52511 84.47489

11.1% of variants move from non-activated regions in non-activated cells to active regions in activated cells, whilst only 2% of variants move from activated regions in non-activated cells to non-activated regions in active cells.

table(data$act, data$non)/dim(data)[1] * 100

##    
##             A         N
##   A 13.546423 11.111111
##   N  1.978691 73.363775

What about non-credible set vrs credible set

There is virtually no difference because the credible set SNPs only make up a small proportion of all the GWAS SNPs (657/16696 * 100 = 3.9%).

(table(data_cs$act, data_cs$non)/dim(data_cs)[1])*100

##    
##             A         N
##   A 13.546423 11.111111
##   N  1.978691 73.363775

(table(data_notcs$act, data_notcs$non)/dim(data_notcs)[1]) *100

##    
##             A         N
##   A 10.632990  6.161522
##   N  1.078890 82.126598

So I’ve found that there is more movement to active chromatin states for credible set SNPs than the standard SNP list (11.1% rather than 6.16%). I do a quick test to see if this change is significant, and find that at the 10% level the change from active state in activated cell types to non-active state in non-activated cell types is not significant.

prop.test(c(6.16,11.11), c(100,100), alternative = "less")

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(6.16, 11.11) out of c(100, 100)
## X-squared = 0.98883, df = 1, p-value = 0.16
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.00000000  0.02558348
## sample estimates:
## prop 1 prop 2 
## 0.0616 0.1111

Summary of chromatin state movements between activated and non-activated cells

3. Distribution of chromatin states

Last week I found that the average fragment size was biggest in non-activated cells. I look at the difference in the distributions of the length of fragments allocated to each emission state for activated and non-activated cells.

The aim of this section was to see whether there are long continuous stretches of inactive regions in both activated and non-activated cells, but I’m not sure how this section helps us to learn about this…

4. New annotation data (Nobel et al. 2019)

“A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types”

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1784-2

This paper has publically available whole genome annotations for 164 human cell types. They also use a machine learning approach to remove the “human interpretation” step that I’ve been struggling with in the ChromHMM dataset - directly labeling regions as “enhancers”/“promoters” etc.

4a. Summary

ChromHMM uses a hidden markov model to transition from observables (e.g. histone marks) to unobservables (e.g. chromatin states). However, HMMs handle missing data poorly, requiring interpolation and smoothing to process regions of missing data (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3340533/).
Segway uses a dynamic Bayesian network which is extention of Bayesian networks with the concept of time. This allows complex relationships among variables at the current or nearby positions in the genome to be modeled (e.g. “hidden” states can interact with each other). They also handle missing data better through switching parents in the DBN structure (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3340533/).
Current methods for genome annotation are based on semi-automated genome annotation (SAGA) algorithms (e.g. chromHMM and Segway), whereby there is a “human functional interpretation” step of the state labels –> this is the step I am struggling with.
The usual method to annotate multiple cell types is to train a shared model across all cell types, meaning that all cell types must have exactly the same assays available (“concatenated approach”).
The method described in the paper removes the human interpretation step and the requirement of identical assays across cell types by using an independent annotation approach to annotate 164 human cell types using 1615 data sets (the concatenated approach could use at most 570 data sets –> 114 cell types with 5 assay types each).

Method:

Use all available data (ChIP-seq, DNase-seq and Repli-seq (which defined replication timings) - note no measures of methylation) for a given cell type as input to Segway. The number of states for a given cell type was defined by \(10+2*\sqrt{\text{number of tracks}}\).
Segway produces an annotation with integer state labels at 100-bp resolution (note that at each of the maximum of 25 iterations, a random 1% of the genome was used as the training data - in chromHMM the whole genome is used).
A random forest classifier (trained using 16 features that represent the information typically used in manual interpretation) then assigns an interpretation to each sate (using features derived from all segments with that state in the genome) using a controlled vocabulary of ten such terms.

“For each annotation state, we defined features that encompass the information typically used to interpret annotation states. Specifically, we used the following 16 features: mean value of H3K27me3, H3K4me3, H3K36me3, H3K4me1, H3K4me3, and H3K9me3 (six features), and log enrichment of the state in the following positions relative to GENCODE genes: 1–10 kbp 5′ flanking, 1 bp–1 kbp 5’ flanking, initial exon, initial intron, internal exons, internal introns, terminal exon, 1 bp–1 kbp 3’ flanking, and 1–10 kbp 3′ flanking.” … “We assigned the term”Low-Confidence" to new states that the classifier predicted as Unclassified and to states not assigned with more than 25% confidence to any of the other terms."

–> the mean value of histone modifications is similar to my idea in the “Ideas for weighting” section of https://annahutch.github.io/PhD/31oct.html#probs.

Possible extension they mention: “It would be valuable to evaluate the effect on annotation quality of hyperparameters such as the number of labels or segment length priors. Unfortunately doing so would require training large numbers of annotations, each of which requires an expensive computation.”

4b. Implementation

I use their genome annotation data for 9 CD4 cell types to see where my T1D SNPs are enriched.

The genome annotation data files I have are:

[1] “./CD4_MEMORY_PRIMARY_CELLS.bed.gz”
[2] “./CD4_NAIVE_PRIMARY_CELLS.bed.gz”
[3] "./CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS.bed.gz"
[4] "./CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS.bed.gz"
[5] "./CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS.bed.gz"
[6] "./CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS.bed.gz"
[7] "./CD4+_CD25-_TH_PRIMARY_CELLS.bed.gz"
[8] "./CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS.bed.gz"
[9] "./CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS.bed.gz"

The numbers of the cell types match to the numbers in the legend for the plots below.

Most of the genome for the 9 CD4 cell types falls in quiescent, constitutive heterochromatin or facultative heterochromatic regions (all inactive regions).

In the plots below, I find that my T1D SNPs are enriched in enhancer regions of the genome. Among only the 95% credible set SNPs, this enrichment may be even higher.

To formalise these results, I run logistic regression models to see which annotations T1D credible set SNPs are enriched in, using constitutive heterochromatin as the baseline.

CD4 memory primary cells

## 
## Call:
## glm(formula = contained ~ CD4_MEMORY_PRIMARY_CELLS, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3052  -0.3052  -0.2796  -0.2695   2.5893  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                            -3.28240    0.09668 -33.951
## CD4_MEMORY_PRIMARY_CELLSEnhancer        0.23925    0.11687   2.047
## CD4_MEMORY_PRIMARY_CELLSFacultativeHet -0.01481    0.14883  -0.099
## CD4_MEMORY_PRIMARY_CELLSQuiescent       0.06000    0.13036   0.460
## CD4_MEMORY_PRIMARY_CELLSTranscribed    -0.03414    0.14676  -0.233
##                                        Pr(>|z|)    
## (Intercept)                              <2e-16 ***
## CD4_MEMORY_PRIMARY_CELLSEnhancer         0.0406 *  
## CD4_MEMORY_PRIMARY_CELLSFacultativeHet   0.9207    
## CD4_MEMORY_PRIMARY_CELLSQuiescent        0.6453    
## CD4_MEMORY_PRIMARY_CELLSTranscribed      0.8161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5529.1  on 16653  degrees of freedom
## Residual deviance: 5521.0  on 16649  degrees of freedom
##   (42 observations deleted due to missingness)
## AIC: 5531
## 
## Number of Fisher Scoring iterations: 6

There is a 27% increase of a credible set SNP over a non-credible set SNP being in an enhancer region, compared with constitutive heterochromatin.

CD4 naive primary cells

## 
## Call:
## glm(formula = contained ~ CD4_NAIVE_PRIMARY_CELLS, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3238  -0.2901  -0.2835  -0.2624   2.6358  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                           -3.19382    0.10360 -30.830   <2e-16
## CD4_NAIVE_PRIMARY_CELLSEnhancer        0.27189    0.13343   2.038   0.0416
## CD4_NAIVE_PRIMARY_CELLSFacultativeHet  0.01942    0.13541   0.143   0.8859
## CD4_NAIVE_PRIMARY_CELLSQuiescent      -0.24849    0.15321  -1.622   0.1048
## CD4_NAIVE_PRIMARY_CELLSRegPermissive   0.04654    0.15028   0.310   0.7568
## CD4_NAIVE_PRIMARY_CELLSTranscribed    -0.15825    0.14349  -1.103   0.2701
##                                          
## (Intercept)                           ***
## CD4_NAIVE_PRIMARY_CELLSEnhancer       *  
## CD4_NAIVE_PRIMARY_CELLSFacultativeHet    
## CD4_NAIVE_PRIMARY_CELLSQuiescent         
## CD4_NAIVE_PRIMARY_CELLSRegPermissive     
## CD4_NAIVE_PRIMARY_CELLSTranscribed       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5534.8  on 16644  degrees of freedom
## Residual deviance: 5517.0  on 16639  degrees of freedom
##   (51 observations deleted due to missingness)
## AIC: 5529
## 
## Number of Fisher Scoring iterations: 6

There is a 31% increase of a credible set SNP over a non-credible set SNP being in an enhancer region, compared with constitutive heterochromatin.

CD4 CD25- CD45RA+ naive primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3234  -0.3151  -0.2648  -0.2515   2.6343  
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                          -3.39696    0.12808
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSBivalent        0.46196    0.23241
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSEnhancer        0.41872    0.15347
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSFacultativeHet  0.06390    0.16951
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSQuiescent       0.17729    0.16340
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSRegPermissive   0.47245    0.16066
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSTranscribed    -0.04106    0.15813
##                                                      z value Pr(>|z|)    
## (Intercept)                                          -26.522  < 2e-16 ***
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSBivalent         1.988  0.04685 *  
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSEnhancer         2.728  0.00637 ** 
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSFacultativeHet   0.377  0.70619    
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSQuiescent        1.085  0.27794    
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSRegPermissive    2.941  0.00328 ** 
## CD4p_CD25m_CD45RAp_NAIVE_PRIMARY_CELLSTranscribed     -0.260  0.79515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5522.8  on 16654  degrees of freedom
## Residual deviance: 5496.1  on 16648  degrees of freedom
##   (41 observations deleted due to missingness)
## AIC: 5510.1
## 
## Number of Fisher Scoring iterations: 6

There is a 59% increase of a credible set SNP being in a bivalent region, a 52% increase of a credible set SNP being in an enhancer region and a 60% increase of a credible set SNP being in a reg permissive region over a non-credible set SNP, compared with constitutive heterochromatin.

CD4 CD25- CD45RA+ memory primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3226  -0.3053  -0.2702  -0.2531   2.7021  
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                           -3.19489    0.09818
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSEnhancer        0.26541    0.12689
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSFacultativeHet -0.22980    0.13454
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSPromoter       -0.42945    0.72316
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSQuiescent      -0.12534    0.17671
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSRegPermissive   0.15248    0.14588
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSTranscribed    -0.09700    0.13469
##                                                       z value Pr(>|z|)    
## (Intercept)                                           -32.542   <2e-16 ***
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSEnhancer         2.092   0.0365 *  
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSFacultativeHet  -1.708   0.0876 .  
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSPromoter        -0.594   0.5526    
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSQuiescent       -0.709   0.4781    
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSRegPermissive    1.045   0.2959    
## CD4p_CD25m_CD45ROp_MEMORY_PRIMARY_CELLSTranscribed     -0.720   0.4714    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5523.2  on 16658  degrees of freedom
## Residual deviance: 5502.1  on 16652  degrees of freedom
##   (37 observations deleted due to missingness)
## AIC: 5516.1
## 
## Number of Fisher Scoring iterations: 6

There is a 30% increase of a credible set SNP over a non-credible set SNP being in an enhancer region, compared with constitutive heterochromatin.

CD4 CD25- IL17- Th stimulated MACS primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3239  -0.3052  -0.2753  -0.2504   2.7825  
## 
## Coefficients:
##                                                                                        Estimate
## (Intercept)                                                                             -3.8501
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSEnhancer         0.9286
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSFacultativeHet   0.5028
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSQuiescent        0.5965
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSRegPermissive    0.8066
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSTranscribed      0.4032
##                                                                                        Std. Error
## (Intercept)                                                                                0.3368
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSEnhancer           0.3441
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSFacultativeHet     0.3512
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSQuiescent          0.3484
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSRegPermissive      0.3530
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSTranscribed        0.3513
##                                                                                        z value
## (Intercept)                                                                            -11.430
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSEnhancer         2.699
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSFacultativeHet   1.432
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSQuiescent        1.712
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSRegPermissive    2.285
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSTranscribed      1.148
##                                                                                        Pr(>|z|)
## (Intercept)                                                                             < 2e-16
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSEnhancer        0.00696
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSFacultativeHet  0.15225
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSQuiescent       0.08691
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSRegPermissive   0.02231
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSTranscribed     0.25101
##                                                                                           
## (Intercept)                                                                            ***
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSEnhancer       ** 
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSFacultativeHet    
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSQuiescent      .  
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSRegPermissive  *  
## CD4p_CD25m_IL17m_PMAmIONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLSTranscribed       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5536.7  on 16667  degrees of freedom
## Residual deviance: 5506.1  on 16662  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 5518.1
## 
## Number of Fisher Scoring iterations: 6

There is a 153% increase of a credible set SNP being in an enhancer region, and a 124% increase of a credible set SNP over a non-credible set SNP being in a reg permissive region, compared with constitutive heterochromatin.

CD4 CD25- IL17+ Th17 stimulated primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3171  -0.2843  -0.2843  -0.2679   2.6650  
## 
## Coefficients:
##                                                                             Estimate
## (Intercept)                                                                -3.309215
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSEnhancer        0.344106
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSFacultativeHet -0.212755
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSQuiescent       0.120849
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSRegPermissive   0.045934
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSTranscribed     0.002217
##                                                                            Std. Error
## (Intercept)                                                                  0.116024
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSEnhancer         0.137933
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSFacultativeHet   0.188060
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSQuiescent        0.136165
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSRegPermissive    0.203771
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSTranscribed      0.154702
##                                                                            z value
## (Intercept)                                                                -28.522
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSEnhancer         2.495
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSFacultativeHet  -1.131
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSQuiescent        0.888
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSRegPermissive    0.225
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSTranscribed      0.014
##                                                                            Pr(>|z|)
## (Intercept)                                                                  <2e-16
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSEnhancer         0.0126
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSFacultativeHet   0.2579
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSQuiescent        0.3748
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSRegPermissive    0.8217
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSTranscribed      0.9886
##                                                                               
## (Intercept)                                                                ***
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSEnhancer       *  
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSFacultativeHet    
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSQuiescent         
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSRegPermissive     
## CD4p_CD25m_IL17p_PMAmIONOMCYIN_STIMULATED_TH17_PRIMARY_CELLSTranscribed       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5516.8  on 16658  degrees of freedom
## Residual deviance: 5500.1  on 16653  degrees of freedom
##   (37 observations deleted due to missingness)
## AIC: 5512.1
## 
## Number of Fisher Scoring iterations: 6

There is a 41% increase of a credible set SNP over a non-credible set SNP being in an enhancer region, compared with constitutive heterochromatin.

CD4 CD25- Th primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25m_TH_PRIMARY_CELLS, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3306  -0.3094  -0.2606  -0.2593   2.7715  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                                -3.8188     0.2260 -16.894
## CD4p_CD25m_TH_PRIMARY_CELLSBivalent         0.4631     0.4075   1.136
## CD4p_CD25m_TH_PRIMARY_CELLSEnhancer         0.9081     0.2424   3.747
## CD4p_CD25m_TH_PRIMARY_CELLSFacultativeHet   0.4428     0.2447   1.810
## CD4p_CD25m_TH_PRIMARY_CELLSPromoter         0.9393     0.2836   3.312
## CD4p_CD25m_TH_PRIMARY_CELLSQuiescent        0.7262     0.2416   3.006
## CD4p_CD25m_TH_PRIMARY_CELLSRegPermissive    0.8033     0.2649   3.032
## CD4p_CD25m_TH_PRIMARY_CELLSTranscribed      0.4530     0.2424   1.869
##                                           Pr(>|z|)    
## (Intercept)                                < 2e-16 ***
## CD4p_CD25m_TH_PRIMARY_CELLSBivalent       0.255832    
## CD4p_CD25m_TH_PRIMARY_CELLSEnhancer       0.000179 ***
## CD4p_CD25m_TH_PRIMARY_CELLSFacultativeHet 0.070326 .  
## CD4p_CD25m_TH_PRIMARY_CELLSPromoter       0.000926 ***
## CD4p_CD25m_TH_PRIMARY_CELLSQuiescent      0.002645 ** 
## CD4p_CD25m_TH_PRIMARY_CELLSRegPermissive  0.002426 ** 
## CD4p_CD25m_TH_PRIMARY_CELLSTranscribed    0.061620 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5522.9  on 16655  degrees of freedom
## Residual deviance: 5489.5  on 16648  degrees of freedom
##   (40 observations deleted due to missingness)
## AIC: 5505.5
## 
## Number of Fisher Scoring iterations: 6

There is a 148% increase of a credible set SNP being in an enhancer region, a 155% increase of a credible set SNP being in a promoter region, a 106% increase of a credible set SNP being in a quiescent region and a 123% increase of being in a reg permissive region over a non-credible set SNP, compared with constitutive heterochromatin.

CD4 CD25+ CD127- Treg primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25p_CD127m_TREG_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3337  -0.2849  -0.2849  -0.2656   2.6061  
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                        -3.32709    0.19234
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSBivalent        0.46680    0.24595
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSEnhancer        0.36542    0.21661
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSFacultativeHet -0.03484    0.21428
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSQuiescent       0.14317    0.20327
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSTranscribed     0.04264    0.21347
##                                                    z value Pr(>|z|)    
## (Intercept)                                        -17.298   <2e-16 ***
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSBivalent         1.898   0.0577 .  
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSEnhancer         1.687   0.0916 .  
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSFacultativeHet  -0.163   0.8708    
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSQuiescent        0.704   0.4812    
## CD4p_CD25p_CD127m_TREG_PRIMARY_CELLSTranscribed      0.200   0.8417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5536.5  on 16665  degrees of freedom
## Residual deviance: 5522.3  on 16660  degrees of freedom
##   (30 observations deleted due to missingness)
## AIC: 5534.3
## 
## Number of Fisher Scoring iterations: 6

There is a 60% increase of a credible set SNP being in an bivalent region and a 44% increase of a credible set SNP being in an enhancer region compared with constitutive heterochromatin.

CD4 CD25int CD127+ Tmem primary cells

## 
## Call:
## glm(formula = contained ~ CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLS, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3329  -0.2808  -0.2690  -0.2677   2.6674  
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                          -3.21344    0.10409
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSEnhancer        0.34841    0.13355
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSFacultativeHet -0.16155    0.16931
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSLowConfidence  -0.31518    0.30003
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSQuiescent      -0.08755    0.13587
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSRegPermissive   0.25199    0.18357
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSTranscribed    -0.09757    0.13237
##                                                      z value Pr(>|z|)    
## (Intercept)                                          -30.871  < 2e-16 ***
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSEnhancer         2.609  0.00908 ** 
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSFacultativeHet  -0.954  0.34001    
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSLowConfidence   -1.050  0.29349    
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSQuiescent       -0.644  0.51931    
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSRegPermissive    1.373  0.16985    
## CD4p_CD25INT_CD127p_TMEM_PRIMARY_CELLSTranscribed     -0.737  0.46107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5523.2  on 16658  degrees of freedom
## Residual deviance: 5499.7  on 16652  degrees of freedom
##   (37 observations deleted due to missingness)
## AIC: 5513.7
## 
## Number of Fisher Scoring iterations: 6

There is a 41% increase of a credible set SNP being in an enhancer region compared with constitutive heterochromatin.

So I have genomic annotations for 9 CD4 cell types. I have overlaid my 95% credible set T1D SNPs to see if they are enriched in specific genomic regions. My results suggest that T1D SNPs are enriched the most in enhancer regions in CD4 CD25- IL17- Th stimulated MACS primary cells and CD4 CD25- Th primary cells.

Now, for each cell type I use quantile regression (0.97 quantile ~= PP=0.01) to see if the genomic annotations can be used to predict PP. I do this just for the 95% credible set SNPs for now, and give one example for the naive CD4 cell showing that the results are largely what we expect - activation marks increase the PP whilst repressive marks tend to decrease the PP (although this did vary across cell types).

## 
## Call: rq(formula = pp ~ CD4_NAIVE_PRIMARY_CELLS, tau = 0.97, data = d)
## 
## tau: [1] 0.97
## 
## Coefficients:
##                                       coefficients lower bd upper bd
## (Intercept)                            0.38813      0.13276  0.54419
## CD4_NAIVE_PRIMARY_CELLSEnhancer        0.00148     -0.13812  0.34553
## CD4_NAIVE_PRIMARY_CELLSFacultativeHet -0.20116     -0.27327  0.30193
## CD4_NAIVE_PRIMARY_CELLSQuiescent      -0.10894     -0.27766  0.66284
## CD4_NAIVE_PRIMARY_CELLSRegPermissive   0.05261     -0.26129  0.67906
## CD4_NAIVE_PRIMARY_CELLSTranscribed     0.08637     -0.16314  0.69575

Perhaps, I can increase the power by collapsing annotations into active (enhancer, transcribed, promoter, reg permissive) and inactive (quiescent, constit het, facult het) [although to do this I will have to remove “unsure” annotations (bivalent, unclassified, low confidence)].

I do this for 97th and 99th QR (approx PP=0.01 and PP=0.05).

So we see that the effect of the annotation on the PPs is stronger in some cell types (e.g. (8) CD4 CD25+ CD127- Treg) than others (e.g. (5) CD4 CD25- IL17).

5. Secret manuscript

5a. Introduction to MPRAs

Massively parallel reporter assays (MPRA) are typically used to measure the regulatory function of thousands of DNA sequences in a single experiment. A synthetic DNA construct is introduced into cells with the aim of measuring the functional consequence of a DNA sequence. The construct usually consists of a minimal promoter region (needs a functional enhancer to drive reporter expression), a report gene and a 3’ UTR which contains the barcode for the specific DNA sequence (allowing multiple DNA sequences to be analysed together).

MPRA data is produced from two parallel procedures: (1) RNA-seq to measure the number of transcripts produced from each barcode (2) DNA-sequencing to measure the number of construct copies of each barcode. Thus, we need to aggregate these data sources to get an estimate of the transcription rate.

Aggregated ratio: \(\dfrac{\frac{1}{n}\sum_i^n RNA_i}{\frac{1}{m}\sum_j^m DNA_j}\), but this is often dominated by a minority of barcodes with high counts.
Mean ratio: \(\frac{1}{n}\sum_i^n \frac{DNA_i}{RNA_i}\), but this is highly sensitive to noise (https://doi.org/10.1101/196394).

MPRAnalyze (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1787-z) accounts for the uncertainty in both RNA and DNA libraries and is a more statistically robust method of estimating sequence activity and differential activity.

Another complication of MPRA is that cell lines are typically used, and so the results may not be generalisable to what we would see when using primary cells. This is further confounded by discrepancies between the transcription factors found in the transfected cells and the unchromatinised form of the DNA - thus, appropriate cell-types should be used, ideally from primary cells.

5b. Replication of results

I’d like to see if I can obtain more accurate credible sets than those reported in the manuscript. However, I can’t find reliable information on how the regions and the SNPs within these are defined. Possible ideas:

Extending 50kb either side of lead-SNP
Using the haplotype region specified in supplementary table 1
Using the immunochip-regions.txt file

Or perhaps just google the regions (e.g. 1p34 is chr1:34300000-46300000 but this is 12000 Kb, whereas the other method give ~100 Kb regions?).

The ImmunoChip data files I have access to (share/Data/GWAS-summary/hg19_gwas_ic) only contain info on the P values. I can convert these to Z scores using qnorm(1-PValue/2) but I assume I’ll need to use a reference panel for the MAFs and SNP correlations? I’ve been using the UK10K data as a reference panel but have still encountered problems with the data:

RA (1p34): Lead-SNP (rs883220) not found in UK10K data. Only 2 variants in region using (i) or 15 using (ii).

MS (3p24): Some extremely small P values (9.15e-18) that give Z scores of Inf (qnorm(1-PValue/2)).

ATD (2p24): Only the lead-SNP (rs1534422) found in the region when using (i) and (ii).

T1D (4q32): Only the lead-SNP (rs2611215) found using (i) and 4 when using (ii).

T1D (14q32): 42 variants found when using 14:98485011-98499051 (ii) or 412 when using 14:98368955 98530673 (iii).

AS (5p13):

I also don’t know how I would go about doing the regions associated with multiple diseases.

Aim: Use their MPRA results to update the prior for Bayesian fine-mapping.

Comments

I think I have some pretty convincing results that functional genomic data (e.g. genome annotations and MPRA results) can be useful in the biological interpretation of GWAS SNPs, perhaps helping to find the relevent cell type for that SNP.
An example of the sort of question this could help is “which SNP is most likely to be causal given its genomic annotation in various cell types”.
I’d now like to start thinking about how we can use this information. There are two main directions:

fGWAS to update the PPs using priors
QR as in the prostate cancer paper to update PPs

Note that I also have the 1000 genomes annotation data that Pickrell has on github (https://github.com/joepickrell/1000-genomes), perhaps useful to see what annotations my credible set SNPs have in this dataset.

14 November

Anna Hutchinson

Contents

1. Enrichment of emission states and SNPs

1a. Modelling this difference (activated cells)

1b. Modelling this difference (non-activated cells)

1c. Credible set SNPs in activated or non-activated cells

2. Collapsing states

All SNPs

Credible set SNPs only

What about non-credible set vrs credible set

Summary of chromatin state movements between activated and non-activated cells

3. Distribution of chromatin states

4. New annotation data (Nobel et al. 2019)

“A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types”

4a. Summary

4b. Implementation

CD4 memory primary cells

CD4 naive primary cells

CD4 CD25- CD45RA+ naive primary cells

CD4 CD25- CD45RA+ memory primary cells

CD4 CD25- IL17- Th stimulated MACS primary cells

CD4 CD25- IL17+ Th17 stimulated primary cells

CD4 CD25- Th primary cells

CD4 CD25+ CD127- Treg primary cells

CD4 CD25int CD127+ Tmem primary cells

5. Secret manuscript

5a. Introduction to MPRAs

5b. Replication of results

Comments