Motivation

See summary document here.


Contents

  1. Big regression model
  2. Relevance of number of active bases in fragment
  3. CHiCAGO scores
  4. Improving the resolution of interactions
  5. GWAS Posterior Probabilities
  6. Emission state probabilities

To help decide which values of \(\tau\) to use in the QR analysis, I check the values at the quantiles of my data.

0.95 0.97 0.99 0.999
MPPC 0.021 0.035 0.070 0.167
CHiCAGO 1.610 2.533 5.270 15.875
PP 0.002 0.010 0.051 0.407

1. Big regression


I run a multiple quantile regression using all of the emission proportions for a subset of 1,00,000 rows of the activated cell data.

The bait emission proportions all seem to have a small negative effect on MPPC, although most of these are not significant (especially for 95% quantile regression). Note that this sort of makes sense because these baited regions were chosen as they were near a TSS so are likely to be open anyway, whereas the prey fragment can be from anywhere in the genome.

The prey emission proportions all have varying effects, suggesting that the inference should be collapsed over several emission states.


Findings


2. Number of active bases in fragment


Last week I showed that the number of active bases (especially in the prey fragments) were significant predictors for MPPC value (at least in the extremes of the distribution). See plot below:

I investigate what the predicted MPPC value looks like as the number of active bases in bait and prey fragments increases. I use the log number of active bases (+1) and add jitter (to help with design matrix singularity problems due to identical x values). The results look very similar to the analysis on the proportion of the fragment that was active.


I investigate whether the information on the number of active bases is still required when we include information on the proportion of the fragment that is active (this analysis is for a subset of 500,000 rows of the data).

As indicated last week, the number of bait active bases is less significant for predicting MPPC, but the number of active bases in the prey fragment is significant, although notice that the coefficient estimate is very small for both 95% and 99% quantile regression.

The correlation between the number of active bases and the proportion of the fragment that is active is 0.61 in baits and 0.66 in preys.

I re-run the analysis but include an interaction between the bait/prey proportions and the bait/prey number of active bases.

In 95% QR:

In 99% QR:


ANOVA

The above analysis used a subset of 500,000 data points. To address this more formally, I use a subset of 1 million data points and I do an ANOVA test comparing three models for \(\tau=0.95, 0.99\):

  1. mppc ~ prey_prop + bait_prop
  2. mppc ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1)
  3. mppc ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1) + prey_prop:log(prey_number+1) + bait_prop:log(bait_number+1)

It seems that the more complex model is favoured, indicating that we should include information on the number of active bases and the interaction between this and the proportion in our analysis.


Stepwise regression

Recall that AIC includes a measure of how well the model fits the data and how complex the model is (trade-off between goodness of fit and parsimony), \[AIC=-2ln(l)+2k\]

where \(l\) is the MLE and \(k\) is the number of degrees of freedom.

The AIC of the 3 models for each \(\tau\) I consider are shown in the table below,

\(\tau\) Model 1 Model 2 Model 3
95% -4009537 -4015256 -4015754 (lowest)
99% -2526193 -2537192 -2537396 (lowest)

Adding even more complex model

But last week I found that the interaction between the proportion of the prey and the proportion of the bait fragment that is active was very important. I use ANOVA and stepwise regression to see if this even more complex model is favoured.

Note that the correlation between the number of bait and the number of prey bases that are active is very low, 0.062.

I consider a 4th model:

  1. mppc ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1) + prey_prop:bait_prop + log(prey_number+1):log(bait_number+1) + prey_prop:log(prey_number+1) + bait_prop:log(bait_number+1)

Indeed, it seems that this even more complicated model is favoured!!


\(\tau\) Model 1 Model 2 Model 3 Model 4
95% -4009537 -4015256 -4015754 -4033457 (lowest)
99% -2526193 -2537192 -2537396 -2547632 (lowest)

Findings

  • The “best” model for predicting MPPC value, from those analysed, is the most complex model including the proportion of the fragment that is active, the number of bases that are active and the interaction terms.

3. CHiCAGO scores

So far my analysis has focussed on the MPPC values, but it would be interesting to look at the relationship between CHiCAGO scores and chromatin state since CHiCAGO scores are derived using a different model.

To begin with, and for my own interest, I briefly look at the relationship between MPPC value and CHiCAGO score. The correlation coefficient is 0.427.


Big regression

As I did for MPPC value, I firstly run a multiple quantile regression using all of the emission proportions for a subset of 1,000,000 rows of the activated cell data.

It seems that chromatin state information is not as useful for CHiCAGO scores as it was for MPPC values (see top of this document). At the 95% level, all the prey emission proportions are significant predictors for MPPC but not CHiCAGO. At the 99% level we see kind of similar results between MPPC and CHiCAGO.


Binary measure of active chromatin

I follow the standard protocol and define active chromatin as states E4-11 and give a binary score of 1 if any of the fragment is allocated an active chromatin state.

For the plots below, I have added a black dashed line at a chicago score of 5 (\(log(chicago)=1.609438\)). I find that there is a slight increase in chicago scores in cases where the bait and prey fragment are in active regions of the genome.

For comparison, the plot below shows the same analysis but for MPPC value.

Notice the increase in the general distribution of both MPPC and CHiCAGO value for the prey fragment being in an active region, compared with the bait fragment. Again, showing that the information on chromatin state at the prey fragment is more informative than that at the bait fragment.

For both CHiCAGO and MPPC values I compare various quantiles for either the prey or the bait and the prey fragment being in active regions.

MPPC Median 75th 95th 99th
Prey in active region 0.00095 0.00305 0.04135 0.10690
Bait and prey in active region 0.00105 0.00375 0.04748 0.11615
CHiCAGO Median 75th 95th 99th
Prey in active region 0 0 3.09 9.20
Bait and prey in active region 0 0.02 3.57 10.26

This seems to provide evidence that chromatin states at the prey matter a lot more than that at the bait when considering a binary measure of active chromatin, but my 3D plots from last week show that this is not the case when we consider a continuous measure of the proportion of the prey fragment that is active. I.e. to really reap the benefits from both the bait and the prey, we need to consider a continuous rather than a binary measure.


Continuous measure of proportion of chromatin that is active

To investigate how extreme chicago scores vary as the proportion of the fragment that is active increases, I need to investigate which quantile to use in the quantile regression framework. To do this, I investigate the distribution of chicago scores, with a red dashed line where \(chicago=5\). I’ve added the histogram for the logged MPPC values for comparison.

Below, I plot chicago scores against quantile and mark on the 95th and 99th percentile. It seems that the quantile between these two would work well (0.97), however I keep results for \(\tau=0.95, 0.97, 0.99\) so that I can begin to build up a picture of the predicted tail of the distribution. 99% seems that it will be the most interesting as this is for interactions that meet the standard chicago threshold of 5.

The plot below shows quantile regression of the chicago scores against the proportion of the fragment that is active for \(\tau=0.95, 0.97, 0.99\). There seems to be a significant trend for larger chicago scores as the proportion of the fragment that is “active” increases, especially in prey fragments. These results are similar to what we saw for the MPPC values; that is, that there is more of an effect at the prey fragments than the bait fragments. This makes sense because baits are chosen because they contain a TSS and the fact that they are baitable implies that the chromatin is open anyway. This means that by nature, baits may be slighly more active than the preys.

Note that this increase is actually pretty big as the y axis scale is large. Plotted are 95%, 97% and 99% QR lines.


Again, I look at a 3D plot, including the interaction between the proportions for baits and preys. It looks exactly the same as the MPPC 3D plot looked?


Multiple regression

As I’ve done above for MPPC values, I investigate other predictors of chicago score using ANOVA and AIC values.

  1. chicago ~ prey_prop + bait_prop
  2. chicago ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1)
  3. chicago ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1) + prey_prop:log(prey_number+1) + bait_prop:log(bait_number+1)
  4. chicago ~ prey_prop + bait_prop + log(prey_number+1) + log(bait_number+1) + prey_prop:bait_prop + log(prey_number+1):log(bait_number+1) + prey_prop:log(prey_number+1) + bait_prop:log(bait_number+1)

The table below is the AIC value for the various models.

\(\tau\) Model 1 Model 2 Model 3 Model 4
95% 4769342 4766842 4765380 4752551 (lowest)
99% 6395325 6390585 6387989 6374929 (lowest)

Findings

  • The “best” model for both MPPC and CHiCAGO value, from those analysed, is the most complex model including the proportion of the fragment that is active, the number of bases that are active and the interaction terms.

  • Lots of evidence that the chromatin state information at the prey is more informative than that at the bait, but this is not shown in my more formal analysis (shown in QR ggplots but not in formal model comparison methods).


4. Improving the resolution of interactions

So I’ve found that chromatin state information (summarised by ChromHMM) could be beneficial to improve or re-weight MPPC/CHiCAGO values.

How could I go about doing this?

I originally through that I could improve the resolution of the prey fragments by only taking those regions that are active, however when looking at examples it seems that many adjacent prey fragments all seem to be interacting, generating a peak consisting of many adjacent hindiii prey fragments.

For the example on the right, the whole genomic region chr1:1239427-1350443 seems to be interacting with the bait (the big peak), which consists of ~25 hindiii prey fragments.

E.g. the interacting region (interacting adjacent preys) may look like:

Where for each hindiii fragment we have an MPPC value and a CHiCAGO score. For this region, I look at how the CHiCAGO and MPPC values align up with the proportion of that hindiii fragment that is active, the number of active bases and the product of the two.

Perhaps we could use this information to improve the resolution of the interacting prey fragment. For example, using the QR method in the prostate cancer to re-weight MPPC values based on the number/proportion of the prey fragment that is active (quantified using ChromHMM output).


5. Integrating GWAS PPs with chromatin landscape


Rather than investigating how chromatin state relates to chromosome contact data (MPPC values and CHICAGO scores), I investigate how chromatin state relates to posterior probabilities of causality from fine-mapping.

Integrating chromatin landscape information with SNP data may help to answer some the standard follow-up questions after a causal SNP has been identified by fine-mapping. For example, whether the SNP falls in an active region of the genome in specific cell types (ATAC-seq), whether the SNP is associated with any transcription factors or histone marks in specific cell types (CHiP-seq) and what regions of the genome/ genes the SNP is physically interacting with in specific cell types (CHi-C).

I use the T1D-associated variants that I analysed in my previous project (immunochip variants, approximately 16600 of these). I store the PP for that variant along with the chromatin state of the genomic region the variant lies in (at the ~200bp scale) from the activated and non-actived CD4+ T cell data seperately. There were several variants that fell on the boundary of two adjacent regions that were allocated different states, so I dropped these variants from the analysis.


Emission state enrichment

  • None of the GWAS variants are allocated to states E12 or E13.

  • Most of the variants were allocated to E14 or E2, which (from the emission parameter plot) are not indicative of any histone marks.

  • 18.79% of the variants fell in any E4-E11 region in the activated cells whilst 18.66% of variants fall in active regions in the non-activated cells.


Comparison with CHEERS

The CHEERS paper found GWAS variants were enriched in active regions of the genome (defined using ATAC-seq and chip-seq H3K27ac data) in activated cell states rather than non-activated CD4+ T cells.

I follow their analysis to see if I see similar results with our data. To do this, I pick out the SNPs that are in the standard 95% credible sets (and those that were not in credible sets) and compare their chromatin state allocation between activated and non-activated cells.

  • States E4 and E7 seem enriched in activated cells, but not the other active marks.

  • Note that E4 is the only state that is relatively strongly indicative of H3K27ac (enhancer mark) and no other marks - perhaps this explains the enrichment for this mark amongst GWAS variants in activated CD4+ T cells.

  • E4 is more highly enriched in credible set variants.



Posterior probabilities of causality

There doesn’t seem to be much difference between variant’s PPs and emission states when looking at these independently across all emission states for activated and non-activated cells.


Multiple quantile regression

For the activated cells data, I fit a multiple quantile regression of PP against emission state, using E14 as the baseline since this is the state that most variants are allocated to, and it is not indicative of any chromatin marks (see emission parameters heatplot).

In order to decide what values of \(\tau\) to use, I see what the quantiles of PP are:

  • 95% is \(PP=0.002\)

  • 97% is \(PP=0.01\)

  • 99% is \(PP=0.051\)

I therefore run the analysis for \(\tau=0.97, 0.99\) as these correspond to “interesting” thresholds at \(PP=0.01\) and \(PP=0.05\), respectively.

The results imply that:

  • E1, E11, E8 and E5 are not significant predictors of PP (relative to E14).

  • The effects of E4-E11 (the “active states”) are not all equal (providing evidence to look at each of these individually and see how indicative each of them are of chromatin marks - see next section).

  • The inference varies with the value of \(\tau\) (possible link to prostate cancer paper later where we can average results over various \(\tau\)).

  • The effect sizes of the emission states are generally much bigger in the 99% analysis.


Binary active chromatin mark

I investigate aggregating E4-11 marks into “active chromatin”, but I am wary of the fact that only ~18% of variants were allocated these active chromatin marks.

There is a slight increase in the PPs for those variants falling in active chromatin regions.


QR with \(\tau=0.97, 0.99\) shows that there is a slight trend for higher PPs using a binary measure of chromatin activity.


Difference between activated and non-activated cells

I have data for activated CD4+ T cells (cultured for four hours with anti-CD3/CD28 beads) and non-activated CD4+ T cells. I know from the literature that T cell activation induces major chromatin remodeling. We also know that T1D-associated variants are enriched in CD4+ T cells. Above, I show the respective chromatin states that T1D-associated variants are allocated to in activated CD4+ T cells.

I now investigate the difference in the emission states that GWAS variants for T1D are allocated to for activated and non-activated CD4+ T cells.

Of the 16692 SNPs that could be matched in the anlysis, 5424 (32.5%) differed in their chromatin state allocation. These differences can be visualised by the table (non-activated on top) and plot below, where the size of the point corresponds to the frequency of the chromatin state in the respective cell types.

##      
##         E1   E2   E3   E4   E5   E6   E7   E8   E9  E10  E11  E14  E15
##   E1   807  178   16    0    0    0    0    0    0    0    6    1    0
##   E2   273 3652    5   39   14    0    1    2    0    4  287  614    0
##   E3    75   10  123   14   34    3    0    0    0    0   25    0    0
##   E4    18  422   21  183  375   23    1    1    0   10  352   28    0
##   E5     0    0    3    2  253   31    0    0    0    2   37    0    0
##   E6     0    0    1    0   83  166    2    2    0   18   14    0    0
##   E7     0   14   14   17   76  301  281   31    0   41   32    0    0
##   E8     0   14    0    4    2   14   13   37    0   40   11    1    0
##   E9     0    0    0    0    0    0    1    2   31    2    0    0    2
##   E10    0    0    0    0    1    1    0    0    0    5    3    0    0
##   E11    1    6    0    0   24    1    0    0    0    0   65    0    0
##   E14    1 1043    0   17    3    0    0   10    0    5   54 5459   50
##   E15    0   14    0    0    0    0    0    0    5    0    1  505  206


The following plot shows that there is a slight increase in the PPs of those SNPs that differed in the chromatin state that they were allocated to. The mean PP of variants that switched chromatin state is 0.003 whilst is 0.002 for the variants that do not switch state.


It should be noted that the chromatin allocation is not for the same fragments between the activated and non-actived cells. I.e. the genomic region a SNP falls in in the chromhmm results may be chr7:50869400-50924800 for activated cells but chr7:50861800-51004000 for non-activated cells.

Below, we see that the size of the fragments are slightly larger for non-activated cells, perhaps this shows that there are more identical consequtive marks in non-activated cells (so that these get grouped). I have also found that the size of the fragment is not related with whether the chromatin state switched between activated and non-activated cells.

summary(act_cut$size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     200    4800   15000   32203   37200  343600
summary(non_cut$size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     200    4400   15600   43865   44400  471400


Integrating information on chromatin contacts

For each of the GWAS variants, I find the relevant hindiii fragment and see if there is any useful information on these in the CHi-C results.

For the 16696 SNPs, 1611 could not be mapped to hindiii fragments and the remaining SNPs mapped to 2321 unique hindiii fragments (meaning that many SNPs are positioned in the same hindiii fragment). Of these, 2308 (99.4%) were prey fragments captured in the CHi-C experiment and 126 were bait fragments captured in the CHi-C experiment.

I consider a single example. The SNP with the largest PP is imm_19_10324118 (rs34536443) with \(PP = 0.998519\). I find that in the activated CHi-C dataset, this interacts with 64 bait fragments (chosen as they have a TSS in them - relate to genes) and that 34 of these are within 1Mb of the prey fragment. The CHiCAGO scores and MPPC values for these interactions are plotted below. In the non-activated, there were 68 interactions and 35 of these were within 1 Mb.

I’ve included information on the proportion of each of the bait fragments that is active and the number of bases that are active within the bait fragment. Perhaps we could use this information to help map GWAS variants to their target genes using CHi-C data and integrating chromatin state data.

HOWEVER: My analysis has shown so far that information on the chromatin state at the prey fragment is more helpful than that at the bait fragment, but here we’d be using information at the bait fragment (because the hindiii fragment containing this SNP is a prey fragment). Note that this is the opposite to conventional CHi-C analysis where we plot the preys interacting within a certain distance of the bait.

Activated:

Non-activated:


Findings

  • ChromHMM states seem to provide some relevant information for GWAS PPs.

  • Chromatin marks indicative of enhancers were kind of enriched in GWAS variants

  • There is not much difference between the relationship of chromatin state and PP between non-activated and activated CD4+ T cells.

  • I can see similarities with my results and those reported in CHEERS.

  • Could perhaps use chromatin contact data to link GWAS variants to their target genes, integrating information on chromatin state.


6. Emission state probabilities

I now have access to the data of the emission probabilities, i.e. the probability of observing that chromatin mark given that you’re in that state, and also the transition probabilities.

The top of the data frame is shown below, for each emission state, each chromatin mark is given a probability of it being present.

##    E c_mark_binary   c_mark present         prob
## 2  1             0 H3K27ac        1 5.246946e-03
## 4  1             1 H3K27me3       1 8.138037e-05
## 6  1             2 H3K36me3       1 4.245271e-01
## 8  1             3 H3K4me1        1 3.015497e-03
## 10 1             4 H3K4me3        1 1.868209e-04
## 12 1             5 H3K9me3        1 3.264689e-04

So it seems that e.g. E5 and E6 (very high emission probability for 2 activion marks) should be upweighted more than E11 (only moderately high for 1 activation mark). Also, E9 is indicative of H3K27me3, so perhaps we should be cautious of this state.


Ideas for weighting

Idea: Upweight/ downweight emission proportions based on some weighting of the emission states. This would incorportate information on how much of the fragment is active and also how indicative of an activity that mark is. BUT I am very wary of this approach because different combinations of marks mean different things!!! E.g. H3K4me1 and H3K27ac - indicative of enhancer, H3K27ac and H3K4me3 - indicative of TSS. I need some information about the annotated chromatin states from ChromHMM (I think this is spat out in the analysis rather than having to think about it manually).

My idea is:

  1. Take the mean emission probability over the three activating marks and use this to upweight those marks. For example, the “best” a fragment can be is 100% E6.

  2. Take the mean emission probability over the two (not sure about H3K36me3) repressive marks and use these to downweight those marks.

For example, E12 states would get upweighted a bit and then downweighted a lot.


So the weights would look something like this…

  • The best a fragment can be (most indicative of active chromatin) is 100% E6

  • Notice that this looks similar to the mean emission probability plot for active marks above, but e.g. E12 has dropped right down.


I do a test run and see what happens in the quantile regression framework for MPPC values:


and for CHiCAGO scores:


This idea obviously needs a lot of work, particularly thinking about the effects of combinations of histone modifications. Need to chat to Chris before I continue thinking about this.


Questions + comments

  • I should have some information on the state annotation for each of the E1-E15 emission states from ChromHMM (which I could use for my weighting).

  • Perhaps I don’t want the chromhmm data and actually want the peaks from MACS2. Can then overlay peak and bait regions rather than having to decode ChromHMM results…

  • Why is the mean size of the higher resolution ChromHMM regions a lot bigger than 200bp (in /mrc-bsu/scratch/cew54/peaky/annot/hind-position.RData)? Do they aggregrate adjacent regions with the same allocation?

  • Explanation of the binning step in Hi-C/MPPC etc.

  • Have got an email response from Blagoje..

  • SEGEG talk: Get rid of PTPN22 example?

  • Chris and Sylvia mentioned that all the P values from QR are wrong, because something about degrees of freedom, what does this mean?


  • Main areas I’m looking at:
  1. Integrating chromatin state information (ChIP-seq data piped through ChromHMM) with DNA contact data (CHi-C data).

    • If prey region is active/ open then it is more likely to interact with the bait.

    • Potential to improve the resolution of contact data (although Peaky already does this very well).

  2. Integrating chromatin state information with GWAS PPs.

    • Could we reweight PPs with information about the chromatin landscape at that SNP?
  3. Could we integrate chromatin state data, chromatin contact data and GWAS PPs?

    • If I can think of a way to upweight PPs based on chromatin landscape, then I could use contact information to link to target genes.

So I basically want to integrate 3 scales of data (note that the scale of chromatin landscape data can vary… surely this is on the per base scale if we don’t use ChromHMM though?):


Additional work before meeting

I run regression models to see if information on the length of the fragment is important. When we add information on the length of the fragment, information on the number of active bases becomes less important.

NOTE THAT THE COEFFICIENTS ARE 0 BECAUSE THE SIZES ARENT LOGGED!


MPPC

Standard linear regression model:

  • All estimates are tiny.

Quantile regression models:


CHiCAGO

Standard linear regression model:


Quantile regression models:



1. Quantiles to use

MPPC

MPPC values of interest are 0.01, 0.05 and 0.1.

Activated dataset:

  • 91st: 0.0111 (91.47th is 0.01)

  • 98th: 0.04695 (98.18 is 0.05)

  • 99.5th: 0.09595 (99.55 is 0.1)

Non-activated dataset:

  • 92nd: 0.011 (91.51th is 0.01)

  • 98th: 0.0475 (98.15th is 0.05)

  • 99.5th: 0.0982 (99.52 is 0.1)

CHiCAGO

CHiCAGO values of interest are 3 and 5 (the authors say that 3 is suggestive and 5 is significant).

Activated dataset:

  • 97.5th: 2.91 (97.60th is 3)

  • 99th: 5.27 (98.90th is 5)

Non-activated dataset:

  • 97.5th: 2.88 (97.63th is 3)

  • 99th: 5.17 (98.94th is 5)

PP

PP values of interest are 0.01 and 0.05.

  • 97th: 0.0095 (97.08th is 0.01)

  • 99th: 0.0508 (98.97th is 0.05)


2. Length of fragment or number of active bases?

It would be best to have the length of the fragment in our model rather than the number of active bases. I compare the AIC for models with both measures for 98% QR models using a subset of 3 million rows from the data.

The AICs are very similar but the model with the lowest AIC is that with the proportion of the fragment that is active and the number of active bases.

Model (0.98 QR) AIC
mppc~prey_prop_act+bait_prop_act+prey_length+bait_length -9252910
mppc~prey_prop_act+bait_prop_act+log(prey_length)+log(bait_length) -9251765
mppc~prey_prop_act+bait_prop_act+prey_no_act+prey_no_act -9251857
mppc~prey_prop_act+bait_prop_act+log(prey_no_act+1)+log(prey_no_act+1) -9276192 (lowest)