17th October

Notes:

Peaky improves the resolution of chromatin contact data and provides a measure of uncertainty of the location of the direct contact(s). The posterior probability that there is a contact in a given region can be calculated by summing the posterior probabilities (MPPCs) in that region.
Struggling to follow the maths of the quantile regression method used in the prostate cancer paper. Need to speak to Paul N about this when he returns.

ChromHMM
Data Exploration and Analysis
Quantile Regression
Quantile Regression for proportion of fragment that is active
Quantile Regression for number of bases in fragments that are active
Summary and next steps

1. ChromHMM

Hidden Markov model

Notes from https://www.youtube.com/watch?v=TPRoLreU9lA

ChromHMM is a software program to transition from chromatin marks (observables) to chromatin states (unobservables). The input to the algorithm is data on chromatin marks (e.g. Chip-seq data - specifically, the list of aligned reads for each chromatin mark which are automatically converted into presence or absence calls for each mark across he genome, based on a Poisson background distribution) which is then used in a hidden Markov model (HMM) to find chromatin (“emission”) states and emission probabilities, which measure the probability of observing that histone mark given that you’re in that state.

A multivariate HMM (each emission state (Z) can correspond to multiple observables (X)) is used which allows the probabilistic modelling of both:

The combinatorial presense/ absence of multiple marks (emission parameters)
Spatial constraints of how these mark combinations occur relative to each other across the genome (shown in the transition matrix).

The resolution of chip-seq data is ~200bp (which is also roughly the size of nucleosomes) and therefore ChromHMM assigns a chromatin (emission) state to each 200bp region of the genome. Within each of these 200bp regions, the emission distributtion is a product of independent Bernoulli random variables (0s and 1s). For each genomic interval, ChromHMM determines the presence or absence of each mark on the basis of the significance of the observed count of sequencing reads relative to a Poisson background distribution.

The results from ChromHMM for the data I’m using are shown below.

2. Data analysis

I have removed rows from the chromHMM data with any NA values and added a column for the size of the hindiii fragment. This data is in /peaky/data_new/cut_chromhmm.RDS and has dimension \(741,316\times20\) (the full one, without NAs omitted is chromhmm.data.RDS and has dimension \(837,157\times20\)).

I then integrate the data on chromatin state with the peaky results. I generate a data frame which has the peaky data and the emission state proportions for the relevant bait and prey. Missing values have been removed. This is saved in /peaky/data_new/full_res.RDS and has dimension \(14,936,678\times38\).

Using this data, I investigate the relationship between the proportion of fragment allocated to each emission state value and MPPC for baits and preys seperately (for a sample of 500,000 rows). These plots are avaliable on github (https://github.com/annahutch/PhD/tree/master/pngs) but note that the \(x\) axis label is incorrect, this should be “emission state proportion”.

I breifly check that each of these plots make sence biologically:

E1: Indicative of H3K36me3 - little correlation between MPPC value and proportion of fragment allocated E1 mark (perhaps positively correlated with MPPC in preys)
E2: Not indicative of any histone modification - wide spread of emission proportions amongst a wide spread of MPPCs, little correlation with MPPC in bait or prey
E3: Indicative of H3K36me3 and H3K27ac (activating) - little correlation with MPPC in bait or prey
E4: Indicative of H3K27ac (activating) - little correlation with MPPC in bait or prey
E5: Strongly indicative of H3K27ac and H3K4me1 (both activating) - Generally low emission state proportion and little correlation with MPPC in bait or prey
E6: Strongly indicative of H3K4me3, H3K27ac and H3K4me1 (all activating) - Positively correlated with MPPC in bait, not so much in prey. So that if the fragment is ~100% E6 then MPPC is higher (at least in baits)
E7: Strongly indicative of H3K4me3 and H3K27ac (both activating) - Slight positive correlation with MPPC in both bait and prey. So that if the fragment is ~100% E7 then MPPC is higher
E8: Strongly indicative of H3K4me3 (activating) and weakly indicative of H3K4me1 and H3K27ac (both activating) - little correlation with MPPC, perhaps slight positive correlation amongst preys and negative amongst baits
E9: Strongly indicative of H3K4me3 (activating) and weakly indicative of H3K27me3 (repressive) and H3K4me1 (activating) - little correlation with MPPC in bait or prey
E10: Strongly indicative of H3K4me3 (activating) and H3K4me1 (activating) - Generally has low emission proportions (\(<0.25\)), positively correlated with MPPC in preys
E11: Strongly indicative of H3K4me1 (activating) - slight positive correlation with MPPC, especially among baits
E12: Strongly indicative of H3K36me3, H3K9me3 (repressive), H3K27me3 (repressive), H3K4me1 (activating) - Generally has very low emission proportions (\(<0.1\)) slight negative correlation with MPPC.
E13: Very weakly indicative of H3K9me3 (repressive) - many more prey with E13 value of 1 relative to baits
E14: Not indicative of any marks - little correlation with MPPC in bait or prey
E15: Weakly indicative of H3K27me3 (repressive) - little correlation with MPPC in bait or prey

This analysis has shown that looking at the proportion of the fragment which occupies each emission state individually is not very informative of MPPC, and provides a good incentive to collapse results over multiple emission states, e.g. those that are indicative of activating or repressive marks. It also shows that modelling the mean is not very informative, and provides evidence to use quantile regression, which would help to better distinguish between the extremes of the distribution.

Which states to aggregate over?

We have ChIP-seq data for 6 histone modifications:

H3K36me3: Originally thought to represent actively transcribed regions but results suggest that it can be associated with heterochromatin¹ (unknown)
H3K9me3: Found in constitutive heterochromatic regions (repressive)
H3K27me3: Associated with the downregulation of nearby genes via the formation of heterochromatic regions (repressive)
H3K4me1: Regulates gene expression (activating)
H3K27ac: Associated with higher activation of transcription (active enhancer mark)
H3K4me3: Recruits the chromatin remodelling factors that open chromatin and prevent the binding of repressive complexes (activating)

Do the combination of histone modifications change their biological meaning²? Is there any data on this?

Quantifying “active chromatin”

The pink regions representing active chromatin in the peaky paper were derived if any of that region was allocated to emission states E4-11, which are indicative of activating marks, thus the chip-seq and chromHMM data was converted to binary format. I assume that these states were chosen to reflect active chromatin as these states are only indicative of activating marks (H3K4me1, H3K27ac and/or H3K4me3).

I investigate the relationship between MPPC value and active chromatin, where active chromatin is defined as in the peaky paper.

This method seems to throw away useful information on how indicative of the activating histone mark that emission state is (i.e. how dark the blue is in the emission parameter plot), what percentage of the hindiii fragment is that state and how many nucleotides that state occupies.

I now have access to the higher resolution chromHMM data (in /mrc-bsu/scratch/cew54/peaky/annot/Act_15_segments.bed) which allocates an emission state to each 200bp region of the genome. For each hindiii fragment (~ 4000bp) the proportion allocated to each emission state is derived to obtain the original chromHMM data I had.

For example, hind1 is chr1:2-16007 and is made up of E13 (2-10000), E12 (10000-10400), E2 (10400-14800) and E1 (14800-18200). The continuous numbers in the lower resolution chromHMM data refer to the proportion of the region that is occupied by that emission state.

This leads to obvious problems due to the varying size of hindiii fragments. Note that the bigger the hindiii fragment, the more resolution we loose. Need to look at spatial epidemiology (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1247193/) which combines data on different scales.

For example, hind1 is 25 times bigger than hind6 so the emission state allocations are difficult to compare (e.g. hind6 is 72% E14 but this is only a region of 445bp, whereas hind1 is only 7% E1 but this represents a region of 1200bp). This is probably why the data is converted to binary format. See image below:

Brain-storm

Aim: Think of a better way to use the chromHMM data that uses information on:

The proportion of the hindiii fragment allocated to that state/ which is active
How indicative of the chromatin mark that state is (this could be very hard and I don’t have the data used to generate the emission parameter plot). E.g. the probability of observing H3K4me1, H3K27ac or H3K4me3 is way higher in E6 than other states that we’re considering active (e.g. E4), so that if a bait/prey has a larger proportion allocated to E6 then this is more indicative of open chromatin than say a small proportion allocated to E4.

Ideas:

For states E4-11, the combined sum of emission proportions must exceed X% (so that X% of the fragment is active).
For states E4-11, the mean of the emission proportions must exceed Y.
Why are we even converting to binary in the first place? Is there a way to keep it continuous?

Difficulties:

Arbitrary choices - “what proportion of the fragment must be allocated an active state for me to call the whole fragment active”
Each ~200bp region is allocated a state, so the larger the fragment the more different states it will consist of - probably contributing some bias?
The emission state proportions are correlated (must sum to 1)

Proportion of fragment that is active

At the moment, a fragment is considered active if ANY of it is allocated an active emission state (E4-11), whether this be 1% or 100% of the fragment. Instead, I look at the relationship between MPPC value and what proportion of the fragment is allocated an “active chromatin state”.

In the plots below, fragments are active if >X% of the fragment is allocated to an active state and inactive otherwise.

This motivates the use of quantile regression to establish this relationship at the extremes of the distribution.

Let’s investigate the distribution of emission state proportions.

We see that the proportions vary substantially, e.g. E13-15 often have very high emission proportions (3rd quantile ~1) whereas other emission states generally have lower proportions (e.g. 3rd quantile for E6 is 0). Also, some emission states are allocated more often than others. Perhaps we could take a top quantile for each of the emission state proportions and see what happens to the MPPC when the proportions fall within this.

Although note that the distribution of proportions for E4-11 look pretty similar to each other and similarly for E1-3, E12-15. So, interestingly a fragment is more consist solely of any of the inactive states (e.g. E13-E14) than the active states (e.g. E8-E9). This is something to do transition probabilities between states.

But do we actually care about the distribution of the emission proportions?

3. Quantile Regression

Frequentist

Quantile regression will allow us to examine the relationship between chromatin state and the different parts of the MPPC distribution. For example, to see what affect chromatin state has on the MPPC for the 95% percentile of MPPC values. It is also more robust to outliers. The motivation for quantile regression is usually that there is some departure from the assumption of IID error terms.

The following is taken from the bayesQR paper, https://www.jstatsoft.org/article/view/v076i07.

Consider the standard linear model, \[y_i=x_i^T\beta+\epsilon_i\] The conditional mean model is obtained by assuming that \(E(\epsilon|x)=0\). The likelihood is then, \[l(x_i,y_i,\beta, \epsilon, \sigma)=\Pi_{i=1}^{n} \dfrac{1}{2\pi \sigma^2}exp\left(-\dfrac{(y_i-x_i^T\beta)^2}{\sigma^2}\right)\] and the log likelihood is, \[L(x_i,y_i,\beta, \epsilon, \sigma)\propto -\sum_{i=1}^n(y_i-x_i^T\beta)^2,\]

So we see that to maximise the log likelihood (a negative number), we can convert it to a minimisation problem - in which case the regression coefficients can be obtained by solving \[\hat\beta=argmin_{\beta}\sum_{i=1}^n (y_i-x_i^T\beta)^2.\]

Alternatively, if we assume that \(Median(\epsilon|x)=0\) then we get the conditional median model and the coefficients are obtained by solving, \[\hat\beta=argmin_{\beta}\sum_{i=1}^n |y_i-x_i^T\beta|.\]

Quantile regression is extended from this for all other quantiles of interest: \[\hat\beta_{\tau}=argmin_{\beta}\sum_{i=1}^n \rho_{\tau} (y_i-x_i^T\beta)\] where \(\rho_{\tau}(x)=x(\tau-I(x<0))\).

The quantreg R package uses frequentist methodology to estimate quantile regression coefficients. It offers 3 methods to solve the above minimisation problem:

Simplex method (default) using method=br: Uses the fact that solutions are focused on the vertices of the constraint set.
Frisch-Newton interior point method using method="fn" or method="pfn" for very large problems: Instead of focusing on the vertices, it traverses the interior of the feasible region.
Sparse regression quantile fitting using method="sfn" or rq.fit.sfn: Sparse version of the above, it is efficient when the design matrix has many zeros (e.g. if the predictors contain several factors).

Bayesian

Bayesian quantile regression uses an Asymmetric Laplace (AL) likelihood to estimate model parameters. Solving the minimisation problem above is equivalent to maximising a regression likelihood using ALD:

\[f(y | x_i^T\beta, \sigma, \tau)=\frac{\tau(1-\tau)}{\sigma} \exp \left(-\rho_{\tau}\left(\frac{y- x_i^T\beta}{\sigma}\right)\right).\]

Priors can then be placed on \(\beta\) and \(\sigma\) and the posterior distribution can be represented as,

\[\psi(\beta,\sigma|y,x,\tau)\propto \pi(\beta,\sigma)\Pi_{i=1}^n \text{ALD}(y_i | x_i^T\beta, \sigma, \tau).\]

Standard Bayesian methods can be used to approximate this (e.g. MCMC).

Implementation

We need to think about which quantile to use for our quantile regression. To investigate this I look at a histogram plot of the (log) MPPC values for a random sample of 500,000 points.

I also plot MPPC value against quantile and mark on the 0.95 and 0.99 quantiles. It looks like the 95th quantile will do, which is a value of \(MPPC=0.02095\) (the 99th is \(MPPC=0.0701\)).

quantreg

quantreg uses frequentist methodology to estimate quantile regressions.

I perform frequentist quantile regression (quantreg) for the 95th, 99th and 99.9th quantiles for BOTH bait and prey falling in active chromatin regions (standard binary conversion of E4-11) using the simplex method for minimisation.

Do the results differ if we use the Frisch-Newton interior point method or sparse regression quantile fitting for the minimisation? The results don’t differ between the simplex and Frisch-Newton method but they do when using the sparse regression method (this is because our design matrix isn’t sparse so we shouldn’t be using this anyway…).

bayesQR

I perform Bayesian quantile regression using bayesQR. This function places weak priors on \(\beta\) and \(\lambda\) and using a Gibbs sampler to estimate the model parameters.

I had problems with an error message saying that I had missing values in my data, it turns out that I needed to scale my response variable (meaning I have to unscale the coefficient estimates).

Comparing frequentist and Bayesian QR

The red dashed line is the OLS line, the blue is the median line, the grey is the frequentist 95% (and 99%) quantile line and the black is the Bayesian 95% (and 99%) quantile line. We see that there is litle difference between the frequentist and Bayesian estimates. The frequentist functions are much quicker than the Bayesian functions so I opt to use these.

Summary from QR work:

Motivation: When plotting the mean MPPC value against each emission state proportion (using geom_smooth) there was little relationship between the two, and the fitted lines were largely flat for both the bait and prey fragments [see plots here]. I found similar results even after aggregating emission state proportions into a binary measure of active chromatin in both the bait and prey, that is, when plotting MPPC against the standard binary indicator of active chromatin (that had to be present in both the bait and prey), the mean MPPC line was basically flat across increasingly active chromatin states. I then investigated the distribution of MPPC values across inactive and active regions and found that there were differences in the median and the higher extreme MPPC values.
Method: Investigate the distribution of MPPC values to decide which quantile to look at. Understand both frequentist and Bayesian quantile regression and their respective R packages. Implement frequentist and Bayesian quantile regression and whether they give different results.
Findings: Frequentist and Bayesian methods give very similar results, quantreg is a lot quicker, so I opt to use this. Even when looking at the 99th percentile, the MPPC only increases from 0.06225 to 0.1173 (increase of 0.055).

4. Quantile regression for proportion of fragment that is active

Rather than converting the continuous emission state proportions into a binary measure of active chromatin, I investigate whether we can use the continuous emission state proportions directly.

Firstly, I investigate the distribution of the proportion of the fragment that is allocated to an active chromatin mark (E4-11).

I consider the relationship between MPPC value and the proportion of the fragment that is allocated to an active chromatin state (E4-11). In the plot below, the blue line is that from geom_smooth which uses a GAM to model the mean MPPC values, and the red lines are the results from .5, .95, .99 and .999 quantile regression.

Multiple quantile regression

I also investigate the mutliple quantile regression (99th percentile), where the proportion of active marks for both baits and preys are used as predictors.

I am pretty sure that an interaction term between the bait and the prey is required, but to double check this I make a quick interaction plot. The non-parallel line shows that an interaction term is required.

The model I fit is shown below, along with the summary of the fitted model.

freq <- rq(mppc ~ prey_prop_act * bait_prop_act, tau=.99, data = res)

So if we know that the bait is in a largely active region (which can be checked as probes were developed to capture these regions in the experiment), then whether the prey is active or not will really vary the MPPC value, and vice versa. Alternatively, if one of the bait or prey is not in an active region (or in a region that is a small percent active), then the MPPC for it’s interaction will never increase by much (this makes sense as biologically we want the bait and prey to both be in largely active regions or it’s unlikely they’d be an actual interaction anyway).

5. Quantile regression for number of bases in fragments that are active

One way to incorporate the size of the fragments into the analysis is to use the actual number of bases that are allocated to active chromatin states. [I am wary of this method though…]

Below shows the distribution for the number of bases that are given an active chromatin state within each fragment. Note that these vary massively due to the variability in the size of the fragment and the percentage allocated to an active chromatin mark.

I investigate the relationship between MPPC and the number of bases in the bait/prey that are active. There is a lot more of a positive relationship between the number of active bases than the proportion of the fragment that is active and MPPC, why could this be? This relationship is larger amongst prey too, why?

Spatial epidemiology paper

“The reduced size of the population at risk leads to small numbers of events and unstable risk estimates” - proportions less meaningful in smaller hindiii fragments
“At the broader scale, purely local variations in data quality are likely to largely cancel out, whereas at the small-area scale, these variations could lead to serious biases if not detected.” - weird allocation of emission states matter less in larger hindiii fragments
“The modifiable areal unit problem (MAUP) is a source of statistical bias that can significantly impact the results of statistical hypothesis tests. MAUP affects results when point-based measures of spatial phenomena are aggregated into districts, for example, population density or illness rates.” - point-based measures in our case are the emission state allocation to 200bp regions and districts would then be the hindiii fragments.

6. Summary and next steps

The main questions I am trying to address are:

Is there a better way to use the ChromHMM data to characterize active chromatin marks?
- The current method converts the values to a binary indicator of whether ANY of the fragment is allocated an active chromatin state (whether this be 1% or 100% of the fragment).
- I have looked at using the proportion of the fragment that is allocated to an active chromatin state (E4-11).
- Need to bare in mind:
  - correlations between emission proportions (they sum to 1, obviously)
  - varying sized fragments (links to spatial epi stuff, that weird allocation of emission states matter less in larger hindiii fragments)
  - the distribution of proportions vary across emission states
How can we utilise data on chromatin state AND chromatin contacts?
- QR to update chromatin contact data using chromatin state data?
What is my main aim here?