Peaky+

1. Original peaky model

Peaky jointly models neighbouring prey fragments to fine-map CHi-C interaction data.

The peaky model for a contact between bait, \(b\), and prey, \(p\), is

\[\begin{equation} Y_{bp}\sim N(\mu_{bp}, 1), \\ \mu_{bp}=\sum_{q:\gamma_{bq}=1}\beta_{bq}\times exp(-\omega d(p,q)). \end{equation}\]

Where:

\(Y_{bp}\) are the observed NB residuals from the count data
\(\gamma_{bq}\) is a binary indicator of a contact between bait \(b\) and a neighbouring prey fragment \(q\)
\(\beta_{bq}\) measures the contact strength between \(b\) and \(q\)
\(\omega\) measures the decay around the location of the bait (currently set to \(10^{-4.7}\))
\(d(p,q)\) is the linear distance between prey \(p\) and neighbouring prey \(q\).

So that \(\mu_{bp}\) measures the expected NB residual.

Peaky uses RJMCMC to sample posterior probabilities for contacts, \(Y_{bp}\), and contact strengths, \(\beta_{bq}\). The primary statistic for inference is the marginal posterior probability of a contact (MPPC) between the bait and the prey, which is the proportion of sampled models in which \(\beta_{bq}>0\).

Priors

\(\gamma_{bq}\) is a binary indicator of a contact between bait \(b\) and fragment \(q\),

\[\begin{equation} \gamma_{bq}\sim Ber(\theta_b). \end{equation}\]

We expect few direct contacts and thus encourage sparcity through the specification of a hyper prior on \(\theta_b\), which is the expected proportion of prey fragments which contact \(b\). \(F_b\) is the number of neighbouring prey fragments and so,

\[\begin{equation} \theta_b\sim Beta(1, F_b). \end{equation}\]

So as the number of neighbouring prey fragments increases, the smaller the probability of a direct contact at \(p\).

A prior is also placed on the contact strength parameter \(\beta_{bq}\),

\[\begin{equation} \beta_{bq} \sim N(0, \sigma_{\beta}^2),\\ \sigma_{\beta}\sim Unif(0.01, 2) \end{equation}\]

In this model, \(\gamma_{bp}\) and \(\beta_{bp}\) are inferred in each bait independently of all the other baits.

2. Peaky+ ideas

2a. Incorporate external data

We wish to develop the peaky model to incorporate external data on chromatin state into the analysis, so that if there is a chromatin mark indicative of (e.g.) open chromatin at \(q\), then the the probability of a direct contact between bait \(b\) and prey \(q\) increases.

To do this, we could specify another parameter, \(\alpha\) in the model. So that \(\theta_{bp}\) is sampled conditional on the external data about chromatin state,

\[\begin{equation} \theta_{bp}=\left\{\begin{array}{l}{\theta_0\sim\operatorname{Beta}(1,F_b) \text{, if chromatin inactive}} \\ {\theta_1 \sim \operatorname{Beta}\left(1,\frac{F_b}{\alpha}\right)\text{, if chromatin active}}\end{array}\right. \end{equation}\]

where \(\alpha>1\) and is sampled from some distribution (perhaps gamma), so that the distribution for \(\theta_{1}\) is more spiked toward 1 (as \(\alpha\) increases, \(\beta=\frac{F_b}{\alpha}\) decreases so the expected value increases). To increase information, \(\alpha\) must be estimated from multiple regions, perhaps even genome-wide.

Note that for now we are assuming that this chromatin state data is binary, but in the future we may like to use continuous data.

2b. Vary \(\omega\) across baits

The idea above could be extended to sample the decay parameter, \(\omega\) from some distribution. Note that at the moment, \(\omega\) is assumed to be fixed and symmetric but this is not biologically sound. For example, if the bait is near the edge of an A/B compartment, then the decay rate would be greater at the side nearest to the boundary as interactions rarely pass over A/B compartment boundaries. For this same reason, the decay would look more symmetric nearer to the centre of an A/B compartment.

Whilst it is likely that \(\omega\) would also vary within each bait (although the size of baits are much smaller than the size of A/B compartments), this would be very hard and may be subject to over-fitting. Moreover, if we allowed \(\omega\) to vary within each bait then our prior on the number of peaks within a region becomes very important; for example, is the signal (i) 2 peaks with high decay rate values or (ii) 1 peak with a lower decay rate value. We also can’t use information on A/B compartments to estimate \(\omega\) because this information comes from the same data, it therefore seems sensible to only allow \(\omega\) to vary across baits and not within baits (at least for now).

3. Characterising active chromatin regions

Side note: Histone modifications associated with transcriptionally active genes and regulatory elements include H3K27ac, H3K4me1 and H3K4me3.

In order to incorporate external information about chromatin state into the analysis, I need to understand how chromatin states (e.g. active/ inactive) are characterised.

CHROMHMM is a tool to help understand the language of epigenetic marks. It uses a hidden Markov model to transition from chromatin marks to chromatin states by integrating multiple tracks from various CHiP-seq data. This is useful because we can jointly visualise the results from various chip-seq experiments (which is also useful because histone modifications are not independent). For example, Burren et al. use CHROMHMM to charactise active/inactive chromatin whereby they input CHiP-seq data for 6 chromatin marks (H3K36me3, H3K9me3, H3K27me3, H3K4me1, H3k27ac, H3K4me3) into a 15 state CHROMHMM model. Each state was then manually annotated and states indicating the presence of promoter or enhancer chromatin tags were selected (E4-E11) to profile regions of active and inactive chromatin. The peaky paper combines these 8 states into a binary measure for active or inactive chromatin, but how is this done?

I can’t find any information about how the continuous data across these 8 states (E4-E11) is converted to a binary value. For now, if any of the columns for E4-E11 states contain a non-zero entry then the binary indicator will be 1 and 0 otherwise. Could also be something to do with the other states, do E1-3 and E12-15 reflect inactive chromatin?

Data generation

Using the chromhmm data, I have generated dataframes: final_chrom.RDS, with 5 columns (hindID, chr, start, end and a binary indicator of active or inactive chromatin within this hindiii fragment) and full_chrom.RDS with 13 columns (as final_chrom but with raw emission state values also):

Load in chromhmm-act.RData
Remove any rows where all the emission state values were NA
Replace the remaining NAs with 0 (should think of a better way to use this information)
Add additional “binary” column which takes a value 1 if any of the state data for states E4-E11 is non-zero and 0 otherwise
Load hind-position.RData and use this to add chr, start and end columns representing the genomic positions of the hindiii fragments

The resultant final_chrom.RDS and full_chrom.RDS files have 836,800 rows (original data has 837,157 rows so data for 357 hindiii fragments lost).

The above steps are repeated for the non-activated cells using chromhmm-non.RData instead, creating non_final_chrom.RDS and non_full_chrom.RDS files.

In terms of the peaky data, the baitIDs and preyIDs correspond with the hindIDs (checked by comparing mid.bait column in peaky data and genomic positions of hindIDs in final_chrom). The peaky analysis was ran on four datasets generated from CD4+ T cells; aCD4pchic.RData (activated promoter capture set), aCD4val.RData (activated validation capture set), nCD4pchic.RData (non-activated promoter capture set), nCD4val.RData (non-activated validation capture set). For now, I just focus on the promoter capture datasets.

“Note that the validation set are used here to provide a complementary dataset whose true contacts would be expected to have alternative biological characteristics than the promoter capture set, owing to the opposite fragment being captured.”

I generate dataframes, aCD4pchic_data.RDS and nCD4pchic_data.RDS, with the peaky data and binary columns representing whether the prey, bait, and/or both fall in active chromatin regions.

Data analysis

For now, I focus only on the activated T cell dataset.

Example: For bait 784839 the following plot show the chicago scores and the MPPCs. We can see that peaky has adjusted the signals. It seems that there are perhaps 2 direct targets which is not obvious from just looking at the chicago scores. Moreover, peaky expresses it’s uncertaincy (unlike say fine-mapping which just picks out variants) in where exactly these direct targets are. This nice feature comes about because the raw read counts are correlated amongst neighbours and peaky jointly models the decay of the signal around these neighbours directly.

Aim: Explore the chromatin state and peaky output data. Start to look at whether PPs from peaky are correlated with chromatin state data –> would expect MPPC to be bigger on average for hindiii fragments falling in regions of active chromatin, is this reflected in the data?

The boxplots shown below are the log transformed mppc values for a random sample of 500,000 rows from the activated CD4 T cell data. We see that when both the bait and prey fall in active chromatin regions, the mppc values are slightly higher (promising).

4. Quantile Regression

Motivation

The prostate cancer paper successfully fine-maps 75 genomic regions using JAM. They use Quantile Regression (QR) to incorporate functional annotation data to prioritise SNPs by regressing the \(PP_{JAM}\) against a linear combination of annotation features and then updating the posterior probabilities. The motivation for using quantile regression is that these functional annotations (the “prior information”) may help us to differentiate at the extremes of the distribution rather than say the mean, which is what standard regression would tell us.

Here, conditional quantile regression is used to compare statistical and functional evidence for causality. The statistical analysis is performed separate to annotation, as opposed to some fine-mapping frameworks (e.g. CAVIAR) that incorporate functional annotations during the statistical analysis. They justify this by saying that this reduces the potential for strong candidate variants to be downweighted due to experimental artifacts (the annotations come from whole-genome biological data sets so their reliability may be dodgey). The only caveat of this method is the re-use of the data, but this can be overcome by splitting the data into testing and training datasets, and there is enough data (SNPs in this example or peaks in peaky) to do this.

“We believe this more clearly allows the most informative annotations, and the variants that are characterised by those annotations to be highlighted within the data set, whilst also reducing the potential for penalisation of strong candidate variants due to localised artifacts or cell line-specific effects within the whole-genome biological data sets used for annotation. Our conditional QR analysis resulted in adjustment of posterior probability for a small proportion of variants and may further assist prioritisation of the most likely functional variants among the credible set selected for each region.”

To minimise correlations between annotations, single data sets for each transcription factor, histone mark, DNaseI, conserved element and chromatin state by ChromHMM category were selected for investigation.

Mathematical method

“At each quantile, we used the fitted model to calculate a predicted posterior probability given the SNP’s annotation features. A single expected posterior probability was then calculated from a weighted average of these quantile-specific expected posterior probabilities with the weight reflecting both the fit (i.e., a function of the likelihood) and variance of the predicted values from the quantile-specific model to the data.”

Fit a conditional QR model regressing the PP obtained from JAM (\(PP_{JAM}\)) against the genomic annotation (\(\mathbf{Z}\)) for each quantile \(\tau=(99.2, 99.4, 99.6, 99.8, 99.95\%)\).
Use each of these 5 models to calculate 5 expected PPs for each SNP given the annotation profile for that SNP (\(\hat {PP}_{QR: \tau=99.2}, \hat {PP}_{QR: \tau=99.4}, \hat {PP}_{QR: \tau=99.6}, \hat {PP}_{QR: \tau=99.8}, \hat {PP}_{QR: \tau=99.95}\)).
Take a weighted average of these 5 expected PPs for each SNP to yield a final estimate, \(\hat{PP}_{QR}\).

See Yu et al. and APTS notes here that explain Bayesian quantile regression whereby a likelihood function that is based on the asymmetric Laplace distribution (ALD) is used. Mean regression corresponds to a normal likelihood, whereas QR corresponds to the AL likelihood.

The Laplace distribution with density \(f(z)=\frac{1}{2\sigma} exp\left(-\frac{|z-\mu|}{\sigma}\right)\) has the nice property that the MLE of \(\mu\) is the sample median. The asymmetric Laplace distribution is a generalisation of the Laplace distribution whereby the MLE of \(\mu\) is now the sampled quantile of \(z\).

The AL distribution is similar to the normal distribution but is “spikier” and is allowed to be asymmetric. Interestingly, it looks like what we may want our peaks to look like in peaky when we allow the decay parameter \(\omega\) to vary.

Let \(\mathbf{Z}\) denote the annotations and \(PP\) denote the posterior probabilities from JAM, then

\[\begin{equation} PP \sim N(v, \sigma^2) \\ v\sim \Pi_i\text{ ALD}(\mathbf{Z}\theta, \lambda_i, \tau_i) \end{equation}\]

So, \(v\) is affected by PP and \(\mathbf{Z}\theta\), and it suggests a weighted average of PP and fitted regression quantiles can approximate v.

The density of ALD distribution with location parameter \(\mathbf{Z} \theta\), scale parameter \(\lambda>0\), skewness parameter (quantile) \(\rho \in (0,1)\), is

\[\begin{equation} f(v | \mathbf{Z} \theta, \lambda, \tau)=\frac{\tau(1-\tau)}{\lambda} \exp \left(-\rho_{\tau}\left(\frac{v-\mathbf{Z} \theta}{\lambda}\right)\right) \end{equation}\]

For \(\lambda\), the maximum is achieved at (can derive this by computing MLE)

\[\begin{equation} \lambda^{*}=\frac{\sum_{j} \rho_{\tau}(v-\mathbf{Z} \theta)}{N} \end{equation}\]

Fix \(\lambda\) to be

\[\begin{equation} \hat{\lambda}=\frac{\sum_{j} \rho_{\tau}(y-\mathbf{Z} \widehat{\theta})}{N} \end{equation}\]

where \(\hat{\lambda}\) is coefficient estimates from classical conditional QR. With \(\lambda\) fixed, only the exponential parts of the Gaussian distribution and ALD involve \(v\), to which we assign weight to \(P\) and \(\mathbf{Z} \theta\).

Classical QR yields a prediction for each SNP \(i\) as \(\widehat P_i = \mathbf{Z}\hat\theta_i\) at each quantile. The larger the penalty from the ALD (\(\rho_{\tau_{i}}\left(P_{i}-\widehat{P}_{i}\right)\)) and the Gaussian distribution (\(\left(P_{i}-\widehat{P}_{i}\right)^2\)), the less influence \(\widehat{P}_{i}\) should have on \(P_i\) (i.e. the bigger the difference between \(PP_{JAM}\) and the prediction, the less influence).

They chose to normalise the weight for \(P_i\) to be 1. For \(\widehat{P_i}\) they assign weight

\[\begin{equation} w_{i}=\exp \left(\left(-\frac{\rho_{\tau_{i}}\left(P_{i}-\widehat{P}_{i}\right)}{\hat{\lambda}_{i}}-\frac{\left(P_{i}-\hat{P}_{i}\right)^{2}}{2 \sigma^{2}}\right) / 4\right) \end{equation}\]

which is approximately the penalty at \(v=(P+\widehat P_i)/2\). Their approximate value for \(v\) is then

\[\begin{equation} \hat{v}=\frac{P+\sum_{i} w_{i} Y_{i}}{1+\sum_{i} w_{i}}. \end{equation}\]

Application to Peaky

Firstly, I need to investigate the relationship between the PPs (MPPCs) and chromatin state using QR. Note that QR analyses all of the PPs at once and so we are achieving our goal of using multiple baits simultaneously. There are two main packages for QR in R:

quantreg which is frequentist.
bayesQR which is Bayesian (and so uses ALD): need to specify MCMC draws, and there is an option to use adaptive lasso variable selection (note that this isn’t working on my data as it is returning an error saying there are missing values when there are not).

Below uses a subset of the aCD4pchic.RData data and uses frequentist QR regression for the MPPC against chromatin state (active or inactive). The red line is fitted to the means, the blue to the median and the grey to various quantiles (.05,.1,.25,.75,.90,.95). As expected QR captures the differences at the extremes, so there seems to be a significant trend for chromatin state with the highest MPPC values. Notice the difference in both the intercept and the slope.

If we were to follow the updating PP idea from the prostate cancer paper then our method would look something like:

Fit a QR model, \(MPPC \sim \mathbf{Z}\), for various quantiles.

For each bait and prey pair,

Predict \(\widehat{MPPC}\) for each of the quantiles using the genomic annoations for each bait and prey pair
Take some sort of weighted average of these predictions over the various quantiles to derive the new MPPC, \(MPPC_{QR}\).

Or we could think about how using this method can give us a different prior on gamma (\(\gamma_{bp}\)) that incorporates information on chromatin state, such that if the prey falls in active chromatin then the prior is larger.

The obvious things to think about are:

At the moment, the only chromatin state information I am using is a binary indicator of active/ inactive chromatin, we would need to incorporate other measures of chromatin state to derive the predicted value for \(\widehat{MPPC}\) (otherwise all of the predicted values would be the same for any bait and prey in active chromatin or inactive chromatin). Perhaps we could incorporate something about the size of the fragments and whether the chromatin is active along the whole fragment (Chris mentioned having some chromatin state data at the bp scale).
What values of \(\tau\) to take a weighted average over (how did they choose theirs in the prostate cancer paper?).
The best way to take the weighted average (and therefore derive the updated MPPC, \(MPPC_{QR}\)).

5. Other things discussed in the meeting

Group lasso - one group have this penalisation, another group has this penalisation.. Our goal is inference. Want to learn the relative penalisation and then we can learn about the fold difference of the prior. [See file:///Users/anna/Downloads/Wiel%20et%20al.%20-%202016%20-%20Better%20prediction%20by%20use%20of%20co-data%20adaptive%20group-regularized%20ridge%20regression.pdf] and chat to Mark van de Wiel and Ash:
- Can use empirical Bayes principles to let the data decide how inforamtive the “co-data” (prior data) should be
- GRridge (read the paper) can handle co-data of many different types
Frequentist form of ridge regression, where everything stays in the model but things are pushed towards 0 (so we don’t have to use RJMCMC). Learn a differential prior. Note that this achieves our goal of using information across everything.

Questions

Scale of peaky and chromatin state data.
- Bait and prey IDs refer to hindiii fragments, is there any data on the size (bp) of these fragments?
- I think Chris mentioned that the chromatin state data was on the bp scale but the /mrc-bsu/scratch/cew54/peaky/annot/chromhmm-*.RData is on the hindiii scale (shows chromatin state for each hindiii fragment)
- So how in the paper do they observe “complete or fractional overlap with active chromatin marks” (pg 11: “Promoter: link to active chromatin”)
- In the paper they look for preys that overlap active chromatin marks - why not bait AND prey overlapping active chromatin marks (as I have done)?
Think I’m going to need some information on the width of the fragments. What do block (ranges from 1-88) and block length (ranges from 1 to 76) refer to in the /mrc-bsu/scratch/cew54/peaky/analysis/numbers.RData file?
What other annotation information could we include? The prostate cancer paper uses datasets for each transcription factor, histone mark, DNaseI, conserved element and chromatin state by ChromHMM category.
What is the resolution of chip-seq data? Think I saw somewhere it finds ~200 bp regions where certain proteins bind.
Should I remove rows with any NAs rather than replacing these with 0? If so, then 11,684 rows out of 836,800 would be removed. Is there a better way to use this information?

Surely there is a better was to merge the chromHMM results into a binary measure?
- Seems unreasonable (and inefficient use of the data) that e.g. E4-10=0, E11=0.01 and E4-11>0.5 would be given the same binary measure of active chromatin.
- If it’s based solely on H3K27ac the surely data for states E5-7 should be upweighted more?

I re-run the analysis but I change the binary score to 1 if any of E5-7 is non-zero (rather than any E4-11) as these look (to me) more indicative of active chromatin. There now seems to be more of a relationship between mppc and active chromatin state. I also re-run the QR analysis and find that at the extremes, when using E5-7 (rather then E4-11), there is a stronger trend for active chromatin state with higher MPPC values.

In Olly’s paper it states that H3K4me3 and H3K36me3 marks define active promoters and transcribed regions. I see if the mppc values are higher when the bait fragment has a H3K4me3 or H3K36me3 mark (non-zero entry in E6-10 or E12).

It also states that H3K4me1 and H3K27ac marks indicate active enhancers. I see if the mppc values are higher when the prey fragment has a H3K4me1 or H3K27ac mark (non-zero entry in E3-7 or E10-12).

What about satisfying both the above conditions, so that the bait must fall in an active promoter region (defined by H3K36me3 and H3K4me3) and the prey in an active enhancer region (defined by H3K4me1 and H3K27ac).