https://www.youtube.com/watch?v=JMT6oRYgkTk
A total length of DNA contained in a human cell would be 2 metres long if completely stretched out, but it has to fit into the nucleus of every cell in our body. For this reason, DNA is highly organised into compact structures.
Nucleosome: A segment of DNA wound in sequence around 8 histones, i.e. a tight loop of DNA and protein.
Chromatin: Nucleosomes coiled together and stacked on top of each other. This may be highly condensed and not transcribed (heterochromatin) or open and regularly transcribed (euchromatin).
Chromosomes: Chromatin fibers that have been looped and further wound and packaged. Note that these only form when cells are dividing.
Epigenetic marks are not directly governed by the genetic code. They ensure that cells remain of the correct type after they divide (e.g. skin cells remain skin cells) by keeping the correct genes on or off. They do this by controlling the accessibility of chromatin through histone marks.
Histones have amino acid tails that protrude from the nucleosome structure. Enzymes can place marks on these tails, activating or repressing genes. The same chemical mark can indicate gene activation (e.g. H3K4me3) or gene repression. Acetylation of amino acides on histones generally correlates to gene activation. Methylation marks can be activating or repressing, depending on where they occur. Recall that each nucleosome consists of 8 histones, each of which can have various post-translational marks - therefore it is important to understand the effects of specific combinations of epigenetic marks.
Note that there are DNA modifications (e.g. methylated CpG islands) and histone modifications (e.g. methyl group added to tail of the histone).
Large scale genomic organisation: The genome is characterised into active (“A”) and inactive (“B”) compartments.
Smaller scale genomic organisation: Topologically associated domains (TAD) emerge. These are regions characterised by high intradomain contact frequency and reduced interdomain contacts.
Chromatin conformation is basically DNA wrapped around histones - some histones reflect regions of open DNA and some represent regions of closed DNA. They also indicate the degree of transcription and therefore gene expression. Genes in open chromatin are more likely to be expressed than genes in closed chromatin and TFs preferentially bind to nucleosomes in the DNA - so by investigating chromatin accesibility we can investigate transcriptional regulation, e.g. by linking non-coding GWAS variants to their target genes.
ATAC-seq is used to find regions of open chromatin using a transposon enzyme that preferentially tags accessible DNA fragments. Open chromatin is represented as peaks.
NIH Roadmap Epigenomics consortium 1 focused on the mapping of DNA methylation, histone modification and chromatin accessibility using cell lines and primary human tissue. I.e. “characterising the epigenome”. Whilst GoShifter (Genomic Annotation Shifter) tests for enrichment between trait-associated SNPs and all types of genomic annotation (“GoShifter is able to robustly identify informative annotations under a range of different scenarios”).
Chromatin conformation/accessibility techniques should be used at the single cell-type or single cell level to avoid averaging results of epigenetic marks over many different cell types/cells (i.e. in a tissue or organ that contains many different cell types and cells) - similar to how single-cell RNA-seq is taking off, rather than bulk RNA-seq.
2 provide a genome-wide map of the coordination between REs and describe how this serves as a backbone for the propagation of noncoding genetic effects in cis and trans onto gene expression.
Briefly, DNA is cross linked and digested with DNA restriction enzymes. The loose DNA fragment ends can then be re-ligated to form a hybrid DNA molecule formed of two fragments of DNA which may be very far apart in linear distance. If two fragments of DNA are ligated using this method then it provides evidence that the fragments were interacting in the genome.
3C: One vs one
4C: One vs all
5C: Many vs many
Hi-C: All vs all
“The classical Hi-C technique involves restriction digestion of a formaldehyde cross-linked genome with sequence specific restriction enzymes, followed by fill in and repair of digested ends with the incorporation of biotin-linked nucleotides. The repaired ends are then re-ligated. Finally, the cross-linking is reversed and associated proteins are degraded. This produces the ligation products which are then non specifically sheared, generally by sonication, and enriched for sheared fragments containing the ligation junction, using a biotin pull-down strategy, and finally sequenced using paired-end sequencing (Belton et al. 2012). The enrichment step aims to select sonicated fragments containing the ligation junction, increasing the proportion of informative non-same fragment read pairs (mate pairs originated from different restriction fragments).” 4. Note that all of these steps up until the ligation are performed in in fact cell nuclei.
The resolution of Hi-C data is determined by the restriction enzyme used and the sequencing depth. A typical restriction enzymes is HindIII which recognize and cut a 6 bp long sequence, AAGCTT, but recently restriction enzymes recognising 4 bp long sequences have been adopted, resulting in smaller fragments. Note that the fragments obtained when using HindIII are typically 125 bp to 23,130 bp long “HindIII produces fragments of median length 4kb” so the resolution of interacting fragments is severely limited.
Capture Hi-C is when baits of interest are chosen a proiri, for example those representing promoter sequences (PCHi-C).
Data-generation:
FASTQ files of paired-end reads (reads from either end of the DNA fragment) are obtained and aligned to the reference genome. Since each read are expected to map in different unrelated regions of the genome, they are aligned to the reference genome separately. Note that problems may arise if the reads span the ligation junction, thus having two portions of the read itself matching distinct genomic positions (chimeric reads).
The reads are filtered to remove spurious signals due to experimental artifacts.
The read counts are then binned into genomic bins. This allows more robust and less noisy signals for the estimation of contact frequencies, but means that the resolution is reduced. Strategies to find the optimal genomic bin size have been proposed.
Read counts are normalised.
Note that the interactions found by standard methods (e.g. HiCCUPS and FastHiC) happen between genomic bins of several kb - what if we want to find higher resolution interactions e.g. between a GWAS SNP and a target gene promoter? For this, capture Hi-C is suggested but a different pipeline must be followed due to the asymmetry in capturing contacting fragments (many vs all rather than all vs all).
Possible able to share information across baits because there are spatially dispersed in the genome.
For calling significant interactions from capture Hi-C data, CHiCAGO is generally recommended. Chicdiff takes as input CHiCAGO-processed data for each replicate and condition and uses the parameters learned by CHiCAGO in data normalisation.
Peaky may then be used to fine-map the long runs of contacting fragments identified by CHiCAGO (to see which bits of the prey are actually in direct contact).
5 used promoter capture Hi-C to find cell type specific promoter interactomes, with the aim of linking non-coding GWAS variants to their target genes (by seeing which genes’ promoters they interact with physically). They aimed to provide a comprehensive catalog of promoter-interacting regions (PIRs). PCHi-C is Hi-C whereby only interactions involving promoters are found (using sequence capture to pull down fragments of interest, i.e. those of promoters).
An integrative and discriminative epigenome annotation system, for jointly characterizing epigenetic landscapes in many cell types and detecting differential regulatory regions.
Motivation: Need to investigate how epigenomic variation both across the genome and across different cell types relates to gene expression changes and phenotypic diversity. Current methods involve genomic segmentation (which are mostly developed for a single genome) which have been extended for genomic concatenation and data stacking to analyse multiple cell lines.