CHEERS is a statistical method that accounts for subtle changes in the chromatin landscape to identify SNP enrichment across cell states. E.g. genome-wide significant SNPs for T1D are enriched in IFN\(\beta\).
Rather than converting annotations into a binary measure of present/absent at the SNP level (for which they found there were not many cell state specific changes in H3K27ac and ATAC-seq data for different T cell states), they model the peak properties.
Define a union of peak regions present across cell types and construct a matrix of normalised read counts.
Calculate and rank specificity scores (the higher the rank score the more specific that peak is to that cell type). The peak specificity score is the normalised read count for that peak in that cell state divided by the square root of the sum of the squared normalised read counts for that peak across all cell states.
Define SNP set as lead-SNPs and those with \(r^2>0.8\) and overlay these to the peaks.
For the peaks that overlap the associated variants, calculate the mean specificity ranks and derive a P value representing the probability that SNP-peak overlap is random in that cell type.
They use their method to quantify SNP enrichment in various T/B cell states for various diseases. E.g. that IFN\(\beta\) is enriched for Systemic lupus erythematosus variants.
I like this method because the binary conversion of SNP-peak overlap commonly used seems to throw away lots of useful information in my opinion. E.g. do the peaks look different in different cell types/states.
Perhaps throwing away useful information at each stage in the method.. E.g. where in the peak the SNP falls and converting read counts to discrete ranks/specificity scores.
But the output is a \(P\) value, how do we interpret this? They compare the peak rank with a theoretical distribution:
"Within each cell state, all ranks (1,2,…,N) can be observed with an equal probability and thus follow a discrete uniform distribution (Kolmogorov-Smirnov test, P=1) with mean:
\[\mu = \dfrac{1+N}{2}\]
and variance:
\[\sigma^2=\dfrac{(max-min+1)^2-1}{12}.\]
When we substitute max with number of peaks (N) and min with the minimal rank (1), we obtain the following formula:
\[\sigma^2=\dfrac{N^2-1}{12}\] which, under the CLT coverges to,
\[\bar{\sigma}^2=\dfrac{N^2-1}{12n}\] where \(\bar{\sigma}^2\) is the variance of the mean of the \(n\) peak ranks overlapping the GWAS SNPs, assuming that these overlap at random.
Finally, we calculate P values as:
\[P = 1-\Phi(\dfrac{x-\mu}{\sigma})."\]