\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).
\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]
\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split} \end{equation}\]
\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]
\[\begin{equation} P \perp Q|H = 0 \end{equation}\]
\[\begin{equation} \begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split} \end{equation}\]
\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]
\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]
\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]
1. Generate the set of possible rejection regions.
Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).
\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]
2. Pick the smallest rejection region that \((p[i], q[i])\) belongs.
Following on from our aim, we want the rejection region such that:
\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]
\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]
Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy
\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]
using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\).
To find \(p_0\) and \(p_1\) that satisfy this equation, we derive an estimator of
\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split} \end{equation}\]
Let
\[\begin{equation} \begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split} \end{equation}\]
where \(\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}\) and \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}\) (since \(\# \{Q = q\}\) cancels in the denominator).
So we want to solve
\[\begin{equation} \hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1) \end{equation}\]
for
3. Integrate the null distribution over this region to get \(v\).
\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split} \end{equation}\]
where \(q_0=A(1)\).
I have downloaded the case-control T1D GWAS results (Onengut-Gumuscu et al.) from the GWAS catalog. This consists of 123130 SNPs genotyped in 6670 T1D cases and 12262 controls.
I breifly compare this to the ic_t1d_onengut_allresults.csv.gz
file on the HPC which I assume also includes the meta-analysis (124726 SNPs). For the SNPs that can be matched by position, I compare the reported P values in the two datasets.
The T1D paper (cc_ichip.sumstats
from GWAS Catalog) does not apply genomic control as we discussed (comparing chi-squared distributions) but does account for cyptic relatedness using KING (to estimate the kinship coefficient and remove problematic individuals) and population structure using the PCA method in KING. I assume this means that I shouldn’t genome-control the P values (again). Note: I checked, and this data set gives a lambda value of 0.93 suggesting that there is no population structure (although shouldn’t this equal 1 in a completely homogenous population?).
I decide to use the publically avaliable data downloaded directly from the GWAS catalog for my analysis.
To derive a final annotation data set, for each of the 123130 SNPs in the data set, I record their reported genomic annotation in 19 different cell types (mostly immune related except brain cells to use as a control) using the segway encyclopedia. For each cell type, I also constuct a binary “active” column that is 1 if the SNP falls in enhancer, transcribed, promoter or reg permissive, 0 if the SNP falls in quiescent, constit het or facult het and NA if the SNP falls in bivalent, unclassified or low confidence.
snps <- readRDS("final_final_annots.RDS")
snps[1,]
## chrom pos rsid other_allele effect_allele p
## 1 1 1118275 rs61733845 C T 0.8342689
## beta se OR OR_lower OR_upper
## 1 -0.01126319 0.05383169 0.9888 0.8897869 1.098831
## BRAIN_INFERIOR_TEMPORAL_LOBE CD14_PRIMARY_CELLS
## 1 FacultativeHet FacultativeHet
## CD19_PRIMARY_CELLS_PERIPHERAL_UW CD3_PRIMARY_CELLS_CORD_BI
## 1 Bivalent Bivalent
## CD3_PRIMARY_CELLS_PERIPHERAL_UW CD4_MEMORY_PRIMARY_CELLS
## 1 FacultativeHet Enhancer
## CD4_NAIVE_PRIMARY_CELLS CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1 RegPermissive Enhancer
## CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1 Enhancer
## CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1 FacultativeHet
## CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1 Enhancer
## CD4+_CD25-_TH_PRIMARY_CELLS CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1 Bivalent FacultativeHet
## CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS CD56_PRIMARY_CELLS
## 1 Enhancer Bivalent
## CD8_MEMORY_PRIMARY_CELLS CD8_NAIVE_PRIMARY_CELLS PANISLETS THYMUS
## 1 FacultativeHet RegPermissive Transcribed Quiescent
## b_BRAIN_INFERIOR_TEMPORAL_LOBE b_CD14_PRIMARY_CELLS
## 1 0 0
## b_CD19_PRIMARY_CELLS_PERIPHERAL_UW b_CD3_PRIMARY_CELLS_CORD_BI
## 1 NA NA
## b_CD3_PRIMARY_CELLS_PERIPHERAL_UW b_CD4_MEMORY_PRIMARY_CELLS
## 1 0 1
## b_CD4_NAIVE_PRIMARY_CELLS b_CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1 1 1
## b_CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1 1
## b_CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1 0
## b_CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1 1
## b_CD4+_CD25-_TH_PRIMARY_CELLS b_CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1 NA 0
## b_CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS b_CD56_PRIMARY_CELLS
## 1 1 NA
## b_CD8_MEMORY_PRIMARY_CELLS b_CD8_NAIVE_PRIMARY_CELLS b_PANISLETS
## 1 0 1 1
## b_THYMUS PP
## 1 0 NA
The baseline proportions of annotations across the ~123,000 ImmunoChip SNPs are shown below:
And the enrichment of annotations amongst genome-wide significant SNPs is shown below. E.g. we see less enrichment of constitutive heterochromatin amongst genome-wide significant SNPs.
Changes from last week:
Use new P values downloaded straight from the GWAS catalog (was using stats-t1d.csv
file before but not sure where these P values came from).
Fix error for big P values (\(P>0.6\)), where the solution to \(\hat{h}(p_0,q=0)=\hat{h}(p_1,q=1)\) for \(p_1\) is \(>1\) (but all \(<1.3\)).
extendInt
parameter in the uniroot
function and set \(p_1=1\) if the solution found by uniroot
is \(>1\). [This approach seems to work well - I use this approach in my analysis].Fix error for very small P values, where there are no P values smaller than this in the leave-one-chromosome-out dataset (and so \(B(p, q)=\# \{P\leq p, Q = q\}=0\)). In my results, this happens for the smallest 3 P values in the whole data set (P=4e-201, 3e-162, 6e-96) (snps=rs689 (q=0), rs3842727 (q=0), rs2476601 (q=1) respectively) so I just omit these for now as these are obviously significant. Note that rs689 and rs2476601 are in INS and PTPN22 respectively.
Use the data in the leave-one-chromosome-out dataset to select epsilon (the lower bound for the search space of the uniroot
function). Set this to the smallest P value in the leave-one-chromosome-out dataset for the relevant value of q.
I tried forcing SNPs with P values \(>1/2\) or \(<1e-20\) straight through the pipeline and setting their V value equal to their P value (which obviously avoided the error for very large and very small P values discussed above) but this gave strange histograms so I’ve decided to run the analysis for all P values.
Check that all the inequalities make sense (e.g. \(P(H_0|v<\alpha)\leq \alpha\) - We’re solving for \(P(H_0|v<\alpha)= \alpha\)). [TO DO]
The point in the bottom right (of the top right plot) is rs6679677 (row 5951 of data), i.e. the SNP in perfect LD with rs2476601 in the PTPN22 region. The P value is 4.031301e-95 whereas the V value is 2.147119e-05 (q=0). Note that the P value for rs2476601 is 6e-96 (q=1) but a V value could not be found because there were no SNPs in the leave-one-chromosome-out dataset with smaller P values.
Why is the V value so much smaller than the P value?
MyData <- readRDS("final_final_annots.RDS")
df <- na.omit(MyData[,c("rsid","chrom","p","b_THYMUS")])
colnames(df)[c(3,4)] <- c("p","q")
A <- function(data, q){ # empirical estimate of P(Q = qi | H0)
length(which(data$q == q & data$p > 0.5))/length(which(data$p > 0.5))
}
B <- function(data, p, q){ # numerator of empirical estimate of P(P < pi | Q = qi)
# as denominator cancels in h
length(which(data$p <= p & data$q == q)) # to deal with smallest P vals
}
i = 5951
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out
q0 <- A(data = df_loo, 1)
# qi = 0
p0 <- pi
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c
p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000)$root
p1
## [1] 6.064557e-05
\(p_1\) should be the solution to the following equation
\[\begin{equation} \dfrac{p_0 A(0)}{A(1) B(p_0, q = 0)} = \dfrac{p_1}{B(p_1, q = 1)} \end{equation}\]
In this case, the LHS = 3.677574e-95 but my code is returning a solution of \(p_1=6.064557e-05\), where there doesn’t appear to be a solution… It seems that the tol
parameter in the uniroot
function causes problems, whereby the solution (\(p_1\)) just continues to get smaller as I increase the tolerance.
uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = 1e-50)
## $root
## [1] -7.663399e-51
##
## $f.root
## [1] -Inf
##
## $iter
## [1] 160
##
## $init.it
## [1] 1
##
## $estim.prec
## [1] 7.663399e-51
uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = 1e-200)
## $root
## [1] -9.364478e-201
##
## $f.root
## [1] -Inf
##
## $iter
## [1] 658
##
## $init.it
## [1] 1
##
## $estim.prec
## [1] 9.364478e-201
This tolerance problem is also causing something to go weird for the more significant SNPs.
Possible solution: For these SNPs with smaller P values (approx. \(P>1e-4\)), set the tolerance in the uniroot
function equal to pi.
I look at two walkthroughs of applying this hack, one for a SNP not overlapping an active annotation (q=0) and one for a SNP overlapping an active annotation (q=1).
# original method
i = 104103
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out
q0 <- A(data = df_loo, 1)
# qi = 0
p0 <- pi
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
LHS
## [1] 4.089385e-15
# find the value of p1 s.t. the solution to p1/b(p1,q=1) is equal to LHS
f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c
uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000)
uniroot_output
## $root
## [1] 6.033583e-96
##
## $f.root
## [1] -4.089385e-15
##
## $iter
## [1] 1
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 6.103516e-05
p1 <- uniroot_output$root
p1
## [1] 6.033583e-96
v_orig <- p0*(1-q0) + p1*q0
Now try the new method, where I set the tolerance of the uniroot
function to the observed P value.
# new method (set tolarance for significant SNPs)
i = 104103
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out
q0 <- A(data = df_loo, 1)
# qi = 0
p0 <- pi
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))
f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c
uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = pi)
uniroot_output
## $root
## [1] 7.115529e-13
##
## $f.root
## [1] 0
##
## $iter
## [1] 3
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 7.115529e-13
Notice that the value of the function evaluated at the root solution is 0 (as we want, before it was -4.089385e-15).
p1 <- uniroot_output$root
p1
## [1] 7.115529e-13
v_new <- p0*(1-q0) + p1*q0
data.frame(pi, qi, v_orig, v_new)
## pi qi v_orig v_new
## 1 4.749689e-13 0 3.113425e-13 5.564719e-13
So this new tolerance thing looks promising, it gives \(V>P\) in this example where there is no annotation.
How about when there is an annotation? Yep, again looks promising. Gives \(V<P\) for \(q=1\) (active annotation).
i = 104103
pi <- df$p[i]
qi <- df$q[i]
df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out
q0 <- A(data = df_loo, 1)
# qi = 1
p1 <- pi
epsilon <- min(df_loo[which(df_loo$q == 0),"p"])
LHS <- (p1 * A(df_loo, 1))/(A(df_loo, 0) * B(df_loo, p1, q = 1))
LHS
## [1] 1.468355e-15
# find the value of p0 s.t. the solution to p0/b(p0,q=0) is equal to LHS
f <- function(data, c, p0, q) (p0/length(which(data$p <= p0 & data$q == q))) - c
uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 10000)
uniroot_output
## $root
## [1] 4.57e-201
##
## $f.root
## [1] -1.468355e-15
##
## $iter
## [1] 1
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 6.103516e-05
p1 <- uniroot_output$root
p1
## [1] 4.57e-201
v_orig <- p0*(1-q0) + p1*q0
##
uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 10000, tol = pi)
uniroot_output
## $root
## [1] 4.595952e-13
##
## $f.root
## [1] 6.207138e-16
##
## $iter
## [1] 2
##
## $init.it
## [1] NA
##
## $estim.prec
## [1] 4.595952e-13
p1 <- uniroot_output$root
p1
## [1] 4.595952e-13
v_new <- p0*(1-q0) + p1*q0
data.frame(pi, qi, v_orig, v_new)
## pi qi v_orig v_new
## 1 4.749689e-13 0 3.113425e-13 4.696726e-13
I.e. because R thinks that e.g. \(5*10^-15\) is sufficiently close to 0, it stops iterating, whereas we need to increase the tolerance as we are dealing with some very small P values.
Of note, the curves that we are trying to set equal to the result of LHS are (different colours for different leave-one-chromosome-out datasets):
\[\begin{equation} \dfrac{p_0 A(0)}{A(1)B(p_0,0)}=\dfrac{p_1}{B(p_1,1)} \end{equation}\]
\[\begin{equation} \dfrac{p_1 A(1)}{A(0)B(p_1,1)}=\dfrac{p_0}{B(p_0,0)} \end{equation}\]
Do the above plots make sense? Why is the right hand plot shorter? Is it because if you don’t have the annotation you need a lower threshold to reject the null?
I re-run the analysis on the whole dataset, whereby if \(P<1e-4\) tol = pi
in the uniroot
function. The code for this is in cFDR/tolerance on the HPC.
The results now look more like we’d expect.
The P and the V values do look to be normally distributed with a spike near 0, as hoped.
Why does the BH corrected P value histogram look so weird???
p_BH <- p.adjust(plot_data$p, method = "BH")
hist(p_BH)
mean(p_BH<1e-4)
## [1] 0.01739563
mean(plot_data$v<1e-4)
## [1] 0.02867199
mean(p.adjust(plot_data$v, method = "BH")<1e-4)
## [1] 0.01729435
The analysis so far has only been using the binary active/inactive annotation in thymus cells to reweight P values. I have done the same analysis in all cell types, generating a data frame with the original P value for each SNP and it’s V value incorporating functional annotation data in each cell type. Not sure how best to confirm/visualise/compare these results.
## ..b_BRAIN_INFERIOR_TEMPORAL_LOBE ..b_CD14_PRIMARY_CELLS
## 1 0.88819951 0.89933898
## 2 0.37473898 0.56734821
## 3 0.11769435 0.12793981
## 4 0.37034319 0.39478131
## 5 0.05219356 0.04813996
## 6 0.67889099 0.67537966
## ..b_CD19_PRIMARY_CELLS_PERIPHERAL_UW ..b_CD3_PRIMARY_CELLS_CORD_BI
## 1 0.34363894 0.71649361
## 2 0.13011736 0.07225766
## 3 0.40217509 0.51005070
## 4 0.04501411 0.04111098
## 5 0.62097696 0.60515540
## 6 0.16013987 0.06067041
## ..b_CD3_PRIMARY_CELLS_PERIPHERAL_UW ..b_CD4_MEMORY_PRIMARY_CELLS
## 1 0.88321520 0.58555375
## 2 0.60740343 0.31875524
## 3 0.06562636 0.06908068
## 4 0.20562312 0.21986693
## 5 0.03791082 0.03947243
## 6 0.04838538 0.59695774
## ..b_CD4_NAIVE_PRIMARY_CELLS ..b_CD4._CD25._CD45RA._NAIVE_PRIMARY_CELLS
## 1 0.58873358 0.56597612
## 2 0.60321732 0.67258771
## 3 0.07093699 0.06864370
## 4 0.42701690 0.21632784
## 5 0.04047475 0.03956575
## 6 0.60015439 0.57764935
## ..b_CD4._CD25._CD45RO._MEMORY_PRIMARY_CELLS
## 1 0.60969363
## 2 0.61309660
## 3 0.07261938
## 4 0.22899426
## 5 0.04180664
## 6 0.62433886
## ..b_CD4._CD25._IL17._PMA.IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1 0.90266790
## 2 0.66193159
## 3 0.07054916
## 4 0.22215251
## 5 0.04057635
## 6 0.60084925
## ..b_CD4._CD25._IL17._PMA.IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1 0.56716837
## 2 0.60420304
## 3 0.06677750
## 4 0.21265731
## 5 0.03833354
## 6 0.57913690
## ..b_CD4._CD25._TH_PRIMARY_CELLS ..b_CD4._CD25._CD127._TREG_PRIMARY_CELLS
## 1 0.65697118 0.87323310
## 2 0.06812850 0.57740418
## 3 0.21167490 0.13513515
## 4 0.03925985 0.19422932
## 5 0.56442024 0.03586759
## 6 0.05241122 0.52936849
## ..b_CD4._CD25INT_CD127._TMEM_PRIMARY_CELLS ..b_CD56_PRIMARY_CELLS
## 1 0.59111703 0.3164252
## 2 0.62111675 0.0708507
## 3 0.14716130 0.2221298
## 4 0.22326269 0.0406311
## 5 0.04185371 0.5875023
## 6 0.60357166 0.1443224
## ..b_CD8_MEMORY_PRIMARY_CELLS ..b_CD8_NAIVE_PRIMARY_CELLS ..b_PANISLETS
## 1 0.8878839 0.59509320 0.6134992
## 2 0.3150166 0.58478595 0.4870544
## 3 0.1418907 0.13706140 0.1095966
## 4 0.4184013 0.41034841 0.3378930
## 5 0.0385241 0.04082018 0.0631748
## 6 0.5924170 0.60842959 0.8651093
## ..b_THYMUS snp p
## 1 0.89294493 rs61733845 0.83426886
## 2 0.58967528 rs1320571 0.46962338
## 3 0.07472900 rs9729550 0.10564986
## 4 0.41470953 rs1815606 0.32532208
## 5 0.04248477 rs7515488 0.06084803
## 6 0.62184809 rs11260562 0.85162718
Transform \(PPs \Longrightarrow 1-PPs\) (so I don’t have to change all the inequalities).
Approximate
\[\begin{equation} \dfrac{f_0(PP_i, q_i)}{f(PP_i, q_i)}=\dfrac{P(PP=PP_i|Q=q_i, H_0)P(Q=qi|H_0)}{P(PP=PP_i|Q=q_i)P(Q=qi)} \end{equation}\]
by
\[\begin{equation} \begin{split} \hat{h}(PP_i,q_i) &= \dfrac{P(PP\leq PP_|Q=q_i, H_0)P(Q=qi|H_0)}{P(PP\leq PP_|Q=q_i)P(Q=qi)} \\[3ex] &\approx \dfrac{P(PP\leq PP_i|H_0)\frac{\# \{Q=q_i, P>1/2\}}{\# \{P>1/2\}}}{\dfrac{\# \{PP\leq PP_i, Q = q_i\}}{\# \{Q=q_i\}} \times \dfrac{\# \{Q=q_i\}}{N}}. \end{split} \end{equation}\]
We can count all of these quantities but note that \(P(PP\leq PP_i|H_0)\) is harder because PPs, unlike P values, are not uniform under the null. We can use the functions from the corrcoverage
R package for this, matching parameters with those from the original study (e.g. sample sizes). The functions may have to be modified slightly to allow for the null model to be true, and it should be used for blocks independently (to obtain one fixed value). Also note that \(H_0\) is not measuring no causality (as in fine-mapping), but it is now concerned with no association.
Comments and questions
The T1D paper states that there were 135,870 SNPs after quality control, but the
ic_t1d_onengut_allresults.csv.gz
file only has 124,726 and the file I downloaded from GWAS catalog (and used in this analysis) only has approx. 123,000?The T1D paper does not apply genomic control as we discussed (comparing chi-squared distributions) but does account for cyptic relatedness using KING (to estimate the kinship coefficient and remove problematic individuals) and population structure using the PCA method in KING. I assume this means that I shouldn’t genome-control the reported P values (again)?
We had hoped that genomic-controlling the P values would help with our \(Pr(P\leq p_i|H_0)\approx pi)\) assumption, but I think the P values I am now using are already genomic-controlled.
There was talk of using this cFDR method iteratively over all cell types (i.e. use the V values from the previous cell type in the table as input (as P values) for the next cell type in the table). But this works on the assumption that all binary active annotation marks in the 19 cell types I’ve looked at are relevant? Surely I should be weighting by the relevance of each cell type or something? And surely the order of cell types I iterate over matters?
How would I get PPs for all 123,000 SNPs? At the moment, I have PPs for the SNPs in the 39 densely mapped regions on the ImmunoChip from my credible set analysis (16696).
Could we run this cFDR approach on the -log10 scale to help deal with the tolerance of the
uniroot
function problem? (because R thinks that e.g. \(5*10^{-15}\) is sufficiently close to 0, it stops iterating, whereas we need to increase the tolerance as we are dealing with some very small P values).How to compare V values for annotations in different cell types.