9th Jan

cFDR method for reweighting GWAS P values using functional data

1. Notation

\(H\) denotes a hypothesis of association with T1D; \(H=1\) indicates association, \(H=0\) indicates no association (null hypothesis).

\[\begin{equation} H \sim \text{Bernoulli}(\pi_1) \Longrightarrow P(H=1)=\pi_1 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=0 \sim \text{Bernoulli}(q_0) & \Longrightarrow P(Q=1|H=0)=q_0 \\ & \Longrightarrow P(Q=0|H=0)=1-q_0 \end{split} \end{equation}\]

\[\begin{equation} P|H=0 \sim U(0,1) \end{equation}\]

\[\begin{equation} P \perp Q|H = 0 \end{equation}\]

\[\begin{equation} \begin{split} Q|H=1 \sim \text{Bernoulli}(q_1) & \Longrightarrow P(Q=1|H=1)=q_1 \\ & \Longrightarrow P(Q=0|H=1)=1-q_1 \end{split} \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_0)=f_0(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q|H_1)=f_1(p,q) \end{equation}\]

\[\begin{equation} f(P=p, Q=q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1=f(p,q) \end{equation}\]

2. Method

1. Generate the set of possible rejection regions.

Since \(Q\) can only be 0 or 1, we just need two P value thresholds, \(p_0\) for \(Q=0\) and \(p_1\) for \(Q=1\).

\[\begin{equation} L(p_0, p_1)=(P\leq p_0, Q=0) \cup (P\leq p_1, Q=1). \end{equation}\]

2. Pick the smallest rejection region that \((p[i], q[i])\) belongs.

Following on from our aim, we want the rejection region such that:

the probability of a null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is fixed at some value \(\alpha\), that is

\[\begin{equation} \int_{L(p_0, p_1)}df_0=\alpha \end{equation}\]

the probability of a non-null instance of \(P,Q\) falling inside \(L(p_0, p_1)\) is maximal, that is

\[\begin{equation} \int_{L(p_0, p_1)}df_1 \text{ is maximal} \end{equation}\]

Previous works (e.g. appendix 8.1 in James’ paper) show that this optimal region is given by the set of points \(\{(p,q):f_0(p,q)/f_1(p,q)\leq k_{\alpha}\}\) which means that in our case, the chosen \(p_0\) and \(p_1\) should satisfy

\[\begin{equation} \dfrac{f_0(p_0,0)}{f(p_0,0)}=\dfrac{f_0(p_1,1)}{f(p_1,1)} \end{equation}\]

using \(f(p,q)=f_0(p,q)(1-\pi_1)+f_1(p,q)\pi_1\).

To find \(p_0\) and \(p_1\) that satisfy this equation, we derive an estimator of

\[\begin{equation} \begin{split} \dfrac{f_0(p,q)}{f(p,q)} &= \dfrac{f(P=p, Q=q|H_0)}{f(P=p, Q=q)}. \end{split} \end{equation}\]

Let

\[\begin{equation} \begin{split} \hat{h}(p,q) &= \dfrac{Pr(P\leq p, Q=q|H_0)}{Pr(P\leq p, Q=q)} \\[2ex] & = \dfrac{Pr(P\leq p| Q=q, H_0)Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times Pr(Q=q|H_0)}{Pr(P\leq p|Q=q)Pr(Q=q)} \\[2ex] & \approx \dfrac{p\times \color{red}{A(q)}}{\color{blue}{B(p, q)}/N} \end{split} \end{equation}\]

where \(\color{red}{A(q)=\dfrac{\# \{Q=q, p>1/2\}}{\# \{p>1/2\}}}\) and \(\color{blue}{B(p, q)=\# \{P\leq p, Q = q\}}\) (since \(\# \{Q = q\}\) cancels in the denominator).

So we want to solve

\[\begin{equation} \hat{h}(p_0,q = 0)=\hat{h}(p_1, q = 1) \end{equation}\]

for

\(p_1\) if \(q=0\) (set \(p_0=p_i\))
\(p_0\) if \(q=1\) (set \(p_1=p_i\))

3. Integrate the null distribution over this region to get \(v\).

\[\begin{equation} \begin{split} \int_{L(p_0,p_1)}df_0 & = Pr(P,Q \in L(p_0, p_1)|H_0) \\ & = Pr((P\leq p_0, Q=0) \cup (P\leq p_1, Q=1) | H_0) \\ & = Pr(P\leq p_0, Q=0|H_0) + Pr(P\leq p_1, Q=1|H_1) \\ & = Pr(P\leq p_0|Q=0, H_0)Pr(Q=0|H_0) + Pr(P\leq p_1|Q=1, H_0)Pr(Q=1|H_0) \\ & = p_0\times (1-q_0) + p_1\times q_0 \end{split} \end{equation}\]

where \(q_0=A(1)\).

3. New data set

I have downloaded the case-control T1D GWAS results (Onengut-Gumuscu et al.) from the GWAS catalog. This consists of 123130 SNPs genotyped in 6670 T1D cases and 12262 controls.

I breifly compare this to the ic_t1d_onengut_allresults.csv.gz file on the HPC which I assume also includes the meta-analysis (124726 SNPs). For the SNPs that can be matched by position, I compare the reported P values in the two datasets.

The T1D paper (cc_ichip.sumstats from GWAS Catalog) does not apply genomic control as we discussed (comparing chi-squared distributions) but does account for cyptic relatedness using KING (to estimate the kinship coefficient and remove problematic individuals) and population structure using the PCA method in KING. I assume this means that I shouldn’t genome-control the P values (again). Note: I checked, and this data set gives a lambda value of 0.93 suggesting that there is no population structure (although shouldn’t this equal 1 in a completely homogenous population?).

I decide to use the publically avaliable data downloaded directly from the GWAS catalog for my analysis.

4. Annotation data set

To derive a final annotation data set, for each of the 123130 SNPs in the data set, I record their reported genomic annotation in 19 different cell types (mostly immune related except brain cells to use as a control) using the segway encyclopedia. For each cell type, I also constuct a binary “active” column that is 1 if the SNP falls in enhancer, transcribed, promoter or reg permissive, 0 if the SNP falls in quiescent, constit het or facult het and NA if the SNP falls in bivalent, unclassified or low confidence.

snps <- readRDS("final_final_annots.RDS")
snps[1,]

##   chrom     pos       rsid other_allele effect_allele         p
## 1     1 1118275 rs61733845            C             T 0.8342689
##          beta         se     OR  OR_lower OR_upper
## 1 -0.01126319 0.05383169 0.9888 0.8897869 1.098831
##   BRAIN_INFERIOR_TEMPORAL_LOBE CD14_PRIMARY_CELLS
## 1               FacultativeHet     FacultativeHet
##   CD19_PRIMARY_CELLS_PERIPHERAL_UW CD3_PRIMARY_CELLS_CORD_BI
## 1                         Bivalent                  Bivalent
##   CD3_PRIMARY_CELLS_PERIPHERAL_UW CD4_MEMORY_PRIMARY_CELLS
## 1                  FacultativeHet                 Enhancer
##   CD4_NAIVE_PRIMARY_CELLS CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1           RegPermissive                               Enhancer
##   CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1                                Enhancer
##   CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1                                                           FacultativeHet
##   CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1                                                     Enhancer
##   CD4+_CD25-_TH_PRIMARY_CELLS CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1                    Bivalent                       FacultativeHet
##   CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS CD56_PRIMARY_CELLS
## 1                               Enhancer           Bivalent
##   CD8_MEMORY_PRIMARY_CELLS CD8_NAIVE_PRIMARY_CELLS   PANISLETS    THYMUS
## 1           FacultativeHet           RegPermissive Transcribed Quiescent
##   b_BRAIN_INFERIOR_TEMPORAL_LOBE b_CD14_PRIMARY_CELLS
## 1                              0                    0
##   b_CD19_PRIMARY_CELLS_PERIPHERAL_UW b_CD3_PRIMARY_CELLS_CORD_BI
## 1                                 NA                          NA
##   b_CD3_PRIMARY_CELLS_PERIPHERAL_UW b_CD4_MEMORY_PRIMARY_CELLS
## 1                                 0                          1
##   b_CD4_NAIVE_PRIMARY_CELLS b_CD4+_CD25-_CD45RA+_NAIVE_PRIMARY_CELLS
## 1                         1                                        1
##   b_CD4+_CD25-_CD45RO+_MEMORY_PRIMARY_CELLS
## 1                                         1
##   b_CD4+_CD25-_IL17-_PMA-IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1                                                                          0
##   b_CD4+_CD25-_IL17+_PMA-IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1                                                              1
##   b_CD4+_CD25-_TH_PRIMARY_CELLS b_CD4+_CD25+_CD127-_TREG_PRIMARY_CELLS
## 1                            NA                                      0
##   b_CD4+_CD25INT_CD127+_TMEM_PRIMARY_CELLS b_CD56_PRIMARY_CELLS
## 1                                        1                   NA
##   b_CD8_MEMORY_PRIMARY_CELLS b_CD8_NAIVE_PRIMARY_CELLS b_PANISLETS
## 1                          0                         1           1
##   b_THYMUS PP
## 1        0 NA

The baseline proportions of annotations across the ~123,000 ImmunoChip SNPs are shown below:

And the enrichment of annotations amongst genome-wide significant SNPs is shown below. E.g. we see less enrichment of constitutive heterochromatin amongst genome-wide significant SNPs.

5a. Implementing cFDR approach

Changes from last week:

Use new P values downloaded straight from the GWAS catalog (was using stats-t1d.csv file before but not sure where these P values came from).
Fix error for big P values (\(P>0.6\)), where the solution to \(\hat{h}(p_0,q=0)=\hat{h}(p_1,q=1)\) for \(p_1\) is \(>1\) (but all \(<1.3\)).
1. Use extendInt parameter in the uniroot function and set \(p_1=1\) if the solution found by uniroot is \(>1\). [This approach seems to work well - I use this approach in my analysis].
2. In these instances, where fixing \(p_0=p_i\) and solving for \(p_1\) does not give a solution in the range [0,1], try instead fixing \(p_1\) to 1 and solving for \(p_0\). This analysis is in cFDR/revaryfixedp on HPC. [This approach does not work - gives V values smaller than P values for \(q=0\)].
Fix error for very small P values, where there are no P values smaller than this in the leave-one-chromosome-out dataset (and so \(B(p, q)=\# \{P\leq p, Q = q\}=0\)). In my results, this happens for the smallest 3 P values in the whole data set (P=4e-201, 3e-162, 6e-96) (snps=rs689 (q=0), rs3842727 (q=0), rs2476601 (q=1) respectively) so I just omit these for now as these are obviously significant. Note that rs689 and rs2476601 are in INS and PTPN22 respectively.
Use the data in the leave-one-chromosome-out dataset to select epsilon (the lower bound for the search space of the uniroot function). Set this to the smallest P value in the leave-one-chromosome-out dataset for the relevant value of q.
I tried forcing SNPs with P values \(>1/2\) or \(<1e-20\) straight through the pipeline and setting their V value equal to their P value (which obviously avoided the error for very large and very small P values discussed above) but this gave strange histograms so I’ve decided to run the analysis for all P values.
Check that all the inequalities make sense (e.g. \(P(H_0|v<\alpha)\leq \alpha\) - We’re solving for \(P(H_0|v<\alpha)= \alpha\)). [TO DO]

Anomaly point

The point in the bottom right (of the top right plot) is rs6679677 (row 5951 of data), i.e. the SNP in perfect LD with rs2476601 in the PTPN22 region. The P value is 4.031301e-95 whereas the V value is 2.147119e-05 (q=0). Note that the P value for rs2476601 is 6e-96 (q=1) but a V value could not be found because there were no SNPs in the leave-one-chromosome-out dataset with smaller P values.

Why is the V value so much smaller than the P value?

\(p_1\) seems much too large…

MyData <- readRDS("final_final_annots.RDS")

df <- na.omit(MyData[,c("rsid","chrom","p","b_THYMUS")])

colnames(df)[c(3,4)] <- c("p","q")

A <- function(data, q){ # empirical estimate of P(Q = qi | H0)
  length(which(data$q == q & data$p > 0.5))/length(which(data$p > 0.5))
}

B <- function(data, p, q){ # numerator of empirical estimate of P(P < pi | Q = qi)
  # as denominator cancels in h
  length(which(data$p <= p & data$q == q)) # to deal with smallest P vals
}

i = 5951

pi <- df$p[i]
qi <- df$q[i]

df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out

q0 <- A(data = df_loo, 1)

# qi = 0
  
p0 <- pi
  
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
  
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c 

p1 <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 5000)$root

p1

## [1] 6.064557e-05

\(p_1\) should be the solution to the following equation

\[\begin{equation} \dfrac{p_0 A(0)}{A(1) B(p_0, q = 0)} = \dfrac{p_1}{B(p_1, q = 1)} \end{equation}\]

In this case, the LHS = 3.677574e-95 but my code is returning a solution of \(p_1=6.064557e-05\), where there doesn’t appear to be a solution… It seems that the tol parameter in the uniroot function causes problems, whereby the solution (\(p_1\)) just continues to get smaller as I increase the tolerance.

uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = 1e-50)

## $root
## [1] -7.663399e-51
## 
## $f.root
## [1] -Inf
## 
## $iter
## [1] 160
## 
## $init.it
## [1] 1
## 
## $estim.prec
## [1] 7.663399e-51

uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = 1e-200)

## $root
## [1] -9.364478e-201
## 
## $f.root
## [1] -Inf
## 
## $iter
## [1] 658
## 
## $init.it
## [1] 1
## 
## $estim.prec
## [1] 9.364478e-201

This tolerance problem is also causing something to go weird for the more significant SNPs.

Possible solution: For these SNPs with smaller P values (approx. \(P>1e-4\)), set the tolerance in the uniroot function equal to pi.

I look at two walkthroughs of applying this hack, one for a SNP not overlapping an active annotation (q=0) and one for a SNP overlapping an active annotation (q=1).

SNP with q=0

# original method

i = 104103

pi <- df$p[i]
qi <- df$q[i]

df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out

q0 <- A(data = df_loo, 1)

# qi = 0
  
p0 <- pi
  
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
  
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

LHS

## [1] 4.089385e-15

# find the value of p1 s.t. the solution to p1/b(p1,q=1) is equal to LHS

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c 

uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000)

uniroot_output

## $root
## [1] 6.033583e-96
## 
## $f.root
## [1] -4.089385e-15
## 
## $iter
## [1] 1
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05

p1 <- uniroot_output$root
p1

## [1] 6.033583e-96

v_orig <- p0*(1-q0) + p1*q0

Now try the new method, where I set the tolerance of the uniroot function to the observed P value.

# new method (set tolarance for significant SNPs)

i = 104103

pi <- df$p[i]
qi <- df$q[i]

df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out

q0 <- A(data = df_loo, 1)

# qi = 0
  
p0 <- pi
  
epsilon <- min(df_loo[which(df_loo$q == 1),"p"])
  
LHS <- (p0 * A(df_loo, 0))/(A(df_loo, 1) * B(df_loo, p0, q = 0))

f <- function(data, c, p1, q) (p1/length(which(data$p <= p1 & data$q == q))) - c 

uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 1, extendInt = "yes", maxiter = 10000, tol = pi)

uniroot_output

## $root
## [1] 7.115529e-13
## 
## $f.root
## [1] 0
## 
## $iter
## [1] 3
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 7.115529e-13

Notice that the value of the function evaluated at the root solution is 0 (as we want, before it was -4.089385e-15).

p1 <- uniroot_output$root
p1

## [1] 7.115529e-13

v_new <- p0*(1-q0) + p1*q0

data.frame(pi, qi, v_orig, v_new)

##             pi qi       v_orig        v_new
## 1 4.749689e-13  0 3.113425e-13 5.564719e-13

So this new tolerance thing looks promising, it gives \(V>P\) in this example where there is no annotation.

SNP with q=1

How about when there is an annotation? Yep, again looks promising. Gives \(V<P\) for \(q=1\) (active annotation).

i = 104103

pi <- df$p[i]
qi <- df$q[i]

df_loo <- df[-which(df$chrom == df[i, "chrom"]), ] # df leave-one-out

q0 <- A(data = df_loo, 1)

# qi = 1

p1 <- pi

epsilon <- min(df_loo[which(df_loo$q == 0),"p"])

LHS <- (p1 * A(df_loo, 1))/(A(df_loo, 0) * B(df_loo, p1, q = 1)) 

LHS

## [1] 1.468355e-15

# find the value of p0 s.t. the solution to p0/b(p0,q=0) is equal to LHS

f <- function(data, c, p0, q) (p0/length(which(data$p <= p0 & data$q == q))) - c

uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 10000)

uniroot_output

## $root
## [1] 4.57e-201
## 
## $f.root
## [1] -1.468355e-15
## 
## $iter
## [1] 1
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 6.103516e-05

p1 <- uniroot_output$root
p1

## [1] 4.57e-201

v_orig <- p0*(1-q0) + p1*q0

##

uniroot_output <- uniroot(f, c(0+epsilon, 1), data = df_loo, c = LHS, q = 0, extendInt = "yes", maxiter = 10000, tol = pi)

uniroot_output

## $root
## [1] 4.595952e-13
## 
## $f.root
## [1] 6.207138e-16
## 
## $iter
## [1] 2
## 
## $init.it
## [1] NA
## 
## $estim.prec
## [1] 4.595952e-13

p1 <- uniroot_output$root
p1

## [1] 4.595952e-13

v_new <- p0*(1-q0) + p1*q0

data.frame(pi, qi, v_orig, v_new)

##             pi qi       v_orig        v_new
## 1 4.749689e-13  0 3.113425e-13 4.696726e-13

I.e. because R thinks that e.g. \(5*10^-15\) is sufficiently close to 0, it stops iterating, whereas we need to increase the tolerance as we are dealing with some very small P values.

Of note, the curves that we are trying to set equal to the result of LHS are (different colours for different leave-one-chromosome-out datasets):

Left hand plot: If \(qi=0\), set \(p_0=pi\) and solve the following for \(p_1\),

\[\begin{equation} \dfrac{p_0 A(0)}{A(1)B(p_0,0)}=\dfrac{p_1}{B(p_1,1)} \end{equation}\]

Right hand plot: If \(qi=1\), set \(p_1=pi\) and solve the following for \(p_0\),

\[\begin{equation} \dfrac{p_1 A(1)}{A(0)B(p_1,1)}=\dfrac{p_0}{B(p_0,0)} \end{equation}\]

Do the above plots make sense? Why is the right hand plot shorter? Is it because if you don’t have the annotation you need a lower threshold to reject the null?

5b. Implementing cFDR approach with tolerance parameter

I re-run the analysis on the whole dataset, whereby if \(P<1e-4\) tol = pi in the uniroot function. The code for this is in cFDR/tolerance on the HPC.

The results now look more like we’d expect.

6. Checking results

The P and the V values do look to be normally distributed with a spike near 0, as hoped.

Comparison with BH method

Why does the BH corrected P value histogram look so weird???

p_BH <- p.adjust(plot_data$p, method = "BH")
hist(p_BH)

mean(p_BH<1e-4)

## [1] 0.01739563

mean(plot_data$v<1e-4)

## [1] 0.02867199

mean(p.adjust(plot_data$v, method = "BH")<1e-4)

## [1] 0.01729435

7. Analysis for individual cell types

The analysis so far has only been using the binary active/inactive annotation in thymus cells to reweight P values. I have done the same analysis in all cell types, generating a data frame with the original P value for each SNP and it’s V value incorporating functional annotation data in each cell type. Not sure how best to confirm/visualise/compare these results.

##   ..b_BRAIN_INFERIOR_TEMPORAL_LOBE ..b_CD14_PRIMARY_CELLS
## 1                       0.88819951             0.89933898
## 2                       0.37473898             0.56734821
## 3                       0.11769435             0.12793981
## 4                       0.37034319             0.39478131
## 5                       0.05219356             0.04813996
## 6                       0.67889099             0.67537966
##   ..b_CD19_PRIMARY_CELLS_PERIPHERAL_UW ..b_CD3_PRIMARY_CELLS_CORD_BI
## 1                           0.34363894                    0.71649361
## 2                           0.13011736                    0.07225766
## 3                           0.40217509                    0.51005070
## 4                           0.04501411                    0.04111098
## 5                           0.62097696                    0.60515540
## 6                           0.16013987                    0.06067041
##   ..b_CD3_PRIMARY_CELLS_PERIPHERAL_UW ..b_CD4_MEMORY_PRIMARY_CELLS
## 1                          0.88321520                   0.58555375
## 2                          0.60740343                   0.31875524
## 3                          0.06562636                   0.06908068
## 4                          0.20562312                   0.21986693
## 5                          0.03791082                   0.03947243
## 6                          0.04838538                   0.59695774
##   ..b_CD4_NAIVE_PRIMARY_CELLS ..b_CD4._CD25._CD45RA._NAIVE_PRIMARY_CELLS
## 1                  0.58873358                                 0.56597612
## 2                  0.60321732                                 0.67258771
## 3                  0.07093699                                 0.06864370
## 4                  0.42701690                                 0.21632784
## 5                  0.04047475                                 0.03956575
## 6                  0.60015439                                 0.57764935
##   ..b_CD4._CD25._CD45RO._MEMORY_PRIMARY_CELLS
## 1                                  0.60969363
## 2                                  0.61309660
## 3                                  0.07261938
## 4                                  0.22899426
## 5                                  0.04180664
## 6                                  0.62433886
##   ..b_CD4._CD25._IL17._PMA.IONOMYCIN_STIMULATED_MACS_PURIFIED_TH_PRIMARY_CELLS
## 1                                                                   0.90266790
## 2                                                                   0.66193159
## 3                                                                   0.07054916
## 4                                                                   0.22215251
## 5                                                                   0.04057635
## 6                                                                   0.60084925
##   ..b_CD4._CD25._IL17._PMA.IONOMCYIN_STIMULATED_TH17_PRIMARY_CELLS
## 1                                                       0.56716837
## 2                                                       0.60420304
## 3                                                       0.06677750
## 4                                                       0.21265731
## 5                                                       0.03833354
## 6                                                       0.57913690
##   ..b_CD4._CD25._TH_PRIMARY_CELLS ..b_CD4._CD25._CD127._TREG_PRIMARY_CELLS
## 1                      0.65697118                               0.87323310
## 2                      0.06812850                               0.57740418
## 3                      0.21167490                               0.13513515
## 4                      0.03925985                               0.19422932
## 5                      0.56442024                               0.03586759
## 6                      0.05241122                               0.52936849
##   ..b_CD4._CD25INT_CD127._TMEM_PRIMARY_CELLS ..b_CD56_PRIMARY_CELLS
## 1                                 0.59111703              0.3164252
## 2                                 0.62111675              0.0708507
## 3                                 0.14716130              0.2221298
## 4                                 0.22326269              0.0406311
## 5                                 0.04185371              0.5875023
## 6                                 0.60357166              0.1443224
##   ..b_CD8_MEMORY_PRIMARY_CELLS ..b_CD8_NAIVE_PRIMARY_CELLS ..b_PANISLETS
## 1                    0.8878839                  0.59509320     0.6134992
## 2                    0.3150166                  0.58478595     0.4870544
## 3                    0.1418907                  0.13706140     0.1095966
## 4                    0.4184013                  0.41034841     0.3378930
## 5                    0.0385241                  0.04082018     0.0631748
## 6                    0.5924170                  0.60842959     0.8651093
##   ..b_THYMUS        snp          p
## 1 0.89294493 rs61733845 0.83426886
## 2 0.58967528  rs1320571 0.46962338
## 3 0.07472900  rs9729550 0.10564986
## 4 0.41470953  rs1815606 0.32532208
## 5 0.04248477  rs7515488 0.06084803
## 6 0.62184809 rs11260562 0.85162718

8. cFDR for PPs

Transform \(PPs \Longrightarrow 1-PPs\) (so I don’t have to change all the inequalities).
Approximate

\[\begin{equation} \dfrac{f_0(PP_i, q_i)}{f(PP_i, q_i)}=\dfrac{P(PP=PP_i|Q=q_i, H_0)P(Q=qi|H_0)}{P(PP=PP_i|Q=q_i)P(Q=qi)} \end{equation}\]

\[\begin{equation} \begin{split} \hat{h}(PP_i,q_i) &= \dfrac{P(PP\leq PP_|Q=q_i, H_0)P(Q=qi|H_0)}{P(PP\leq PP_|Q=q_i)P(Q=qi)} \\[3ex] &\approx \dfrac{P(PP\leq PP_i|H_0)\frac{\# \{Q=q_i, P>1/2\}}{\# \{P>1/2\}}}{\dfrac{\# \{PP\leq PP_i, Q = q_i\}}{\# \{Q=q_i\}} \times \dfrac{\# \{Q=q_i\}}{N}}. \end{split} \end{equation}\]

Proceed as in the P value approach.

We can count all of these quantities but note that \(P(PP\leq PP_i|H_0)\) is harder because PPs, unlike P values, are not uniform under the null. We can use the functions from the corrcoverage R package for this, matching parameters with those from the original study (e.g. sample sizes). The functions may have to be modified slightly to allow for the null model to be true, and it should be used for blocks independently (to obtain one fixed value). Also note that \(H_0\) is not measuring no causality (as in fine-mapping), but it is now concerned with no association.

Comments and questions

The T1D paper states that there were 135,870 SNPs after quality control, but the ic_t1d_onengut_allresults.csv.gz file only has 124,726 and the file I downloaded from GWAS catalog (and used in this analysis) only has approx. 123,000?
The T1D paper does not apply genomic control as we discussed (comparing chi-squared distributions) but does account for cyptic relatedness using KING (to estimate the kinship coefficient and remove problematic individuals) and population structure using the PCA method in KING. I assume this means that I shouldn’t genome-control the reported P values (again)?
We had hoped that genomic-controlling the P values would help with our \(Pr(P\leq p_i|H_0)\approx pi)\) assumption, but I think the P values I am now using are already genomic-controlled.
There was talk of using this cFDR method iteratively over all cell types (i.e. use the V values from the previous cell type in the table as input (as P values) for the next cell type in the table). But this works on the assumption that all binary active annotation marks in the 19 cell types I’ve looked at are relevant? Surely I should be weighting by the relevance of each cell type or something? And surely the order of cell types I iterate over matters?
How would I get PPs for all 123,000 SNPs? At the moment, I have PPs for the SNPs in the 39 densely mapped regions on the ImmunoChip from my credible set analysis (16696).
Could we run this cFDR approach on the -log10 scale to help deal with the tolerance of the uniroot function problem? (because R thinks that e.g. \(5*10^{-15}\) is sufficiently close to 0, it stops iterating, whereas we need to increase the tolerance as we are dealing with some very small P values).
How to compare V values for annotations in different cell types.