30th April

Genomic control

It is important that we use genomic control (or something similar) to ensure that the P values are uniformly distributed under the null so that \(P(P<p)=p\). If \(P(P<p)>p\) then we may see inflation of V values. Although note that genomic control is a brute force tool and is not necessarily appropriate for targetted chips, such as the ImmunoChip.

I am unsure whether the P values have already been genomic-controlled. I derive the \(\lambda\) value using various combinations of SNPs:

	lamdbda	lambda_1000
Use all SNPs	1.6	1.07
Use null reading SNPs	0.93	0.99
Use SNPs in non-dense regions	1.3	1.03

The only mention of genomic control in the manuscript is (although note here they use 135,870 SNPs that passed QC whereas my dataset only has 124,000 SNPs):

“We observed inflation of test statistics across all SNPs that passed quality control (\(\lambda_{1,000} = 1.09\)), which was expected as the Immunochip was designed to target robustly defined immune-mediated disease susceptibility loci. Excluding SNPs from regions reported here, \(\lambda_{1,000}\) was reduced to 1.07; excluding all densely genotyped regions reduced \(\lambda_{1,000}\) to 1.03”

My values are comparable with theirs and I decide not to perform genomic-control on my P values.

cFDR

I am working on new cFDR code that automatically incorporates the adjusment step.

\[\begin{equation} \begin{split} cFDR(p,q) &= P(H_0^p|P\leq p, Q\leq q)\\[2ex] &= \dfrac{P(P\leq p|H_0^p, Q\leq q) P(H_0^p|Q\leq q)}{P(P\leq p |Q\leq q)}\\[2ex] &= \dfrac{P(P\leq p|H_0^p)}{\dfrac{P(P\leq p, Q\leq q)}{P(Q\leq q)}} P(H_0^p|Q\leq q)\\[2ex] &= \dfrac{p P(Q\leq q)}{P(P\leq p, Q\leq q)} \times \dfrac{P(Q\leq q|H_0^p)P(H_0^p)}{P(Q\leq q)}\\[2ex] &\geq \dfrac{p}{P(P\leq p, Q\leq q)} \times P(Q\leq q|H_0^p) \end{split} \end{equation}\]

pq_full <- readRDS("realdata.RDS")

p <- pq_full$p
q <- pq_full$q

# ensure smaller q are enriched for smaller p
if(cor(p,q,method="spearman") < 0) q <- -q

chr1 = which(pq_full$chr == 1)

# problems for very small p values (inflation)
# and very big p values (de-inflation) so pick out these
candidate_indices = c(which(pq_full$p<0.0005 | pq_full$p>0.99), sample(1:nrow(pq_full), 1000))

indices = intersect(chr1, candidate_indices)
fold = chr1 # leave out these indices
mode = 2 # leave one-chromosome-out
adj = TRUE # new method does adjustment separately - will look at including later
nt = 5000 # resolution of L curves (p)
nv = 1000 # resolution of L curves (q)
res_p = 500 # resolution for KDE estimates
res_q = 200
gx = min(p[indices])
zp = -qnorm(p/2)
maxx = max(zp)
maxy = max(q)
miny = min(q)

if (mode==2) sub = setdiff(1:length(zp), fold) else sub = 1:length(zp)

# define limits of q by adding 10% either side
q_10 <- (max(q[sub]) - min(q[sub])) * 0.1

# define upper limit of zp by adding 10%
zp_10 <- (max(zp[sub]) - min(zp[sub])) * 0.1

lims <- c(0, maxx + zp_10, miny-q_10, maxy+q_10) # c(xl, xu, yl, yu)

# folded normal KDE
kpq <- kde2d(c(zp[sub], -zp[sub]), c(q[sub], q[sub]), n=c(res_p, res_q), lims = lims) #h  = c(bandwidth.nrd(c(zp[sub], -zp[sub])), bandwidth.nrd(c(q[sub], q[sub]))))

qs  = q[sub][which(p[sub]>0.5)]
kq = density(c(qs, qs), from=lims[3], to=lims[4], n=res_q) #bw = bandwidth.nrd(c(qs, qs)))

I will look more into this at a later date but note that kde2d and density have different default bandwidths.

I ensure that the univariate PDF (\(P(Q=q|H0)\)) integrates to 1:

integrate(approxfun(kq$x, kq$y), lims[3], lims[4])

## 1.00096 with absolute error < 5.5e-05

kq$y <- kq$y/integrate(approxfun(kq$x, kq$y), lims[3], lims[4])$value

I then make the bivariate PDF (\(P(P=p, Q=q)\)) integrate to 1 (I think this is the right thing to do?)

cell_size <- (diff(range(lims[1], lims[2])) / res_p) * (diff(range(lims[3], lims[4])) / res_q)
integ <- sum(kpq$z) * cell_size
integ

## [1] 0.5071539

kpq$z <- kpq$z/integ

I compare on a single plot the univariate PDF, \(P(Q=q|H0)\), and the bivariate PDF for a fixed P at \(P=1\):

I then find the cdfs by summing across rows/columns [Note - should \(P(P\leq1, Q\leq lims[4])=1\) and \(P(Q\leq q|H0)=1\)? In which case another standardisation method could be to ensure the top right entry of the cdf matrix is 1].

##### sum to get P(Q <= q | h0)
kqh0 <- t(outer(kq$y, rep(1,res_p)))
int_kqh0 <- t(apply(kqh0, 1, cumsum))

##### sum to get P(P <= p, Q <= q)
int_kpq <- t(apply(apply(kpq$z[res_p:1,], 2, cumsum), 1, cumsum))[res_p:1, ]

But now we have an even bigger problem of scale differences….

I use a sneaky normalisation trick and define kgrid(\(=P(P\leq p, Q\leq q)/P(Q\leq q|H0)\).

int_kqh0 <- int_kqh0 * (max(int_kpq)/max(int_kqh0))

int_kpq[which(int_kpq==0)] = min(int_kpq[which(int_kpq>0)]) # avoid dividing by 0
int_kqh0[which(int_kqh0==0)] = min(int_kqh0[which(int_kqh0>0)]) # avoid dividing by 0

kgrid <- kpq
kgrid$z <- int_kpq/int_kqh0

I then use this to define our adjusted cfdr estimator, ensuring it is strictly decreasing:

# int_p is P(P\leq p)
int_p = outer(2*pnorm(-kpq$x), rep(1,res_q))

cgrid = kgrid
cgrid$z = int_p/kgrid$z
cgrid$z = apply(cgrid$z, 2, cummin) # make non-increasing
cgrid$z = apply(cgrid$z, 2, function(y) y-gx*cgrid$x+gx*maxx) # make strictly decreasing

I then proceed as normal to find the L curves.

if (!is.null(indices)) ccut = interp.surface(cgrid, cbind(pmin(zp[indices], maxx), pmin(q[indices], maxy))) 

max(ccut)

## [1] 0.994664

# define support of L curves
yval2 = seq(lims[3], lims[4], length.out=nv)
xval2 = outer(rep(1,length(ccut)), yval2)
xtest = seq(0, lims[2], length.out=nt)
ptest = 2*pnorm(-xtest)

for (i in 1:length(yval2)) {
  xdenom = interp.surface(kgrid, cbind(xtest, rep(yval2[i], length(xtest))))
  cfx = cummin(ptest/xdenom)
  xval2[,i] = approx(cfx-gx*xtest  + gx*maxx, xtest, ccut, rule=2, method="const", f=1)$y
}

Note that something is going wrong for large P values…

I then do the final integration step to obtain the final v values.

# P | H0 
fph0 = function(x) dunif(x)

# Q|H0
qs  = q[sub][which(p[sub]>0.5)]
kq = density(c(qs, qs), from=lims[3], to=lims[4], n=res_q) # define on same support as L curves

fqh0 <- approxfun(kq$x, kq$y)

fpqh0 <- function(s){
  fph0(s[,1])*fqh0(s[,2])
}

fullint <- polyCub(owin(poly=list(x = c(1, 0, 0, 1), y = c(lims[4], lims[4], lims[3], lims[3])), check = FALSE, fix = FALSE), fpqh0)

fullint

## [1] 1.006436

# divide by this so that the density integrates to 1
fpqh0 <- function(s){
  (fph0(s[,1])*fqh0(s[,2]))/fullint
}

v <- apply(L$x, 1, function(x) {
  return(tryCatch(polyCub(owin(poly=list(x = c(x[nv], 0,0, x), y = c(L$y[nv], max(L$y), min(L$y), L$y)), check = FALSE, fix  = FALSE), fpqh0), error = function(e) NA))
}) %>% unlist()

The results are now similar to those when using the original code that did a seperate adjustment step using ecdfs. But this all still needs a bit of work to deal with very large and very small P values.

Comparing with previous method (adjustment using ecdfs)

Note that in the previous method, kgrid approximated \(P(P\leq p, Q\leq q)/P(Q\leq q)\) (easy as \(P(Q\leq q)\) is taken from the first row of the matrix, i.e. for \(P(Q\leq q, P<1)\)) whereas in my new method kgrid approximates \(P(P\leq p, Q\leq q)/P(Q\leq q|H0)\), which is harder.

The L curves using the previous code (seperate adjustment using ecdfs) are very different:

but the v values are comparable:

Comments and queries

I am using P values from ic_t1d_onengut_allresults.csv.gz (124726 SNPs) for my analysis. I have noticed there is another similar file in the directory: hg19_gwas_ic_t1d_onengut_cc_4_19_1.tab.gz (123130 SNPs). The p-values are very similar (not identical) - but I wonder what the difference is?
In the adjustment using ecdf code yval2 = seq(lims[3], lims[4], length.out=nv+1)[1:nv] which means the L curves do not quite have lims[4] as the upper limit. This is because ecdfs are used for the adjustment and this will affect the integration step.
I need to check why the gx trick isn’t working for very small q.
I need to look at defining the bandwidths in the kde.

30th April

Anna Hutchinson

Genomic control

cFDR

Comparing with previous method (adjustment using ecdfs)

Comments and queries