7th May

1. De-inflation of large P

I have a problem whereby very large $p$-values never reach 1. This occurs whether I am estimating cfdr using either:

“kgrid method”: $cFdr=\dfrac{p}{kgrid}=\dfrac{p}{P(P\leq p, Q\leq q)/P(Q\leq q|H0)}$
“Bivariate method”: $cFdr=\dfrac{P(P\leq p, Q\leq q|H0)}{P(P\leq p, Q\leq q)}$

Recall that we find the L curves in a two-step approach:

Find ccut=cfdr(p,q) (note that ccut values are identical when using kgrid and bivariate methods)

ccut = interp.surface(cgrid, cbind(zp[indices], q[indices]))

Construct the L curves by finding the xval2 such that cfdr(xval2,yval2)=ccut for each value yval2 takes. [I.e. we build a cfdr curve for each value of yval2 and find the x coordinate (xval2) where the cfdr curve equals ccut].

# "kgrid" method
for (i in 1:length(yval2)) {
  xdenom=interp.surface(kgrid,cbind(xtest,rep(yval2[i],length(xtest))))
  cfx=cummin(ptest/xdenom)
  xval2[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
}

# "bivariate method" (where cgrid is the ratio of two bivariates)
for (i in 1:length(yval2)) {
  xdenom=interp.surface(cgrid,cbind(xtest,rep(yval2[i],length(xtest))))
  cfx=xdenom
  xval2[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
}

Interpolation methods

Note that different interpolation methods are used to read off the cfdr curves; interp.surface in the first step to get the ccut values, and constant interpolation using approx in the second step to generate the L curves. I stick with the kgrid method for now, as the results are the same when using the ratio of bivariates.

I investigate the differences in using these two methods (to generate ccut values).

ccut1 = interp.surface(cgrid, cbind(zp[indices], q[indices])) 

ccut2 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
  xdenom = interp.surface(kgrid, cbind(kpq$x, rep(q[indices[i]], length(kpq$x))))
  cfx = cummin(2*pnorm(-kpq$x)/xdenom)
  ccut2[i] = approx(kpq$x, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}

ccut3 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
  xdenom = interp.surface(cgrid, cbind(kpq$x, rep(q[indices[i]], length(kpq$x))))
  cfx = xdenom
  ccut3[i] = approx(kpq$x, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}

Findings:

The ccut values are deinflated when using the for loop method rather than the interp.surface method. Perhaps this could be contributing to the de-inflation problem. Could I use the inter.surface method in the final step to prevent the de-inflation. Although this de-inflation is fixed when increasing the resolution of the generated L curves (using xtest rather than kpq$z - which is what is actually used in the method).
Seperating out the kgrid step in the for loop doesn’t change the results (ccut2 and ccut3 basically the same).
By switching method="const" to method="linear" in approx I can get the same results as when using interp.surface (makes sense as they are now both dooing linear interpolation).
Using left continuous (f=1 s.t. the right hand point is used) decreases ccut values, using right continuous (f=0 s.t. the lest hand point is used) increases ccut values.
If I increase the resolution s.t. the cfdr curve is defined on xtest (length 5000) rather than kpq$x (length 500), then the ccut values are increased.

ccut4 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
  xdenom = interp.surface(kgrid, cbind(xtest, rep(q[indices[i]], length(xtest))))
  cfx = cummin(ptest/xdenom)
  ccut4[i] = approx(xtest, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}

ccut5 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
  xdenom = interp.surface(cgrid, cbind(xtest, rep(q[indices[i]], length(xtest))))
  cfx = xdenom
  ccut5[i] = approx(xtest, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}

I think that the same interpolation method should be used to obtain the ccut values and the L curve co-ordinates. Why does James choose constant interpolation with left-continuous?

cfdr curve

Upon visual inspection of my cfdr curves, I notice that they do not reach 1, but I think this is correct.

L curves

On visual inspection of the L curves for large $p$, they bend inwards for small q, which decreases the area of the L region and may lead to de-inflation (I think it is likely this that is causing the deinflation). Note that this isn’t fixed when fiddling with the approx parameters.

Why do they bend inwards?

For each ccut value, the x co-ordinate of the L curve is found by reading off the Z score value where the cfdr curve (at the corresponding y value) equals ccut.

For example ZP is plotted against the cfdr curve for -0.4978084 below, with a vertical line at a selected ccut value (0.96). Th vertical line doesnt hit the cfdr curve and rule=2 forces it to take the closest value. The x coordinate of the L curve at this point would therefore be taken to be 0 in these instances (good - no bending inwards):

However, if I look at where each of the cfdr curves stop (beyond which x co-ordinates are 0), then we see that the shape exactly reflects the bending of the L curves.

Can I stop them bending inwards? (for q < min(q))

For now (as a sort of hack - need a better approach), I stop the extreme bending inwards for $q<min(q)$ by ensuring that cfdr curves are not used for the approximation for small $q$ (since there is not enough data to accurately fit the cfdr curve to read off). Instead, I force them to drop straight downwards.

xval2_new <- outer(rep(1,length(ccut)), yval2)
for (i in 1:length(yval2)) {
  w = which(q[sub] <= rev(yval2)[i])
  if(length(w)>1){
    xdenom=interp.surface(kgrid,cbind(xtest,rep(rev(yval2)[i],length(xtest))))
    cfx= cummin(ptest/xdenom)
    xval2_new[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
  } else if (i>1) xval2_new[,i]=xval2_new[,i-1] else xval2_new[,i]=-qnorm(ccut/2) 
}

But does this solve the problem? No.

## [1] 1.006436

2. The story of tailing off

Tailing off of small $p$ values doesn’t occur in my current kgrid method, but it does happen in the bivariate method. It also happens in the kgrid method if I interpolate directly from cgrid in the last step. I investigate why.

Considering a single index as an example.

If we look at the L curves on the Z score x-axis scale then we see that there is a bulge outwards at approx min(q[sub]) where there is no tailing off (separating out kgrid), as opposed to a bulge inwards where there is tailing off (bivariate method).

On the P value scale this amounts to a nice looking L region for no tailing. For the tailing off example, there is a bulge outwards for low q —> increasing area of L region —> increasing v value —> tailing off.

Why does the bulge appear?

So it seems to be related to the big P problem whereby we get some unwanted bulging for low q. If I plot the minimum value of cfdr curves for both the kgrid (no tailing) and the bivariate (tailing) methods, then I find that the minimum is always 0 for the bivariate method (meaning that rule=2 is never used).

I consider the cfdr curve for yval2=-2.4.. for both methods. Whilst it stops at 8.435309e-26 for the kgrid method (no tailing) it carries on going in the bivariate method.

So, using the rule=2 trick for the cfdr curves (which are not completely defined on $[0,1]$) from the kgrid method seems to solve the problem.