I have a problem whereby very large \(p\)-values never reach 1. This occurs whether I am estimating cfdr using either:
“kgrid method”: \(cFdr=\dfrac{p}{kgrid}=\dfrac{p}{P(P\leq p, Q\leq q)/P(Q\leq q|H0)}\)
“Bivariate method”: \(cFdr=\dfrac{P(P\leq p, Q\leq q|H0)}{P(P\leq p, Q\leq q)}\)
Recall that we find the L curves in a two-step approach:
ccut=cfdr(p,q)
(note that ccut
values are identical when using kgrid and bivariate methods)ccut = interp.surface(cgrid, cbind(zp[indices], q[indices]))
cfdr(xval2,yval2)=ccut
for each value yval2 takes. [I.e. we build a cfdr curve for each value of yval2 and find the x coordinate (xval2) where the cfdr curve equals ccut].# "kgrid" method
for (i in 1:length(yval2)) {
xdenom=interp.surface(kgrid,cbind(xtest,rep(yval2[i],length(xtest))))
cfx=cummin(ptest/xdenom)
xval2[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
}
# "bivariate method" (where cgrid is the ratio of two bivariates)
for (i in 1:length(yval2)) {
xdenom=interp.surface(cgrid,cbind(xtest,rep(yval2[i],length(xtest))))
cfx=xdenom
xval2[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
}
Note that different interpolation methods are used to read off the cfdr curves; interp.surface
in the first step to get the ccut
values, and constant interpolation using approx
in the second step to generate the L curves. I stick with the kgrid
method for now, as the results are the same when using the ratio of bivariates.
I investigate the differences in using these two methods (to generate ccut values).
ccut1 = interp.surface(cgrid, cbind(zp[indices], q[indices]))
ccut2 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
xdenom = interp.surface(kgrid, cbind(kpq$x, rep(q[indices[i]], length(kpq$x))))
cfx = cummin(2*pnorm(-kpq$x)/xdenom)
ccut2[i] = approx(kpq$x, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}
ccut3 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
xdenom = interp.surface(cgrid, cbind(kpq$x, rep(q[indices[i]], length(kpq$x))))
cfx = xdenom
ccut3[i] = approx(kpq$x, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}
Findings:
The ccut values are deinflated when using the for loop method rather than the interp.surface
method. Perhaps this could be contributing to the de-inflation problem. Could I use the inter.surface method in the final step to prevent the de-inflation. Although this de-inflation is fixed when increasing the resolution of the generated L curves (using xtest
rather than kpq$z
- which is what is actually used in the method).
Seperating out the kgrid
step in the for loop doesn’t change the results (ccut2
and ccut3
basically the same).
By switching method="const"
to method="linear"
in approx
I can get the same results as when using interp.surface
(makes sense as they are now both dooing linear interpolation).
Using left continuous (f=1
s.t. the right hand point is used) decreases ccut values, using right continuous (f=0
s.t. the lest hand point is used) increases ccut values.
If I increase the resolution s.t. the cfdr curve is defined on xtest
(length 5000) rather than kpq$x
(length 500), then the ccut
values are increased.
ccut4 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
xdenom = interp.surface(kgrid, cbind(xtest, rep(q[indices[i]], length(xtest))))
cfx = cummin(ptest/xdenom)
ccut4[i] = approx(xtest, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}
ccut5 = rep(1, length(p[indices]))
for (i in 1:length(p[indices])){
xdenom = interp.surface(cgrid, cbind(xtest, rep(q[indices[i]], length(xtest))))
cfx = xdenom
ccut5[i] = approx(xtest, cfx, zp[indices[i]], rule=2, method="const", f=1)$y
}
I think that the same interpolation method should be used to obtain the ccut
values and the L curve co-ordinates. Why does James choose constant interpolation with left-continuous?
Upon visual inspection of my cfdr curves, I notice that they do not reach 1, but I think this is correct.
On visual inspection of the L curves for large \(p\), they bend inwards for small q, which decreases the area of the L region and may lead to de-inflation (I think it is likely this that is causing the deinflation). Note that this isn’t fixed when fiddling with the approx
parameters.
Why do they bend inwards?
For each ccut value, the x co-ordinate of the L curve is found by reading off the Z score value where the cfdr curve (at the corresponding y value) equals ccut.
For example ZP is plotted against the cfdr curve for -0.4978084
below, with a vertical line at a selected ccut value (0.96). Th vertical line doesnt hit the cfdr curve and rule=2
forces it to take the closest value. The x coordinate of the L curve at this point would therefore be taken to be 0 in these instances (good - no bending inwards):
However, if I look at where each of the cfdr curves stop (beyond which x co-ordinates are 0), then we see that the shape exactly reflects the bending of the L curves.
Can I stop them bending inwards? (for q < min(q))
For now (as a sort of hack - need a better approach), I stop the extreme bending inwards for \(q<min(q)\) by ensuring that cfdr curves are not used for the approximation for small \(q\) (since there is not enough data to accurately fit the cfdr curve to read off). Instead, I force them to drop straight downwards.
xval2_new <- outer(rep(1,length(ccut)), yval2)
for (i in 1:length(yval2)) {
w = which(q[sub] <= rev(yval2)[i])
if(length(w)>1){
xdenom=interp.surface(kgrid,cbind(xtest,rep(rev(yval2)[i],length(xtest))))
cfx= cummin(ptest/xdenom)
xval2_new[,i]=approx(cfx, xtest, ccut, rule=2, method="const", f=1)$y
} else if (i>1) xval2_new[,i]=xval2_new[,i-1] else xval2_new[,i]=-qnorm(ccut/2)
}
But does this solve the problem? No.
## [1] 1.006436
Tailing off of small \(p\) values doesn’t occur in my current kgrid method, but it does happen in the bivariate method. It also happens in the kgrid method if I interpolate directly from cgrid in the last step. I investigate why.
Considering a single index as an example.
If we look at the L curves on the Z score x-axis scale then we see that there is a bulge outwards at approx min(q[sub]) where there is no tailing off (separating out kgrid), as opposed to a bulge inwards where there is tailing off (bivariate method).
On the P value scale this amounts to a nice looking L region for no tailing. For the tailing off example, there is a bulge outwards for low q —> increasing area of L region —> increasing v value —> tailing off.
Why does the bulge appear?
So it seems to be related to the big P problem whereby we get some unwanted bulging for low q. If I plot the minimum value of cfdr curves for both the kgrid (no tailing) and the bivariate (tailing) methods, then I find that the minimum is always 0 for the bivariate method (meaning that rule=2
is never used).
I consider the cfdr curve for yval2=-2.4.. for both methods. Whilst it stops at 8.435309e-26 for the kgrid method (no tailing) it carries on going in the bivariate method.
So, using the rule=2
trick for the cfdr curves (which are not completely defined on \([0,1]\)) from the kgrid method seems to solve the problem.
The full results for my method (using the hacky big p fix) are shown below.