1st July

Long tails

As explained on slack, I can fix the disaster example by censoring q values.
It may be worth checking how the log and qnorm simulation results look when using this censoring, and whether it saves the FDR etc.

init_censor <- function(q, val = 10^-10){
  q[which(q<val)] <- val
  return(q)
}

Depending on these results we can decide whether we tell users to use raw q, log(q) or Z-scores.

In terms of arbitrary data having long tails, I don’t know of a statistic that quantifies the heavy-tailed-ness of a distribution (kurtosis doesn’t seem right). It may be that we do a quick check of their data based on something like percentiles and print a warning if it looks long tailed, suggesting censoring.

set.seed(5)
N <- 100000
q <- -rnbinom(N, 10, 0.7)
hist(q, main = "Histogram of example data\ngray line is 1st percentile\nblue line is half the data range")
abline(v = quantile(q, 0.01), col = "gray", lty = "dashed")
abline(v = range(q)[1]+(range(q)[2]-range(q)[1])*0.5, col = "blue", lty = "dashed")

quantile(q, 0.01) > range(q)[1]+(range(q)[2]-range(q)[1])*0.5

##    1% 
## FALSE

load("disaster.RData")
hist(q)
abline(v = quantile(q, 0.01), col = "gray", lty = "dashed")
abline(v = range(q)[1]+(range(q)[2]-range(q)[1])*0.5, col = "blue", lty = "dashed")

quantile(q, 0.01) > range(q)[1]+(range(q)[2]-range(q)[1])*0.5

##   1% 
## TRUE

if( quantile(q, 0.01) > range(q)[1]+(range(q)[2]-range(q)[1])*0.5 ) warning("Your data may have very long tails, try left censoring")

## Warning: Your data may have very long tails, try left censoring

Could also try something based on data quartiles:

set.seed(5)
N <- 100000
q <- -rnbinom(N, 10, 0.7)
quans <- quantile(q)
if((quans[3]-quans[2]) > 3 *(quans[4]-quans[3])) warning("Your data may have very long tails, try left censoring")

load("disaster.RData")
quans <- quantile(q)
if((quans[3]-quans[2]) > 3 *(quans[4]-quans[3])) warning("Your data may have very long tails, try left censoring")

## Warning: Your data may have very long tails, try left censoring

I’ve implemented this censoring on Chris’ simulations (pqonly) and it improves FDR control (gets to 0.06 on iteration 5 rather than 0.13) but FDR still isn’t controlled after ~ 4 iterations:

(Side Note: in the pqonly simulations, although p and q are more correlated (\(~0.08\)) the q vectors are also correlated (\(~0.04\)))

1st July

Anna Hutchinson

Long tails