A quick fix would be to convert the q values to a normal distribution and then directly use James’ functions.
The bestNormalize
package (https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html) allows comparison between different normalisation methods (where normalisation refers to making data normal, i.e. not just \([0,1]\)).
mca_df <- readRDS("res.mca.mjca.RDS")
q <- mca_df[,1]
hist(q, breaks = 100)
I use the package to find the recommended method for normalisation. The package suggests the ordered quantile (ORQ) normalization via orderNorm
, a normalization technique based off of a rank mapping to the normal distribution, which guarantees normally distributed transformed data (if ties are not present).
library(bestNormalize)
(BNobject <- bestNormalize(q))
## Warning in bestNormalize(q): boxcox did not work; Error in estimate_boxcox_lambda(x, ...) : x must be positive
## Warning in orderNorm(standardize = TRUE, warn = TRUE, x = c(1.23253769877005, : Ties in data, Normal distribution not guaranteed
## Best Normalizing transformation with 121879 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - No transform: 248.0488
## - Log_b(x+a): 62.4443
## - sqrt(x+a): 91.907
## - exp(x): 2179.1578
## - arcsinh(x): 291.9998
## - Yeo-Johnson: 131.9365
## - orderNorm: 4.6778
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 121879 nonmissing obs and ties
## - 102874 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## -1.344 -1.038 -0.620 0.997 4.357
The annotation data is now normally distributed.
orderNorm_obj <- orderNorm(q)
## Warning in orderNorm(q): Ties in data, Normal distribution not guaranteed
hist(orderNorm_obj$x.t, main = "orderNorm transformation", breaks = 100)
James’ code does not allow p or q \(<1e-300\) which corresponds to 50% of the dataset.
final_dataset <- readRDS("final_dataset.RDS")
q <- readRDS("transformed_q.RDS")
p <- final_dataset$p
source("cfdr_functions.R")
# p[which(p<1e-300)] <- 1e-300 # to avoid errors in James' code
# q[which(qs<1e-300)] <- 1e-300
candidate_indices <- which(p < 1)
chr <- final_dataset$chrom
est_q0_pars <- fit.2g(q[which(p>0.5)])$pars
# prepare containers
fold <- vector(mode = "list", length = length(unique(chr)))
ind <- vector(mode = "list", length = length(unique(chr)))
L <- vector(mode = "list", length = length(unique(chr)))
v <- vector(mode = "list", length = length(unique(chr)))
for(i in 1:length(unique(chr))){
fold[[i]] <- which(chr == i)
ind[[i]] <- intersect(candidate_indices, fold[[i]])
L[[i]] <- vl(p, q, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]])); # compute L curves
v[[i]] <- il(L[[i]], pi0_null = est_q0_pars[1], sigma_null = est_q0_pars[2], distx="norm", gx = min(p[ind[[i]]])) # integrate over L curves
}
save(p, q, v, file = "transform_res.RData")
review_PAINTORfig
) generate many simulated locus files:
rbinom(length(z), 1, pi)
with pi=0.1
.review_PAINTORfig/res/misc/
)make_inputfiles.R
):
pi
(annotation uninformative). Make input.files
and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs).0.2
(annotation informative - CVs more likely to have it). Make input.files
and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs).0.05
(annotation informative - CVs less likely to have it). Make input.files
and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs).res_pi <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_pi.RDS")
res_lower <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_lower.RDS")
res_higher <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_higher.RDS")
# remove problematic simulations where all SNPs are given PP=1 for some reason
res_pi <- res_pi[-which(res_pi$PPs==1 & res_pi$ranks!=1),]
res_lower <- res_lower[-which(res_lower$PPs==1 & res_lower$ranks!=1),]
res_higher <- res_higher[-which(res_higher$PPs==1 & res_higher$ranks!=1),]