A quick fix would be to convert the q values to a normal distribution and then directly use James’ functions.

The `bestNormalize`

package (https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html) allows comparison between different normalisation methods (where normalisation refers to making data normal, i.e. not just \([0,1]\)).

```
mca_df <- readRDS("res.mca.mjca.RDS")
q <- mca_df[,1]
hist(q, breaks = 100)
```

I use the package to find the recommended method for normalisation. The package suggests the ordered quantile (ORQ) normalization via `orderNorm`

, a normalization technique based off of a rank mapping to the normal distribution, which guarantees normally distributed transformed data (if ties are not present).

```
library(bestNormalize)
(BNobject <- bestNormalize(q))
```

`## Warning in bestNormalize(q): boxcox did not work; Error in estimate_boxcox_lambda(x, ...) : x must be positive`

`## Warning in orderNorm(standardize = TRUE, warn = TRUE, x = c(1.23253769877005, : Ties in data, Normal distribution not guaranteed`

```
## Best Normalizing transformation with 121879 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - No transform: 248.0488
## - Log_b(x+a): 62.4443
## - sqrt(x+a): 91.907
## - exp(x): 2179.1578
## - arcsinh(x): 291.9998
## - Yeo-Johnson: 131.9365
## - orderNorm: 4.6778
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 121879 nonmissing obs and ties
## - 102874 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## -1.344 -1.038 -0.620 0.997 4.357
```

The annotation data is now normally distributed.

`orderNorm_obj <- orderNorm(q)`

`## Warning in orderNorm(q): Ties in data, Normal distribution not guaranteed`

`hist(orderNorm_obj$x.t, main = "orderNorm transformation", breaks = 100)`

James’ code does not allow p or q \(<1e-300\) which corresponds to 50% of the dataset.

```
final_dataset <- readRDS("final_dataset.RDS")
q <- readRDS("transformed_q.RDS")
p <- final_dataset$p
source("cfdr_functions.R")
# p[which(p<1e-300)] <- 1e-300 # to avoid errors in James' code
# q[which(qs<1e-300)] <- 1e-300
candidate_indices <- which(p < 1)
chr <- final_dataset$chrom
est_q0_pars <- fit.2g(q[which(p>0.5)])$pars
# prepare containers
fold <- vector(mode = "list", length = length(unique(chr)))
ind <- vector(mode = "list", length = length(unique(chr)))
L <- vector(mode = "list", length = length(unique(chr)))
v <- vector(mode = "list", length = length(unique(chr)))
for(i in 1:length(unique(chr))){
fold[[i]] <- which(chr == i)
ind[[i]] <- intersect(candidate_indices, fold[[i]])
L[[i]] <- vl(p, q, indices=ind[[i]], mode=2, fold=fold[[i]], gx = min(p[ind[[i]]])); # compute L curves
v[[i]] <- il(L[[i]], pi0_null = est_q0_pars[1], sigma_null = est_q0_pars[2], distx="norm", gx = min(p[ind[[i]]])) # integrate over L curves
}
save(p, q, v, file = "transform_res.RData")
```

- On the HPC (in
`review_PAINTORfig`

) generate many simulated locus files:- Fix OR=1.1, NN=10000 and randomly select a CV to generate Z scores for the “medium” LD region (714 SNPs).
- Generate corresponding random annotation vector using
`rbinom(length(z), 1, pi)`

with`pi=0.1`

. - Save locus file (consisting of Z scores), ld file (note that this will be the same for all simulations) and annotations file to be used as input into PAINTOR.
- Store the iCV and binary annotaion at the CV (in
`review_PAINTORfig/res/misc/`

) - REPEAT (i)-(iv) many times (600 times).

- Select subsets of datasets to be used as input into PAINTOR (see
`make_inputfiles.R`

):- Select 10 collections of 100 locus files whereby the proportion of CVs with annotation equal to 1 is
`pi`

(annotation uninformative). Make`input.files`

and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs). - Select 10 collections of 100 locus files whereby the proportion of CVs with annotation equal to 1 is
`0.2`

(annotation informative - CVs more likely to have it). Make`input.files`

and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs). - Select 10 collections of 100 locus files whereby the proportion of CVs with annotation equal to 1 is
`0.05`

(annotation informative - CVs less likely to have it). Make`input.files`

and run PAINTOR for each of these 10. Save the PP at the CV in each (so have 10*100 PPs).

- Select 10 collections of 100 locus files whereby the proportion of CVs with annotation equal to 1 is
- Compare PPs for CVs.

```
res_pi <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_pi.RDS")
res_lower <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_lower.RDS")
res_higher <- readRDS("/Users/anna/PAINTOR_V3.0/misc/res_higher.RDS")
# remove problematic simulations where all SNPs are given PP=1 for some reason
res_pi <- res_pi[-which(res_pi$PPs==1 & res_pi$ranks!=1),]
res_lower <- res_lower[-which(res_lower$PPs==1 & res_lower$ranks!=1),]
res_higher <- res_higher[-which(res_higher$PPs==1 & res_higher$ranks!=1),]
```