library(data.table)
setwd("/Users/anna/PhD")
cs <- read.table("cs for diff nsnps")
cs <- as.data.table(cs)
cs_ord <- cs[order=="TRUE"] # separate ordered and non-ordered simulations
cs_noord <- cs[order=="FALSE"]
cs_ord$covered <- as.numeric(cs_ord$covered) # has dim 200,000*9
cs_noord$covered <- as.numeric(cs_noord$covered) # has dim 200,000*9
cs_ord <- cs_ord[,-1] # remove order column
cs_noord <- cs_noord[,-1]
head(cs_ord)
## thr size nvar covered entropy N OR nsnps
## 1: 0.6 0.6094180 10 1 5.306601 5000 1.1 100
## 2: 0.6 0.6069801 32 0 4.859282 5000 1.1 100
## 3: 0.6 0.6598043 1 1 5.843130 5000 1.1 100
## 4: 0.6 0.7162386 1 1 6.000415 5000 1.1 100
## 5: 0.6 0.6133004 4 0 5.491237 5000 1.1 100
## 6: 0.6 0.6057816 10 1 5.310640 5000 1.1 100
head(cs_noord)
## thr size nvar covered entropy N OR nsnps
## 1: 0.6 0.6231312 38 1 5.306601 5000 1.1 100
## 2: 0.6 0.6068766 54 0 4.859282 5000 1.1 100
## 3: 0.6 0.8186720 46 1 5.843130 5000 1.1 100
## 4: 0.6 0.7827822 14 1 6.000415 5000 1.1 100
## 5: 0.6 0.7888346 59 1 5.491237 5000 1.1 100
## 6: 0.6 0.6027625 49 0 5.310640 5000 1.1 100
The ‘cs’ dataset contains information for 400,000 credible set simulations with varying combinations of OR, threshold, N and nsnps values.
Sampled OR values: 1, 1.05, 1.1, 1.2, 1.3.
Sampled thr values:0.5,0.6,0.7,0.8,0.9.
Sampled N values: 1000, 2000, 3000, 4000, 5000.
Sampled nsnps values: 100, 300, 500, 1000.
Logistic models were fit with each variable as the sole predictor. Fake data was then created for each variable, and the corresponding logistic model used to predict coverage for the fake data.
xent <- glm(covered~entropy, data=cs_ord, family='binomial')
xthr <- glm(covered~thr, data=cs_ord, family='binomial')
xsize <- glm(covered~size, data=cs_ord, family='binomial')
xnsnps <- glm(covered~nsnps, data=cs_ord, family='binomial')
xN <- glm(covered~N, data=cs_ord, family='binomial')
xnvar <- glm(covered~nvar, data=cs_ord, family='binomial')
xOR <- glm(covered~OR, data=cs_ord, family='binomial')
fake_ent <- data.frame(entropy=seq(4,40,by=0.1))
fake_ent$pred <- predict(xent,newdata=fake_ent,family=binomial,type="response")
fake_thr <- data.frame(thr=seq(0.5,1,by=0.01))
fake_thr$pred <- predict(xthr,newdata=fake_thr,family=binomial,type="response")
fake_size <- data.frame(size=seq(0.5,1,by=0.01))
fake_size$pred <- predict(xsize,newdata=fake_size,family=binomial,type="response")
fake_nsnps <- data.frame(nsnps=seq(100,1000,by=1))
fake_nsnps$pred <- predict(xnsnps,newdata=fake_nsnps,family=binomial,type="response")
fake_N <- data.frame(N=seq(1000,5000,by=1))
fake_N$pred <- predict(xN,newdata=fake_N,family=binomial,type="response")
fake_nvar <- data.frame(nvar=seq(1,843,by=1))
fake_nvar$pred <- predict(xnvar,newdata=fake_nvar,family=binomial,type="response")
fake_OR <- data.frame(OR=seq(1,1.5,by=0.01))
fake_OR$pred <- predict(xOR,newdata=fake_OR,family=binomial,type="response")
The same method was implemented for the non-ordered data for comparison
yent <- glm(covered~entropy, data=cs_noord, family='binomial')
ythr <- glm(covered~thr, data=cs_noord, family='binomial')
ysize <- glm(covered~size, data=cs_noord, family='binomial')
ynsnps <- glm(covered~nsnps, data=cs_noord, family='binomial')
yN <- glm(covered~N, data=cs_noord, family='binomial')
ynvar <- glm(covered~nvar, data=cs_noord, family='binomial')
yOR <- glm(covered~OR, data=cs_noord, family='binomial')
fake_ent$pred1 <- predict(yent,newdata=fake_ent,family=binomial,type="response")
fake_thr$pred1 <- predict(ythr,newdata=fake_thr,family=binomial,type="response")
fake_size$pred1 <- predict(ysize,newdata=fake_size,family=binomial,type="response")
fake_nsnps$pred1 <- predict(ynsnps,newdata=fake_nsnps,family=binomial,type="response")
fake_N$pred1 <- predict(yN,newdata=fake_N,family=binomial,type="response")
fake_nvar$pred1 <- predict(ynvar,newdata=fake_nvar,family=binomial,type="response")
fake_OR$pred1 <- predict(yOR,newdata=fake_OR,family=binomial,type="response")
Both curves plateau as the entropy value reaches approximately 20. This value represents a high disorder system, whereby one snp takes a huge proportion of the posterior probability. On analysis of the ordered data, there were many (~8%) cases where \(size=1\), \(nvar=1\) and \(covered=TRUE\), representing a system where one snp (the CV) has all the posterior probability. Conversely, in the non-ordered system, there were many instances of \(size=1\), but few with \(nvar=1\) (as this would only be the case when the first snp has all the posterior probability).
For any entropy value, the coverage for non-ordered sets is higher than that for ordered sets. This does not sound right as we know there are instances of over and undercoverage for different values of OR.
We know that entropy depends on OR, see below for seperate plots for each value of OR. Since they are all so different, seems silly to ‘squish’ them all into one plot. Idea: Could also include OR (and interaction term?) in the glm.
Ordered:
Non-ordered
The graph shows that coverage is higher for any threshold value for non-ordered sets compared with ordered sets. As the threshold increases to 1, the difference between coverage for ordered and non-ordered sets decreases.
Similarly to coverage~thr (since size and thr are closely related), it seems that coverage is higher for any size value for non-ordered sets compared with ordered sets. As the size increases to 1, the difference between coverage for ordered and non-ordered sets decreases.
Note the difference in plots for thr and size for 0.5-0.8.
As the number of snps increases in the system, the coverage decreases slightly.
As N increases, coverage increases. This increase is faster for non-ordered as opposed to ordered sets.
This plot doesn’t make much sense.. perhaps something wrong with the code?
As nvar increases, coverage increases. This increase is steeper for ordered sets and coverage for ordered sets beats that for non-ordered sets at nvar approx 625.
Coverage for ordered sets increases rapidly for OR=1-1.15 and plateaus at OR=1.4. For non-ordered sets, the increase in coverage is a lot slower. For OR=1 (the odds of disease for people with and without the allele is equal), coverage is a lot higher for non-ordered sets (~0.65) than ordered sets (~0.38), presumably because the CV would not hold a substantial proportion of the posterior probability, and therefore would not necessarily appear towards the beginning of the ordering.
Is there a way to quantify how successful these predictions were? For example, the entropy only model predicts an average coverage of 0.5177 for \(entropy=4\), is this correct?
The full model is presumably a better predictor of coverage. The cv.glmnet
function has been used to run lasso on the model. The model for the ‘best’ lambda (1se - largest value of lambda such that error is within 1 standard error of the minimum) is given below:
library(glmnet)
FALSE Loading required package: Matrix
FALSE Loading required package: foreach
FALSE Loaded glmnet 2.0-16
set.seed(3)
xfactors <- model.matrix(covered ~ ., data=cs_ord)[, -1]
x <- as.matrix(data.frame(xfactors))
cv.ord <- cv.glmnet(x, y=cs_ord$covered, alpha=1, family="binomial")
coef(cv.ord,s=cv.ord$lambda.1se)
FALSE 8 x 1 sparse Matrix of class "dgCMatrix"
FALSE 1
FALSE (Intercept) -1.980661e+01
FALSE thr 3.208768e+00
FALSE size 2.175730e+00
FALSE nvar 1.223560e-03
FALSE entropy .
FALSE N 9.136867e-05
FALSE OR 1.521719e+01
FALSE nsnps -4.792834e-04
Note: Surely entropy should not have been dropped?? Perhaps ridge should be used instead of lasso as this can cope with collinear variables??
Note 2: Can we run prediction on this? Would need to specify values for all the variables…
# GLM function
test <- sample(nrow(cs_ord), 100000) # randomly select rows to form the test set
test_set <- cs_ord[test, ]
training_set <- cs_ord[-test,]
fit_glm <- glm(covered ~ ., training_set, family=binomial(link="logit"))
glm_scores <- predict(fit_glm, test_set, type="response")
hist(glm_scores)
# round output to 0 if <0.5 and 1 if >0.5
rounded_scores <- round(glm_scores, digits = 0)
new <- cbind(test_set, rounded_scores)
length(which(new$covered==new$rounded_scores))/nrow(new) # 80.8% accurate
## [1] 0.80978
a <- which(new$covered==1 & new$rounded_scores==0) # length 7415: false negative 38.6%
b <- which(new$covered==0 & new$rounded_scores==1) # length 11789: false positive 61.4%
# Hence, 'false positives' reported more often than 'false negatives'
head(new)
## thr size nvar covered entropy N OR nsnps rounded_scores
## 1: 0.6 0.6012678 175 1 6.426384 1000 1.0 500 0
## 2: 0.5 0.8190513 1 0 8.992368 5000 1.1 1000 1
## 3: 0.7 0.9999837 1 1 17.685040 4000 1.3 500 1
## 4: 0.6 0.6000352 117 1 6.621915 2000 1.2 500 1
## 5: 0.8 0.8010262 287 1 6.588828 5000 1.1 500 1
## 6: 0.7 0.8927730 1 1 8.521814 2000 1.1 300 1
Seems abrupt to commit a value of 0.500001 to 1 and a value of 0.499999 to 0. Would be good if we could apply a correction factor to these values, to ‘pull them’ further away from 0.5. Perhaps incorporating entropy as this isn’t included in the glm.
Run this training/test set idea with the single variable glm model above to see how good/terrible they are.
Possible issue with dataset: In the ordered dataset, there are loads (~10% of) entries with \(size=nvar=covered=1\), meaning the credible set consists only of the CV which holds all the posterior probability. These are mostly for systems with OR=1.3 and very high entropy (>15). Should I remove these from the dataset as they do not represent real-world GWAS results we would like to study (as they would already have been investigated)?