Logistic regression

Aim: Estimate coverage of credible sets (in fine mapping genetic associations).

Method: Use logistic regression to analyse the relationship between coverage and entropy, nvar, N, OR, nsnps, thr and size.

Read in data/ data prep:

library(data.table)
setwd("/Users/anna/PhD")
cs <- read.table("cs for diff nsnps")
cs <- as.data.table(cs)
cs_ord <- cs[order=="TRUE"] # separate ordered and non-ordered simulations
cs_noord <- cs[order=="FALSE"]
cs_ord$covered <- as.numeric(cs_ord$covered) # has dim 200,000*9
cs_noord$covered <- as.numeric(cs_noord$covered) # has dim 200,000*9
cs_ord <- cs_ord[,-1] # remove order column
cs_noord <- cs_noord[,-1]
head(cs_ord)

##    thr      size nvar covered  entropy    N  OR nsnps
## 1: 0.6 0.6094180   10       1 5.306601 5000 1.1   100
## 2: 0.6 0.6069801   32       0 4.859282 5000 1.1   100
## 3: 0.6 0.6598043    1       1 5.843130 5000 1.1   100
## 4: 0.6 0.7162386    1       1 6.000415 5000 1.1   100
## 5: 0.6 0.6133004    4       0 5.491237 5000 1.1   100
## 6: 0.6 0.6057816   10       1 5.310640 5000 1.1   100

head(cs_noord)

##    thr      size nvar covered  entropy    N  OR nsnps
## 1: 0.6 0.6231312   38       1 5.306601 5000 1.1   100
## 2: 0.6 0.6068766   54       0 4.859282 5000 1.1   100
## 3: 0.6 0.8186720   46       1 5.843130 5000 1.1   100
## 4: 0.6 0.7827822   14       1 6.000415 5000 1.1   100
## 5: 0.6 0.7888346   59       1 5.491237 5000 1.1   100
## 6: 0.6 0.6027625   49       0 5.310640 5000 1.1   100

The ‘cs’ dataset contains information for 400,000 credible set simulations with varying combinations of OR, threshold, N and nsnps values.

Sampled OR values: 1, 1.05, 1.1, 1.2, 1.3.

Sampled thr values:0.5,0.6,0.7,0.8,0.9.

Sampled N values: 1000, 2000, 3000, 4000, 5000.

Sampled nsnps values: 100, 300, 500, 1000.

Individual effect of each variable

Logistic models were fit with each variable as the sole predictor. Fake data was then created for each variable, and the corresponding logistic model used to predict coverage for the fake data.

xent <- glm(covered~entropy, data=cs_ord, family='binomial') 
xthr <- glm(covered~thr, data=cs_ord, family='binomial') 
xsize <- glm(covered~size, data=cs_ord, family='binomial') 
xnsnps <- glm(covered~nsnps, data=cs_ord, family='binomial') 
xN <- glm(covered~N, data=cs_ord, family='binomial') 
xnvar <- glm(covered~nvar, data=cs_ord, family='binomial') 
xOR <- glm(covered~OR, data=cs_ord, family='binomial') 

fake_ent <- data.frame(entropy=seq(4,40,by=0.1))
fake_ent$pred <- predict(xent,newdata=fake_ent,family=binomial,type="response")

fake_thr <- data.frame(thr=seq(0.5,1,by=0.01))
fake_thr$pred <- predict(xthr,newdata=fake_thr,family=binomial,type="response")

fake_size <- data.frame(size=seq(0.5,1,by=0.01))
fake_size$pred <- predict(xsize,newdata=fake_size,family=binomial,type="response")

fake_nsnps <- data.frame(nsnps=seq(100,1000,by=1))
fake_nsnps$pred <- predict(xnsnps,newdata=fake_nsnps,family=binomial,type="response")

fake_N <- data.frame(N=seq(1000,5000,by=1))
fake_N$pred <- predict(xN,newdata=fake_N,family=binomial,type="response")

fake_nvar <- data.frame(nvar=seq(1,843,by=1))
fake_nvar$pred <- predict(xnvar,newdata=fake_nvar,family=binomial,type="response")

fake_OR <- data.frame(OR=seq(1,1.5,by=0.01))
fake_OR$pred <- predict(xOR,newdata=fake_OR,family=binomial,type="response")

The same method was implemented for the non-ordered data for comparison

yent <- glm(covered~entropy, data=cs_noord, family='binomial') 
ythr <- glm(covered~thr, data=cs_noord, family='binomial') 
ysize <- glm(covered~size, data=cs_noord, family='binomial') 
ynsnps <- glm(covered~nsnps, data=cs_noord, family='binomial') 
yN <- glm(covered~N, data=cs_noord, family='binomial') 
ynvar <- glm(covered~nvar, data=cs_noord, family='binomial') 
yOR <- glm(covered~OR, data=cs_noord, family='binomial') 

fake_ent$pred1 <- predict(yent,newdata=fake_ent,family=binomial,type="response")
fake_thr$pred1 <- predict(ythr,newdata=fake_thr,family=binomial,type="response")
fake_size$pred1 <- predict(ysize,newdata=fake_size,family=binomial,type="response")
fake_nsnps$pred1 <- predict(ynsnps,newdata=fake_nsnps,family=binomial,type="response")
fake_N$pred1 <- predict(yN,newdata=fake_N,family=binomial,type="response")
fake_nvar$pred1 <- predict(ynvar,newdata=fake_nvar,family=binomial,type="response")
fake_OR$pred1 <- predict(yOR,newdata=fake_OR,family=binomial,type="response")

Results

Entropy

Both curves plateau as the entropy value reaches approximately 20. This value represents a high disorder system, whereby one snp takes a huge proportion of the posterior probability. On analysis of the ordered data, there were many (~8%) cases where \(size=1\), \(nvar=1\) and \(covered=TRUE\), representing a system where one snp (the CV) has all the posterior probability. Conversely, in the non-ordered system, there were many instances of \(size=1\), but few with \(nvar=1\) (as this would only be the case when the first snp has all the posterior probability).

For any entropy value, the coverage for non-ordered sets is higher than that for ordered sets. This does not sound right as we know there are instances of over and undercoverage for different values of OR.

We know that entropy depends on OR, see below for seperate plots for each value of OR. Since they are all so different, seems silly to ‘squish’ them all into one plot. Idea: Could also include OR (and interaction term?) in the glm.

Ordered:

Non-ordered

Threshold

The graph shows that coverage is higher for any threshold value for non-ordered sets compared with ordered sets. As the threshold increases to 1, the difference between coverage for ordered and non-ordered sets decreases.

Size

Similarly to coverage~thr (since size and thr are closely related), it seems that coverage is higher for any size value for non-ordered sets compared with ordered sets. As the size increases to 1, the difference between coverage for ordered and non-ordered sets decreases.

Note the difference in plots for thr and size for 0.5-0.8.

nsnps

As the number of snps increases in the system, the coverage decreases slightly.

N

As N increases, coverage increases. This increase is faster for non-ordered as opposed to ordered sets.

This plot doesn’t make much sense.. perhaps something wrong with the code?

nvar

As nvar increases, coverage increases. This increase is steeper for ordered sets and coverage for ordered sets beats that for non-ordered sets at nvar approx 625.

OR

Coverage for ordered sets increases rapidly for OR=1-1.15 and plateaus at OR=1.4. For non-ordered sets, the increase in coverage is a lot slower. For OR=1 (the odds of disease for people with and without the allele is equal), coverage is a lot higher for non-ordered sets (~0.65) than ordered sets (~0.38), presumably because the CV would not hold a substantial proportion of the posterior probability, and therefore would not necessarily appear towards the beginning of the ordering.

Ideas

Is there a way to quantify how successful these predictions were? For example, the entropy only model predicts an average coverage of 0.5177 for \(entropy=4\), is this correct?
The full model is presumably a better predictor of coverage. The cv.glmnet function has been used to run lasso on the model. The model for the ‘best’ lambda (1se - largest value of lambda such that error is within 1 standard error of the minimum) is given below:

library(glmnet)

FALSE Loading required package: Matrix

FALSE Loading required package: foreach

FALSE Loaded glmnet 2.0-16

set.seed(3)
xfactors <- model.matrix(covered ~ ., data=cs_ord)[, -1]
x        <- as.matrix(data.frame(xfactors))
cv.ord <- cv.glmnet(x, y=cs_ord$covered, alpha=1, family="binomial")
coef(cv.ord,s=cv.ord$lambda.1se)

FALSE 8 x 1 sparse Matrix of class "dgCMatrix"
FALSE                         1
FALSE (Intercept) -1.980661e+01
FALSE thr          3.208768e+00
FALSE size         2.175730e+00
FALSE nvar         1.223560e-03
FALSE entropy      .           
FALSE N            9.136867e-05
FALSE OR           1.521719e+01
FALSE nsnps       -4.792834e-04

Note: Surely entropy should not have been dropped?? Perhaps ridge should be used instead of lasso as this can cope with collinear variables??

Note 2: Can we run prediction on this? Would need to specify values for all the variables…

Split the dataset randomly into test and training datasets. Use the training set to build the model and the test for prediction (nb, glmnet not cv.glmnet used here).

# GLM function

test <- sample(nrow(cs_ord), 100000) # randomly select rows to form the test set
test_set <- cs_ord[test, ]
training_set <- cs_ord[-test,]

fit_glm <- glm(covered ~ ., training_set, family=binomial(link="logit"))

glm_scores <- predict(fit_glm, test_set, type="response")
hist(glm_scores)

# round output to 0 if <0.5 and 1 if >0.5

rounded_scores <- round(glm_scores, digits = 0)
new <- cbind(test_set, rounded_scores)
length(which(new$covered==new$rounded_scores))/nrow(new) # 80.8% accurate

## [1] 0.80978

a <- which(new$covered==1 & new$rounded_scores==0) # length 7415: false negative 38.6%
b <- which(new$covered==0 & new$rounded_scores==1) # length 11789: false positive 61.4% 
# Hence, 'false positives' reported more often than 'false negatives'

head(new)

##    thr      size nvar covered   entropy    N  OR nsnps rounded_scores
## 1: 0.6 0.6012678  175       1  6.426384 1000 1.0   500              0
## 2: 0.5 0.8190513    1       0  8.992368 5000 1.1  1000              1
## 3: 0.7 0.9999837    1       1 17.685040 4000 1.3   500              1
## 4: 0.6 0.6000352  117       1  6.621915 2000 1.2   500              1
## 5: 0.8 0.8010262  287       1  6.588828 5000 1.1   500              1
## 6: 0.7 0.8927730    1       1  8.521814 2000 1.1   300              1

Seems abrupt to commit a value of 0.500001 to 1 and a value of 0.499999 to 0. Would be good if we could apply a correction factor to these values, to ‘pull them’ further away from 0.5. Perhaps incorporating entropy as this isn’t included in the glm.
Run this training/test set idea with the single variable glm model above to see how good/terrible they are.
Possible issue with dataset: In the ordered dataset, there are loads (~10% of) entries with \(size=nvar=covered=1\), meaning the credible set consists only of the CV which holds all the posterior probability. These are mostly for systems with OR=1.3 and very high entropy (>15). Should I remove these from the dataset as they do not represent real-world GWAS results we would like to study (as they would already have been investigated)?