### Aim: Explore relationships between variables for ordered and non-ordered sets.

#### Method: Pairs plots, heatmap, correlation networks, logistic regression.

##### Authors claim that coverage ~ size in non-ordered sets. We hope that by incorporating information on both the size and the entropy of the system in ordered sets, we can better predict coverage. I.E. we hope that entropy describes the way set size changes for ordered sets, such that information on the OR (not known to experimentors) is not needed to predict coverage.

A new dataset has been created using reference haplotypes with $$MAF=0.5$$, $$LD1=0.2$$ and $$LD2=0.08$$.

$$OR=1.3$$ has been dropped from the dataset as this represents systems with very high entropy, with the CV holding a large proportion of the posterior probability. Moreover, these CVs are likely to have been identified by previous GWAS.

OR = 1, 1.05, 1.08, 1.1, 1.15, 1.2.

N = 1000, 2000, 3000, 4000, 5000.

thr = 0.5, 0.6, 0.7, 0.8, 0.9.

nsnps = 100, 200, 300, 500, 1000.

## Inferences

• We expect that size and covered are highly correlated for unordered sets, and that this correlation is weaker for ordered sets. We hope to incorporate entropy in the model for unordered sets to account for some of the extra noise.

• Ordered: 0.38
• Unordered: 0.40

• Size and coverage slightly more correlated in unordered

• We expect that OR and entropy are highly correlated. These variables are the same in the ordered and non-ordered datasets as they reflect information on the system, and the same systems were used to form ordered and non-ordered credible sets.

• Ordered/ Unordered: 0.545

• OR and entropy are highly correlated

• We expect that entropy and covered are more correlated in ordered than non-ordered sets. We hope to include entropy as a predictor for coverage in ordered sets.

• Ordered: 0.202
• Unordered: 0.202

• Since the correlation is low, perhaps entropy will not be a significant predictor of coverage in the following logistic regression section.

• We see there is much higher correlation between OR and covered in ordered sets. We hope that by incorporating entropy as a predictor for coverage in ordered sets, we do not need to incorporate information on the OR as this is not known to experimentors.

• Ordered: 0.447
• Unordered: 0.217

• OR and covered show high correlation in ordered sets

• We see that nvar and entropy have stronger negative correlation in ordered sets.

• Ordered: -0.212
• Unordered: 0.023

• nvar and entropy show higher negative correlation in ordered sets

• Similarly, nvar and OR have stronger (negative) correlation in ordered sets.

• Ordered: -0.246
• Unordered: -0.064

• nvar and OR show higher negative correlation in ordered sets

• We see that nsnps and nvar are much more correlated in unordered sets. This intuitively makes sense.

• Ordered: 0.52
• Unordered: 0.875

• nsnps and nvar highly correlated in unordered sets

• We see that thr and nvar are more correlated in ordered sets - as the threshold increases, as does the nvar. I would expect this correlation to be higher in unordered sets, as if there is a snp with very high posterior probability then this will be included in the set quicker in ordered than non-ordered methods, making the set size smaller? Whereas for non-ordered sets, more snps have to be added to the set before ‘finding’ this high pp snp.

• Ordered: 0.452
• Unordered: 0.189

• thr and size more correlated in ordered sets

The next section will analyse the following claims:

Claim 1: $log(\frac{p}{1-p})\sim log(\frac{size}{1-size})$ works well for non-ordered sets, works less well for ordered sets.

Claim 2: Can we improve the accuracy of the above model in ordered sets by incorporating entropy as a predictor.

Claim 3: Hoping that adding OR to the $$log(\frac{p}{1-p})\sim log(\frac{size}{1-size})+entropy$$ model does not improve it too much. Hoping that entropy has absorbed in our knowledge of OR.

Claim 4: Entropy has a non-linear effect on coverage. Use the rcs function to analyse its non-linear effect.