Aim: Explore relationships between variables for ordered and non-ordered sets.


Method: Pairs plots, heatmap, correlation networks, logistic regression.


Authors claim that coverage ~ size in non-ordered sets. We hope that by incorporating information on both the size and the entropy of the system in ordered sets, we can better predict coverage. I.E. we hope that entropy describes the way set size changes for ordered sets, such that information on the OR (not known to experimentors) is not needed to predict coverage.

A new dataset has been created using reference haplotypes with \(MAF=0.5\), \(LD1=0.2\) and \(LD2=0.08\).

\(OR=1.3\) has been dropped from the dataset as this represents systems with very high entropy, with the CV holding a large proportion of the posterior probability. Moreover, these CVs are likely to have been identified by previous GWAS.

OR = 1, 1.05, 1.08, 1.1, 1.15, 1.2.

N = 1000, 2000, 3000, 4000, 5000.

thr = 0.5, 0.6, 0.7, 0.8, 0.9.

nsnps = 100, 200, 300, 500, 1000.


Pairs plot

Pairs plot for ordered

Pairs plot for ordered

Pairs forest plot for unordered

Correlation heat map

Correlation networks

Correlation networks


Correlation network

Correlation networks

Correlation networks


Inferences



The next section will analyse the following claims:

Claim 1: \[log(\frac{p}{1-p})\sim log(\frac{size}{1-size})\] works well for non-ordered sets, works less well for ordered sets.

Claim 2: Can we improve the accuracy of the above model in ordered sets by incorporating entropy as a predictor.

Claim 3: Hoping that adding OR to the \(log(\frac{p}{1-p})\sim log(\frac{size}{1-size})+entropy\) model does not improve it too much. Hoping that entropy has absorbed in our knowledge of OR.

Claim 4: Entropy has a non-linear effect on coverage. Use the rcs function to analyse its non-linear effect.