We have so far performed an exploratory or descriptive analysis of the data. This is the first important step! Usually, in a second step, we want to test some statistical hypothesis: do women have lower cholesterol level compared to men? Are height and weight significantly correlated? Are there significantly more men from Buckingham compared to women? We do not want to given a statistics class, but we will see how to answer these 3 questions using R functions!
This is the most widely used class of statistical tests; we use them to test the value of the mean of a vector of numerical values. Let us test if men have significant different cholesterol values compared to women!
First, we have to exactly define the question we want to answer; we have three possible questions:
Notice the difference? The first 2 questions correspond to one-sided tests, as we indicate throught the question that we want to check for difference in a specific direction. The third question corresponds to a two-sided test, as we do not specify in which direction we expect the difference.
Important: Make sure to specify the question before you start doing your test!!
If we have no reason to expect the changes to be in one or the other direction, we go for a two-sided test.
We read again the data matrix:
dat = read.delim("https://tinyurl.com/87tpvd4c", stringsAsFactors = TRUE)
We first create a vector of cholesterol levels for men an women separately. We use here again the very usefull function split()
:
cholesterol = split(dat$chol, dat$gender)
Check the type of the cholesterol object!
We now perform a t-test comparing these 2 vectors:
t.test(cholesterol$female, cholesterol$male)
Welch Two Sample t-test
data: cholesterol$female and cholesterol$male
t = 0.85404, df = 314.56, p-value = 0.3937
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.249704 13.302582
sample estimates:
mean of x mean of y
209.1449 205.1184
Check the output of the test; do you understand the information given here? What is the meaning of the alternative hypothesis line? Does the test seem to be significant?
There are multiple pieces of information in this output:
How would we perform a one-sided t-test? Let's suppose we want to test if men have a higher cholesterol level compared to women:
The R command reads:
t.test(cholesterol$male, cholesterol$female, alternative = "greater")
Welch Two Sample t-test
data: cholesterol$male and cholesterol$female
t = -0.85404, df = 314.56, p-value = 0.8031
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-11.80418 Inf
sample estimates:
mean of x mean of y
205.1184 209.1449
Beware that the alternative='greater'
refers to the first sample indicated in the function! So here, the order matters!
Test if men have a significantly higher weight compared to women!
weights = split(dat$weight, dat$gender)
t.test(weights$male, weights$female, alternative = "greater")
Check again the corresponding violin plot!
We have previously computed the correlations between the numeric variables, and displayed them as heatmaps. Now, even if the correlation is non zero, maybe the correlation is not really significant. Hence, it does not indicate a strong relationship between the 2 variables... Let's see an example:
height
and waist
cor(dat$height, dat$waist)
[1] 0.04180787
Non zero, but not very large... But sometimes (if you have many values), a small correlation can be significant, despite being small.
We can check this with the function cor.test
:
cor.test(dat$height, dat$waist)
Pearson's product-moment correlation
data: dat$height and dat$waist
t = 0.79834, df = 364, p-value = 0.4252
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.06096349 0.14370208
sample estimates:
cor
0.04180787
Check again the different pieces of the output, and try to interpret them!
Well, nice try, but these 2 variables indeed do not seem to be significantly associated!
Try to find two variables with a significant correlation; maybe you can inspect the previous heatmap to guess which could be interesting to check!
As the name indicates, we use these types of tests to compute the relationship between categorical variables. In particular, we can test if the proportions in different categories are significantly different.
Let us see an example: are there significantly more men coming from Buckingham compared to women?
You might know that the starting point for these types of test are contingency matrices; here, that you be a 2x2 matrix with columns indicating men/women and rows indicating Buckinham/Louisa (or vice-versa). We can easily produce such a table with the .... table
function:
T = table(dat$gender, dat$location)
T
Buckingham Louisa
female 102 112
male 73 79
Now, there are two tests we can use for this kind of question:
Let us try both of them, to see if they agree!
fisher.test(T)
Fisher's Exact Test for Count Data
data: T
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.636320 1.527029
sample estimates:
odds ratio
0.9856036
chisq.test(T)
Pearson's Chi-squared test with Yates' continuity correction
data: T
X-squared = 0, df = 1, p-value = 1
Well, they do... Absolutely non significant!
Test if the proportion of small/medium/large value in the
frame
variable depends on the gender.... What type of plot could we use to visualize this?
## build the contigency table
T = table(dat$frame, dat$gender)
## perform test
chisq.test(T)
Pearson's Chi-squared test
data: T
X-squared = 19.199, df = 2, p-value = 6.777e-05
So there is a significant relations between gender and the frame variable! Hence the proportions are unequal!
We can try to do a barplot to visualize these differences, in particular a stacked barplot
barplot(T)
Nice, but we could choose nicer colors!
library(RColorBrewer)
col.frame = brewer.pal(3, "Set1")
barplot(T, col = col.frame)
By the way, can we add a legend??
barplot(T, col = col.frame)
legend("topright", legend = rownames(T), fill = col.frame)
Nice!!
compute the pairwise correlation between all patients in the
all.aml
datasetplot the heatmap; use the
annotation_row=...
argument to add additional informationselect a pair of positively/negatively correlated patients; determine if the correlation is significant
aml.cor = cor(all.aml,method='spearman')
Now make the heatmap:
Choose a nice color palette:
library(RColorBrewer)
## 10 colors from the PiYG palette
col.cor = brewer.pal(10, "PiYG")
## we can extrapolate to more colors shades
col.cor = colorRampPalette(col.cor)(100)
pheatmap(aml.cor,col=col.cor,annotation_row=all.aml.anno,annotation_col=all.aml.anno)
Testing the correlation between patient 34 and 66:
cor.test(all.aml$pat34,all.aml$pat66)
Pearson's product-moment correlation
data: all.aml$pat34 and all.aml$pat66
t = 26.702, df = 829, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6416567 0.7149375
sample estimates:
cor
0.6799916