# Here are the libraries I used
library(tidyverse) # standard
library(knitr)
library(readr)
library(ggpubr) # allows for stat_cor in ggplots
library(ggfortify) # Needed for autoplot to work on lm()
library(gridExtra) # allows me to organize the graphs in a grid
library(car) # need for some regression stuff like vif
library(GGally)
# library(mosaic)
12 Multiple Comparisons
Here are some code chunks that setup this document.
# This changes the default theme of ggplot
<- theme_get()
old.theme theme_set(theme_bw())
Suppose we did a single test between two group means:
We would have: \[ t = \frac{\overline{y_i}-\overline{y_j}}{SE_{\overline y_i - \overline y_j}} \] Then we get a p-value based on that t-distribution with some degrees of freedom defined by the method we were using for the comparison, e.g., equal or unequal variance two-sample t-test, ANOVA, and others…
Null Hypothesis Significance Testing (NHST) Approach
In NHST we “Reject \(H_0\) if \(p \le \alpha\).
\(\alpha\) is the Type I error rate: “The probability of rejecting \(H_0\) when \(H_0\) is true.”
- This would be a “false positive” or a “false discovery”
P-value as a spectrum.
We would talk about the strength of evidence for declaring \(\mu_i - \mu_j\).
- This approach would not have a Type I error per se.
- The idea of a Type I error would fall on a spectrum as well.
12.1 Multiple testing problem
In the ANOVA setting, should we Reject \(H_0\) (all means are equal), or declare the strength of evidence is sufficient to be confident of a difference.
The next step is to look and see which specific groups/means differ.
This involves pairwise comparisons.
\[ t = \frac{\overline{y_i}-\overline{y_j}}{\sqrt{MSE\left(\frac{1}{n_i}+\frac{1}{n_2}\right)}} \]
When 2 or more comparisons are done we have more than one Type I error, we have a “Family” of tests.
This where we have the concept of Family-Wise Error Rate (FWER).
- This is defined to be the probability of making at least one Type I error.
- If all the tests are “independent” then \(FWER = 1 - (1 - \alpha_c)^k\) where \(\alpha_c\) is the type I error rate for each individual comparison and \(k\) is the total number of comparisons in the family of tests.
As FWER increases, the reliability of the individual comparisons decreases.
The number of possible pairwise comparisons among \(t\) groups is \(k = \frac{t(t - 1)}{2}\)
If we did not control the FWER error rate with \(\alpha_c = 0.01\):
Groups | Comparisons | FWER |
---|---|---|
2 | 1 | 0.010 |
3 | 3 | 0.030 |
4 | 6 | 0.059 |
5 | 10 | 0.096 |
6 | 15 | 0.140 |
7 | 21 | 0.190 |
8 | 28 | 0.245 |
9 | 36 | 0.304 |
10 | 45 | 0.364 |
There are a few of common methods for controlling FWER
Notation.
- \(\alpha_F\) represents the desired FWER for the family of tests.
- \(\alpha_C\) is the type I error rate for each individual test/comparison.
12.2 The Bonferroni method
Consider \(k\) hypothesis tests with \(\alpha_F = 0.01\)
\(H_{0i}\) vs. \(H_{1i}, i = 1, \ldots, k\).
Let \(p_1, \ldots, p_k\) be the \(p\)-values for these tests.
Suppose we reject \(H_0\) when \(p_i \le \alpha_C\)
\[ FWER = \alpha_F \le k\cdot\alpha_C \]
\[ \alpha_C = \frac{\alpha_F}{k} \]
This is based off what is known as the Boole’s theorem/inequality.
Bonferroni was the person that connected the Boole’s inequality to Hypothesis testing.
For the more probability theory inclined people, you can see a proof at https://en.wikipedia.org/wiki/Bonferroni_correction
The Bonferroni rule for multiple comparisons is:
- Reject null hypothesis \(H_{0i}\) when \(p_i < \alpha_F/k\).
- Alternatively, you could get Bonferroni corrected p-values: \(p_i^* = p_i \cdot k\).
- Reject \(H_0\) when \(p_i^* \le \alpha_F\).
- Use \(p_i^*\) if you are using the strength of evidence approach.
12.2.1 Example 1, OASIS data
<- read_csv(here::here('datasets',
oasis 'oasis.csv'))
In our OASIS data, we have three groups:
- Group: whether subject is demented, nondemented, or converted.
We are trying to see if nWBV differs among the groups.
- nWBV: Normalized Whole Brain Volume; Brain volume is normalized by intercranial volume to put subjects of different sizes and gender on the same scale. To my best knowledge…
If we wanted to do all pairwise comparisons between the groups, we can do that via the pairwise.t.test()
function.
- The form is
pairwise.t.test(x, g, p.adjust.method)
x
is the response vector.g
is the group vector.p.adjust.method
is the method used to adjust p-values- There are many, but the two we will discuss are “none” and “bonferroni”
- Say we had loaded a data set called
data
, then we would idenfity the vector within the dataframe usingdata$Variable
pairwise.t.test(data$responseVar, data$groupVar, "none")
would do all pairwise t-tests with no correction.pairwise.t.test(data$responseVar, data$groupVar, "bonferroni")
There are 3 groups so there \(k = \frac{3(3-1)}{2} = 3\) comparisons.
Here are the unadjusted and adjusted p-values.
pairwise.t.test(oasis$nWBV, oasis$Group,
p.adjust.method = "none")
Pairwise comparisons using t tests with pooled SD
data: oasis$nWBV and oasis$Group
Converted Demented
Demented 0.19624 -
Nondemented 0.42698 0.00046
P value adjustment method: none
pairwise.t.test(oasis$nWBV, oasis$Group,
p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: oasis$nWBV and oasis$Group
Converted Demented
Demented 0.5887 -
Nondemented 1.0000 0.0014
P value adjustment method: bonferroni
Notice the adjusted p-value is capped at 1.
12.2.2 Example, Genomics
In Genomics, 100s if not 1000s of genes are compared in terms of “activation levels” (or something like that…).
- Suppose we were comparing a 100 genes, then there would be 4950 comparisons.
- We would multiple each unadjusted p-value by 4950 which means and individual comparison would have to have an unadjusted p-value less than 0.00010 to be rejected with \(\alpha_F = 0.05\) using the Bonferroni method.
- We would probably miss many results where there is a difference in mean activation levels.
- The Bonferroni method is overly conservative, i.e., is fails to reject the null hypothesis too often (it has low power).
- In general, the Bonferroni method should not be applied unless you absolutely have to (which I cannot think of a situation where you would except for your “boss” telling you to).
- However, you are still expected to know it, so here it is.
12.3 Tukey’s HSD (Tukey’s Honestly Significant Difference)
\[ MSE = \frac{SSE}{N-k} \]
\[ Q = \frac{|\max_i(\overline{y}_i)-\min_i(\overline{y}_j)|}{\sqrt{\frac{2\cdot MSE}{n}}} \]
Under the null hypothesis of ANOVA (all means equal) then \(Q\) is a random variable of the “Studentized Range Distribution”, \(q(t, N-t)\).
The Tukey method does not have to be used in ANOVA, but that is the main setting where it applies.
12.3.1 Tukey in R
Now a pain with R is all of these different types of analysis methods, e.g., lm
, aov
, etc., may have different subsidiary functions to into further analysis.
By default, TukeyHDS()
is a very convenient function for doing Tukey comparisons. However, it only works on aov
objects.
<- lm(nWBV ~ Group, oasis)
fit.lm <- aov(nWBV ~ Group, oasis)
fit.aov
# TukeyHSD(fit.lm) # THIS GIVES AN ERROR
TukeyHSD(fit.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = nWBV ~ Group, data = oasis)
$Group
diff lwr upr p adj
Demented-Converted -0.013502877 -0.038129366 0.01112361 0.3984496
Nondemented-Converted 0.008202274 -0.016177415 0.03258196 0.7057132
Nondemented-Demented 0.021705151 0.007366034 0.03604427 0.0013286
I personally think that’s stupid. lm
and aov
are two sides of the same analysis coin. All relevant calculations are the same, just presented in different ways.
The modelbased
or emmeans
packages can help resolve these problems.
Anyway…
library(emmeans)
emmeans(fit.lm,
~ Group,
pairwise adjust = "tukey")
$emmeans
Group emmean SE df lower.CL upper.CL
Converted 0.738 0.00942 147 0.719 0.756
Demented 0.724 0.00441 147 0.716 0.733
Nondemented 0.746 0.00415 147 0.738 0.754
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
Converted - Demented 0.0135 0.01040 147 1.298 0.3984
Converted - Nondemented -0.0082 0.01030 147 -0.797 0.7057
Demented - Nondemented -0.0217 0.00606 147 -3.584 0.0013
P value adjustment: tukey method for comparing a family of 3 estimates
12.4 FDR and the Benjamani-Hochberg procedure
There are four scenarios when it comes to the Reject/Fail to reject testing procedure.
Test Result | Alternmative is true | Null hypothesis true | Total |
---|---|---|---|
Tests significant | TP | FP | P |
Test is declared non-significant | FN | TN | N |
Total | \(m - m_0\) | \(m_0\) | \(m\) |
False Discovery Rate: The proportion of times that you reject the null hypothesis when it is true, i.e., the proportion of Type I Errors.
\[FDR = \frac{FP}{FP + TP} = \frac{FP}{P}\]
12.4.1 Controlling the FDR: Benjamani-Hochberg Procedure
- Specify an acceptable maximum false discovery rate \(q\) (or \(\alpha\))
- Order the p-values from least to greatest: \(p_{(1)}, p_{(2)}, \dots, p_{(m)}\).
- Find \(k\), which is the largest value of \(i\) such that \(p_{(i)} \leq \frac{i}{m}q\), i.e, \(k = \max \{i:p_{(i)} \leq \frac{i}{m}q\}\)
- Reject all null hypotheses corresponding to the p-values \(p_{(1)}, \dots, p_{(k)}\). That is, reject the \(k\)-th smallest p-values.
pairwise.t.test(oasis$nWBV, oasis$Group, p.adjust.method = "BH")
Pairwise comparisons using t tests with pooled SD
data: oasis$nWBV and oasis$Group
Converted Demented
Demented 0.2944 -
Nondemented 0.4270 0.0014
P value adjustment method: BH
12.4.2 Other Procedures for Controlling FDR
There are a couple methods for controlling FDR in pairwise.t.test()
:
- Benjamani-Hochberg:
p.adjust.method = "BH"
(you may also put"fdr"
instead of"BH"
) - Benjamani-Yekutieli:
p.adjust.method = "BY"