12 Multiple Comparisons

Here are some code chunks that setup this document.

# Here are the libraries I used
library(tidyverse) # standard
library(knitr) 
library(readr) 
library(ggpubr) # allows for stat_cor in ggplots
library(ggfortify) # Needed for autoplot to work on lm()
library(gridExtra) # allows me to organize the graphs in a grid
library(car) # need for some regression stuff like vif
library(GGally)
# library(mosaic)

# This changes the default theme of ggplot
old.theme <- theme_get()
theme_set(theme_bw())

Suppose we did a single test between two group means:

We would have: \[ t = \frac{\overline{y_i}-\overline{y_j}}{SE_{\overline y_i - \overline y_j}} \] Then we get a p-value based on that t-distribution with some degrees of freedom defined by the method we were using for the comparison, e.g., equal or unequal variance two-sample t-test, ANOVA, and others…

Null Hypothesis Significance Testing (NHST) Approach

In NHST we “Reject $H_0$ if $p \le \alpha$.

$\alpha$ is the Type I error rate: “The probability of rejecting $H_0$ when $H_0$ is true.”
- This would be a “false positive” or a “false discovery”

P-value as a spectrum.

We would talk about the strength of evidence for declaring $\mu_i - \mu_j$.
- This approach would not have a Type I error per se.
- The idea of a Type I error would fall on a spectrum as well.

12.1 Multiple testing problem

In the ANOVA setting, should we Reject $H_0$ (all means are equal), or declare the strength of evidence is sufficient to be confident of a difference.

The next step is to look and see which specific groups/means differ.
This involves pairwise comparisons.

\[ t = \frac{\overline{y_i}-\overline{y_j}}{\sqrt{MSE\left(\frac{1}{n_i}+\frac{1}{n_2}\right)}} \]

When 2 or more comparisons are done we have more than one Type I error, we have a “Family” of tests.
This where we have the concept of Family-Wise Error Rate (FWER).
- This is defined to be the probability of making at least one Type I error.
- If all the tests are “independent” then $FWER = 1 - (1 - \alpha_c)^k$ where $\alpha_c$ is the type I error rate for each individual comparison and $k$ is the total number of comparisons in the family of tests.
As FWER increases, the reliability of the individual comparisons decreases.
The number of possible pairwise comparisons among $t$ groups is $k = \frac{t(t - 1)}{2}$

If we did not control the FWER error rate with $\alpha_c = 0.01$:

FWER when comparing many groups via pairwise comparisons
Groups	Comparisons	FWER
2	1	0.010
3	3	0.030
4	6	0.059
5	10	0.096
6	15	0.140
7	21	0.190
8	28	0.245
9	36	0.304
10	45	0.364

There are a few of common methods for controlling FWER

Notation.

$\alpha_F$ represents the desired FWER for the family of tests.
$\alpha_C$ is the type I error rate for each individual test/comparison.

12.2 The Bonferroni method

Consider $k$ hypothesis tests with $\alpha_F = 0.01$

$H_{0i}$ vs. $H_{1i}, i = 1, \ldots, k$.

Let $p_1, \ldots, p_k$ be the $p$-values for these tests.

Suppose we reject $H_0$ when $p_i \le \alpha_C$

\[ FWER = \alpha_F \le k\cdot\alpha_C \]

\[ \alpha_C = \frac{\alpha_F}{k} \]

This is based off what is known as the Boole’s theorem/inequality.

Bonferroni was the person that connected the Boole’s inequality to Hypothesis testing.
For the more probability theory inclined people, you can see a proof at https://en.wikipedia.org/wiki/Bonferroni_correction

The Bonferroni rule for multiple comparisons is:

Reject null hypothesis $H_{0i}$ when $p_i < \alpha_F/k$.
Alternatively, you could get Bonferroni corrected p-values: $p_i^* = p_i \cdot k$.
- Reject $H_0$ when $p_i^* \le \alpha_F$.
- Use $p_i^*$ if you are using the strength of evidence approach.

12.2.1 Example 1, OASIS data

oasis <- read_csv(here::here('datasets',
                             'oasis.csv'))

In our OASIS data, we have three groups:

Group: whether subject is demented, nondemented, or converted.

We are trying to see if nWBV differs among the groups.

nWBV: Normalized Whole Brain Volume; Brain volume is normalized by intercranial volume to put subjects of different sizes and gender on the same scale. To my best knowledge…

If we wanted to do all pairwise comparisons between the groups, we can do that via the pairwise.t.test() function.

The form is pairwise.t.test(x, g, p.adjust.method)
x is the response vector.
g is the group vector.
p.adjust.method is the method used to adjust p-values
- There are many, but the two we will discuss are “none” and “bonferroni”
Say we had loaded a data set called data, then we would idenfity the vector within the dataframe using data$Variable
- pairwise.t.test(data$responseVar, data$groupVar, "none") would do all pairwise t-tests with no correction.
- pairwise.t.test(data$responseVar, data$groupVar, "bonferroni")

There are 3 groups so there $k = \frac{3(3-1)}{2} = 3$ comparisons.

Here are the unadjusted and adjusted p-values.

pairwise.t.test(oasis$nWBV, oasis$Group, 
                p.adjust.method = "none")


    Pairwise comparisons using t tests with pooled SD 

data:  oasis$nWBV and oasis$Group 

            Converted Demented
Demented    0.19624   -       
Nondemented 0.42698   0.00046 

P value adjustment method: none

pairwise.t.test(oasis$nWBV, oasis$Group,
                p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  oasis$nWBV and oasis$Group 

            Converted Demented
Demented    0.5887    -       
Nondemented 1.0000    0.0014  

P value adjustment method: bonferroni

Notice the adjusted p-value is capped at 1.

12.2.2 Example, Genomics

In Genomics, 100s if not 1000s of genes are compared in terms of “activation levels” (or something like that…).

Suppose we were comparing a 100 genes, then there would be 4950 comparisons.
- We would multiple each unadjusted p-value by 4950 which means and individual comparison would have to have an unadjusted p-value less than 0.00010 to be rejected with $\alpha_F = 0.05$ using the Bonferroni method.
- We would probably miss many results where there is a difference in mean activation levels.
The Bonferroni method is overly conservative, i.e., is fails to reject the null hypothesis too often (it has low power).
In general, the Bonferroni method should not be applied unless you absolutely have to (which I cannot think of a situation where you would except for your “boss” telling you to).
However, you are still expected to know it, so here it is.

12.3 Tukey’s HSD (Tukey’s Honestly Significant Difference)

\[ MSE = \frac{SSE}{N-k} \]

\[ Q = \frac{|\max_i(\overline{y}_i)-\min_i(\overline{y}_j)|}{\sqrt{\frac{2\cdot MSE}{n}}} \]

Under the null hypothesis of ANOVA (all means equal) then $Q$ is a random variable of the “Studentized Range Distribution”, $q(t, N-t)$.

The Tukey method does not have to be used in ANOVA, but that is the main setting where it applies.

12.3.1 Tukey in R

Now a pain with R is all of these different types of analysis methods, e.g., lm, aov, etc., may have different subsidiary functions to into further analysis.

By default, TukeyHDS() is a very convenient function for doing Tukey comparisons. However, it only works on aov objects.

fit.lm <- lm(nWBV ~ Group, oasis)
fit.aov <- aov(nWBV ~ Group, oasis)


# TukeyHSD(fit.lm) # THIS GIVES AN ERROR

TukeyHSD(fit.aov)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = nWBV ~ Group, data = oasis)

$Group
                              diff          lwr        upr     p adj
Demented-Converted    -0.013502877 -0.038129366 0.01112361 0.3984496
Nondemented-Converted  0.008202274 -0.016177415 0.03258196 0.7057132
Nondemented-Demented   0.021705151  0.007366034 0.03604427 0.0013286

I personally think that’s stupid. lm and aov are two sides of the same analysis coin. All relevant calculations are the same, just presented in different ways.

The modelbased or emmeans packages can help resolve these problems.

Anyway…

library(emmeans)
emmeans(fit.lm, 
        pairwise ~ Group,
        adjust = "tukey")

$emmeans
 Group       emmean      SE  df lower.CL upper.CL
 Converted    0.738 0.00942 147    0.719    0.756
 Demented     0.724 0.00441 147    0.716    0.733
 Nondemented  0.746 0.00415 147    0.738    0.754

Confidence level used: 0.95 

$contrasts
 contrast                estimate      SE  df t.ratio p.value
 Converted - Demented      0.0135 0.01040 147   1.298  0.3984
 Converted - Nondemented  -0.0082 0.01030 147  -0.797  0.7057
 Demented - Nondemented   -0.0217 0.00606 147  -3.584  0.0013

P value adjustment: tukey method for comparing a family of 3 estimates

12.4 FDR and the Benjamani-Hochberg procedure

There are four scenarios when it comes to the Reject/Fail to reject testing procedure.

Test Result	Alternmative is true	Null hypothesis true	Total
Tests significant	TP	FP	P
Test is declared non-significant	FN	TN	N
Total	$m - m_0$	$m_0$	$m$

False Discovery Rate: The proportion of times that you reject the null hypothesis when it is true, i.e., the proportion of Type I Errors.

\[FDR = \frac{FP}{FP + TP} = \frac{FP}{P}\]

12.4.1 Controlling the FDR: Benjamani-Hochberg Procedure

Specify an acceptable maximum false discovery rate $q$ (or $\alpha$)
Order the p-values from least to greatest: $p_{(1)}, p_{(2)}, \dots, p_{(m)}$.
Find $k$, which is the largest value of $i$ such that $p_{(i)} \leq \frac{i}{m}q$, i.e, $k = \max \{i:p_{(i)} \leq \frac{i}{m}q\}$
Reject all null hypotheses corresponding to the p-values $p_{(1)}, \dots, p_{(k)}$. That is, reject the $k$-th smallest p-values.

pairwise.t.test(oasis$nWBV, oasis$Group, p.adjust.method = "BH")


    Pairwise comparisons using t tests with pooled SD 

data:  oasis$nWBV and oasis$Group 

            Converted Demented
Demented    0.2944    -       
Nondemented 0.4270    0.0014  

P value adjustment method: BH

12.4.2 Other Procedures for Controlling FDR

There are a couple methods for controlling FDR in pairwise.t.test():

Benjamani-Hochberg: p.adjust.method = "BH" (you may also put "fdr" instead of "BH")
Benjamani-Yekutieli: p.adjust.method = "BY"