Robust TOST Procedures
Aaron R. Caldwell
2025-03-28
robustTOST.Rmd
Introduction to Robust TOST Methods
The Two One-Sided Tests (TOST) procedure is a statistical approach used to test for equivalence between groups or conditions. Unlike traditional null hypothesis significance testing (NHST) which aims to detect differences, TOST is designed to statistically demonstrate similarity or equivalence within predefined bounds.
While the standard t_TOST
function in TOSTER provides a
parametric approach to equivalence testing, it relies on assumptions of
normality and homogeneity of variance. In real-world data analysis,
these assumptions are often violated, necessitating more robust
alternatives. This vignette introduces several robust TOST methods
available in the TOSTER package that maintain their validity under a
wider range of data conditions.
When to Use Robust TOST Methods
Consider using the robust alternatives to t_TOST
when:
- Your data shows notable departures from normality (e.g., skewed distributions, heavy tails)
- Potential outliers are a concern
- Sample sizes are small or unequal
- You have concerns about violating the assumptions of parametric tests
- You want to confirm parametric test results with complementary nonparametric approaches
The following table provides a quick overview of the robust methods covered in this vignette:
Method | Function | Key Characteristics | Best Used When |
---|---|---|---|
Wilcoxon TOST | wilcox_TOST() |
Rank-based, nonparametric | Data is ordinal or non-normal |
Brunner-Munzel |
brunner_munzel() with simple_htest()
|
Probability-based, robust to heteroscedasticity | Distribution shapes differ between groups |
Bootstrap TOST | boot_t_TOST() |
Resampling-based, requires fewer assumptions | Sample size is small or distribution is unknown |
Log-Transformed TOST | log_TOST() |
Ratio-based, for multiplicative comparisons | Comparing relative differences (e.g., bioequivalence) |
Wilcoxon TOST
The Wilcoxon group of tests (includes Mann-Whitney U-test) provide a
non-parametric test of differences between groups, or within samples,
based on ranks. This provides a test of location shift, which is a fancy
way of saying differences in the center of the distribution (i.e., in
parametric tests the location is mean). With TOST, there are two
separate tests of directional location shift to determine if the
location shift is within (equivalence) or outside (minimal effect). The
exact calculations can be explored via the documentation of the
wilcox.test
function.
Key Features and Usage
TOSTER’s version is the wilcox_TOST
function. Overall,
this function operates extremely similar to the t_TOST
function. However, the standardized mean difference (SMD) is
not calculated. Instead the rank-biserial correlation is
calculated for all types of comparisons (e.g., two sample, one
sample, and paired samples). Also, there is no plotting capability at
this time for the output of this function.
The wilcox_TOST
function is particularly useful
when:
- You’re working with ordinal data
- You’re concerned about outliers influencing your results
- You need a test that makes minimal assumptions about the underlying distributions
As an example, we can use the sleep data to make a non-parametric comparison of equivalence.
data('sleep')
library(TOSTER)
test1 = wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
eqb = .5)
print(test1)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Rank-Biserial Correlation -0.490 [-0.7493, -0.1005] 0.9
Interpreting the Results
When interpreting the output of wilcox_TOST
, pay
attention to:
- The p-values for both one-sided tests (
p1
andp2
) - The equivalence test p-value (
TOSTp
), which should be < alpha to claim equivalence - The rank-biserial correlation (
rb
) and its confidence interval
A statistically significant equivalence test (p < alpha) indicates that the observed effect is statistically within your specified equivalence bounds. The rank-biserial correlation provides a measure of effect size, with values ranging from -1 to 1:
- Values near 0 indicate negligible association
- Values around ±0.3 indicate a small effect
- Values around ±0.5 indicate a moderate effect
- Values around ±0.7 or greater indicate a large effect
Rank-Biserial Correlation
The standardized effect size reported for the
wilcox_TOST
procedure is the rank-biserial correlation.
This is a fairly intuitive measure of effect size which has the same
interpretation of the common language effect size (Kerby 2014). However, instead of assuming
normality and equal variances, the rank-biserial correlation calculates
the number of favorable (positive) and unfavorable (negative) pairs
based on their respective ranks.
For the two sample case, the correlation is calculated as the proportion of favorable pairs minus the unfavorable pairs.
\[ r_{biserial} = f_{pairs} - u_{pairs} \]
Where: - \(f_{pairs}\) is the proportion of favorable pairs - \(u_{pairs}\) is the proportion of unfavorable pairs
For the one sample or paired samples cases, the correlation is calculated with ties (values equal to zero) not being dropped. This provides a conservative estimate of the rank biserial correlation.
It is calculated in the following steps wherein \(z\) represents the values or difference between paired observations:
- Calculate signed ranks:
\[ r_j = -1 \cdot sign(z_j) \cdot rank(|z_j|) \]
Where: - \(r_j\) is the signed rank for observation \(j\) - \(sign(z_j)\) is the sign of observation \(z_j\) (+1 or -1) - \(rank(|z_j|)\) is the rank of the absolute value of observation \(z_j\)
- Calculate the positive and negative sums:
\[ R_{+} = \sum_{1\le i \le n, \space z_i > 0}r_j \]
\[ R_{-} = \sum_{1\le i \le n, \space z_i < 0}r_j \]
Where: - \(R_{+}\) is the sum of ranks for positive observations - \(R_{-}\) is the sum of ranks for negative observations
- Determine the smaller of the two rank sums:
\[ T = min(R_{+}, \space R_{-}) \]
\[ S = \begin{cases} -4 & R_{+} \ge R_{-} \\ 4 & R_{+} < R_{-} \end{cases} \]
Where: - \(T\) is the smaller of the two rank sums - \(S\) is a sign factor based on which rank sum is smaller
- Calculate rank-biserial correlation:
\[ r_{biserial} = S \cdot | \frac{\frac{T - \frac{(R_{+} + R_{-})}{2}}{n}}{n + 1} | \]
Where: - \(n\) is the number of observations (or pairs) - The final value ranges from -1 to 1
Confidence Intervals
The Fisher approximation is used to calculate the confidence intervals.
For paired samples, or one sample, the standard error is calculated as the following:
\[ SE_r = \sqrt{ \frac {(2 \cdot nd^3 + 3 \cdot nd^2 + nd) / 6} {(nd^2 + nd) / 2} } \]
wherein, nd represents the total number of observations (or pairs).
For independent samples, the standard error is calculated as the following:
\[ SE_r = \sqrt{\frac {(n1 + n2 + 1)} { (3 \cdot n1 \cdot n2)}} \]
Where:
- \(n1\) and \(n2\) are the sample sizes of the two groups
The confidence intervals can then be calculated by transforming the estimate.
\[ r_z = atanh(r_{biserial}) \]
Then the confidence interval can be calculated and back transformed.
\[ r_{CI} = tanh(r_z \pm Z_{(1 - \alpha / 2)} \cdot SE_r) \]
Where:
- \(Z_{(1 - \alpha / 2)}\) is the critical value from the standard normal distribution
- \(\alpha\) is the significance level (typically 0.05)
Conversion to other effect sizes
Two other effect sizes can be calculated for non-parametric tests. First, there is the concordance probability, which is also known as the c-statistic, c-index, or probability of superiority1. The c-statistic is converted from the correlation using the following formula:
\[ c = \frac{(r_{biserial} + 1)}{2} \]
The c-statistic can be interpreted as the probability that a randomly selected observation from one group will be greater than a randomly selected observation from another group. A value of 0.5 indicates no difference between groups, while values approaching 1 indicate perfect separation between groups.
The Wilcoxon-Mann-Whitney odds (O’Brien and Castelloe 2006), also known as the “Generalized Odds Ratio” (Agresti 1980), is calculated by converting the c-statistic using the following formula:
\[ WMW_{odds} = e^{logit(c)} \]
Where \(logit(c) = \ln\frac{c}{1-c}\)
The WMW odds can be interpreted similarly to a traditional odds ratio, representing the odds that an observation from one group is greater than an observation from another group.
Either effect size is available by simply modifying the
ses
argument for the wilcox_TOST
function.
# Rank biserial
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "r",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Rank-Biserial Correlation -0.490 [-0.7493, -0.1005] 0.9
# Odds
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "o",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.3464 [-3.4, -0.1] 0.9
## WMW Odds 0.3423 [0.1433, 0.8173] 0.9
# Concordance
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "c",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Concordance 0.255 [0.1254, 0.4497] 0.9
Guidelines for Selecting Effect Size Measures
-
Rank-biserial correlation (
"r"
) is useful when you want a correlation-like measure that’s easily interpretable and comparable to other correlation coefficients. -
Concordance probability (
"c"
) is beneficial when you want to express the effect in terms of probability, making it accessible to non-statisticians. -
WMW odds (
"o"
) is helpful when you want to express the effect in terms familiar to those who work with odds ratios in logistic regression or epidemiology.
Brunner-Munzel
As Karch (2021) explained, there are some reasons to dislike the WMW family of tests as the non-parametric alternative to the t-test. Regardless of the underlying statistical arguments2, it can be argued that the interpretation of the WMW tests, especially when involving equivalence testing, is a tad difficult. Some may want a non-parametric alternative to the WMW test, and the Brunner-Munzel test(s) may be a useful option.
Overview and Advantages
The Brunner-Munzel test (Brunner and Munzel 2000; Neubert and Brunner 2007) offers several advantages over the Wilcoxon-Mann-Whitney tests:
- It does not assume equal distributions (shapes) between groups
- It is robust to heteroscedasticity (unequal variances)
- It provides a more interpretable effect size measure (“relative effect”)
- It maintains good statistical properties even with unequal sample sizes
The Brunner-Munzel test is based on calculating the “stochastic superiority” (Karch 2021, i.e., probability of superiority), which is usually referred to as the relative effect, based on the ranks of the two groups being compared (X and Y). A Brunner-Munzel type test is then a directional test of an effect, and answers a question akin to “what is the probability that a randomly sampled value of X will be greater than Y?”
\[ \hat p = P(X>Y) + 0.5 \cdot P(X=Y) \]
Where:
- \(P(X>Y)\) is the probability that a random value from X exceeds a random value from Y
- \(P(X=Y)\) is the probability that random values from X and Y are equal
- The 0.5 factor means ties contribute half their weight to the probability
The relative effect \(\hat p\) has an intuitive interpretation:
- \(\hat p = 0.5\) indicates no difference between groups
- \(\hat p > 0.5\) indicates values in X tend to be greater than values in Y
- \(\hat p < 0.5\) indicates values in X tend to be smaller than values in Y
Basics of the Calculative Approach
In this section, I will quickly detail the calculative approach that
underlies the Brunner-Munzel test in TOSTER
.
A studentized test statistic can be calculated as:
\[ t = \sqrt{N} \cdot \frac{\hat p -p_{null}}{s} \]
Where:
- \(N\) is the total sample size
- \(\hat p\) is the estimated relative effect
- \(p_{null}\) is the null hypothesis value (typically 0.5)
- \(s\) is the rank-based Brunner-Munzel standard error
The default null hypothesis \(p_{null}\) is typically 0.5 (50% probability of superiority is the default null), and \(s\) refers the rank-based Brunner-Munzel standard error. The null can be modified therefore allowing for equivalence testing directly based on the relative effect. Additionally, for paired samples the probability of superiority is based on a hypothesis of exchangability and is not based on the differences scores3.
For more details on the calculative approach, I suggest reading Karch (2021). At small sample sizes, it is
recommended that the permutation version of the test
(perm = TRUE
) be used rather than the basic test statistic
approach.
Example
The interface for the function is very similar to the
t.test
function. The brunner_munzel
function
itself does not allow for equivalence tests, but you can set an
alternative hypothesis for “two.sided”, “less”, or “greater”.
# studentized test
brunner_munzel(formula = extra ~ group,
data = sleep,
paired = FALSE)
## Sample size in at least one group is small. Permutation test (perm = TRUE) is highly recommended.
##
## two-sample Brunner-Munzel test
##
## data: extra by group
## t = -2.1447, df = 16.898, p-value = 0.04682
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
## 0.01387048 0.49612952
## sample estimates:
## p(X>Y) + .5*P(X=Y)
## 0.255
# permutation
brunner_munzel(formula = extra ~ group,
data = sleep,
paired = FALSE,
perm = TRUE)
## NOTE: Confidence intervals derived from permutation tests may differ from conclusions drawn from p-values. When discrepancies occur, consider additional diagnostics or alternative inference methods.
##
## two-sample Brunner-Munzel permutation test
##
## data: extra by group
## t-observed = -2.1447, df = 16.898, p-value = 0.0566
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
## 0.001755783 0.507227576
## sample estimates:
## p(X>Y) + .5*P(X=Y)
## 0.255
The simple_htest
function allows TOST tests using a
Brunner-Munzel test by setting the alternative to “equivalence” or
“minimal.effect”. The equivalence bounds, based on the relative effect,
can be set with the mu
argument.
# permutation based Brunner-Munzel test of equivalence
simple_htest(formula = extra ~ group,
test = "brunner",
data = sleep,
paired = FALSE,
alternative = "equ",
mu = .7,
perm = TRUE)
## NOTE: Confidence intervals derived from permutation tests may differ from conclusions drawn from p-values. When discrepancies occur, consider additional diagnostics or alternative inference methods.
## NOTE: Confidence intervals derived from permutation tests may differ from conclusions drawn from p-values. When discrepancies occur, consider additional diagnostics or alternative inference methods.
## NOTE: Confidence intervals derived from permutation tests may differ from conclusions drawn from p-values. When discrepancies occur, consider additional diagnostics or alternative inference methods.
##
## two-sample Brunner-Munzel permutation test
##
## data: extra by group
## t-observed = -3.8954, df = 16.898, p-value = 0.9977
## alternative hypothesis: equivalence
## null values:
## relative effect relative effect
## 0.3 0.7
## 90 percent confidence interval:
## 0.05427006 0.45029716
## sample estimates:
## p(X>Y) + .5*P(X=Y)
## 0.255
Interpreting Brunner-Munzel Results
When interpreting the Brunner-Munzel test results:
- The relative effect (p-hat) is the primary measure of interest
- For equivalence testing, you’re testing whether this relative effect falls within your specified bounds
- A significant equivalence test suggests that the probability of superiority is statistically confined to your specified range
When to Use Permutation
The permutation approach (perm = TRUE
) is recommended
when:
- Sample sizes are small (generally n < 30 per group)
- Data distributions are highly skewed or contain outliers
- Precision is more important than computational speed
Note that the permutation approach can be computationally intensive for large datasets, potentially increasing processing time significantly. Additionlly, with a permutation test you may observe situations where the confidence interval and the p-values would yield different conclusions.
Bootstrap TOST
The bootstrap is a simulation based technique, derived from
re-sampling with replacement, designed for statistical estimation and
inference. Bootstrapping techniques are very useful because they are
considered somewhat robust to the violations of assumptions for a simple
t-test. Therefore we added a bootstrap option, boot_t_TOST
to the package to provide another robust alternative to the
t_TOST
function.
Advantages of Bootstrapping
Bootstrap methods offer several advantages for equivalence testing:
- They make minimal assumptions about the underlying data distribution
- They are robust to deviations from normality
- They provide realistic confidence intervals even with small samples
- They can handle complex data structures and dependencies
- They often provide more accurate results when parametric assumptions are violated
In this function, we provide the percentile bootstrap solution
outlined by Efron and Tibshirani (1993)
(see chapter 16, page 220). The bootstrapped p-values are derived from
the “studentized” version of a test of mean differences outlined by
Efron and Tibshirani (1993). Overall, the
results should be similar to the results of t_TOST
.
Two Sample Algorithm
-
Form B bootstrap data sets from x* and y* wherein x* is sampled with replacement from \(\tilde x_1,\tilde x_2, ... \tilde x_n\) and y* is sampled with replacement from \(\tilde y_1,\tilde y_2, ... \tilde y_n\)
Where:
- B is the number of bootstrap replications (set using the
R
parameter) - \(\tilde x_i\) and \(\tilde y_i\) represent the original observations in each group
- B is the number of bootstrap replications (set using the
t is then evaluated on each sample, but the mean of each sample (y or x) and the overall average (z) are subtracted from each
\[ t(z^{*b}) = \frac {(\bar x^*-\bar x - \bar z) - (\bar y^*-\bar y - \bar z)}{\sqrt {sd_y^*/n_y + sd_x^*/n_x}} \]
Where:
- \(\bar x^*\) and \(\bar y^*\) are the means of the bootstrap samples
- \(\bar x\) and \(\bar y\) are the means of the original samples
- \(\bar z\) is the overall mean
- \(sd_x^*\) and \(sd_y^*\) are the standard deviations of the bootstrap samples
- \(n_x\) and \(n_y\) are the sample sizes
- An approximate p-value can then be calculated as the number of bootstrapped results greater than the observed t-statistic from the sample.
\[ p_{boot} = \frac {\#t(z^{*b}) \ge t_{sample}}{B} \]
Where: - \(\#t(z^{*b}) \ge t_{sample}\) is the count of bootstrap t-statistics that exceed the observed t-statistic - B is the total number of bootstrap replications
The same process is completed for the one sample case but with the one sample solution for the equation outlined by \(t(z^{*b})\). The paired sample case in this bootstrap procedure is equivalent to the one sample solution because the test is based on the difference scores.
Choosing the Number of Bootstrap Replications
When using bootstrap methods, the choice of replications (the
R
parameter) is important:
- R = 999 or 1999: Recommended for standard analyses
- R = 4999 or 9999: Recommended for publication-quality results or when precise p-values are needed
- R < 500: Acceptable only for exploratory analyses or when computational resources are limited
Larger values of R provide more stable results but increase computation time. For most purposes, 999 or 1999 replications strike a good balance between precision and computational efficiency.
Example
We can use the sleep data to see the bootstrapped results. Notice that the plots show how the re-sampling via bootstrapping indicates the instability of Hedges’s dz.
data('sleep')
test1 = boot_t_TOST(formula = extra ~ group,
data = sleep,
paired = TRUE,
eqb = .5,
R = 499)
print(test1)
##
## Bootstrapped Paired t-test
##
## The equivalence test was non-significant, t(9) = -2.777, p = 1e+00
## The null hypothesis test was significant, t(9) = -4.062, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -4.062 9 < 0.001
## TOST Lower -2.777 9 1
## TOST Upper -5.348 9 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## Raw -1.580 0.3476 [-2.7183, -1.094] 0.9
## Hedges's g(z) -1.174 0.5235 [-1.4013, 0.6692] 0.9
## Note: studentized bootstrap ci method utilized.
plot(test1)
Interpreting Bootstrap TOST Results
When interpreting the results of boot_t_TOST
:
- The bootstrap p-values (
p1
andp2
) represent the empirical probability of observing the test statistic or more extreme values under repeated sampling - The confidence intervals are derived directly from the empirical distribution of bootstrap samples
- The distribution plots provide visual insight into the variability of the effect size estimate
For equivalence testing, examine whether both bootstrap p-values are significant (< alpha) and whether the confidence interval for the effect size falls entirely within the equivalence bounds.
Ratio of Difference (Log Transformed)
In many bioequivalence studies, the differences between drugs are compared on the log scale (He et al. 2022). The log scale allows researchers to compare the ratio of two means.
\[ log ( \frac{y}{x} ) = log(y) - log(x) \]
Where: - y and x are the means of the two groups being compared - The transformation converts multiplicative relationships into additive ones
Why Use The Natural Log Transformation?
The United States Food and Drug Administration (FDA)4 has stated a rationale for using the log transformed values:
Using logarithmic transformation, the general linear statistical model employed in the analysis of BE data allows inferences about the difference between the two means on the log scale, which can then be retransformed into inferences about the ratio of the two averages (means or medians) on the original scale. Logarithmic transformation thus achieves a general comparison based on the ratio rather than the differences.
Log transformation offers several advantages:
- It facilitates the analysis of relative rather than absolute differences
- It often makes right-skewed distributions more symmetric
- It stabilizes variance when variability increases with the mean
- It provides an easy-to-interpret interpretable effect size (ratio of means)
In addition, the FDA considers two drugs as bioequivalent when the ratio between x and y is less than 1.25 and greater than 0.8 (1/1.25), which is the default equivalence bound for the log functions.
Applications Beyond Bioequivalence
While log transformation is standard in bioequivalence studies, it’s useful in many other contexts:
- Economics: Comparing percentage changes in economic indicators
- Environmental science: Analyzing concentration ratios of pollutants
- Biology: Examining growth rates or concentration ratios
- Medicine: Comparing relative efficacy of treatments
- Psychology: Analyzing response time ratios
Consider using log transformation whenever your research question is about relative rather than absolute differences, particularly when the data follow a multiplicative rather than additive pattern.
log_TOST
For example, we could compare whether the cars of different
transmissions are “equivalent” with regards to gas mileage. We can use
the default equivalence bounds (eqb = 1.25
).
log_TOST(
mpg ~ am,
data = mtcars
)
##
## Log-transformed Welch Two Sample t-test
##
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.07e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 8.19e-04
## NHST: reject null significance hypothesis that the effect is equal to one
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96 0.907
## TOST Upper -6.288 23.96 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## log(Means Ratio) -0.3466 0.09061 [-0.5017, -0.1916] 0.9
## Means Ratio 0.7071 NA [0.6055, 0.8256] 0.9
Note, that the function produces t-tests similar to the
t_TOST
function, but provides two effect sizes. The means
ratio on the log scale (the scale of the test statistics), and the means
ratio. The means ratio is missing standard error because the confidence
intervals and estimate are simply the log scale results
exponentiated.
Interpreting the Means Ratio
When interpreting the means ratio:
- A ratio of 1.0 indicates perfect equivalence (no difference)
- Ratios > 1.0 indicate that the first group has higher values than the second
- Ratios < 1.0 indicate that the first group has lower values than the second
For equivalence testing with the default bounds (0.8, 1.25): - Equivalence is established when the 90% confidence interval for the ratio falls entirely within (0.8, 1.25) - This range corresponds to a difference of ±20% on a relative scale
Bootstrap + Log
However, it has been noted in the statistics literature that t-tests
on the logarithmic scale can be biased, and it is recommended that
bootstrapped tests be utilized instead. Therefore, the
boot_log_TOST
function can be utilized to perform a more
precise test.
boot_log_TOST(
mpg ~ am,
data = mtcars,
R = 499
)
##
## Bootstrapped Log Welch Two Sample t-test
##
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.6e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to 1
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96 0.96
## TOST Upper -6.288 23.96 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## log(Means Ratio) -0.3466 0.08997 [-0.5423, -0.1416] 0.9
## Means Ratio 0.7071 0.06427 [0.5814, 0.868] 0.9
## Note: studentized bootstrap ci method utilized.
The bootstrapped version is particularly recommended when:
- Sample sizes are small (n < 30 per group)
- Data show notable deviations from log-normality
- You want to ensure robust confidence intervals
Just Estimate an Effect Size
It was requested that a function be provided that only calculates a
robust effect size. Therefore, I created the ses_calc
and
boot_ses_calc
functions as robust effect size calculation5. The
interface is almost the same as wilcox_TOST
but you don’t
set an equivalence bound.
ses_calc(formula = extra ~ group,
data = sleep,
paired = TRUE,
ses = "r")
## estimate lower.ci upper.ci conf.level
## Rank-Biserial Correlation 0.9818182 0.928369 0.9954785 0.95
# Setting bootstrap replications low to
## reduce compiling time of vignette
boot_ses_calc(formula = extra ~ group,
data = sleep,
paired = TRUE,
R = 199,
boot_ci = "perc", # recommend percentile bootstrap for paired SES
ses = "r")
## Bootstrapped results contain extreme results (i.e., no overlap), caution advised interpreting the with confidence intervals.
## estimate bias SE lower.ci upper.ci conf.level boot_ci
## 1 0.9818182 0 0.03788636 0.8909091 1 0.95 perc
Choosing Between Different Bootstrap CI Methods
The boot_ses_calc
function offers several bootstrap
confidence interval methods through the boot_ci
parameter:
- “perc” (Percentile): Simple and intuitive, works well for symmetric distributions
- “basic”: Similar to percentile but adjusts for bias, more conservative
- “norm” (Normal approximation): Assumes normality of the bootstrap distribution
Summary Comparison of Robust TOST Methods
Method | Key Advantages | Limitations | Best Use Cases |
---|---|---|---|
Wilcoxon TOST | Simple, widely accepted, minimal assumptions | Less power than parametric tests with normal data | Ordinal data, non-normal distributions, presence of outliers |
Brunner-Munzel | Robust to unequal distributions, interpretable effect | Computationally intensive with permutation | Different distribution shapes between groups, heteroscedasticity |
Bootstrap TOST | Flexible, minimal assumptions, works with small samples | Computationally intensive, results vary slightly between runs | Small samples, complex data structures, when precise CIs are important |
Log-Transformed | Focuses on relative differences, often stabilizes variance | Requires positive data, can be biased with small samples | Bioequivalence studies, comparing ratios rather than absolute differences |
Conclusion
The robust TOST procedures provided in the TOSTER package offer reliable alternatives to standard parametric equivalence testing when data violate typical assumptions. By selecting the appropriate robust method for your specific data characteristics and research question, you can ensure more valid statistical inferences about equivalence or minimal effects.
Remember that no single method is universally superior - the choice depends on your data structure, sample size, and specific research question. When in doubt, running multiple approaches and comparing results can provide valuable insights into the robustness of your conclusions.