# Two-Sample Permutation Test of Difference in Means

By Charles Holbert

December 22, 2022

# Introduction

There are a large number of statistical tests that can be used to determine whether the difference between two groups of data is statistically significant. If the data in each group are normally distributed, a two-sample t-test can be used to compare the means of the two groups. If this assumption is severely violated, the nonparametric Wilcoxon Rank Sum (WRS) test, also known as the two-sample Mann-Whitney U test, may be considered instead. This is a nonparametric version of a two-sample t-test and calculates whether the medians of the two distributions are different or similar, assuming observations in each group are identically and independently distributed apart from location. When observations represent very different distributions, the method is a test of dominance between distributions. However, when the interest is in the differences of means, an alternative is a two-sample permutation test.

This post demonstrates how to perform a two-sample permutation test using various packages within the R environment for statistical computing and visualization.

# Permutation Tests

Permutation tests are used to test for differences between means for skewed, moderately sized datasets. They compute the p-value using computer-intensive methods rather than assuming data follow normal distributions. Permutation tests are also called resampling methods (Good 2001), randomization tests (Manly 2007), and observation randomization tests (Brown and Rothery 1993). In 1935, R. A. Fisher introduced the idea of a randomization test (Manly 2007; Efron and Tibshirani 1993). Because of the large computing resources needed to enumerate all possible outcomes, especially for large sample sizes, permutation tests did not become popular until around the late 1980s (Helsel et al. 2020).

The two-sample permutation test works by enumerating all possible permutations of group assignments, and for each permutation computing the difference between the measures of location for each group. The measure of location for a group could be the mean, median, or some other measure. Under the null hypothesis, the underlying distributions for each group are the same; therefore, it should make no difference which group an observation gets assigned to. To check whether the null hypothesis can be rejected, the data are randomly shuffled, again and again, and the difference between the two randomly created groups is recorded each time. At the end of the test, the actual difference observed between the two groups is compared with the thousands of simulations performed to see how likely/unlikely it would be under the null hypothesis.

For a one-sided upper alternative, the p-value is computed as the proportion of times that the differences of the means (or medians) in the permutation distribution are greater than or equal to the observed difference in means (or medians). For a one-sided lower alternative hypothesis, the p-value is computed as the proportion of times that the differences in the means (or medians) in the permutation distribution are less than or equal to the observed difference in the means (or medians). For a two-sided alternative hypothesis, the p-value is computed as the proportion of times the absolute values of the differences in the means (or medians) in the permutation distribution are greater than or equal to the absolute value of the observed difference in the means (or medians).

A two-sample permutation test for differences in means avoids the assumptions of the parametric t-test. The t-test requires that the data from each group follow a normal distribution and that the groups have the same variance. Violation of these assumptions leads to a loss of power, raising p-values, and failing to find differences between group means when they occur. The permutation test still assumes, however, that under the null hypothesis, the distributions of the observations from each group are exactly the same, and under the alternative hypothesis there is simply a shift in location (Millard 2013). Thus, the shape and spread of the two distributions should be the same whether the null hypothesis is true or not.

# Simulated Data

For this example, let’s generate 20 observations from a lognormal distribution with parameters mean = 10 and cv = 2, and 10 observations from a lognormal distribution with parameters mean = 5 and cv = 2. We will use these data to test the null hypothesis that the mean of group 1 is equal to or less than the mean for group 2 against the alternative that the mean for group 1 is greater than the mean for group 2. To generate the data, we will use the *rlnormAlt()* function from the
EnvStats package.

```
# Load libraries
library(dplyr)
library(EnvStats)
# Load functions
source('functions/sumstats.R')
# Set seed for reproducibility
set.seed(123)
# Create two data sets
dat1 <- data.frame(
Result = rlnormAlt(20, mean = 10, cv = 2),
Censored = FALSE,
Group = 'Group 1'
)
dat2 <- data.frame(
Result = rlnormAlt(10, mean = 5, cv = 2),
Censored = FALSE,
Group = 'Group 2'
)
# Combine data
dat <- rbind(dat1, dat2)
# Make ordered factor
myvars <- c('Group 1', 'Group 2')
dat$Group <- factor(
dat$Group,
ordered = TRUE,
levels = myvars
)
```

Descriptive statistics for each data set are shown below. The descriptive statistics include the total number of observations, detected observations, and frequency of detection; the minimum and maximum values for detected and non-detected observations; and the mean, median, standard deviation (SD), and coefficient of variation (CV) for each of the two groups. For this example, all observations are detects.

```
# Attach data
attach(dat)
# Get summary statistics using function
sumstats(dat, rv = 2, trans = TRUE)
## Group 1 Group 2
## Obs 20.00 10.00
## Dets 20.00 10.00
## DetFreq 100.00 100.00
## MinND NA NA
## MinDET 0.37 0.26
## MaxND NA NA
## MaxDet 43.16 10.97
## Mean 10.45 2.57
## Median 5.21 0.95
## SD 12.97 3.48
## CV 1.24 1.35
```

Box plots comparing the two groups of data are provided below.

```
# Boxplots
boxplot(Result ~ Group, data = dat)
```

# Wilcoxon Rank Sum Test

For comparison purposes, let’s test the null hypothesis that the mean of group 1 is equal to or less than the mean for group 2 against the alternative that the median for group 1 is greater than the median for group 2 using the WRS test.

```
wilcox.test(
x = Result[Group == 'Group 1'],
y = Result[Group == 'Group 2'],
alternative = 'greater',
mu = 0, paired = FALSE,
conf.int = TRUE, conf.level = 0.95
)
##
## Wilcoxon rank sum exact test
##
## data: Result[Group == "Group 1"] and Result[Group == "Group 2"]
## W = 158, p-value = 0.004808
## alternative hypothesis: true location shift is greater than 0
## 95 percent confidence interval:
## 1.309476 Inf
## sample estimates:
## difference in location
## 3.511922
```

The p-value for the WRS test is 0.005. Thus, at a typical significance level of 0.05, we reject the null hypothesis and accept the alternative that the median for group 1 is greater than the median for group 2.

# Permutation Test: coin Package

The *one-way()* function in the
coin package can be used to perform the Fisher-Pitman permutation test (Fisher 1936; Pitman 1937). The conditional null distribution of the test statistic is used to obtain p-values and an asymptotic approximation of the exact distribution is used by default (distribution = “symptotic”). Alternatively, the distribution can be approximated via Monte-Carlo resampling by setting the *distribution* variable to “approximate.”

```
# Load library
library(coin)
# Remove group ordering for coin package
dat2 <- dat %>%
mutate(Group = factor(Group, ordered = FALSE))
# Run asymptotic Fisher-Pitman test
oneway_test(
Result ~ Group, data = dat2,
alternative = 'greater'
)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: Result by Group (Group 1, Group 2)
## Z = 1.796, p-value = 0.03624
## alternative hypothesis: true mu is greater than 0
# Run approximative (Monte Carlo) Fisher-Pitman test
set.seed(123)
oneway_test(
Result ~ Group, data = dat2,
distribution = approximate(nresample = 10000),
alternative = 'greater'
)
##
## Approximative Two-Sample Fisher-Pitman Permutation Test
##
## data: Result by Group (Group 1, Group 2)
## Z = 1.796, p-value = 0.0125
## alternative hypothesis: true mu is greater than 0
```

The p-values for the asymptotic approximation and the Monte-Carlo resampling are 0.036 and 0.013, respectively.

# Permutation Test: EnvStats Package

The *twoSamplePermutationTestLocation()* function in the
EnvStats package
can be used to perform a two-sample or paired-sample permutation test for location based on either means or medians.

```
# Run test
perm_out <- twoSamplePermutationTestLocation(
x = Result[Group == 'Group 1'],
y = Result[Group == 'Group 2'],
fcn = 'mean',
alternative = 'greater',
mu1.minus.mu2 = 0,
paired = FALSE,
exact = FALSE,
n.permutations = 10000,
seed = 123
)
print(perm_out)
##
## Results of Hypothesis Test
## --------------------------
##
## Null Hypothesis: mu.x-mu.y = 0
##
## Alternative Hypothesis: True mu.x-mu.y is greater than 0
##
## Test Name: Two-Sample Permutation Test
## Based on Differences in Means
## (Based on Sampling
## Permutation Distribution
## 10000 Times)
##
## Estimated Parameter(s): mean of x = 10.449733
## mean of y = 2.573186
##
## Data: x = Result[Group == "Group 1"]
## y = Result[Group == "Group 2"]
##
## Sample Sizes: nx = 20
## ny = 10
##
## Test Statistic: mean.x - mean.y = 7.876547
##
## P-value: 0.0132
```

The p-value for the permutation test using the *twoSamplePermutationTestLocation()* function is 0.013.

# Permutation Test: perm Package

The *permTS()* function in the
perm package can be used to perform a two-sample permutation test. There is a choice of four different methods to calculate the p-value:

*pclt*uses a permutational central limit theorem (Sen 1985)*exact.mc*uses Monte-Carlo simulation*exact.network*uses a network algorithm (Agresti et al. 1990)*exact.ce*uses complete enumeration for very small sample sizes

The variable *permControl* is used to create a list of arguments for controlling the permutation test. Let’s set the confidence level for the p-value estimate (*p.conf.level*) to 0.95, the number of Monte-Carlo replications (*nmc*) to 10,000, and the seed to 123.

```
# Load library
library(perm)
# Run test
permTS(
x = Result[Group == 'Group 1'],
y = Result[Group == 'Group 2'],
method = 'exact.mc',
alternative = 'greater',
control = permControl(p.conf.level = 0.95, nmc = 10000, seed = 123)
)
##
## Exact Permutation Test Estimated by Monte Carlo
##
## data: GROUP 1 and GROUP 2
## p-value = 0.011
## alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is greater than 0
## sample estimates:
## mean GROUP 1 - mean GROUP 2
## 7.876547
##
## p-value estimated from 10000 Monte Carlo replications
## 95 percent confidence interval on p-value:
## 0.008958341 0.013133833
```

The p-value for the permutation test using the *permTS()* function is 0.011.

# Permutation Test: NADA2 Package

The *cenperm2()* function in the
NADA2 package can be used to perform a two-sample permutation test for data with and without censored results.

```
# Load library
library(NADA2)
# Run permutation test
set.seed(123)
cenperm2(
y1 = Result,
y2 = Censored,
grp = Group,
R = 10000,
alternative = 'greater',
printstat = TRUE
)
## Permutation test of mean CensData: Result by Factor: Group
## 10000 Permutations alternative = greater
## mean of Group 1 = 10.45 to 10.45 mean of Group 2 = 2.573 to 2.573
## Mean (Group 1 - Group 2) = 7.877 to 7.877 p = 0.0109989 to 0.0109989
##
```

The p-value for the permutation test using the *cenperm2()* function is 0.011.

# Permutation Test: USGS Function

The
perm2.R function included in Chapter 5 of *Statistical Methods in Water Resources: U.S. Geological Survey Techniques and Methods, Book 4, Chapter A3* (Helsel et al. 2020) can be used to perform a permutation test for difference in the means associated with two groups of data. If the *stacked* variable is set to FALSE, the *y1* and *y2* variables are the two columns of data. Alternatively, if *stacked* is set to TRUE, the *y1* variable represents the data and the *y2* variable represents the groups.

I modified the
perm2.R function, by adding a *cc* variable that let’s the user assign a reference factor for the two groups when setting *stacked* to TRUE. The default for *cc* is the first grouping factor.

Let’s first run the test using *stacked* set to FALSE.

```
# Load function
source('functions/perm2.R')
# Run test with stcaked set to FALSE
set.seed(123)
perm2(
y1 = Result[Group == 'Group 1'],
y2 = Result[Group == 'Group 2'],
R = 10000,
stacked = FALSE,
alternative = 'g',
conf = 0.95
)
##
##
## Permutation Test of Difference Between 2 Group Means
## Data: Result[Group == "Group 1"] and Result[Group == "Group 2"]
## Number of Possible Permutations is greater than 1000
##
## R = 10000 pvalue = 0.0112
## Alt Hyp: true difference in means is greater than 0
## sample estimates:
## mean of Result[Group == "Group 1"] = 10.44973 mean of Result[Group == "Group 2"] = 2.573186
## Diff of means = 7.876547
## 95 percent confidence interval
## 3.160836 Inf
```

Now, let’s run the test using *stacked* set to TRUE where *y2* is the grouping variable. Let’s assign “Group 1” as the reference factor.

```
# Run test with stacked set to TRUE
set.seed(123)
perm2(
y1 = Result,
y2 = factor(Group, ordered = FALSE),
cc = 'Group 1',
R = 10000,
stacked = TRUE,
alternative = 'g',
conf = 0.95
)
##
##
## Permutation Test of Difference Between 2 Group Means
## Data: Result by factor(Group, ordered = FALSE)
## Number of Possible Permutations is greater than 1000
##
## R = 10000 pvalue = 0.0112
## Alt Hyp: true difference in means is greater than 0
## sample estimates:
## mean of Group 1 = 10.44973 mean of Group 2 = 2.573186
## Diff of means = 7.876547
## 95 percent confidence interval
## 3.160836 Inf
```

The p-value for the permutation test using the *perm2.R* function is 0.011.

# Conclusions

Permutation tests are designed to be robust against departures from normality. Permutation tests compute p-values by randomly selecting several thousand outcomes from the many larger number of outcomes possible that represent the null hypothesis. The p-value is then the proportion of outcomes that are equal to, or more extreme than, the one obtained from the actual data. When comparing central values of two independent groups of data, the two-sample t-test, the WRS test, or a two-sample permutation test might be selected. Test procedures should be selected based on the characteristics of the data and the objectives of interest. Of importance is whether the data have a particular distribution (such as a normal distribution) and if the central value under consideration is better defined by the mean (center of mass) or median (center of frequency). The advantage of a two-sample permutation test over a t-test is that there is no concern that a nonsignificant result might be a result that the input data substantially depart from the distributional assumption of the test.

# References

Agresti, A., C.R. Mehta, and N.R. Patel. 1990. Exact Inference for Contingency Tables with Ordered Categories. *JASA* 85, 453-458.

Brown, D. and P. Rothery. 1993. *Models in Biology: Mathematics, Statistics, and Computing*. Wiley, New Jersey.

Good, P.I. 2005. *Resampling Methods: A Practical Guide to Data Analysis. Third Edition*. Birkhauser, Boston.

Helsel, D.R., R.M. Hirsch, K.R. Ryberg, S.A. Archfield, and E.J. Gilroy. 2020. *Statistical Methods in Water Resources: U.S. Geological Survey Techniques and Methods, Book 4, Chapter A3*. [Supersedes USGS Techniques of Water-Resources Investigations, Book 4, Chapter A3, Version 1.1.]. U.S. Geological Survey, Reston, Virginia.

Efron, B. and R.J. Tibshirani. 1993. *An Introduction to the Bootstrap*. Chapman and Hall, New York.

Fisher, R.A. 1936. The Coefficient of Racial Likeness and the Future of Craniometry. *Journal of the Royal Anthropological Institute*, 66, 57-63.

Manly, B.F.J. 2007. *Randomization, Bootstrap and Monte Carlo Methods in Biology. Third Edition*. Chapman & Hall, New York.

Millard, S.P. 2013. *EnvStats: An R Package for Environmental Statistics*. Springer-Verlag, NY.

Pitman, E.J.G. 1937. Significance Test Which May Be Applied to Samples From Any Populations. *Supplement to the Journal of the Royal Statistical Society*, 4, 119-130.

Sen, P.K. 1985. Permutational Central Limit Theorems. In S. Kotz and N.L. Johnson (eds.), *Encyclopedia of Statistics, Volume 6*. Wiley.

in Encyclopedia of Statistics, Vol 6.

- Posted on:
- December 22, 2022

- Length:
- 13 minute read, 2669 words

- See Also: