library(openintro)
library(palmerpenguins)
library(mosaic)
library(ggplot2)
Simulation-Based Test of Significance
Required Packages
theme_set(theme_bw())
theme_replace(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Comparing two proportions
From the openintro
package we consider the sten30
dataset. An experiment that studies effectiveness of stents in treating patients at risk of stroke with some unexpected results, this dataset represent the results 30 days after stroke
<- openintro::stent30 stent30
#> Rows: 451
#> Columns: 2
#> $ group <fct> treatment, treatment, treatment, treatment, treatment, treatme…
#> $ outcome <fct> stroke, stroke, stroke, stroke, stroke, stroke, stroke, stroke…
We are interesting in testing if the probability of having a stroke is different for those in the control and treatment group. The null and alternative hypotheses are given by \[ \begin{align*} H_0: \pi_1 &= \pi_2 \\[5pt] H_a: \pi_1 &\neq \pi_2 \end{align*} \]
<- tally(outcome~group,data=stent30 ,format='proportion') observed_tbl
The observed statistic is the difference in proportion of individuals who did get a stroke under the treatment and control groups
<- observed_tbl['stroke','treatment'] - observed_tbl['stroke','control'] observed_diff
observed_diff
#> [1] 0.09005271
Below is a for loop that will simulate the null distribution for our test statistic we will use to test our hypothesis
set.seed(234)
<- 5000
iters <- numeric(iters)
diff_proportions
for(i in 1:iters){
<- sample(stent30$group)
group_shuffle_i <- tally(stent30$outcome~group_shuffle_i,format = 'proportion')
cond_props_i
<- cond_props_i['stroke','treatment'] - cond_props_i['stroke','control']
diff_proportions[i] }
Show Code
<- ggplot(data.frame(diff_proportions),
sampling_dist aes(diff_proportions))+
geom_histogram(fill = 'steelblue',alpha = 0.3,
color = 'black',bins=25)+
geom_vline(xintercept = observed_diff,linetype = 2,
color ='red')+
geom_vline(xintercept = -observed_diff,linetype = 2,
color ='red')+
labs(title = 'Sampling distribution',
subtitle = '',
x = 'Sample Difference of Proportions',y = '')
Using our null distribution, we can decide on the strength of evidence for or against the null hypothesis. The simulation-based \(p\)-value is then computed as
mean(abs(diff_proportions) >= observed_diff)
#> [1] 8e-04
Using a significance level of \(\alpha=0.05\), we have enough evidence in favor of the alternative hypothesis. That is, we reject the null hypothesis and suggest there is a difference in proportions of patients getting a stroke under the control and treatment group.
Comparing two means
Using the penguins
dataset from the package palmerpenguins
, in this portion of the tutorial we are interested in whether the average body mass (in grams) of Adelie type penguins is less than that of Chinstrap penguins
We will test this comparing the mean body mass between Adelie and Chinstrap penguins using a simulation-based approach
First, we remove missing data concerning the two variables of interest species
and body_mass_g
<- is.na(penguins$species)|is.na(penguins$body_mass_g)| penguins$species == 'Gentoo' subset_filter
penguins_dat
will be the dataset without missing values for the species
and body_mass_g
variables
<- subset(penguins,!subset_filter) penguins_dat
<- mean(body_mass_g ~ species, data = penguins_dat) observed_means
The observed statistic is the difference in means of Adelie type penguins and Chinstrap penguins
<- observed_means['Adelie']-observed_means['Chinstrap']
observed_means_diff observed_means_diff
#> Adelie
#> -32.42598
The null and alternative hypotheses are given by
\[ \begin{align*} H_0: \mu_{\text{adelie}} = \mu_{\text{Chinstrap}} \\ H_a: \mu_{\text{adelie}} < \mu_{\text{Chinstrap}} \end{align*} \]
Below is a for loop that will simulate the null distribution for our test statistic we will use to test our hypothesis
<- 5000
iters <- numeric(iters)
diff_means set.seed(123)
for (i in 1:iters){
<- sample(penguins_dat$species)
species_shuffle_i <- mean(body_mass_g ~ species_shuffle_i, data=penguins_dat)
means_i <- means_i[1] - means_i[2]
diff_means[i] }
Show Code
<- ggplot(data.frame(diff_means),
sampling_dist aes(diff_means))+
geom_histogram(fill = 'steelblue',alpha = 0.3,
color = 'black',bins=30)+
geom_vline(xintercept = observed_means_diff,linetype = 2,
color ='red')+
labs(title = 'Sampling distribution',
subtitle = '',
x = 'Sample Difference of Means',y = '')
Using our null distribution, we can decide on the strength of evidence for or against the null hypothesis. The simulation-based \(p\)-value is then computed as
mean(diff_means <= observed_means_diff)
#> [1] 0.3098
Using a significance level of \(\alpha=0.05\), we do not have enough evidence in favor of the alternative hypothesis. That is, we fail to reject the null hypothesis and cant suggest the average body mass of Adelie type penguins is less than that of a Chinstrap penguin