The Mad Hatter’s Guide to Data Viz and Stats in R
  1. Permutation Tests for Two Means
  • Data Viz and Stats
    • Tools
      • Introduction to R and RStudio
    • Descriptive Analytics
      • Data
      • Inspect Data
      • Graphs
      • Summaries
      • Counts
      • Quantities
      • Groups
      • Distributions
      • Groups and Distributions
      • Change
      • Proportions
      • Parts of a Whole
      • Evolution and Flow
      • Ratings and Rankings
      • Surveys
      • Time
      • Space
      • Networks
      • Miscellaneous Graphing Tools, and References
    • Inference
      • Basics of Statistical Inference
      • 🎲 Samples, Populations, Statistics and Inference
      • Basics of Randomization Tests
      • Inference for a Single Mean
      • Inference for Two Independent Means
      • Inference for Comparing Two Paired Means
      • Comparing Multiple Means with ANOVA
      • Inference for Correlation
      • Testing a Single Proportion
      • Inference Test for Two Proportions
    • Modelling
      • Modelling with Linear Regression
      • Modelling with Logistic Regression
      • 🕔 Modelling and Predicting Time Series
    • Workflow
      • Facing the Abyss
      • I Publish, therefore I Am
      • Data Carpentry
    • Arts
      • Colours
      • Fonts in ggplot
      • Annotating Plots: Text, Labels, and Boxes
      • Annotations: Drawing Attention to Parts of the Graph
      • Highlighting parts of the Chart
      • Changing Scales on Charts
      • Assembling a Collage of Plots
      • Making Diagrams in R
    • AI Tools
      • Using gander and ellmer
      • Using Github Copilot and other AI tools to generate R code
      • Using LLMs to Explain Stat models
    • Case Studies
      • Demo:Product Packaging and Elderly People
      • Ikea Furniture
      • Movie Profits
      • Gender at the Work Place
      • Heptathlon
      • School Scores
      • Children's Games
      • Valentine’s Day Spending
      • Women Live Longer?
      • Hearing Loss in Children
      • California Transit Payments
      • Seaweed Nutrients
      • Coffee Flavours
      • Legionnaire’s Disease in the USA
      • Antarctic Sea ice
      • William Farr's Observations on Cholera in London
    • Projects
      • Project: Basics of EDA #1
      • Project: Basics of EDA #2
      • Experiments

On this page

  • 1 Setting up the Packages
  • 2 Case Study-1: Verizon
    • 2.1 Hypothesis Specification
    • 2.2 Null Distribution Computation
    • 2.3 Verizon Conclusion
  • 3 Case Story-2: Recidivism
    • 3.1 Hypothesis Specification
    • 3.2 Null Distribution for Recidivism
    • 3.3 Recidivism Conclusion
    • 3.4 Case Study #3: Flight Delays
    • 3.5 Hypothesis Specification
    • 3.6 Null Distribution for FlightDelays

Permutation Tests for Two Means

Author

Arvind Venkatadri

Published

November 22, 2022

Modified

September 22, 2025

1 Setting up the Packages

library(ggplot2)
library(dplyr)
library(mosaic)

library(resampledata)

2 Case Study-1: Verizon

Does Verizon create a difference in Repair Times between ILEC and CLEC systems?

data("Verizon")
inspect(Verizon)

categorical variables:  
   name  class levels    n missing
1 Group factor      2 1687       0
                                   distribution
1 ILEC (98.6%), CLEC (1.4%)                    

quantitative variables:  
  name   class min   Q1 median   Q3   max     mean       sd    n missing
1 Time numeric   0 0.75   3.63 7.35 191.6 8.522009 14.78848 1687       0

Describe the Variables!

2.1 Hypothesis Specification

Write the Null and Alternate hypotheses here.

2.2 Null Distribution Computation

2.3 Verizon Conclusion

3 Case Story-2: Recidivism

Do criminals released after a jail term commit crimes again? Does recidivism depend upon age?

data("Recidivism")
inspect(Recidivism)

categorical variables:  
     name  class levels     n missing
1  Gender factor      2 17019       3
2     Age factor      5 17019       3
3   Age25 factor      2 17019       3
4    Race factor     10 16988      34
5 Offense factor      2 17022       0
6   Recid factor      2 17022       0
7    Type factor      3 17022       0
                                   distribution
1 M (87.7%), F (12.3%)                         
2 25-34 (36.6%), 35-44 (23.7%) ...             
3 Over 25 (81.9%), Under 25 (18.1%)            
4 White-NonHispanic (67%) ...                  
5 Felony (80.6%), Misdemeanor (19.4%)          
6 No (68.4%), Yes (31.6%)                      
7 No Recidivism (68.4%), New (20.2%) ...       

quantitative variables:  
  name   class min  Q1 median  Q3  max     mean       sd    n missing
1 Days integer   0 241    418 687 1095 473.3275 283.1393 5386   11636

Describe the variables!

3.1 Hypothesis Specification

Let us see if the indidence of recidivism is dependent upon whether a person is aged less than or more than 25 years. Write the Null and Alternate hypotheses here.

\[ H_0: \mu_{recid-age-25-minus}\ = \mu_{recid-age-25-plus}\\ \]

\[ H_a:\mu_{recid-age-25-minus}\ \ne\mu_{recid-age-25-plus}\\ \]

Recidivism
inspect(Recidivism)

categorical variables:  
     name  class levels     n missing
1  Gender factor      2 17019       3
2     Age factor      5 17019       3
3   Age25 factor      2 17019       3
4    Race factor     10 16988      34
5 Offense factor      2 17022       0
6   Recid factor      2 17022       0
7    Type factor      3 17022       0
                                   distribution
1 M (87.7%), F (12.3%)                         
2 25-34 (36.6%), 35-44 (23.7%) ...             
3 Over 25 (81.9%), Under 25 (18.1%)            
4 White-NonHispanic (67%) ...                  
5 Felony (80.6%), Misdemeanor (19.4%)          
6 No (68.4%), Yes (31.6%)                      
7 No Recidivism (68.4%), New (20.2%) ...       

quantitative variables:  
  name   class min  Q1 median  Q3  max     mean       sd    n missing
1 Days integer   0 241    418 687 1095 473.3275 283.1393 5386   11636

Also, the variable Recid is a factor variable coded “Yes” or “No”. We ought to convert it to a numeric variable of 1’s and 0’s. Why?

3.2 Null Distribution for Recidivism

3.3 Recidivism Conclusion

3.4 Case Study #3: Flight Delays

LaGuardia Airport (LGA) is one of three major airports that serves the New York City metropolitan area. In 2008, over 23 million passengers and over 375 000 planes flew in or out of LGA. United Airlines and America Airlines are two major airlines that schedule services at LGA. The data set FlightDelays contains information on all 4029 departures of these two airlines from LGA during May and June 2009.

data("FlightDelays")
inspect(FlightDelays)

categorical variables:  
         name  class levels    n missing
1     Carrier factor      2 4029       0
2 Destination factor      7 4029       0
3  DepartTime factor      5 4029       0
4         Day factor      7 4029       0
5       Month factor      2 4029       0
6   Delayed30 factor      2 4029       0
                                   distribution
1 AA (72.1%), UA (27.9%)                       
2 ORD (44.3%), DFW (22.8%), MIA (15.1%) ...    
3 8-Noon (26.1%), Noon-4pm (26%) ...           
4 Fri (15.8%), Mon (15.6%), Tue (15.6%) ...    
5 June (50.4%), May (49.6%)                    
6 No (85.2%), Yes (14.8%)                      

quantitative variables:  
          name   class min   Q1 median   Q3  max      mean         sd    n
1           ID integer   1 1008   2015 3022 4029 2015.0000 1163.21645 4029
2     FlightNo integer  71  371    691  787 2255  827.1035  551.30939 4029
3 FlightLength integer  68  155    163  228  295  185.3011   41.78783 4029
4        Delay integer -19   -6     -3    5  693   11.7379   41.63050 4029
  missing
1       0
2       0
3       0
4       0

The variables in the FlightDelays dataset are:

3.5 Hypothesis Specification

Let us compute the proportion of times that each carrier’s flights was delayed more than 20 min. We will conduct a two-sided test to see if the difference in these proportions is statistically significant.

3.6 Null Distribution for FlightDelays

which is very small. Hence we reject the null Hypothesis that there is no difference between carriers on delay times.

Back to top

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .