library(systemfonts)library(showtext)library(ggrepel)library(marquee)## Clean the slatesystemfonts::clear_local_fonts()systemfonts::clear_registry()##showtext_opts(dpi =96)# set DPI for showtextsysfonts::font_add( family ="Alegreya", regular ="../../../../../../fonts/Alegreya-Regular.ttf", bold ="../../../../../../fonts/Alegreya-Bold.ttf", italic ="../../../../../../fonts/Alegreya-Italic.ttf", bolditalic ="../../../../../../fonts/Alegreya-BoldItalic.ttf")sysfonts::font_add( family ="Roboto Condensed", regular ="../../../../../../fonts/RobotoCondensed-Regular.ttf", bold ="../../../../../../fonts/RobotoCondensed-Bold.ttf", italic ="../../../../../../fonts/RobotoCondensed-Italic.ttf", bolditalic ="../../../../../../fonts/RobotoCondensed-BoldItalic.ttf")showtext_auto(enable =TRUE)# enable showtext##theme_custom<-function(){theme_bw(base_size =10)+theme_sub_axis( title =element_text( family ="Roboto Condensed", size =8), text =element_text( family ="Roboto Condensed", size =6))+theme_sub_legend( text =element_text( family ="Roboto Condensed", size =6), title =element_text( family ="Alegreya", size =8))+theme_sub_plot( title =element_text( family ="Alegreya", size =14, face ="bold"), title.position ="plot", subtitle =element_text( family ="Alegreya", size =10), caption =element_text( family ="Alegreya", size =6), caption.position ="plot")}## Use available fonts in ggplot text geoms too!ggplot2::update_geom_defaults(geom ="text", new =list( family ="Roboto Condensed", face ="plain", size =3.5, color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="label", new =list( family ="Roboto Condensed", face ="plain", size =3.5, color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="marquee", new =list( family ="Roboto Condensed", face ="plain", size =3.5, color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="text_repel", new =list( family ="Roboto Condensed", face ="plain", size =3.5, color ="#2b2b2b"))ggplot2::update_geom_defaults(geom ="label_repel", new =list( family ="Roboto Condensed", face ="plain", size =3.5, color ="#2b2b2b"))## Set the themeggplot2::theme_set(new =theme_custom())## tinytable optionsoptions("tinytable_tt_digits"=2)options("tinytable_format_num_fmt"="significant_cell")options(tinytable_html_mathjax =TRUE)## Set defaults for flextableflextable::set_flextable_defaults(font.family ="Roboto Condensed")
1 Introduction
We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.
We will use two packages in R, mosaic and the relatively new infer package, to develop our intuition for what are called permutation based statistical tests.
2 Hypothesis Testing using Permutation
From Reference #1:
Hypothesis testing can be thought of as a 4-step process:
State the null and alternative hypotheses.
Compute a test statistic.
Determine the p-value.
Draw a conclusion.
In a traditional introductory statistics course, once this general framework has been mastered, the main work is in applying the correct formula to compute the standard test statistics in step 2 and using a table or computer to determine the p-value based on the known (usually approximate) theoretical distribution of the test statistic under the null hypothesis.
In a simulation-based approach, steps 2 and 3 change. In Step 2, it is no longer required that the test statistic be normalized to conform with a known, named distribution. Instead, natural test statistics, like the difference between two sample means \(y1 − y2\) can be used.
In Step 3, we use randomization to approximate the sampling distribution of the test statistic. Our lady tasting tea example demonstrates how this can be done from first principles. More typically, we will use randomization to create new simulated data sets ( “Parallel Worlds”) that are like our original data in some ways, but make the null hypothesis true. For each simulated data set, we calculate our test statistic, just as we did for the original sample. Together, this collection of test statistics computed from the simulated samples constitute our randomization distribution.
When creating a randomization distribution, we will attempt to satisfy 3 guiding principles.
Be consistent with the null hypothesis. We need to simulate a world in which the null hypothesis is true. If we don’t do this, we won’t be testing our null hypothesis.
Use the data in the original sample. The original data should shed light on some aspects of the distribution that are not determined by null hypothesis. For example, a null hypothesis about a mean doesn’t tell us about the shape of the population distribution, but the data give us some indication.
Reflect the way the original data were collected.
From Chihara and Hesterberg:
This is the core idea of statistical significance or classical hypothesis testing – to calculate how often pure random chance would give an effect as large as that observed in the data, in the absence of any real effect. If that probability is small enough, we conclude that the data provide convincing evidence of a real effect.
3 Permutations tests using mosaic::shuffle()
The mosaic package provides the shuffle() function as a synonym for sample(). When used without additional arguments, this will permute its first argument.
# library(mosaic)shuffle(1:10)
[1] 3 7 2 6 8 1 4 10 9 5
Applying shuffle() to an explanatory variable in a model allows us to test the null hypothesis that the explanatory variable has, in fact, no explanatory power. This idea can be used to test
the equivalence of two or more means,
the equivalence of two or more proportions,
whether a regression parameter is 0. (Correlations between two variables)
Coupled with mosaic::do() we can repeat a shuffle many times, computing a desired statistic each time we shuffle. The distribution of this computed statistic is a NULL distribution, which can then be compared with the observed statistic to decide upon the Hypothesis Test and p-value.
3.1 Case Study-1: Hot Wings Orders vs Gender
A student conducted a study of hot wings and beer consumption at a Bar. She asked patrons at the bar to record their consumption of hot wings and beer over the course of several hours. She wanted to know if people who ate more hot wings would then drink more beer. In addition, she investigated whether or not gender had an impact on hot wings or beer consumption.
Beerwings<-resampledata3::Beerwings# From resampledata3 packageskimr::skim(Beerwings)
Data summary
Name
Beerwings
Number of rows
30
Number of columns
4
_______________________
Column type frequency:
factor
1
numeric
3
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Gender
0
1
FALSE
2
F: 15, M: 15
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
ID
0
1
15.50
8.80
1
8.25
15.5
22.75
30
▇▇▇▇▇
Hotwings
0
1
11.93
4.78
4
8.00
12.5
15.50
21
▅▃▇▃▃
Beer
0
1
26.20
11.84
0
24.00
30.0
36.00
48
▁▃▅▇▂
Let us calculate the observed difference in Hotwings consumption between Males and Females ( Gender)
gf_boxplot( data =Beerwings, Hotwings~Gender, orientation ="x", fill =~Gender, title ="Hotwings Consumption by Gender")
The observed difference in mean consumption of Hotwings between Males and Females is \(5.2\). Could this have occurred by chance? Here is our formulation of the Hypotheses:
null_dist_wings<-do(1000)*diffmean(Hotwings~shuffle(Gender), data =Beerwings)null_dist_wings%>%head()
gf_histogram( data =null_dist_wings, ~diffmean, fill =~(diffmean>=obs_diff_wings))%>%gf_vline(xintercept =obs_diff_wings, colour ="red")%>%gf_refine(scale_fill_manual(values =c("lightseagreen", "tomato")))
prop1(~diffmean>=obs_diff_wings, data =null_dist_wings)
prop_TRUE
0.002997003
The \(\color{red}{red\ line}\) shows the actual measured mean difference in Hot Wings consumption. The probability that our Permutation distribution is able to equal or exceed that number is \(0.001998002\) and we have to reject the Null Hypothesis that the means are identical.
3.2 Case Study-2: Verizon
Verizon was an Incumbent Local Exchange Carrier (ILEC), responsible for maintaining land-line phone service in certain areas. Verizon also sold long-distance service, as did a number of competitors, termed Competitive Local Exchange Carriers (CLEC). When something went wrong, Verizon was responsible for repairs, and was supposed to make repairs as quickly for CLEC long-distance customers as for their own.
The New York Public Utilities Commission (PUC) monitored fairness by comparing repair times for Verizon and different CLECs, for different classes of repairs and time periods. In each case a hypothesis test was performed at the 1% significance level, to determine whether repairs for CLEC’s customers were significantly slower than for Verizon’s customers. There were hundreds of such tests. If substantially more than 1% of the tests were significant, then Verizon would pay large penalties.
These tests were performed using t tests; Verizon proposed using permutation tests instead.
verizon<-resampledata3::Verizon# From resampledata3 packageverizon
skim(verizon)
Data summary
Name
verizon
Number of rows
1687
Number of columns
2
_______________________
Column type frequency:
factor
1
numeric
1
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Group
0
1
FALSE
2
ILE: 1664, CLE: 23
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Time
0
1
8.52
14.79
0
0.75
3.63
7.35
191.6
▇▁▁▁▁
Let us make a boxplot of the repair times for Verizon and CLEC:
gf_boxplot( data =verizon, Group~Time, orientation ="y", fill =~Group, title ="Repair Times by Operator")
obs_diff_verizon<-diffmean(Time~Group, data =verizon)obs_diff_verizon
diffmean
-8.09752
null_dist_verizon<-do(1000)*diffmean(Time~shuffle(Group), data =verizon)gf_histogram( data =null_dist_verizon, ~diffmean, fill =~(diffmean>=obs_diff_verizon))%>%gf_vline(xintercept =obs_diff_verizon, colour ="red")%>%gf_refine(scale_fill_manual(values =c("lightseagreen", "tomato")))
prop1(~diffmean>=obs_diff_wings, data =null_dist_verizon)
prop_TRUE
0.004995005
It seems that the repair Time for CLEC is significantly higher than for the ILEC and there only about a \(1.2\%\) chance that this could have occurred by chance. So we reject the Null Hypothesis that the mean repair Time-s are identical.
3.3 Case Story-3: Recidivism
Do criminals released after a jail term commit crimes again?
recidivism<-resampledata3::Recidivism# From resampledata3 packagerecidivism
skim(recidivism)
Data summary
Name
recidivism
Number of rows
17022
Number of columns
7
_______________________
Column type frequency:
factor
6
numeric
1
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Gender
3
1
FALSE
2
M: 14918, F: 2101
Age
3
1
FALSE
5
25-: 6227, 35-: 4035, Und: 3077, 45-: 2872
Age25
3
1
FALSE
2
Ove: 13942, Und: 3077
Offense
0
1
FALSE
2
Fel: 13720, Mis: 3302
Recid
0
1
FALSE
2
No: 11636, Yes: 5386
Type
0
1
FALSE
3
No : 11636, New: 3437, Tec: 1949
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Days
11636
0.32
473.33
283.14
0
241
418
687
1095
▆▇▆▅▃
There are some missing values in the variable Age25. The complete.cases command gives the row numbers where values are not missing. We create a new data frame omitting the rows where there is a missing value in the ‘Age25’ variable.
Also, the variable Recid is a factor variable coded “Yes” or “No”. We convert it to a numeric variable of 1’s and 0’s.
recidivism_na<-recidivism_na%>%mutate(Recid2 =ifelse(Recid=="Yes", 1, 0))obs_diff_recid<-diffmean(Recid2~Age25, data =recidivism_na)obs_diff_recid
diffmean
0.05919913
null_dist_recid<-do(1000)*diffmean(Recid2~shuffle(Age25), data =recidivism_na)gf_histogram(~diffmean, data =null_dist_recid)%>%gf_vline(xintercept =obs_diff_recid, colour ="red")
The probability that the difference in means of Recid2 between the two age groups is as large as the observed value is about \(0.15\), and hence we cannot reject the Null Hypothesis that the means are identical.
3.4 Case Study-4: Matched Pairs: Results from a diving championship.
Diving2017<-resampledata3::Diving2017# From resampledata3 packagehead(Diving2017)
skim(Diving2017)
Data summary
Name
Diving2017
Number of rows
12
Number of columns
4
_______________________
Column type frequency:
factor
2
numeric
2
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Name
0
1
FALSE
12
SI: 1, BEN: 1, CHA: 1, CHE: 1
Country
0
1
FALSE
8
Can: 2, Chi: 2, Mal: 2, Nor: 2
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Semifinal
0
1
338.50
22.95
313.70
322.20
325.62
356.57
382.8
▇▁▂▂▁
Final
0
1
350.48
40.02
283.35
318.59
358.92
387.15
397.5
▃▃▂▆▇
The data is made up of paired observations per swimmer. So we need to take the difference between the two swim records for each swimmer and then shuffle the differences to either polarity. Another way to look at this is to shuffle the records between Semifinal and Final on a per Swimmer basis.
obs_diff_swim<-mean(~Final-Semifinal, data =Diving2017)obs_diff_swim
It appears that there no reason to accept the Alternate Hypothesis that the Diving scores are different across semi-final and final events.
3.5 Case Study #5: Flight Delays
LaGuardia Airport (LGA) is one of three major airports that serves the New York City metropolitan area. In 2008, over 23 million passengers and over 375 000 planes flew in or out of LGA. United Airlines and America Airlines are two major airlines that schedule services at LGA. The data set FlightDelays contains information on all 4029 departures of these two airlines from LGA during May and June 2009.
flightDelays<-resampledata3::FlightDelays# From resampledata3 packageskim(flightDelays)
Data summary
Name
flightDelays
Number of rows
4029
Number of columns
10
_______________________
Column type frequency:
factor
6
numeric
4
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Carrier
0
1
FALSE
2
AA: 2906, UA: 1123
Destination
0
1
FALSE
7
ORD: 1785, DFW: 918, MIA: 610, DEN: 264
DepartTime
0
1
FALSE
5
8-N: 1053, Noo: 1048, 4-8: 972, 4-8: 699
Day
0
1
FALSE
7
Fri: 637, Mon: 630, Tue: 628, Thu: 566
Month
0
1
FALSE
2
Jun: 2030, May: 1999
Delayed30
0
1
FALSE
2
No: 3432, Yes: 597
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
ID
0
1
2015.00
1163.22
1
1008
2015
3022
4029
▇▇▇▇▇
FlightNo
0
1
827.10
551.31
71
371
691
787
2255
▅▇▁▂▂
FlightLength
0
1
185.30
41.79
68
155
163
228
295
▁▆▇▅▂
Delay
0
1
11.74
41.63
-19
-6
-3
5
693
▇▁▁▁▁
The variables in the flightDelays dataset are:
flightDelay dataset variables
Variable
Description
Carrier
UA=United Airlines, AA=American Airlines
FlightNo
Flight number
Destination
Airport code
DepartTime
Scheduled departure time in 4 h intervals
Day
Day of the Week
Month
May or June
Delay
Minutes flight delayed (negative indicates early departure)
Delayed30
Departure delayed more than 30 min? Yes or No
FlightLength
Length of time of flight (minutes)
Let us compute the proportion of times that each carrier’s flights was delayed more than 20 min. We will conduct a two-sided test to see if the difference in these proportions is statistically significant.
We see carrier AA has a 17.13% chance of delays>= 20, while UA has 22.26% chance. The difference is 5.12%. Is this statistically significant? We take the Delays for both Carriers and perform a permutation test by shuffle on the carrier variable:
which is very small. Hence we reject the null Hypothesis that there is no difference between carriers on delay times.
Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines differs from that of American Airlines.
# There is no readymade function in mosaic called `diffvar`...so...we construct oneobs_diff_var<-diff(var(data =flightDelays, Delay~Carrier))obs_diff_var
UA
431.0677
The difference in variances in Delay between the two carriers is \(-431.0677\). In our Permutation Test, we shuffle the Carrier variable:
Clearly there is no case for a significant difference in variances!
3.6 Case Study #6: Walmart vs Target
Is there a difference in the price of groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day.
Inspect the data set, then explain why this is an example of matched pairs data.
Compute summary statistics of the prices for each store.
Conduct a permutation test to determine whether or not there is a difference in the mean prices.
Create a histogram bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
Redo the hypothesis test without this observation. Do you reach the same conclusion?
groceries<-resampledata3::Groceries%>%# From resampledata3 packagemutate(Product =stringr::str_squish(Product))groceries
skim(groceries)
Data summary
Name
groceries
Number of rows
30
Number of columns
4
_______________________
Column type frequency:
character
1
factor
1
numeric
2
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Product
0
1
8
56
0
30
0
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Size
0
1
FALSE
24
18o: 3, 12o: 2, 15o: 2, 16o: 2
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Target
0
1
2.76
1.58
0.99
1.83
2.54
3.14
7.99
▇▇▁▁▁
Walmart
0
1
2.71
1.56
1.00
1.76
2.34
2.96
6.98
▇▅▁▁▂
We see that the comparison is to be made between two prices for the same product, and hence this is one more example of paired data, as in Case Study #4. Let us plot the prices for the products:
Show the Code
gf_col( data =groceries,Target~Product, fill ="#0073C299", width =0.5)%>%gf_col( data =groceries,-Walmart~Product, fill ="#EFC00099", ylab ="Prices", width =0.5)%>%gf_col( data =groceries%>%filter(Product=="Quaker Oats Life Cereal Original"),-Walmart~Product, fill ="red", width =0.5)%>%gf_theme(theme_classic())%>%gf_theme(ggplot2::theme(axis.text.x =element_text( size =8, face ="bold", vjust =0, hjust =1)))%>%gf_theme(ggplot2::coord_flip())
Figure 1: Figure: Walmart vs Target Prices
We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Let us check the mean difference in prices:
Does not seem to be any significant difference in prices…
Suppose we knock off the Quaker Cereal data item…
which(groceries$Product=="Quaker Oats Life Cereal Original")
[1] 2
groceries_less<-groceries[-2, ]groceries_less
obs_diff_price_less<-mean(~Walmart-Target, data =groceries_less)obs_diff_price_less
[1] -0.1558621
polarity_less<-c(rep(1, 15), rep(-1, 14))# Due to resampling this small bias makes no differencenull_dist_price_less<-do(100000)*mean( data =groceries_less,~(Walmart-Target)*resample(polarity_less, replace =TRUE))null_dist_price_less%>%head()
3.7 Case Study 7: Proportions between Categorical Variables
Let us try a dataset with Qualitative / Categorical data. This is a General Social Survey dataset, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?
Since data is Categorical, we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.
GSS2002<-resampledata::GSS2002# From resampledata packageskim(GSS2002)
Data summary
Name
GSS2002
Number of rows
2765
Number of columns
21
_______________________
Column type frequency:
factor
20
numeric
1
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
Region
0
1.00
FALSE
7
Nor: 684, Sou: 486, Sou: 471, Mid: 435
Gender
0
1.00
FALSE
2
Fem: 1537, Mal: 1228
Race
0
1.00
FALSE
3
Whi: 2188, Bla: 410, Oth: 167
Education
5
1.00
FALSE
5
HS: 1485, Bac: 443, Lef: 400, Gra: 230
Marital
0
1.00
FALSE
5
Mar: 1269, Nev: 708, Div: 445, Wid: 247
Religion
19
0.99
FALSE
13
Pro: 1460, Cat: 673, Non: 379, Chr: 65
Happy
1396
0.50
FALSE
3
Pre: 784, Ver: 415, Not: 170
Income
890
0.68
FALSE
24
400: 170, 300: 166, 250: 140, 500: 136
PolParty
36
0.99
FALSE
8
Ind: 528, Not: 515, Not: 449, Str: 408
Politics
1434
0.48
FALSE
7
Mod: 522, Con: 210, Sli: 209, Sli: 159
Marijuana
1914
0.31
FALSE
2
Not: 545, Leg: 306
DeathPenalty
1457
0.47
FALSE
2
Fav: 899, Opp: 409
OwnGun
1841
0.33
FALSE
3
No: 605, Yes: 310, Ref: 9
GunLaw
1849
0.33
FALSE
2
Fav: 737, Opp: 179
SpendMilitary
1441
0.48
FALSE
3
Abo: 615, Too: 414, Too: 295
SpendEduc
1422
0.49
FALSE
3
Too: 992, Abo: 278, Too: 73
SpendEnv
1443
0.48
FALSE
3
Too: 793, Abo: 439, Too: 90
SpendSci
1499
0.46
FALSE
3
Abo: 629, Too: 461, Too: 176
Pres00
1016
0.63
FALSE
5
Bus: 885, Gor: 781, Nad: 57, Oth: 16
Postlife
1554
0.44
FALSE
2
Yes: 975, No: 236
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
ID
0
1
1383
798.33
1
692
1383
2074
2765
▇▇▇▇▇
Note how all variables are Categorical !! Education has five levels:
GSS2002%>%count(Education)
GSS2002%>%count(DeathPenalty)
Let us drop NA entries in Education and Death Penalty. And set up a table for the chi-square test.
gss_summary<-gss2002%>%mutate( Education =factor(Education, levels =c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS"), labels =c("Bachelors", "Graduate", "Jr Col", "HS", "Left HS")), DeathPenalty =as.factor(DeathPenalty))%>%group_by(Education, DeathPenalty)%>%summarise(count =n())%>%# This is good for a chisq test# Add two more columns to faciltate mosaic/Marrimekko Plot#mutate( edu_count =sum(count), edu_prop =count/sum(count))%>%ungroup()gss_summary
# We can plot a heatmap-like `mosaic chart` for this table, using `ggplot`:# https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2ggplot(data =gss_summary, aes(x =Education, y =edu_prop))+geom_bar(aes(width =edu_count, fill =DeathPenalty), stat ="identity", position ="fill", colour ="black")+geom_text(aes(label =scales::percent(edu_prop)), position =position_stack(vjust =0.5))+# if labels are desiredfacet_grid(~Education, scales ="free_x", space ="free_x")+theme(scale_fill_brewer(palette ="RdYlGn"))+# theme(panel.spacing.x = unit(0, "npc")) + # if no spacing preferred between barstheme_void()
Let us now perform the base chisq test: We need a table and then the chisq test:
gss_table<-tally(DeathPenalty~Education, data =gss2002)gss_table
Education
DeathPenalty Left HS HS Jr Col Bachelors Graduate
Favor 117 511 71 135 64
Oppose 72 200 16 71 50
# Get the observed chi-square statisticobservedChi2<-mosaic::chisq(tally(DeathPenalty~Education, data =gss2002))observedChi2
X.squared
23.45093
# Actual chi-square teststats::chisq.test(tally(DeathPenalty~Education, data =gss2002))
Pearson's Chi-squared test
data: tally(DeathPenalty ~ Education, data = gss2002)
X-squared = 23.451, df = 4, p-value = 0.0001029
We should now repeat the test with permutations on Education:
null_chisq<-do(4999)*chisq.test(tally(DeathPenalty~shuffle(Education), data =gss2002))head(null_chisq)
gf_histogram(~X.squared, data =null_chisq)%>%gf_vline(xintercept =observedChi2, color ="red")
So we would conclude that Education has a significant effect on DeathPenalty opinion!