Tutorial: Multiple Linear Regression with Backward Selection

Linear Regression

Quantitative Predictor

Quantitative Response

Sum of Squares

Residuals

Author

Arvind V.

Published

May 3, 2023

Modified

September 30, 2025

Abstract

Using Multiple Regression to predict Quantitative Target Variables

1 Setting up R Packages

options(scipen = 1, digits = 3) # set to three decimal

library(tidyverse)
library(ggformula)
library(mosaic)
library(infer)

Plot Fonts and Theme

Show the Code

library(systemfonts)
library(showtext)
library(tidyverse)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  font <- "Alegreya" # assign font family up front
  "%+replace%" <- ggplot2::"%+replace%" # nolint

  ggplot2::theme_classic(base_size = 14, base_family = font) %+replace% # replace elements we want to change

    theme(
      text = element_text(family = font), # set base font family

      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        size = 24, # set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        margin = margin(t = 5, r = 0, b = 5, l = 0)
      ), # margin
      plot.title.position = "plot",
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        size = 14, # font size
        hjust = 0, # left align
        margin = margin(t = 5, r = 0, b = 10, l = 0)
      ), # margin

      plot.caption = element_text( # caption
        family = font, # font family
        size = 9, # font size
        hjust = 1
      ), # right align

      plot.caption.position = "plot", # right align

      axis.title = element_text( # axis titles
        family = "Roboto Condensed", # font family
        size = 12
      ), # font size

      axis.text = element_text( # axis text
        family = "Roboto Condensed", # font family
        size = 9
      ), # font size

      axis.text.x = element_text( # margin for axis text
        margin = margin(5, b = 10)
      )

      # since the legend often requires manual tweaking
      # based on plot content, don't define it here
    )
}

## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))

## Set the theme
ggplot2::theme_set(new = theme_custom())

In this tutorial, we will use the Boston housing dataset. Our research question is:

Research Question

How do we predict the price of a house in Boston, based on other parameters Quantitative parameters such as area, location, rooms, and crime-rate in the neighbourhood?

And how do we choose the “best” model, based on a tradeoff between Model Complexity and Model Accuracy?

2 Workflow: Read the Data

data("BostonHousing2", package = "mlbench")
housing <- BostonHousing2
inspect(housing)


categorical variables:  
  name  class levels   n missing                                  distribution
1 town factor     92 506       0 Cambridge (5.9%) ...                         
2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          

quantitative variables:  
      name   class       min       Q1   median       Q3      max     mean
1    tract integer   1.00000 1303.250 3393.500 3739.750 5082.000 2700.356
2      lon numeric -71.28950  -71.093  -71.053  -71.020  -70.810  -71.056
3      lat numeric  42.03000   42.181   42.218   42.252   42.381   42.216
4     medv numeric   5.00000   17.025   21.200   25.000   50.000   22.533
5    cmedv numeric   5.00000   17.025   21.200   25.000   50.000   22.529
6     crim numeric   0.00632    0.082    0.257    3.677   88.976    3.614
7       zn numeric   0.00000    0.000    0.000   12.500  100.000   11.364
8    indus numeric   0.46000    5.190    9.690   18.100   27.740   11.137
9      nox numeric   0.38500    0.449    0.538    0.624    0.871    0.555
10      rm numeric   3.56100    5.886    6.208    6.623    8.780    6.285
11     age numeric   2.90000   45.025   77.500   94.075  100.000   68.575
12     dis numeric   1.12960    2.100    3.207    5.188   12.127    3.795
13     rad integer   1.00000    4.000    5.000   24.000   24.000    9.549
14     tax integer 187.00000  279.000  330.000  666.000  711.000  408.237
15 ptratio numeric  12.60000   17.400   19.050   20.200   22.000   18.456
16       b numeric   0.32000  375.377  391.440  396.225  396.900  356.674
17   lstat numeric   1.73000    6.950   11.360   16.955   37.970   12.653
          sd   n missing
1  1380.0368 506       0
2     0.0754 506       0
3     0.0618 506       0
4     9.1971 506       0
5     9.1822 506       0
6     8.6015 506       0
7    23.3225 506       0
8     6.8604 506       0
9     0.1159 506       0
10    0.7026 506       0
11   28.1489 506       0
12    2.1057 506       0
13    8.7073 506       0
14  168.5371 506       0
15    2.1649 506       0
16   91.2949 506       0
17    7.1411 506       0

The original data are 506 observations on 14 variables, medv being the target variable:

crim	per capita crime rate by town
zn	proportion of residential land zoned for lots over 25,000 sq.ft
indus	proportion of non-retail business acres per town
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox	nitric oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted distances to five Boston employment centres
rad	index of accessibility to radial highways
tax	full-value property-tax rate per USD 10,000
ptratio	pupil-teacher ratio by town
b	\(1000(B - 0.63)^2\) where B is the proportion of Blacks by town
lstat	percentage of lower status of the population
medv	median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv	corrected median value of owner-occupied homes in USD 1000’s
town	name of town
tract	census tract
lon	longitude of census tract
lat	latitude of census tract

Our response variable is cmedv, the corrected median value of owner-occupied homes in USD 1000’s. Their are many Quantitative feature variables that we can use to predict cmedv. And there are two Qualitative features, chas and tax.

3 Workflow: Correlations

We can use purrr to evaluate all pair-wise correlations in one shot:

ggplot2::theme_set(new = theme_custom())
##
all_corrs <- housing %>%
  select(where(is.numeric)) %>%
  # leave off cmedv/medv to get all the remaining ones
  select(-cmedv, -medv) %>%
  # perform a cor.test for all variables against cmedv
  purrr::map(
    .x = .,
    .f = \(x) cor.test(x, housing$cmedv)
  ) %>%
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor")

all_corrs

all_corrs %>%
  gf_hline(
    yintercept = 0,
    color = "grey",
    linewidth = 2
  ) %>%
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    colour = ~estimate,
    width = 0.5,
    linewidth = ~ -log10(p.value),
    caption = "Significance = -log10(p.value)"
  ) %>%
  gf_point(estimate ~ reorder(predictor, estimate)) %>%
  gf_labs(
    title = "Correlation Scores with Target Variable",
    x = "Predictors", y = "Correlation with Median House Price"
  ) %>%
  gf_theme(theme(axis.text.x = element_text(angle = 45, hjust = 1))) %>%
  gf_refine(
    scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
    scale_linewidth_continuous("Significance", range = c(0.25, 3)),
    guides(linewidth = guide_legend(reverse = TRUE))
  )

The variables rm, lstat seem to have high correlations with cmedv which are also statistically significant.

4 Workflow: Maximal Multiple Regression Model

We will create a regression model for cmedv using all the other numerical predictor features in the dataset.

housing_numeric <- housing %>% select(
  where(is.numeric),

  # remove medv
  # an older version of cmedv
  -c(medv)
)
names(housing_numeric) # 16 variables, one target, 15 predictors

 [1] "tract"   "lon"     "lat"     "cmedv"   "crim"    "zn"      "indus"  
 [8] "nox"     "rm"      "age"     "dis"     "rad"     "tax"     "ptratio"
[15] "b"       "lstat"

housing_maximal <- lm(cmedv ~ ., data = housing_numeric)
summary(housing_maximal)


Call:
lm(formula = cmedv ~ ., data = housing_numeric)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.934  -2.752  -0.619   1.711  26.120 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.45e+02   3.23e+02   -1.07  0.28734    
tract       -7.52e-04   4.46e-04   -1.69  0.09231 .  
lon         -6.79e+00   3.44e+00   -1.98  0.04870 *  
lat         -2.35e+00   5.36e+00   -0.44  0.66074    
crim        -1.09e-01   3.28e-02   -3.32  0.00097 ***
zn           4.40e-02   1.39e-02    3.17  0.00164 ** 
indus        2.75e-02   6.20e-02    0.44  0.65692    
nox         -1.55e+01   4.03e+00   -3.85  0.00014 ***
rm           3.81e+00   4.20e-01    9.07  < 2e-16 ***
age          5.82e-03   1.34e-02    0.43  0.66416    
dis         -1.38e+00   2.10e-01   -6.59  1.1e-10 ***
rad          2.36e-01   8.47e-02    2.78  0.00558 ** 
tax         -1.48e-02   3.74e-03   -3.96  8.5e-05 ***
ptratio     -9.49e-01   1.41e-01   -6.73  4.7e-11 ***
b            9.55e-03   2.67e-03    3.57  0.00039 ***
lstat       -5.46e-01   5.06e-02  -10.80  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.73 on 490 degrees of freedom
Multiple R-squared:  0.743, Adjusted R-squared:  0.735 
F-statistic: 94.3 on 15 and 490 DF,  p-value: <2e-16

The maximal model has an R.squared of \(0.7426\) which is much better than that we obtained for a simple model based on rm alone. How much can we simplify this maximal model, without losing out on R.squared?

5 Workflow: Model Reduction

We now proceed naively by removing one predictor after another. We will resort to what may amount to p-hacking by sorting the predictors based on their p-value¹ in the maximal model and removing them in decreasing order of their p-value.

We will also use some powerful features from the purrr package (also part of the tidyverse), to create all these models all at once. Then we will be able to plot their R.squared values together and decide where we wish to trade off Explainability vs Complexity for our model.

# No of Quant predictor variables in the dataset
n_vars <- housing %>%
  select(where(is.numeric), -c(cmedv, medv)) %>%
  length()

# Maximal Model, now tidied
housing_maximal_tidy <-
  housing_maximal %>%
  broom::tidy() %>%
  # Obviously remove "Intercept" ;-D
  filter(term != "(Intercept)") %>%
  # And horrors! Sort variables by p.value
  arrange(p.value)

housing_maximal_tidy

The last 5 variables are clearly statistically insignificant.

And now we unleash the purrr package to create all the simplified models at once. We will construct a dataset containing three columns:

A list of all quantitative predictor variables
A sequence of numbers from 1 to N(predictor)
A “list” column containing the housing data frame itself

We will use the iteration capability of purrr to sequentially drop one variable at a time from the maximal(15 predictor) model, build a new reduced model each time, and compute the r.squared:

Show the Code

housing_model_set <- tibble(
  all_vars =
    list(housing_maximal_tidy$term), # p-hacked order!!
  keep_vars = seq(1, n_vars),
  data = list(housing_numeric)
)

# Unleash purrr in a series of mutates
housing_model_set <- housing_model_set %>%
  # list of predictor variables for each model
  mutate(
    mod_vars =
      pmap(
        .l = list(all_vars, keep_vars, data),
        .f = \(all_vars, keep_vars, data) all_vars[1:keep_vars]
      )
  ) %>%
  # build formulae with these for linear regression
  mutate(
    formula =
      map(
        .x = mod_vars,
        .f = \(mod_vars) as.formula(paste(
          "cmedv ~", paste(mod_vars, collapse = "+")
        ))
      )
  ) %>%
  # use the formulae to build multiple linear models
  mutate(
    models =
      pmap(
        .l = list(data, formula),
        .f = \(data, formula) lm(formula, data = data)
      )
  )

# Tidy up the models using broom to expose their metrics
housing_models_tidy <- housing_model_set %>%
  mutate(
    tidy_models =
      map(
        .x = models,
        .f = \(x) broom::glance(x,
          conf.int = TRUE,
          conf.lvel = 0.95
        )
      )
  ) %>%
  # Remove unwanted columns, keep model and predictor count
  select(keep_vars, tidy_models) %>%
  unnest(tidy_models)
housing_models_tidy

Show the Code

ggplot2::theme_set(new = theme_custom())
##
housing_models_tidy %>%
  gf_line(
    r.squared ~ keep_vars,
    ylab = "R.Squared",
    xlab = "No. params in the Linear Model",
    title = "Model Explainability vs Complexity",
    subtitle = "Model r.squared vs No. of Predictors",
    data = .
  ) %>%
  # Plot r.squared vs predictor count
  gf_point(r.squared ~ keep_vars,
    size = 3.5, color = "grey"
  ) %>%
  # Show off the selected best model
  gf_point(
    r.squared ~ keep_vars,
    size = 3.5,
    color = "red",
    data = housing_models_tidy %>% filter(keep_vars == 4)
  ) %>%
  gf_hline(yintercept = 0.7, linetype = "dashed")

At the loss of some 5% in the r.squared, we can drop our model complexity from 15 predictors to say 4! Filtering the set of models for just the one that has 4 predictors, we obtain final model:

Show the Code

ggplot2::theme_set(new = theme_custom())
##
housing_model_final <-
  housing_model_set %>%
  # filter for best model, with 4 variables
  filter(keep_vars == 4) %>%
  # tidy up the model
  mutate(
    tidy_models =
      map(
        .x = models,
        .f = \(x) broom::tidy(x,
          conf.int = TRUE,
          conf.lvel = 0.95
        )
      )
  ) %>%
  # Remove unwanted columns, keep model and predictor count
  select(keep_vars, models, tidy_models) %>%
  unnest(tidy_models) %>%
  select(-c(1, 2))

housing_model_final

which we can state in equation form as:

\[ \widehat{cmedv} \sim 24.459 - 0.563 * dis - 0.673 * lstat - 0.957 * ptratio + 4.199 * rm \]

And plotting this model, we have:

Show the Code

housing_model_final %>%
  # And plot the model
  # Remove the intercept term
  filter(term != "(Intercept)") %>%
  gf_col(estimate ~ term, fill = ~term, width = 0.25) %>%
  gf_hline(yintercept = 0) %>%
  gf_errorbar(conf.low + conf.high ~ term,
    width = 0.1,
    title = "Multiple Regression",
    subtitle = "Model Estimates with Confidence Intervals"
  )

6 Workflow: Diagnostics

Let us use broom::augment to calculate residuals and predictions to arrive at a quick set of diagnostic plots.

housing_model_final_augment <-
  housing_model_set %>%
  filter(keep_vars == 4) %>%
  # augment the model
  mutate(
    augment_models =
      map(
        .x = models,
        .f = \(x) broom::augment(x)
      )
  ) %>%
  unnest(augment_models) %>%
  select(cmedv:last_col())

housing_model_final_augment

ggplot2::theme_set(new = theme_custom())
##
housing_model_final_augment %>%
  gf_point(.resid ~ .fitted, title = "Residuals vs Fitted") %>%
  gf_smooth()
##
housing_model_final_augment %>%
  gf_qq(~.std.resid, title = "Q-Q Residuals") %>%
  gf_qqline()
##
housing_model_final_augment %>%
  gf_point(sqrt(.std.resid) ~ .fitted,
    title = "Scale-Location Plot"
  ) %>%
  gf_smooth()
##
housing_model_final_augment %>%
  gf_point(.std.resid ~ .hat,
    title = "Residuals vs Leverage"
  ) %>%
  gf_smooth()

The residuals plot shows a curved trend, and certainly does not resemble the stars at night, so it is possible that we have left out some possible richness in our model-making, a “systemic inadequacy”.

The Q-Q plot of residuals also shows a J-shape which indicates a non-normal distribution of residuals.

These could indicate that more complex model ( e.g. linear model with interactions between variables ( i.e. product terms ) may be necessary.

7 Conclusion

We have used a multiple-regression workflow that takes all predictor variables into account in a linear model, and then systematically simplified that model such that the performance was just adequate.

The models we chose were all linear of course, but without interaction terms : each predictor was used only for its main effect. When the diagnostic plots were examined, we did see some shortcomings in the model. This could be overcome with a more complex model. These might include selected interactions, transformations of target(\(cmedv^2\), or \(sqrt(cmedv)\)) and some selected predictors.

8 References

James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning. Chapter 3. Linear Regression. https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.html

R Package Citations

Package	Version	Citation
corrgram	1.14	Wright (2021)
corrplot	0.95	Wei and Simko (2024)
GGally	2.4.0	Schloerke et al. (2025)
gt	1.1.0	Iannone et al. (2025)
infer	1.0.9	Couch et al. (2021)
ISLR	1.4	James et al. (2021)
janitor	2.2.1	Firke (2024)
reghelper	1.1.2	Hughes and Beiner (2023)

Couch, Simon P., Andrew P. Bray, Chester Ismay, Evgeni Chasnovski, Benjamin S. Baumer, and Mine Çetinkaya-Rundel. 2021. “infer: An R Package for Tidyverse-Friendly Statistical Inference.” Journal of Open Source Software 6 (65): 3661. https://doi.org/10.21105/joss.03661.

Firke, Sam. 2024. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://doi.org/10.32614/CRAN.package.janitor.

Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://doi.org/10.32614/CRAN.package.reghelper.

Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, JooYoung Seo, Ken Brevoort, and Olivier Roy. 2025. gt: Easily Create Presentation-Ready Display Tables. https://doi.org/10.32614/CRAN.package.gt.

James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://doi.org/10.32614/CRAN.package.ISLR.

Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2025. GGally: Extension to “ggplot2”. https://doi.org/10.32614/CRAN.package.GGally.

Wei, Taiyun, and Viliam Simko. 2024. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.

Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://doi.org/10.32614/CRAN.package.corrgram.

Footnotes

James, Witten, Hastie, Tibshirani,An Introduction to Statistical Learning. Chapter 3. Linear Regression https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.html ↩︎

Citation

BibTeX citation:

@online{v.2023,
  author = {V., Arvind},
  title = {Tutorial: {Multiple} {Linear} {Regression} with {Backward}
    {Selection}},
  date = {2023-05-03},
  url = {https://madhatterguide.netlify.app/content/courses/Analytics/30-Modelling/Modules/20-LinReg/files/backward-selection-1.html},
  langid = {en},
  abstract = {Using Multiple Regression to predict Quantitative Target
    Variables}
}

For attribution, please cite this work as:

V., Arvind. 2023. “Tutorial: Multiple Linear Regression with Backward Selection.” May 3, 2023. https://madhatterguide.netlify.app/content/courses/Analytics/30-Modelling/Modules/20-LinReg/files/backward-selection-1.html.