2023-04-13
library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
family = "Alegreya",
regular = "../../../../../../fonts/Alegreya-Regular.ttf",
bold = "../../../../../../fonts/Alegreya-Bold.ttf",
italic = "../../../../../../fonts/Alegreya-Italic.ttf",
bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)
sysfonts::font_add(
family = "Roboto Condensed",
regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
theme_bw(base_size = 10) +
# theme(panel.widths = unit(11, "cm"),
# panel.heights = unit(6.79, "cm")) + # Golden Ratio
theme(
plot.margin = margin_auto(t = 1, r = 2, b = 1, l = 1, unit = "cm"),
plot.background = element_rect(
fill = "bisque",
colour = "black",
linewidth = 1
)
) +
theme_sub_axis(
title = element_text(
family = "Roboto Condensed",
size = 10
),
text = element_text(
family = "Roboto Condensed",
size = 8
)
) +
theme_sub_legend(
text = element_text(
family = "Roboto Condensed",
size = 6
),
title = element_text(
family = "Alegreya",
size = 8
)
) +
theme_sub_plot(
title = element_text(
family = "Alegreya",
size = 14, face = "bold"
),
title.position = "plot",
subtitle = element_text(
family = "Alegreya",
size = 10
),
caption = element_text(
family = "Alegreya",
size = 6
),
caption.position = "plot"
)
}
## Use available fonts in ggplot text geoms too!
ggplot2::update_geom_defaults(geom = "text", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "marquee", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "text_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
ggplot2::update_geom_defaults(geom = "label_repel", new = list(
family = "Roboto Condensed",
face = "plain",
size = 3.5,
color = "#2b2b2b"
))
## Set the theme
ggplot2::theme_set(new = theme_custom())
## tinytable options
options("tinytable_tt_digits" = 2)
options("tinytable_format_num_fmt" = "significant_cell")
options(tinytable_html_mathjax = TRUE)
## Set defaults for flextable
flextable::set_flextable_defaults(font.family = "Roboto Condensed")
Sometimes the dependent variable is Qualitative: an either/or categorization. for example, or the variable we want to predict might be won
or lost
the contest, has an ailment
or not
, voted
or not
in the last election, or graduated
from college or not
. There might even be more than two categories such as voted for Congress
, BJP
, or Independent
; or never smoker
, former smoker
, or current smoker
.
We saw with the General Linear Model that it models the mean of a target Quantitative variable as a linear weighted sum of the predictor variables:
\[ \Large{y \sim N(x_i^T * \beta, ~~\sigma^2)} \tag{1}\]
This model is considered to be general because of the dependence on potentially more than one explanatory variable, v.s. the simple linear model:1 \(y = \beta_0 + \beta_1*x_1 + \epsilon\). The general linear model gives us model “shapes” that start from a simple straight line to a p-dimensional hyperplane.
Although a very useful framework, there are some situations where general linear models are not appropriate:
How do we use the familiar linear model framework when the target/dependent variable is Categorical?
Recall that we spoke of dummy-encoded Qualitative **predictor** variables
for our linear models and how we would dummy encode them using numerical values, such as 0 and 1, or +1 and -1. Could we try the same way for a target categorical variable?
\[ Y_i = \beta_0 + \beta_1*X_i + \epsilon_i\\ \nonumber \] \[ where\\\ \]
\[ \begin{align} Y_i &= 0 ~ if ~~~"No"\\ \nonumber &= 1 ~ if ~~~ "Yes" \nonumber \end{align} \]
Sadly this seems to not work for categorical dependent variables using a simple linear model as before. Consider the Credit Card Default
data from the package ISLR
.
We see balance
and income
are quantitative predictors; student
is a qualitative predictor, and default
is a qualitative target variable. If we naively use a linear model equation as model = lm(default ~ balance, data = Default)
and plot it, then…
…it is pretty much clear from Figure 1 that something is very odd. (no pun intended! See below!) If the only possible values for default
are \(No = 0\) and \(Yes = 1\), how could we interpret predicted value of, say, \(Y_i = 0.25\) or \(Y_i = 1.55\), or perhaps \(Y_i = -0.22\)? Anything other than Yes/No is hard to interpret!
Where do we go from here?
Let us state what we might desire of our model: despite this setback, we would still like our model to be as close as possible to the familiar linear model equation.
\[ Y_i = \beta_0 + \beta_1*X_i + \epsilon_i\\ \nonumber \] \[ where\\\ \]
\[ \begin{align} Y_i &= 0 ~ if ~~~"No"\\ \nonumber &= 1 ~ if ~~~ "Yes" \nonumber \end{align} \tag{2}\]
We have quantitative predictors so we still want to use a linear-weighted sum for the RHS (i.e predictor side) of the model equation. What can we try to make this work? Especially for the LHS (i.e the target side)?
What can we try? In dummy encoding our target variable, we found a range of [0,1], which is the same range for a probability value! Could we try to use probability of the outcome as our target, even though we are interested in binary outcomes? This would still leave us with a range of \([0,1]\) for the target variable, as before.
Binomially distributed target variable
If we map our Categorical/Qualitative target variable into a Quantitative probability, we need immediately to look at the LINE assumptions in linear regression.
In linear regression, we assume a normally distributed target variable, i.e. the residuals/errors around the predicted value are normally distributed. With a categorical target variable with two levels \(0\) and \(1\) it would be impossible for the errors \(e_i = Y_i - \hat{Y_i}\) to have a normal distribution, as assumed for the statistical tests to be valid. The errors are bounded by \([0,1]\)! One candidate for the error distribution in this case is the binomial distribution, whose mean and variance are p
and np(1-p)
respectively.
Note immediately that the binomial variance moves with the mean! The LINE assumption of normality is clearly violated. And from the figure above, extreme probabilities (near 1 or 0) are more stable (i.e., have less error variance) than middle probabilities. So the model has “built-in” heteroscedasticity, which we need to counter with transformations such as the \(log()\) function. More on this very shortly!
How would one “extend” the range of a target variable from [0,1] to \([-\infty, \infty]\) ? One step would be to try the odds of the outcome, instead of trying to predict the outcomes directly (Yes or No), or their probabilities \([0,1]\).
Odds
Odds of an event with probability p
of occurrence is defined as \(Odds = p/(1-p)\). As can be seen, the odds are the ratio of two probabilities, that of the event and its complement. In the Default
dataset just considered, the odds of default and the odds of non-default can be calculated as:
\[ \begin{align} p(Default) &= 333/(333 + 9667)\\ \nonumber &= 0.333\\ \nonumber \end{align} \]
therefore:
\[ \begin{align} Odds~of~Default &=p(Default)/(1-p(Default))\\ \nonumber &= 0.333/(1-0.333)\\ \nonumber &= 0.5\\ \end{align} \]
and
\[ Odds~No~Default = 0.9667/(1-0.9667) = 29 \].
Now, odds cover half of real number line, i.e. \([0, \infty]\) ! Clearly, when the probability p
of an event is \(0\), the odds are \(0\)…and when it nears \(1\), the odds tend to \(\infty\). So we have transformed a simple probability that lies between \([0,1]\) to odds lying between \([0, \infty]\). That’s one step towards making a linear model possible; we have “removed” one of the limits on our linear model’s prediction range by using Odds
as our target variable.
log()
?We need one more leap of faith: how do we convert a \([0, \infty]\) range to a \([-\infty, \infty]\)? Can we try a log transformation?
\[ log([0, \infty]) ~ = ~ [-\infty, \infty] \]
This extends the range of our Qualitative target to the same as with a Quantitative target!
There is an additional benefit if this log()
transformation: the Error Distributions with Odds targets. See the plot below. Odds are a necessarily nonlinear function of probability; the slope of Odds ~ probability
also depends upon the probability itself, as we saw with the probability curve earlier.
To understand this issue intuitively, consider what happens to, say, a 5% change in the odds ratio near 1.0. If the odds ratio is \(1.0\), then the probabilities p
and 1-p
are \(0.5\), and \(0.5\). A 20% increase in the odds ratio to \(1.20\) would correspond to probabilities of \(0.545\) and \(0.455\). However, if the original probabilities were \(0.9\) and \(0.1\) for an odds ratio \(9\), then a 20% increase (in odds ratio) to \(10.8\) would correspond to probabilities of \(0.915\) and \(0.085\), a much smaller change in the probabilities. The basic curve is non-linear and the log
transformation flattens this out to provide a more linear relationship, which is what we desire.
So in our model, instead of modeling odds as the dependent variable, we will use \(log(odds)\), also known as the logit, defined as:
\[ \begin{align} log(odds_i) &= log\bigg[p_i/(1-p_i)\bigg]\\ \nonumber &= logit(p_i)\\ \end{align} \tag{3}\]
This is our Logistic Regression Model, which uses a Quantitative Predictor variable to predict a Categorical target variable. We write the model as ( for the Default
dataset ) :
\[ \Large{logit(default) = \beta_0 + \beta_1 * balance} \tag{4}\]
This means that:
\[ log(p(default)/(1-p(default))) = \beta_0+\beta_1 * balance \] and therefore:
\[ \begin{align} p(default) &= \frac{exp(\beta_0 + \beta_1 * balance)}{1 + exp(\beta_0 + \beta_1 * balance)}\\ &= \frac{1}{1 + exp^{-(\beta_0 + \beta_1 * balance)}} \end{align} \tag{5}\]
From the Equation 4 above it should be clear that a unit increase in balance
should increase the odds of default
by \(\beta_1\) units. The RHS of Equation 5 is a sigmoid function of the weighted sum of predictors and is limited to the range [0,1].
If we were to include income
also as a predictor variable in the model, we might obtain something like:
\[ \begin{align} p(default) &= \frac{exp(\beta_0 + \beta_1 * balance + \beta_2 * income)}{1 + exp(\beta_0 + \beta_1 * balance + \beta_2 * income)}\\ &= \frac{1}{1 + exp^{-(\beta_0 + \beta_1 * balance + \beta_2 * income)}} \end{align} \tag{6}\]
This model Equation 6 is plotted a little differently, since it includes three variables. We’ll see this shortly, with code. The thing to note is that the formula inside the exp()
is a linear combination of the predictors!
The parameters \(\beta_i\) now need to be estimated. How might we do that? This last problem is that because we have made so many transformations to get to the logits
that we want to model, the logic of minimizing the sum of squared errors(SSE) is no longer appropriate.
Infinite SSE!!
The probabilities for default
are \(0\) and \(1\). At these values the log(odds)
will map respectively to \(-\infty\) and \(\infty\) 🙀. So if we naively try to take residuals, we will find that they are all \(\infty\) !! Hence the Sum of Squared Errors \(SSE\) cannot be computed and we need another way to assess the quality of our model.
Instead, we will have to use maximum likelihood estimation(MLE) to estimate the models. The maximum likelihood method maximizes the probability of obtaining the data at hand against every choice of model parameters \(\beta_i\). (And compare that method with the \(X^2\) (“chi-squared”) test and statistic instead of t
and F
to evaluate the model comparisons)
Let us proceed with the logistic regression workflow. We will use the well-known Wisconsin breast cancer
dataset, readily available from Vincent Arel-Bundock’s website.
Rows: 569
Columns: 32
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ x_radius_mean <dbl> 13.540, 13.080, 9.504, 13.030, 8.196, 12.050, 13.4…
$ x_texture_mean <dbl> 14.36, 15.71, 12.44, 18.42, 16.84, 14.63, 22.30, 2…
$ x_perimeter_mean <dbl> 87.46, 85.63, 60.34, 82.61, 51.71, 78.04, 86.91, 7…
$ x_area_mean <dbl> 566.3, 520.0, 273.9, 523.8, 201.9, 449.3, 561.0, 4…
$ x_smoothness_mean <dbl> 0.09779, 0.10750, 0.10240, 0.08983, 0.08600, 0.103…
$ x_compactness_mean <dbl> 0.08129, 0.12700, 0.06492, 0.03766, 0.05943, 0.090…
$ x_concavity_mean <dbl> 0.066640, 0.045680, 0.029560, 0.025620, 0.015880, …
$ x_concave_pts_mean <dbl> 0.047810, 0.031100, 0.020760, 0.029230, 0.005917, …
$ x_symmetry_mean <dbl> 0.1885, 0.1967, 0.1815, 0.1467, 0.1769, 0.1675, 0.…
$ x_fractal_dim_mean <dbl> 0.05766, 0.06811, 0.06905, 0.05863, 0.06503, 0.060…
$ x_radius_se <dbl> 0.2699, 0.1852, 0.2773, 0.1839, 0.1563, 0.2636, 0.…
$ x_texture_se <dbl> 0.7886, 0.7477, 0.9768, 2.3420, 0.9567, 0.7294, 1.…
$ x_perimeter_se <dbl> 2.058, 1.383, 1.909, 1.170, 1.094, 1.848, 1.735, 2…
$ x_area_se <dbl> 23.560, 14.670, 15.700, 14.160, 8.205, 19.870, 20.…
$ x_smoothness_se <dbl> 0.008462, 0.004097, 0.009606, 0.004352, 0.008968, …
$ x_compactness_se <dbl> 0.014600, 0.018980, 0.014320, 0.004899, 0.016460, …
$ x_concavity_se <dbl> 0.023870, 0.016980, 0.019850, 0.013430, 0.015880, …
$ x_concave_pts_se <dbl> 0.013150, 0.006490, 0.014210, 0.011640, 0.005917, …
$ x_symmetry_se <dbl> 0.01980, 0.01678, 0.02027, 0.02671, 0.02574, 0.014…
$ x_fractal_dim_se <dbl> 0.002300, 0.002425, 0.002968, 0.001777, 0.002582, …
$ x_radius_worst <dbl> 15.110, 14.500, 10.230, 13.300, 8.964, 13.760, 15.…
$ x_texture_worst <dbl> 19.26, 20.49, 15.66, 22.81, 21.96, 20.70, 31.82, 2…
$ x_perimeter_worst <dbl> 99.70, 96.09, 65.13, 84.46, 57.26, 89.88, 99.00, 8…
$ x_area_worst <dbl> 711.2, 630.5, 314.9, 545.9, 242.2, 582.6, 698.8, 5…
$ x_smoothness_worst <dbl> 0.14400, 0.13120, 0.13240, 0.09701, 0.12970, 0.149…
$ x_compactness_worst <dbl> 0.17730, 0.27760, 0.11480, 0.04619, 0.13570, 0.215…
$ x_concavity_worst <dbl> 0.239000, 0.189000, 0.088670, 0.048330, 0.068800, …
$ x_concave_pts_worst <dbl> 0.12880, 0.07283, 0.06227, 0.05013, 0.02564, 0.065…
$ x_symmetry_worst <dbl> 0.2977, 0.3184, 0.2450, 0.1987, 0.3105, 0.2747, 0.…
$ x_fractal_dim_worst <dbl> 0.07259, 0.08183, 0.07773, 0.06169, 0.07409, 0.083…
$ y <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", …
Name | cancer |
Number of rows | 569 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 31 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
y | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rownames | 0 | 1 | 285.00 | 164.40 | 1.00 | 143.00 | 285.00 | 427.00 | 569.00 | ▇▇▇▇▇ |
x_radius_mean | 0 | 1 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 | ▂▇▃▁▁ |
x_texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 | ▃▇▃▁▁ |
x_perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 | ▃▇▃▁▁ |
x_area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 | ▇▃▂▁▁ |
x_smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 | ▁▇▇▁▁ |
x_compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
x_concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
x_concave_pts_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
x_symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
x_fractal_dim_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
x_radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
x_texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
x_perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
x_area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 | ▇▁▁▁▁ |
x_smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
x_compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
x_concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
x_concave_pts_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
x_symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
x_fractal_dim_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
x_radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
x_texture_worst | 0 | 1 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 | ▃▇▆▁▁ |
x_perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 | ▇▇▃▁▁ |
x_area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 | ▇▂▁▁▁ |
x_smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
x_compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
x_concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
x_concave_pts_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
x_symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
x_fractal_dim_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
We see that there are 31 Quantitative variables, all named as x_***
, and one Qualitative variable,y
, which is a two-level target. (B = Benign, M = Malignant). The dataset has 569 observations, and no missing data.
Let us rename y
as diagnosis
and take two other Quantitative parameters as predictors
, suitably naming them too. We will also create a binary-valued variable called diagnosis_malignant
(Binary, Malignant = 1, Benign = 0) for use as a target
in our logistic regression model.
cancer_modified <- cancer %>%
rename(
"diagnosis" = y,
"radius_mean" = x_radius_mean,
"concave_points_mean" = x_concave_pts_mean
) %>%
## Convert diagnosis to factor
mutate(diagnosis = factor(
diagnosis,
levels = c("B", "M"),
labels = c("B", "M")
)) %>%
## New Variable
mutate(diagnosis_malignant = if_else(diagnosis == "M", 1, 0)) %>%
select(radius_mean, concave_points_mean, diagnosis, diagnosis_malignant)
cancer_modified
Research Question
How can we predict whether a cancerous tumour is Benign
or Malignant
, based on the variable radius_mean
alone, and with both radius_mean
and concave_points_mean
?
Let us use GGally
to plot a set of combo-plots for our modified dataset:
ggplot2::theme_set(new = theme_custom())
cancer_modified %>%
select(diagnosis, radius_mean, concave_points_mean) %>%
GGally::ggpairs(
mapping = aes(colour = diagnosis),
switch = "both",
# axis labels in more traditional locations(left and bottom)
progress = FALSE,
# no compute progress messages needed
# Choose the diagonal graphs (always single variable! Think!)
diag = list(continuous = "densityDiag", alpha = 0.3),
# choosing density
# Choose lower triangle graphs, two-variable graphs
lower = list(continuous = wrap("points", alpha = 0.3)),
title = "Cancer Pairs Plot #1"
) +
scale_color_brewer(
palette = "Set1",
aesthetics = c("color", "fill")
)
Business Insights from GGally::ggpairs
radius_mean
and concave_pts_mean
appear to have well-separated box plot distributions for “B” and “M”.radius_mean
and concave_pts_mean
, we can believe that these will be good choices as predictors.radius_mean
and concave_pts_mean
are also mutually well-correlated, with a \(\rho = 0.823\); we may wish (later) to choose (a pair of) predictor variables that are less strongly correlated.Let us code two models, using one and then both the predictor variables:
The equation for the simple model is:
\[ \begin{aligned} \operatorname{diagnosis\_malignant} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{diagnosis\_malignant} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= -15.25 + 1.03(\operatorname{radius\_mean}) \end{aligned} \tag{7}\]
Increasing radius_mean
by one unit changes the log odds by \(\hat{\beta_1} = 1.033\) or equivalently it multiplies the odds by \(exp(\hat{\beta_1}) = 2.809\). We can plot the model as shown below:
ggplot2::theme_set(new = theme_custom())
qthresh <- c(0.2, 0.5, 0.8)
beta01 <- coef(cancer_fit_1)[1]
beta11 <- coef(cancer_fit_1)[2]
decision_point <- (log(qthresh / (1 - qthresh)) - beta01) / beta11
##
cancer_modified %>%
gf_point(
diagnosis_malignant ~ radius_mean,
colour = ~diagnosis,
title = "diagnosis ~ radius_mean",
xlab = "Average radius",
ylab = "Diagnosis (1=malignant)", size = 3, show.legend = F
) %>%
# gf_fun(exp(1.033 * radius_mean - 15.25) / (1 + exp(1.033 * radius_mean - 15.25)) ~ radius_mean, xlim = c(1, 30), linewidth = 3, colour = "red") %>%
gf_smooth(
method = glm,
method.args = list(family = "binomial"),
se = FALSE,
color = "black"
) %>%
gf_vline(xintercept = decision_point, linetype = "dashed") %>%
gf_refine(annotate(
"text",
label = paste0("q = ", qthresh),
x = decision_point + 0.45,
y = 0.4,
angle = -90
), scale_color_brewer(palette = "Set1")) %>%
gf_hline(yintercept = 0.5) %>%
gf_theme(theme(plot.title.position = "plot")) %>%
gf_refine(xlim(5, 30))
The dotted lines show how the model can be used to classify the data in to two classes (“B” and “M”) depending upon the threshold probability \(q\).
Taking both predictor variables, we obtain the model:
The equation for the more complex model is:
\[ \begin{aligned} \operatorname{diagnosis\_malignant} &\sim Bernoulli\left(\operatorname{prob}_{\operatorname{diagnosis\_malignant} = \operatorname{1}}= \hat{P}\right) \\ \log\left[ \frac { \hat{P} }{ 1 - \hat{P} } \right] &= -13.7 + 0.64(\operatorname{radius\_mean}) + 84.22(\operatorname{concave\_points\_mean}) \end{aligned} \tag{8}\]
Increasing radius_mean
by one unit changes the log odds by \(\hat{\beta_1} = 0.6389\) or equivalently it multiplies the odds by \(exp(\hat{\beta_1}) = 1.894\), provided concave_points_mean
is held fixed.
We can plot the model as shown below: we create a scatter plot of the two predictor variables. The superimposed diagonal lines are lines for several constant values of threshold probability \(q\).
ggplot2::theme_set(new = theme_custom())
beta02 <- coef(cancer_fit_2)[1]
beta12 <- coef(cancer_fit_2)[2]
beta22 <- coef(cancer_fit_2)[3]
##
decision_intercept <- 1 / beta22 * (log(qthresh / (1 - qthresh)) - beta02)
decision_slope <- -beta12 / beta22
##
cancer_modified %>%
gf_point(concave_points_mean ~ radius_mean,
color = ~diagnosis, shape = ~diagnosis,
size = 3, alpha = 0.5
) %>%
gf_labs(
x = "Average radius",
y = "Average concave\nportions of the\ncontours",
color = "Diagnosis",
shape = "Diagnosis",
title = "diagnosis ~ radius_mean + concave_points_mean"
) %>%
gf_abline(
slope = decision_slope, intercept = decision_intercept,
linetype = "dashed"
) %>%
gf_refine(
scale_color_brewer(palette = "Set1"),
annotate("text", label = paste0("q = ", qthresh), x = 10, y = c(0.08, 0.1, 0.115), angle = -17.155)
) %>%
gf_theme(theme(plot.title.position = "plot"))
To Be Written Up. Yeah, sure.
To Be Written Up. Hmph.
To Be Written Up. (This is becoming a joke.)
All that is very well, but what is happening under the hood of the glm
command? Consider the diagnosis
(target) variable and say the average_radius
feature/predictor variable. What we do is:
gf_point(diagnosis ~ average_radius, data = cancer_modified)
diagnosis
for any given average_radius
How does one find out the “ML” parameters? There is clearly a two step procedure:
Let us visualize the variations and computations from step(5). For the sake of clarity:
In Figure 8, we see three models: the “optimum” one in black, and two others in green and orange respectively.
We now project the actual points on to the regression curve, to obtain the predicted probability for each point.
ggplot2::theme_set(new = theme_custom())
plot_dat <- cancer_modified_sample %>%
mutate(
curve1 = ilogit(radius_mean * (beta1_sample - 0.002) + beta0_sample - 1),
curve2 = ilogit(radius_mean * (beta1_sample + 0.0075) + beta0_sample + 1)
)
# plot_dat
## Now plot the two steps for each regression curve
## 1. Point projection
plot_dat %>%
gf_point(diagnosis_malignant ~ radius_mean,
colour = ~diagnosis, size = 4, title = "Data Point Projection"
) %>%
gf_smooth(
method = glm,
method.args = list(family = "binomial"),
se = FALSE,
color = "grey"
) %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
## Overlaid regression curves
gf_fun(ilogit(x * (beta1_sample - 0.002) + beta0_sample - 1) ~ x, xlim = c(10, 20), color = "orange", linewidth = 2) %>%
gf_point(curve1 ~ radius_mean, size = 4) %>%
gf_segment(curve1 + diagnosis_malignant ~ radius_mean + radius_mean, arrow = arrow(ends = "first", angle = 15, type = "closed", length = unit(0.125, "inches")))
# %>%
# gf_segment(curve1 + curve1 ~ radius_mean + 0)
## 2. probability lookup
plot_dat %>%
gf_point(diagnosis_malignant ~ radius_mean,
colour = ~diagnosis, size = 4, alpha = 0.3,
title = "Predicted Probability Lookup"
) %>%
gf_smooth(
method = glm,
method.args = list(family = "binomial"),
se = FALSE,
color = "grey"
) %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
gf_fun(ilogit(x * (beta1_sample - 0.002) + beta0_sample - 1) ~ x,
xlim = c(10, 20), color = "orange", linewidth = 2
) %>%
gf_point(curve1 ~ radius_mean, size = 4) %>%
# gf_segment(curve1 + diagnosis_malignant ~ radius_mean + radius_mean) %>%
gf_segment(curve1 + curve1 ~ radius_mean + 10,
arrow = arrow(
ends = "last", angle = 15,
length = unit(0.1, "inches")
)
)
### Curve 2
## Now plot the two steps for each regression curve
## 1. Point projection
plot_dat %>%
gf_point(diagnosis_malignant ~ radius_mean,
colour = ~diagnosis, size = 4, title = "Data Point Projection"
) %>%
gf_smooth(
method = glm,
method.args = list(family = "binomial"),
se = FALSE,
color = "grey"
) %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
## Overlaid regression curves
gf_fun(ilogit(x * (beta1_sample + 0.0075) + beta0_sample + 1) ~ x, xlim = c(10, 20), color = "limegreen", linewidth = 2) %>%
gf_point(curve2 ~ radius_mean, size = 4) %>%
gf_segment(curve2 + diagnosis_malignant ~ radius_mean + radius_mean) %>%
gf_segment(curve2 + diagnosis_malignant ~ radius_mean + radius_mean, arrow = arrow(ends = "first", angle = 15, type = "closed", length = unit(0.125, "inches")))
## 2. probability lookup
plot_dat %>%
gf_point(diagnosis_malignant ~ radius_mean,
colour = ~diagnosis, size = 4, alpha = 0.3,
title = "Predicted Probability Lookup"
) %>%
gf_refine(scale_color_brewer(palette = "Set1")) %>%
gf_smooth(
method = glm,
method.args = list(family = "binomial"),
se = FALSE,
color = "grey"
) %>%
gf_fun(ilogit(x * (beta1_sample + 0.0075) + beta0_sample + 1) ~ x, xlim = c(10, 20), color = "limegreen", linewidth = 2) %>%
gf_point(curve2 ~ radius_mean, size = 4) %>%
# gf_segment(curve2 + diagnosis_malignant ~ radius_mean + radius_mean) %>%
gf_segment(curve2 + curve2 ~ radius_mean + 10, arrow = arrow(ends = "last", angle = 15, length = unit(0.1, "inches")))
The predicted probability \(p_i\) for each datum(radius_mean) is for the tumour being Malignant. If the datum corresponds to a tumour that is Benign, we must take \(1-p_i\). Each datum point is assumed to be independent, so we can calculate the likelihood as a product of probabilities, as follows: In this way, we calculate the likelihood of the data, give the model parameters as:
\[ \large{ \begin{equation} \begin{aligned} likelihood &= \prod_{Malignant}^{}p_i ~ \times ~ \prod_{Benign}^{}(1 - p_i)\\ &= \prod_{}^{}p_i^{y_{Malignant = 1}} ~ \times ~ (1-p_i)^{y_{Benign = 1}}\\ &= \prod_{}^{}(p_i)^{y_i} ~\times~ (1-p_i)^{1-y_i}// ~since~labels~y_i~are~binary~1~or~0// \end{aligned} \end{equation} } \]
Lastly, since this is a product of small numbers, it can lead to inaccuracies, so we take the log of the whole thing to make it into an addition, obtaining the log-likelihood (LL):
\[ \large{ \begin{equation} \begin{aligned} log~likelihood ~~ ll(\beta_i) &= log\prod_{}^{}(p_i)^{y_i} * (1-p_i)^{1-y_i}\\ &= \sum_{}^{} y_i * log (p_i) + (1-y_i) * log(1 - p_i)\\ \end{aligned} \end{equation} } \tag{9}\]
We now need to find the (global) maximum of this quantity and determine the \(\beta_i\). Flipping this problem around, we find the maximum likelihood by minimizing the slope/gradient of of the LL!! And, to minimize the slope of the LL, we use the Newton-Raphson method or equivalent. Phew!
The Newton-Raphson Method
How do we calculate the next value of x using the tangent?