Training models in R with `tidymodels`

Lecture 24

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

May 2, 2023

Announcements

Finish your projects

What is machine learning?

Examples of supervised learning

Will a user click on this ad?
Will a police officer engage in misconduct in the next six months?
Will the Supreme Court vote to overturn affirmative action in college admissions?
How many individuals will become infected with COVID-19 in the next week?

Two modes

Classification

Will this home sell in the next 30 days?

Regression

What will the sale price be for this home?

Two cultures

Statistics

model first
inference emphasis

Machine Learning

data first
prediction emphasis

Statistics

“Statisticians, like artists, have the bad habit of falling in love with their models.”

— George Box

`tidymodels`

Predictive modeling

library(tidymodels)

The Bechdel test

The Bechdel test

In order to pass the test, a movie must have

At least two named women in it
Who talk to each other
About something besides a man

library(rcis)
data("bechdel")
glimpse(bechdel)

Rows: 1,394
Columns: 10
$ year          <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
$ title         <chr> "12 Years a Slave", "2 Guns", "42", …
$ test          <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
$ budget_2013   <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
$ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
$ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
$ rated         <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
$ metascore     <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
$ imdb_rating   <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
$ genre         <chr> "Biography", "Action", "Biography", …

Bechdel test data

N = 1394
1 categorical outcome: test
8 predictors

What is the goal of machine learning?

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

Machine learning

We’ll use this goal to drive learning of 3 core tidymodels packages:

parsnip
rsample
yardstick

🔨 Build models with `parsnip`

parsnip

To specify a model with `parsnip`

Pick a model
Set the engine
Set the mode (if needed)

To specify a model with `parsnip`

logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Logistic Regression Model Specification (classification)

Computational engine: glm

To specify a model with `parsnip`

decision_tree() |>
  set_engine("C5.0") |>
  set_mode("classification")

Decision Tree Model Specification (classification)

Computational engine: C5.0

To specify a model with `parsnip`

nearest_neighbor() |>
  set_engine("kknn") |>
  set_mode("classification")

K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

1. Pick a model

All available models are listed at

https://www.tidymodels.org/find/parsnip/

`logistic_reg()`

Specifies a model that uses logistic regression

logistic_reg(penalty = NULL, mixture = NULL)

`logistic_reg()`

Specifies a model that uses logistic regression

logistic_reg(
  mode = "classification", # "default" mode, if exists
  penalty = NULL, # model hyper-parameter
  mixture = NULL # model hyper-parameter
)

`set_engine()`

Adds an engine to power or implement the model.

logistic_reg() |> set_engine(engine = "glm")

Set the engine when you define the model type.

logistic_reg(engine = "glm")

`set_mode()`

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

logistic_reg() |> set_mode(mode = "classification")

lr_mod <- logistic_reg() |>
  set_engine(engine = "glm") |>
  set_mode("classification")
lr_mod

Logistic Regression Model Specification (classification)

Computational engine: glm

tree_mod <- decision_tree() |>
  set_engine(engine = "C5.0") |>
  set_mode("classification")
tree_mod

Decision Tree Model Specification (classification)

Computational engine: C5.0

Now we’ve built a model.

But, how do we use a model?

First - what does it mean to use a model?

Statistical models learn from the data.

Many learn model parameters, which can be useful as values for inference and interpretation.

`fit()`

Train a model by fitting a model. Returns a parsnip model fit.

fit(tree_mod, test ~ metascore + imdb_rating, data = bechdel)

`fit()`

Train a model by fitting a model. Returns a parsnip model fit.

tree_mod |> # parsnip model
  fit(
    test ~ metascore + imdb_rating, # a formula
    data = bechdel # dataframe
  )

A fitted model

lr_mod |>
  fit(test ~ metascore + imdb_rating,
    data = bechdel
  ) |>
  broom::tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   2.70     0.436        6.20 5.64e-10
2 metascore     0.0202   0.00481      4.20 2.66e- 5
3 imdb_rating  -0.606    0.0889      -6.82 8.87e-12

“All models are wrong, but some are useful”

          Truth
Prediction Fail Pass
      Fail  613  421
      Pass  159  201

“All models are wrong, but some are useful”

          Truth
Prediction Fail Pass
      Fail  583  397
      Pass  189  225

Axiom

The best way to measure a model’s performance at predicting new data is to predict new data.

♻️ Resample models with `rsample`

`rsample`

The holdout method

`initial_split()*`

“Splits” data randomly into a single testing and a single training set.

initial_split(data, prop = 3 / 4)

`initial_split()`

bechdel_split <- initial_split(data = bechdel, strata = test, prop = 3 / 4)
bechdel_split

<Training/Testing/Total>
<1045/349/1394>

`training()` and `testing()*`

Extract training and testing sets from an rsplit

training(bechdel_split)
testing(bechdel_split)

`training()`

bechdel_train <- training(bechdel_split)
bechdel_train

# A tibble: 1,045 × 9
    year test  budget_2013 domgross_2013 intgross_2013 rated
   <dbl> <fct>       <dbl>         <dbl>         <dbl> <chr>
 1  2013 Fail         2             5.31         15.9  R    
 2  2013 Fail         6.1           7.56         13.2  R    
 3  2013 Fail         4             9.50          9.50 PG-13
 4  2013 Fail        22.5           3.84         14.6  PG-13
 5  2013 Fail         9.2           6.73         30.4  R    
 6  2013 Fail        13             6.05         24.4  PG-13
 7  2013 Fail         7.8          12.0          27.2  PG   
 8  2013 Fail         0.55          2.45          2.64 R    
 9  2013 Fail         7             2.52         10.4  R    
10  2013 Fail         6             4.60         10.4  R    
# ℹ 1,035 more rows
# ℹ 3 more variables: metascore <dbl>, imdb_rating <dbl>,
#   genre <chr>

Data Splitting

Take the average!

The testing set is precious…

we can only use it once!

How can we use the training set to compare, evaluate, and tune models?

Cross-validation

V-fold cross-validation

vfold_cv(data, v = 10, ...)

Guess

How many times does an observation/row appear in the assessment set?

Quiz

If we use 10 folds, which percent of our data will end up in the analysis set and which percent in the assessment set for each fold?

90% - analysis

10% - assessment

Bechdel folds

set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test)
bechdel_folds

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits            id    
   <list>            <chr> 
 1 <split [940/105]> Fold01
 2 <split [940/105]> Fold02
 3 <split [940/105]> Fold03
 4 <split [940/105]> Fold04
 5 <split [940/105]> Fold05
 6 <split [940/105]> Fold06
 7 <split [941/104]> Fold07
 8 <split [941/104]> Fold08
 9 <split [941/104]> Fold09
10 <split [942/103]> Fold10

`fit_resamples()`

Trains and tests a resampled model.

tree_mod |>
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )

tree_mod |>
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )

# Resampling results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 4
   splits            id     .metrics         .notes  
   <list>            <chr>  <list>           <list>  
 1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
 2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
 3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
 4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
 5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
 6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
 7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
 8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
 9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>

`collect_metrics()`

Unnest the metrics column from a tidymodels fit_resamples()

_results |> collect_metrics(summarize = TRUE)

tree_fit <- tree_mod |> 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds
  )
collect_metrics(tree_fit)

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.533    10  0.0127 Preprocessor1_Mod…
2 roc_auc  binary     0.506    10  0.0120 Preprocessor1_Mod…

collect_metrics(tree_fit, summarize = FALSE)

# A tibble: 20 × 5
   id     .metric  .estimator .estimate .config             
   <chr>  <chr>    <chr>          <dbl> <chr>               
 1 Fold01 accuracy binary         0.429 Preprocessor1_Model1
 2 Fold01 roc_auc  binary         0.444 Preprocessor1_Model1
 3 Fold02 accuracy binary         0.552 Preprocessor1_Model1
 4 Fold02 roc_auc  binary         0.5   Preprocessor1_Model1
 5 Fold03 accuracy binary         0.505 Preprocessor1_Model1
 6 Fold03 roc_auc  binary         0.479 Preprocessor1_Model1
 7 Fold04 accuracy binary         0.552 Preprocessor1_Model1
 8 Fold04 roc_auc  binary         0.5   Preprocessor1_Model1
 9 Fold05 accuracy binary         0.552 Preprocessor1_Model1
10 Fold05 roc_auc  binary         0.5   Preprocessor1_Model1
# ℹ 10 more rows

10-fold CV

10 different analysis/assessment sets
10 different models (trained on analysis sets)
10 different sets of performance statistics (on assessment sets)

📏 Evaluate models with `yardstick`

`yardstick`

https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics

`roc_curve()`

Takes predictions from a special kind of fit_resamples().

Returns a tibble with probabilities.

roc_curve(data, truth, ...)

truth = actual outcome

... = predicted probability of outcome

tree_preds <- tree_mod |> 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds,
    control = control_resamples(save_pred = TRUE)
  )

tree_preds |> 
  collect_predictions() |> 
  roc_curve(truth = test, .pred_Fail)

# A tibble: 19 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1   -Inf          0            1    
 2      0.372      0            1    
 3      0.439      0.0129       0.991
 4      0.440      0.0343       0.969
 5      0.462      0.0622       0.943
 6      0.467      0.101        0.889
 7      0.554      0.129        0.858
 8      0.554      0.326        0.658
 9      0.554      0.425        0.560
10      0.556      0.727        0.259
# ℹ 9 more rows

Area under the curve

AUC = 0.5: random guessing
AUC = 1: perfect classifier
In general AUC of above 0.8 considered “good”

`autoplot()`

tree_preds |> 
  collect_predictions() |> 
  roc_curve(truth = test, estimate = .pred_Fail) |> 
  autoplot()

👩🏼‍🍳 Build a better training set with `recipes`

Preprocessing options

Encode categorical predictors
Center and scale variables
Handle class imbalance
Impute missing data
Perform dimensionality reduction
A lot more!

To build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

kknn_rec <- recipe(test ~ ., data = bechdel) |>
  step_other(genre, threshold = .05) |> 
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric())

📜 Create boilerplate code using `usemodels`

library(usemodels)
use_kknn(test ~ ., data = bechdel, verbose = TRUE, tune = FALSE)

kknn_recipe <- 
  recipe(formula = test ~ ., data = bechdel) %>% 
  ## Since distance calculations are used, the predictor variables should 
  ## be on the same scale. Before centering and scaling the numeric 
  ## predictors, any predictors with a single unique value are filtered 
  ## out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

kknn_spec <- 
  nearest_neighbor() %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

kknn_workflow <- 
  workflow() %>% 
  add_recipe(kknn_recipe) %>% 
  add_model(kknn_spec)

use_glmnet(test ~ ., data = bechdel, verbose = TRUE, tune = FALSE)

glmnet_recipe <- 
  recipe(formula = test ~ ., data = bechdel) %>% 
  ## Regularization methods sum up functions of the model slope 
  ## coefficients. Because of this, the predictor variables should be on 
  ## the same scale. Before centering and scaling the numeric predictors, 
  ## any predictors with a single unique value are filtered out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

glmnet_spec <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec)

Note

usemodels creates boilerplate code using the older pipe operator %>%

🪢🪵 Bundling machine learning workflows with `workflow()`

Combine a recipe + model specification

kknn_mod <- nearest_neighbor() |>
  set_engine("kknn") |>
  set_mode("classification")

kknn_wf <- workflow() |> 
  add_recipe(kknn_rec) |> 
  add_model(kknn_mod)
kknn_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_other()
• step_novel()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

🎶 Fitting and tuning models with `tune`

`tune()`

A placeholder for hyper-parameters to be “tuned”

nearest_neighbor(neighbors = tune())

`tune_grid()`

A version of fit_resamples() that performs a grid search for the best combination of tuned hyper-parameters.

tune_grid(
  object,
  resamples,
  ...,
  grid = 10,
  metrics = NULL,
  control = control_grid()
)

One of:

A parsnip model object
A workflow

Tune a nearest neighbors model

kknn_tuner <- nearest_neighbor(
    engine = "kknn",
    neighbors = tune()
  ) |>
  set_mode("classification")

kknn_wf <- workflow() |>
  add_recipe(kknn_rec) |>
  add_model(kknn_tuner)

set.seed(100) # Important!
kknn_results <- kknn_wf |>
  tune_grid(resamples = bechdel_folds,
            control = control_grid(save_workflow = TRUE))

kknn_results |>
  collect_metrics()

# A tibble: 20 × 7
   neighbors .metric  .estimator  mean     n std_err .config
       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>  
 1         1 accuracy binary     0.569    10  0.0165 Prepro…
 2         1 roc_auc  binary     0.566    10  0.0170 Prepro…
 3         3 accuracy binary     0.569    10  0.0165 Prepro…
 4         3 roc_auc  binary     0.607    10  0.0184 Prepro…
 5         5 accuracy binary     0.596    10  0.0201 Prepro…
 6         5 roc_auc  binary     0.611    10  0.0189 Prepro…
 7         6 accuracy binary     0.598    10  0.0183 Prepro…
 8         6 roc_auc  binary     0.615    10  0.0186 Prepro…
 9         7 accuracy binary     0.598    10  0.0201 Prepro…
10         7 roc_auc  binary     0.620    10  0.0191 Prepro…
# ℹ 10 more rows

Training models in R with tidymodels

Announcements

Announcements

What is machine learning?

Examples of supervised learning

Two modes

Classification

Regression

Two cultures

Statistics

Machine Learning

Statistics

tidymodels

tidymodels

Predictive modeling

The Bechdel test

The Bechdel test

The Bechdel test

Bechdel test data

What is the goal of machine learning?

Machine learning

🔨 Build models with parsnip

parsnip

To specify a model with parsnip

To specify a model with parsnip

To specify a model with parsnip

To specify a model with parsnip

1. Pick a model

logistic_reg()

logistic_reg()

set_engine()

set_mode()

fit()

fit()

A fitted model

“All models are wrong, but some are useful”

“All models are wrong, but some are useful”

Axiom

♻️ Resample models with rsample

rsample

The holdout method

initial_split()*

initial_split()

training() and testing()*

training()

Data Splitting

Cross-validation

V-fold cross-validation

Guess

Quiz

Bechdel folds

fit_resamples()

collect_metrics()

10-fold CV

📏 Evaluate models with yardstick

yardstick

roc_curve()

Area under the curve

autoplot()

👩🏼‍🍳 Build a better training set with recipes

Preprocessing options

To build a recipe

📜 Create boilerplate code using usemodels

🪢🪵 Bundling machine learning workflows with workflow()

Combine a recipe + model specification

🎶 Fitting and tuning models with tune

tune()

tune_grid()

Tune a nearest neighbors model

Recommended resources

Training models in R with `tidymodels`

`tidymodels`

`tidymodels`

🔨 Build models with `parsnip`

To specify a model with `parsnip`

To specify a model with `parsnip`

To specify a model with `parsnip`

To specify a model with `parsnip`

`logistic_reg()`

`logistic_reg()`

`set_engine()`

`set_mode()`

`fit()`

`fit()`

♻️ Resample models with `rsample`

`rsample`

`initial_split()*`

`initial_split()`

`training()` and `testing()*`

`training()`

`fit_resamples()`

`collect_metrics()`

📏 Evaluate models with `yardstick`

`yardstick`

`roc_curve()`

`autoplot()`

👩🏼‍🍳 Build a better training set with `recipes`

📜 Create boilerplate code using `usemodels`

🪢🪵 Bundling machine learning workflows with `workflow()`

🎶 Fitting and tuning models with `tune`

`tune()`

`tune_grid()`