Training models in R with tidymodels

Lecture 24

Dr. Benjamin Soltoff

Cornell University
INFO 3312/5312 - Spring 2024

May 2, 2023

Announcements

Announcements

  • Finish your projects

What is machine learning?

Examples of supervised learning

  • Will a user click on this ad?
  • Will a police officer engage in misconduct in the next six months?
  • Will the Supreme Court vote to overturn affirmative action in college admissions?
  • How many individuals will become infected with COVID-19 in the next week?

Two modes

Classification

Will this home sell in the next 30 days?

Regression

What will the sale price be for this home?

Two cultures

Statistics

  • model first
  • inference emphasis

Machine Learning

  • data first
  • prediction emphasis

Statistics

“Statisticians, like artists, have the bad habit of falling in love with their models.”

— George Box

tidymodels

tidymodels

Predictive modeling

library(tidymodels)

The Bechdel test

The Bechdel test

The Bechdel test

In order to pass the test, a movie must have

  1. At least two named women in it
  2. Who talk to each other
  3. About something besides a man
library(rcis)
data("bechdel")
glimpse(bechdel)
Rows: 1,394
Columns: 10
$ year          <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
$ title         <chr> "12 Years a Slave", "2 Guns", "42", …
$ test          <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
$ budget_2013   <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
$ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
$ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
$ rated         <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
$ metascore     <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
$ imdb_rating   <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
$ genre         <chr> "Biography", "Action", "Biography", …

Bechdel test data

  • N = 1394
  • 1 categorical outcome: test
  • 8 predictors

What is the goal of machine learning?

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

Machine learning

We’ll use this goal to drive learning of 3 core tidymodels packages:

  • parsnip
  • rsample
  • yardstick

🔨 Build models with parsnip

parsnip

To specify a model with parsnip

  1. Pick a model
  2. Set the engine
  3. Set the mode (if needed)

To specify a model with parsnip

logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")
Logistic Regression Model Specification (classification)

Computational engine: glm 

To specify a model with parsnip

decision_tree() |>
  set_engine("C5.0") |>
  set_mode("classification")
Decision Tree Model Specification (classification)

Computational engine: C5.0 

To specify a model with parsnip

nearest_neighbor() |>
  set_engine("kknn") |>
  set_mode("classification")
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn 

1. Pick a model

All available models are listed at

https://www.tidymodels.org/find/parsnip/

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(penalty = NULL, mixture = NULL)

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(
  mode = "classification", # "default" mode, if exists
  penalty = NULL, # model hyper-parameter
  mixture = NULL # model hyper-parameter
)

set_engine()

Adds an engine to power or implement the model.

logistic_reg() |> set_engine(engine = "glm")

Set the engine when you define the model type.

logistic_reg(engine = "glm")

set_mode()

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

logistic_reg() |> set_mode(mode = "classification")
lr_mod <- logistic_reg() |>
  set_engine(engine = "glm") |>
  set_mode("classification")
lr_mod
Logistic Regression Model Specification (classification)

Computational engine: glm 
tree_mod <- decision_tree() |>
  set_engine(engine = "C5.0") |>
  set_mode("classification")
tree_mod
Decision Tree Model Specification (classification)

Computational engine: C5.0 

Now we’ve built a model.

But, how do we use a model?

First - what does it mean to use a model?

Statistical models learn from the data.

Many learn model parameters, which can be useful as values for inference and interpretation.

fit()

Train a model by fitting a model. Returns a parsnip model fit.

fit(tree_mod, test ~ metascore + imdb_rating, data = bechdel)

fit()

Train a model by fitting a model. Returns a parsnip model fit.

tree_mod |> # parsnip model
  fit(
    test ~ metascore + imdb_rating, # a formula
    data = bechdel # dataframe
  )

A fitted model

lr_mod |>
  fit(test ~ metascore + imdb_rating,
    data = bechdel
  ) |>
  broom::tidy()
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   2.70     0.436        6.20 5.64e-10
2 metascore     0.0202   0.00481      4.20 2.66e- 5
3 imdb_rating  -0.606    0.0889      -6.82 8.87e-12

“All models are wrong, but some are useful”

          Truth
Prediction Fail Pass
      Fail  613  421
      Pass  159  201

“All models are wrong, but some are useful”

          Truth
Prediction Fail Pass
      Fail  583  397
      Pass  189  225

Axiom

The best way to measure a model’s performance at predicting new data is to predict new data.

♻️ Resample models with rsample

rsample

The holdout method

initial_split()*

“Splits” data randomly into a single testing and a single training set.

initial_split(data, prop = 3 / 4)

initial_split()

bechdel_split <- initial_split(data = bechdel, strata = test, prop = 3 / 4)
bechdel_split
<Training/Testing/Total>
<1045/349/1394>

training() and testing()*

Extract training and testing sets from an rsplit

training(bechdel_split)
testing(bechdel_split)

training()

bechdel_train <- training(bechdel_split)
bechdel_train
# A tibble: 1,045 × 9
    year test  budget_2013 domgross_2013 intgross_2013 rated
   <dbl> <fct>       <dbl>         <dbl>         <dbl> <chr>
 1  2013 Fail         2             5.31         15.9  R    
 2  2013 Fail         6.1           7.56         13.2  R    
 3  2013 Fail         4             9.50          9.50 PG-13
 4  2013 Fail        22.5           3.84         14.6  PG-13
 5  2013 Fail         9.2           6.73         30.4  R    
 6  2013 Fail        13             6.05         24.4  PG-13
 7  2013 Fail         7.8          12.0          27.2  PG   
 8  2013 Fail         0.55          2.45          2.64 R    
 9  2013 Fail         7             2.52         10.4  R    
10  2013 Fail         6             4.60         10.4  R    
# ℹ 1,035 more rows
# ℹ 3 more variables: metascore <dbl>, imdb_rating <dbl>,
#   genre <chr>

Data Splitting

Take the average!

The testing set is precious…


we can only use it once!


How can we use the training set to compare, evaluate, and tune models?

Cross-validation

V-fold cross-validation

vfold_cv(data, v = 10, ...)

Guess

How many times does an observation/row appear in the assessment set?

Quiz

If we use 10 folds, which percent of our data will end up in the analysis set and which percent in the assessment set for each fold?

90% - analysis

10% - assessment

Bechdel folds

set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test)
bechdel_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits            id    
   <list>            <chr> 
 1 <split [940/105]> Fold01
 2 <split [940/105]> Fold02
 3 <split [940/105]> Fold03
 4 <split [940/105]> Fold04
 5 <split [940/105]> Fold05
 6 <split [940/105]> Fold06
 7 <split [941/104]> Fold07
 8 <split [941/104]> Fold08
 9 <split [941/104]> Fold09
10 <split [942/103]> Fold10

fit_resamples()

Trains and tests a resampled model.

tree_mod |>
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )
tree_mod |>
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )
# Resampling results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 4
   splits            id     .metrics         .notes  
   <list>            <chr>  <list>           <list>  
 1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
 2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
 3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
 4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
 5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
 6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
 7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
 8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
 9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>

collect_metrics()

Unnest the metrics column from a tidymodels fit_resamples()

_results |> collect_metrics(summarize = TRUE)
tree_fit <- tree_mod |> 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds
  )
collect_metrics(tree_fit)
# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.533    10  0.0127 Preprocessor1_Mod…
2 roc_auc  binary     0.506    10  0.0120 Preprocessor1_Mod…
collect_metrics(tree_fit, summarize = FALSE)
# A tibble: 20 × 5
   id     .metric  .estimator .estimate .config             
   <chr>  <chr>    <chr>          <dbl> <chr>               
 1 Fold01 accuracy binary         0.429 Preprocessor1_Model1
 2 Fold01 roc_auc  binary         0.444 Preprocessor1_Model1
 3 Fold02 accuracy binary         0.552 Preprocessor1_Model1
 4 Fold02 roc_auc  binary         0.5   Preprocessor1_Model1
 5 Fold03 accuracy binary         0.505 Preprocessor1_Model1
 6 Fold03 roc_auc  binary         0.479 Preprocessor1_Model1
 7 Fold04 accuracy binary         0.552 Preprocessor1_Model1
 8 Fold04 roc_auc  binary         0.5   Preprocessor1_Model1
 9 Fold05 accuracy binary         0.552 Preprocessor1_Model1
10 Fold05 roc_auc  binary         0.5   Preprocessor1_Model1
# ℹ 10 more rows

10-fold CV

  • 10 different analysis/assessment sets

  • 10 different models (trained on analysis sets)

  • 10 different sets of performance statistics (on assessment sets)

📏 Evaluate models with yardstick

yardstick

https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics

roc_curve()

Takes predictions from a special kind of fit_resamples().

Returns a tibble with probabilities.

roc_curve(data, truth, ...)

truth = actual outcome

... = predicted probability of outcome

tree_preds <- tree_mod |> 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds,
    control = control_resamples(save_pred = TRUE)
  )

tree_preds |> 
  collect_predictions() |> 
  roc_curve(truth = test, .pred_Fail)
# A tibble: 19 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1   -Inf          0            1    
 2      0.372      0            1    
 3      0.439      0.0129       0.991
 4      0.440      0.0343       0.969
 5      0.462      0.0622       0.943
 6      0.467      0.101        0.889
 7      0.554      0.129        0.858
 8      0.554      0.326        0.658
 9      0.554      0.425        0.560
10      0.556      0.727        0.259
# ℹ 9 more rows

Area under the curve

  • AUC = 0.5: random guessing

  • AUC = 1: perfect classifier

  • In general AUC of above 0.8 considered “good”

autoplot()

tree_preds |> 
  collect_predictions() |> 
  roc_curve(truth = test, estimate = .pred_Fail) |> 
  autoplot()

👩🏼‍🍳 Build a better training set with recipes

Preprocessing options

  • Encode categorical predictors
  • Center and scale variables
  • Handle class imbalance
  • Impute missing data
  • Perform dimensionality reduction
  • A lot more!

To build a recipe

  1. Start the recipe()
  2. Define the variables involved
  3. Describe preprocessing step-by-step
kknn_rec <- recipe(test ~ ., data = bechdel) |>
  step_other(genre, threshold = .05) |> 
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric())

📜 Create boilerplate code using usemodels

library(usemodels)
use_kknn(test ~ ., data = bechdel, verbose = TRUE, tune = FALSE)
kknn_recipe <- 
  recipe(formula = test ~ ., data = bechdel) %>% 
  ## Since distance calculations are used, the predictor variables should 
  ## be on the same scale. Before centering and scaling the numeric 
  ## predictors, any predictors with a single unique value are filtered 
  ## out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

kknn_spec <- 
  nearest_neighbor() %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

kknn_workflow <- 
  workflow() %>% 
  add_recipe(kknn_recipe) %>% 
  add_model(kknn_spec) 
use_glmnet(test ~ ., data = bechdel, verbose = TRUE, tune = FALSE)
glmnet_recipe <- 
  recipe(formula = test ~ ., data = bechdel) %>% 
  ## Regularization methods sum up functions of the model slope 
  ## coefficients. Because of this, the predictor variables should be on 
  ## the same scale. Before centering and scaling the numeric predictors, 
  ## any predictors with a single unique value are filtered out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

glmnet_spec <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

Note

usemodels creates boilerplate code using the older pipe operator %>%

🪢🪵 Bundling machine learning workflows with workflow()

Combine a recipe + model specification

kknn_mod <- nearest_neighbor() |>
  set_engine("kknn") |>
  set_mode("classification")

kknn_wf <- workflow() |> 
  add_recipe(kknn_rec) |> 
  add_model(kknn_mod)
kknn_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_other()
• step_novel()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn 

🎶 Fitting and tuning models with tune

tune()

A placeholder for hyper-parameters to be “tuned”

nearest_neighbor(neighbors = tune())

tune_grid()

A version of fit_resamples() that performs a grid search for the best combination of tuned hyper-parameters.

tune_grid(
  object,
  resamples,
  ...,
  grid = 10,
  metrics = NULL,
  control = control_grid()
)

One of:

  • A parsnip model object

  • A workflow

Tune a nearest neighbors model

kknn_tuner <- nearest_neighbor(
    engine = "kknn",
    neighbors = tune()
  ) |>
  set_mode("classification")

kknn_wf <- workflow() |>
  add_recipe(kknn_rec) |>
  add_model(kknn_tuner)

set.seed(100) # Important!
kknn_results <- kknn_wf |>
  tune_grid(resamples = bechdel_folds,
            control = control_grid(save_workflow = TRUE))
kknn_results |>
  collect_metrics()
# A tibble: 20 × 7
   neighbors .metric  .estimator  mean     n std_err .config
       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>  
 1         1 accuracy binary     0.569    10  0.0165 Prepro…
 2         1 roc_auc  binary     0.566    10  0.0170 Prepro…
 3         3 accuracy binary     0.569    10  0.0165 Prepro…
 4         3 roc_auc  binary     0.607    10  0.0184 Prepro…
 5         5 accuracy binary     0.596    10  0.0201 Prepro…
 6         5 roc_auc  binary     0.611    10  0.0189 Prepro…
 7         6 accuracy binary     0.598    10  0.0183 Prepro…
 8         6 roc_auc  binary     0.615    10  0.0186 Prepro…
 9         7 accuracy binary     0.598    10  0.0201 Prepro…
10         7 roc_auc  binary     0.620    10  0.0191 Prepro…
# ℹ 10 more rows