tidymodels
Lecture 24
Cornell University
INFO 3312/5312 - Spring 2024
May 2, 2023
Will this home sell in the next 30 days?
What will the sale price be for this home?
“Statisticians, like artists, have the bad habit of falling in love with their models.”
— George Box
tidymodels
tidymodels
In order to pass the test, a movie must have
Rows: 1,394
Columns: 10
$ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
$ title <chr> "12 Years a Slave", "2 Guns", "42", …
$ test <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
$ budget_2013 <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
$ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
$ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
$ rated <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
$ metascore <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
$ imdb_rating <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
$ genre <chr> "Biography", "Action", "Biography", …
test
Build models that
generate accurate predictions
for future, yet-to-be-seen data.
We’ll use this goal to drive learning of 3 core tidymodels
packages:
parsnip
rsample
yardstick
parsnip
parsnip
parsnip
parsnip
parsnip
All available models are listed at
https://www.tidymodels.org/find/parsnip/
logistic_reg()
Specifies a model that uses logistic regression
logistic_reg()
Specifies a model that uses logistic regression
set_engine()
Adds an engine to power or implement the model.
set_mode()
Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.
Logistic Regression Model Specification (classification)
Computational engine: glm
Now we’ve built a model.
But, how do we use a model?
First - what does it mean to use a model?
Statistical models learn from the data.
Many learn model parameters, which can be useful as values for inference and interpretation.
fit()
Train a model by fitting a model. Returns a parsnip model fit.
fit()
Train a model by fitting a model. Returns a parsnip model fit.
Truth
Prediction Fail Pass
Fail 613 421
Pass 159 201
Truth
Prediction Fail Pass
Fail 583 397
Pass 189 225
The best way to measure a model’s performance at predicting new data is to predict new data.
rsample
rsample
initial_split()*
“Splits” data randomly into a single testing and a single training set.
initial_split()
training()
and testing()*
Extract training and testing sets from an rsplit
training()
# A tibble: 1,045 × 9
year test budget_2013 domgross_2013 intgross_2013 rated
<dbl> <fct> <dbl> <dbl> <dbl> <chr>
1 2013 Fail 2 5.31 15.9 R
2 2013 Fail 6.1 7.56 13.2 R
3 2013 Fail 4 9.50 9.50 PG-13
4 2013 Fail 22.5 3.84 14.6 PG-13
5 2013 Fail 9.2 6.73 30.4 R
6 2013 Fail 13 6.05 24.4 PG-13
7 2013 Fail 7.8 12.0 27.2 PG
8 2013 Fail 0.55 2.45 2.64 R
9 2013 Fail 7 2.52 10.4 R
10 2013 Fail 6 4.60 10.4 R
# ℹ 1,035 more rows
# ℹ 3 more variables: metascore <dbl>, imdb_rating <dbl>,
# genre <chr>
Take the average!
The testing set is precious…
we can only use it once!
How can we use the training set to compare, evaluate, and tune models?
How many times does an observation/row appear in the assessment set?
If we use 10 folds, which percent of our data will end up in the analysis set and which percent in the assessment set for each fold?
90% - analysis
10% - assessment
# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [940/105]> Fold01
2 <split [940/105]> Fold02
3 <split [940/105]> Fold03
4 <split [940/105]> Fold04
5 <split [940/105]> Fold05
6 <split [940/105]> Fold06
7 <split [941/104]> Fold07
8 <split [941/104]> Fold08
9 <split [941/104]> Fold09
10 <split [942/103]> Fold10
fit_resamples()
Trains and tests a resampled model.
# Resampling results
# 10-fold cross-validation using stratification
# A tibble: 10 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>
collect_metrics()
Unnest the metrics column from a tidymodels fit_resamples()
tree_fit <- tree_mod |>
fit_resamples(
test ~ metascore + imdb_rating,
resamples = bechdel_folds
)
collect_metrics(tree_fit)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.533 10 0.0127 Preprocessor1_Mod…
2 roc_auc binary 0.506 10 0.0120 Preprocessor1_Mod…
# A tibble: 20 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold01 accuracy binary 0.429 Preprocessor1_Model1
2 Fold01 roc_auc binary 0.444 Preprocessor1_Model1
3 Fold02 accuracy binary 0.552 Preprocessor1_Model1
4 Fold02 roc_auc binary 0.5 Preprocessor1_Model1
5 Fold03 accuracy binary 0.505 Preprocessor1_Model1
6 Fold03 roc_auc binary 0.479 Preprocessor1_Model1
7 Fold04 accuracy binary 0.552 Preprocessor1_Model1
8 Fold04 roc_auc binary 0.5 Preprocessor1_Model1
9 Fold05 accuracy binary 0.552 Preprocessor1_Model1
10 Fold05 roc_auc binary 0.5 Preprocessor1_Model1
# ℹ 10 more rows
10 different analysis/assessment sets
10 different models (trained on analysis sets)
10 different sets of performance statistics (on assessment sets)
yardstick
yardstick
https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics
roc_curve()
Takes predictions from a special kind of fit_resamples()
.
Returns a tibble with probabilities.
truth
= actual outcome
...
= predicted probability of outcome
tree_preds <- tree_mod |>
fit_resamples(
test ~ metascore + imdb_rating,
resamples = bechdel_folds,
control = control_resamples(save_pred = TRUE)
)
tree_preds |>
collect_predictions() |>
roc_curve(truth = test, .pred_Fail)
# A tibble: 19 × 3
.threshold specificity sensitivity
<dbl> <dbl> <dbl>
1 -Inf 0 1
2 0.372 0 1
3 0.439 0.0129 0.991
4 0.440 0.0343 0.969
5 0.462 0.0622 0.943
6 0.467 0.101 0.889
7 0.554 0.129 0.858
8 0.554 0.326 0.658
9 0.554 0.425 0.560
10 0.556 0.727 0.259
# ℹ 9 more rows
AUC = 0.5: random guessing
AUC = 1: perfect classifier
In general AUC of above 0.8 considered “good”
autoplot()
recipes
recipe()
usemodels
kknn_recipe <-
recipe(formula = test ~ ., data = bechdel) %>%
## Since distance calculations are used, the predictor variables should
## be on the same scale. Before centering and scaling the numeric
## predictors, any predictors with a single unique value are filtered
## out.
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
kknn_spec <-
nearest_neighbor() %>%
set_mode("classification") %>%
set_engine("kknn")
kknn_workflow <-
workflow() %>%
add_recipe(kknn_recipe) %>%
add_model(kknn_spec)
glmnet_recipe <-
recipe(formula = test ~ ., data = bechdel) %>%
## Regularization methods sum up functions of the model slope
## coefficients. Because of this, the predictor variables should be on
## the same scale. Before centering and scaling the numeric predictors,
## any predictors with a single unique value are filtered out.
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
glmnet_spec <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("glmnet")
glmnet_workflow <-
workflow() %>%
add_recipe(glmnet_recipe) %>%
add_model(glmnet_spec)
Note
usemodels creates boilerplate code using the older pipe operator %>%
workflow()
kknn_mod <- nearest_neighbor() |>
set_engine("kknn") |>
set_mode("classification")
kknn_wf <- workflow() |>
add_recipe(kknn_rec) |>
add_model(kknn_mod)
kknn_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_other()
• step_novel()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)
Computational engine: kknn
tune
tune()
A placeholder for hyper-parameters to be “tuned”
tune_grid()
A version of fit_resamples()
that performs a grid search for the best combination of tuned hyper-parameters.
kknn_tuner <- nearest_neighbor(
engine = "kknn",
neighbors = tune()
) |>
set_mode("classification")
kknn_wf <- workflow() |>
add_recipe(kknn_rec) |>
add_model(kknn_tuner)
set.seed(100) # Important!
kknn_results <- kknn_wf |>
tune_grid(resamples = bechdel_folds,
control = control_grid(save_workflow = TRUE))
# A tibble: 20 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 accuracy binary 0.569 10 0.0165 Prepro…
2 1 roc_auc binary 0.566 10 0.0170 Prepro…
3 3 accuracy binary 0.569 10 0.0165 Prepro…
4 3 roc_auc binary 0.607 10 0.0184 Prepro…
5 5 accuracy binary 0.596 10 0.0201 Prepro…
6 5 roc_auc binary 0.611 10 0.0189 Prepro…
7 6 accuracy binary 0.598 10 0.0183 Prepro…
8 6 roc_auc binary 0.615 10 0.0186 Prepro…
9 7 accuracy binary 0.598 10 0.0201 Prepro…
10 7 roc_auc binary 0.620 10 0.0191 Prepro…
# ℹ 10 more rows