STA 210 - Spring 2022
# A tibble: 186 × 14
season episode episode_name imdb_rating total_votes air_date lines_jim
<dbl> <dbl> <chr> <dbl> <dbl> <date> <dbl>
1 1 1 Pilot 7.6 3706 2005-03-24 0.157
2 1 2 Diversity Day 8.3 3566 2005-03-29 0.123
3 1 3 Health Care 7.9 2983 2005-04-05 0.172
4 1 4 The Alliance 8.1 2886 2005-04-12 0.202
5 1 5 Basketball 8.4 3179 2005-04-19 0.0913
6 1 6 Hot Girl 7.8 2852 2005-04-26 0.159
7 2 1 The Dundies 8.7 3213 2005-09-20 0.125
8 2 2 Sexual Harassment 8.2 2736 2005-09-27 0.0565
9 2 3 Office Olympics 8.4 2742 2005-10-04 0.196
10 2 4 The Fire 8.4 2713 2005-10-11 0.160
# … with 176 more rows, and 7 more variables: lines_pam <dbl>,
# lines_michael <dbl>, lines_dwight <dbl>, halloween <chr>, valentine <chr>,
# christmas <chr>, michael <chr>
episode_name
as an ID variable and doesn’t use air_date
as a predictoroffice_rec1 <- recipe(imdb_rating ~ ., data = office_train) %>%
update_role(episode_name, new_role = "id") %>%
step_rm(air_date) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
office_rec1
Recipe
Inputs:
role #variables
id 1
outcome 1
predictor 12
Operations:
Delete terms air_date
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Actually, not so fast!
Resampling is only conducted on the training set. The test set is not involved. For each iteration of resampling, the data are partitioned into two subsamples:
Source: Kuhn and Silge. Tidy modeling with R.
More specifically, v-fold cross validation – commonly used resampling technique:
Let’s give an example where v = 3
…
Randomly split your training data into 3 partitions:
# Resampling results
# 3-fold cross-validation
# A tibble: 3 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [92/47]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]>
2 <split [93/46]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]>
3 <split [93/46]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>
# A tibble: 6 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold1 rmse standard 0.356 Preprocessor1_Model1
2 Fold1 rsq standard 0.520 Preprocessor1_Model1
3 Fold2 rmse standard 0.367 Preprocessor1_Model1
4 Fold2 rsq standard 0.498 Preprocessor1_Model1
5 Fold3 rmse standard 0.330 Preprocessor1_Model1
6 Fold3 rsq standard 0.621 Preprocessor1_Model1
Cross validation RMSE stats:
cv_metrics1 %>%
filter(.metric == "rmse") %>%
summarise(
min = min(.estimate),
max = max(.estimate),
mean = mean(.estimate),
sd = sd(.estimate)
)
# A tibble: 1 × 4
min max mean sd
<dbl> <dbl> <dbl> <dbl>
1 0.330 0.367 0.351 0.0192
Training data IMDB score stats:
To illustrate how CV works, we used v = 3
:
This was useful for illustrative purposes, but v = 3
is a poor choice in practice
Values of v
are most often 5 or 10; we generally prefer 10-fold cross-validation as a default