Feature engineering

STA 210 - Spring 2022

Dr. Mine Çetinkaya-Rundel

Welcome

Announcements

Check Sakai Gradebook to make sure all scores so far are accurate
Any questions on topic selection for projects?
Any feedback on time of my office hours?

Midterm evaluation summary

Live analysis…

Topics

Review: Training and testing splits
Feature engineering with recipes

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(gghighlight)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

Introduction

The Office

Data

The data come from data.world, by way of TidyTuesday

office_ratings <- read_csv(here::here("slides", "data/office_ratings.csv"))
office_ratings

# A tibble: 188 × 6
   season episode title             imdb_rating total_votes air_date  
    <dbl>   <dbl> <chr>                   <dbl>       <dbl> <date>    
 1      1       1 Pilot                     7.6        3706 2005-03-24
 2      1       2 Diversity Day             8.3        3566 2005-03-29
 3      1       3 Health Care               7.9        2983 2005-04-05
 4      1       4 The Alliance              8.1        2886 2005-04-12
 5      1       5 Basketball                8.4        3179 2005-04-19
 6      1       6 Hot Girl                  7.8        2852 2005-04-26
 7      2       1 The Dundies               8.7        3213 2005-09-20
 8      2       2 Sexual Harassment         8.2        2736 2005-09-27
 9      2       3 Office Olympics           8.4        2742 2005-10-04
10      2       4 The Fire                  8.4        2713 2005-10-11
# … with 178 more rows

IMDB ratings

IMDB ratings vs. number of votes

Outliers

Aside…

If you like the Dinner Party episode, I highly recommend this “oral history” of the episode published on Rolling Stone magazine.

Rating vs. air date

IMDB ratings vs. seasons

Modeling

Train / test

Step 1: Create an initial split:

set.seed(123)
office_split <- initial_split(office_ratings) # prop = 3/4 by default

Step 2: Save training data

office_train <- training(office_split)
dim(office_train)

[1] 141   6

Step 3: Save testing data

office_test  <- testing(office_split)
dim(office_test)

[1] 47  6

Training data

office_train

# A tibble: 141 × 6
   season episode title               imdb_rating total_votes air_date  
    <dbl>   <dbl> <chr>                     <dbl>       <dbl> <date>    
 1      8      18 Last Day in Florida         7.8        1429 2012-03-08
 2      9      14 Vandalism                   7.6        1402 2013-01-31
 3      2       8 Performance Review          8.2        2416 2005-11-15
 4      9       5 Here Comes Treble           7.1        1515 2012-10-25
 5      3      22 Beach Games                 9.1        2783 2007-05-10
 6      7       1 Nepotism                    8.4        1897 2010-09-23
 7      3      15 Phyllis' Wedding            8.3        2283 2007-02-08
 8      9      21 Livin' the Dream            8.9        2041 2013-05-02
 9      9      18 Promos                      8          1445 2013-04-04
10      8      12 Pool Party                  8          1612 2012-01-19
# … with 131 more rows

Feature engineering

We prefer simple models when possible, but parsimony does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity
Variables that go into the model and how they are represented are just as critical to success of the model
Feature engineering allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)

Feature engineering with dplyr

office_train %>%
  mutate(
    season = as_factor(season),
    month = lubridate::month(air_date),
    wday = lubridate::wday(air_date)
  )

# A tibble: 141 × 8
  season episode title            imdb_rating total_votes air_date   month  wday
  <fct>    <dbl> <chr>                  <dbl>       <dbl> <date>     <dbl> <dbl>
1 8           18 Last Day in Flo…         7.8        1429 2012-03-08     3     5
2 9           14 Vandalism                7.6        1402 2013-01-31     1     5
3 2            8 Performance Rev…         8.2        2416 2005-11-15    11     3
4 9            5 Here Comes Treb…         7.1        1515 2012-10-25    10     5
5 3           22 Beach Games              9.1        2783 2007-05-10     5     5
6 7            1 Nepotism                 8.4        1897 2010-09-23     9     5
# … with 135 more rows

Can you identify any potential problems with this approach?

Modeling workflow, revisited

Create a recipe for feature engineering steps to be applied to the training data
Fit the model to the training data after these steps have been applied
Using the model estimates from the training data, predict outcomes for the test data
Evaluate the performance of the model on the test data

Building recipes

Initiate a recipe

office_rec <- recipe(
  imdb_rating ~ .,    # formula
  data = office_train # data for cataloguing names and types of variables
  )

office_rec

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          5

Step 1: Alter roles

title isn’t a predictor, but we might want to keep it around as an ID

office_rec <- office_rec %>%
  update_role(title, new_role = "ID")

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Step 2: Add features

New features for day of week and month

office_rec <- office_rec %>%
  step_date(air_date, features = c("dow", "month"))

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date

Step 3: Add more features

Identify holidays in air_date, then remove air_date

office_rec <- office_rec %>%
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  )

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date

Step 4: Convert numbers to factors

Convert season to factor

office_rec <- office_rec %>%
  step_num2factor(season, levels = as.character(1:9))

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date
Factor variables from season

Step 5: Make dummy variables

Convert all nominal (categorical) predictors to factors

office_rec <- office_rec %>%
  step_dummy(all_nominal_predictors())

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date
Factor variables from season
Dummy variables from all_nominal_predictors()

Step 6: Remove zero variance pred.s

Remove all predictors that contain only a single value

office_rec <- office_rec %>%
  step_zv(all_predictors())

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date
Factor variables from season
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()

Putting it altogether

office_rec <- recipe(imdb_rating ~ ., data = office_train) %>%
  # make title's role ID
  update_role(title, new_role = "ID") %>%
  # extract day of week and month of air_date
  step_date(air_date, features = c("dow", "month")) %>%
  # identify holidays and add indicators
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) %>%
  # turn season into factor
  step_num2factor(season, levels = as.character(1:9)) %>%
  # make dummy variables
  step_dummy(all_nominal_predictors()) %>%
  # remove zero variance predictors
  step_zv(all_predictors())

Putting it altogether

office_rec

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor          4

Operations:

Date features from air_date
Holiday features from air_date
Factor variables from season
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()

Recap

Review: Training and testing splits
Feature engineering with recipes