STA 210 - Spring 2022
Which variables help us predict the amount customers tip at a restaurant?
# A tibble: 169 × 4
Tip Party Meal Age
<dbl> <dbl> <chr> <chr>
1 2.99 1 Dinner Yadult
2 2 1 Dinner Yadult
3 5 1 Dinner SenCit
4 4 3 Dinner Middle
5 10.3 2 Dinner SenCit
6 4.85 2 Dinner Middle
7 5 4 Dinner Yadult
8 4 3 Dinner Middle
9 5 2 Dinner Middle
10 1.58 1 Dinner SenCit
# … with 159 more rows
Predictors:
Party
: Number of people in the partyMeal
: Time of day (Lunch, Dinner, Late Night)Age
: Age category of person paying the bill (Yadult, Middle, SenCit)Outcome: Tip
: Amount of tip
Tip
tip_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(Tip ~ Party + Age, data = tips)
tidy(tip_fit, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.170 | 0.366 | -0.465 | 0.643 | -0.893 | 0.553 |
Party | 1.837 | 0.124 | 14.758 | 0.000 | 1.591 | 2.083 |
AgeMiddle | 1.009 | 0.408 | 2.475 | 0.014 | 0.204 | 1.813 |
AgeSenCit | 1.388 | 0.485 | 2.862 | 0.005 | 0.430 | 2.345 |
Is this the best model to explain variation in tips?
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Party | 1 | 1188.64 | 1188.64 | 285.71 | 0.00 |
Age | 2 | 38.03 | 19.01 | 4.57 | 0.01 |
Residuals | 165 | 686.44 | 4.16 | NA | NA |
the variation that can be explained by the each of the variables in the model
the variation that can’t be explained by the model (left in the residuals)
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Party | 1 | 1188.64 | 1188.64 | 285.71 | 0 |
Age | 2 | 38.03 | 19.01 | 4.57 | 0.01 |
Residuals | 165 | 686.44 | 4.16 | ||
Total | 168 | 1913.11 |
term | df | sumsq |
---|---|---|
Party | 1 | 1188.64 |
Age | 2 | 38.03 |
Residuals | 165 | 686.44 |
Total | 168 | 1913.11 |
Recall: \(R^2\) is the proportion of the variation in the response variable explained by the regression model.
\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}} = 1 - \frac{686.44}{1913.11} = 0.641 \]
\[R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}}\]
\[R^2_{adj} = 1 - \frac{SS_{Error}/(n-p-1)}{SS_{Total}/(n-1)}\]
Estimators of prediction error and relative quality of models:
Akaike’s Information Criterion (AIC): \[AIC = n\log(SS_\text{Error}) - n \log(n) + 2(p+1)\]
Schwarz’s Bayesian Information Criterion (BIC): \[BIC = n\log(SS_\text{Error}) - n\log(n) + log(n)\times(p+1)\]
\[ \begin{aligned} & AIC = \color{blue}{n\log(SS_\text{Error})} - n \log(n) + 2(p+1) \\ & BIC = \color{blue}{n\log(SS_\text{Error})} - n\log(n) + \log(n)\times(p+1) \end{aligned} \]
First Term: Decreases as p increases
\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - \color{blue}{n \log(n)} + 2(p+1) \\ & BIC = n\log(SS_\text{Error}) - \color{blue}{n\log(n)} + \log(n)\times(p+1) \end{aligned} \]
Second Term: Fixed for a given sample size n
\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{2(p+1)} \\ & BIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]
Third Term: Increases as p increases
\[ \begin{aligned} & AIC = n\log(SS_{Error}) - n \log(n) + \color{red}{2(p+1)} \\ & BIC = n\log(SS_{Error}) - n\log(n) + \color{red}{\log(n)\times(p+1)} \end{aligned} \]
Choose model with the smaller value of AIC or BIC
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)
The principle of parsimony is attributed to William of Occam (early 14th-century English nominalist philosopher), who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation1
Called Occam’s razor because he “shaved” his explanations down to the bare minimum
Parsimony in modeling:
Occam’s razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected
Model selection follows this principle
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model
In other words, we prefer the simplest best model, i.e. parsimonious model
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.
Radford Neal - Bayesian Learning for Neural Networks1
split our data into testing and training sets
“train” the model on the training data and pick a few models we’re genuinely considering as potentially good models
test those models on the testing set
ANOVA for Multiple Linear Regression and sum of squares
Comparing models with \(R^2\) vs. \(R^2_{adj}\)
Comparing models with AIC and BIC
Occam’s razor and parsimony
Overfitting and spending our data