STA 210 - Spring 2022

Dr. Mine Çetinkaya-Rundel

- Congratulations on finishing Exam 1!
- Grading of AEs
- Questions on feedback vs. regrades

Mean-centering quantitative predictors

Using indicator variables for categorical predictors

Using interaction terms

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the `loan50`

data frame in the **openintro** R package.

```
# A tibble: 50 × 4
annual_income debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59000 0.558 Not Verified 10.9
2 60000 1.31 Not Verified 9.92
3 75000 1.06 Verified 26.3
4 75000 0.574 Not Verified 9.92
5 254000 0.238 Not Verified 9.43
6 67000 1.08 Source Verified 9.92
7 28800 0.0997 Source Verified 17.1
8 80000 0.351 Not Verified 6.08
9 34000 0.698 Not Verified 7.97
10 80000 0.167 Source Verified 12.6
# … with 40 more rows
```

**Predictors**:

`annual_income`

: Annual income`debt_to_income`

: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income`verified_income`

: Whether borrower’s income source and amount have been verified (`Not Verified`

,`Source Verified`

,`Verified`

)

**Outcome**: `interest_rate`

: Interest rate for the loan

`interest_rate`

min | median | max |
---|---|---|

5.31 | 9.93 | 26.3 |

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 |

debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |

verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |

verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |

annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

If we are interested in interpreting the intercept, we can **mean-center** the quantitative predictors in the model.

We can mean-center a quantitative predictor \(X_j\) using the following:

\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]

If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.

How do you expect the model to change if we use the `debt_inc_cent`

and `annual_income_cent`

in the model?

```
# A tibble: 5 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.44 0.977 9.66 1.50e-12 7.48 11.4
2 debt_inc_cent 0.671 0.676 0.993 3.26e- 1 -0.690 2.03
3 verified_incomeSourc… 2.21 1.40 1.58 1.21e- 1 -0.606 5.03
4 verified_incomeVerif… 6.88 1.80 3.82 4.06e- 4 3.25 10.5
5 annual_income_th_cent -0.0205 0.0114 -1.80 7.79e- 2 -0.0434 0.00238
```

term | estimate |
---|---|

(Intercept) | 10.726 |

debt_to_income | 0.671 |

verified_incomeSource Verified | 2.211 |

verified_incomeVerified | 6.880 |

annual_income_th | -0.021 |

term | estimate |
---|---|

(Intercept) | 9.444 |

debt_inc_cent | 0.671 |

verified_incomeSource Verified | 2.211 |

verified_incomeVerified | 6.880 |

annual_income_th_cent | -0.021 |

Suppose there is a categorical variable with \(K\) categories (levels)

We can make \(K\) indicator variables - one indicator for each category

An

**indicator variable**takes values 1 or 0- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category

`verified_income`

```
# A tibble: 3 × 4
verified_income not_verified source_verified verified
<fct> <dbl> <dbl> <dbl>
1 Not Verified 1 0 0
2 Verified 0 0 1
3 Source Verified 0 1 0
```

- We will use \(K-1\) of the indicator variables in the model.
- The
**baseline**is the category that doesn’t have a term in the model. - The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
- This approach is also called
**dummy coding**.

```
# A tibble: 3 × 3
verified_income source_verified verified
<fct> <dbl> <dbl>
1 Not Verified 0 0
2 Verified 0 1
3 Source Verified 1 0
```

`verified_income`

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |

debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |

verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |

verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |

annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |

- The baseline category is
`Not verified`

. - People with source verified income are expected to take a loan with an interest rate that is 2.211% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.
- People with verified income are expected to take a loan with an interest rate that is 6.880% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.

- Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
- This is an
**interaction effect**. - To account for this, we can include
**interaction terms**in the model.

The lines are not parallel indicating there is an **interaction effect**. The slope of annual income differs based on the income verification.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 9.484 | 0.989 | 9.586 | 0.000 |

debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 |

annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 |

verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 |

verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 |

annual_income_th_cent:verified_incomeSource Verified | -0.016 | 0.026 | -0.643 | 0.523 |

annual_income_th_cent:verified_incomeVerified | -0.032 | 0.033 | -0.979 | 0.333 |

- What the interaction means: The effect of annual income on the interest rate differs by -0.016 when the income is source verified compared to when it is not verified, holding all else constant.
- Interpreting
`annual_income`

for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.

Defining the interaction variable in the model formula as `verified_income * annual_income_th_cent`

is an implicit data manipulation step as well

```
Rows: 50
Columns: 9
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent <dbl> -0.16511719, 0.…
$ annual_income_th_cent <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified` <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified` <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified` <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified` <dbl> 0.00, 0.00, -11…
```

Mean-centering quantitative predictors

Using indicator variables for categorical predictors

Using interaction terms

Data manipulation, with **dplyr** (from **tidyverse**):

```
loan50 %>%
select(interest_rate, annual_income, debt_to_income, verified_income) %>%
mutate(
# 1. rescale income
annual_income_th = annual_income / 1000,
# 2. mean-center quantitative predictors
debt_inc_cent = debt_to_income - mean(debt_to_income),
annual_income_th_cent = annual_income_th - mean(annual_income_th),
# 3. create dummy variables for verified_income
source_verified = if_else(verified_income == "Source Verified", 1, 0),
verified = if_else(verified_income == "Verified", 1, 0),
# 4. create interaction variables
`annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
`annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
)
```

**Feature engineering**, with **recipes** (from **tidymodels**):

```
loan_rec <- recipe( ~ ., data = loan50) %>%
# 1. rescale income
step_mutate(annual_income_th = annual_income / 1000) %>%
# 2. mean-center quantitative predictors
step_center(all_numeric_predictors()) %>%
# 3. create dummy variables for verified_income
step_dummy(verified_income) %>%
# 4. create interaction variables
step_interact(terms = ~ annual_income_th:verified_income)
```

```
Recipe
Inputs:
role #variables
predictor 24
Operations:
Variable mutation
Centering for all_numeric_predictors()
Dummy variables from verified_income
Interactions with annual_income_th:verified_income
```