STA 210 - Spring 2022
Schedule changes for the remainder of the semester
Thursday office hours in my office: 213 Old Chem
Any questions on project proposals?
Logistic regression for binary response variable
Relationship between odds and probabilities
Use logistic regression model to calculate predicted odds and probabilities
Quantitative outcome variable:
Categorical outcone variable:
Logistic regression
2 Outcomes
1: Yes, 0: No
Multinomial logistic regression
3+ Outcomes
1: Democrat, 2: Republican, 3: Independent
Students in grades 9 - 12 surveyed about health risk behaviors including whether they usually get 7 or more hours of sleep.
Sleep7
1: yes
0: no
data(YouthRisk2009)
sleep <- YouthRisk2009 %>%
as_tibble() %>%
filter(!is.na(Age), !is.na(Sleep7))
sleep %>%
relocate(Age, Sleep7)
# A tibble: 446 × 6
Age Sleep7 Sleep SmokeLife SmokeDaily MarijuaEver
<int> <int> <fct> <fct> <fct> <int>
1 16 1 8 hours Yes Yes 1
2 17 0 5 hours Yes Yes 1
3 18 0 5 hours Yes Yes 1
4 17 1 7 hours Yes No 1
5 15 0 4 or less hours No No 0
6 17 0 6 hours No No 0
7 17 1 7 hours No No 0
8 16 1 8 hours Yes No 0
9 16 1 8 hours No No 0
10 18 0 4 or less hours Yes Yes 1
# … with 436 more rows
Outcome: \(Y\) = 1: yes, 0: no
Outcome: Probability of getting 7+ hours of sleep
Outcome: Probability of getting 7+ hours of sleep
🛑 This model produces predictions outside of 0 and 1.
✅ This model (called a logistic regression model) only produces predictions between 0 and 1.
Method | Outcome | Model |
---|---|---|
Linear regression | Quantitative | \(Y = \beta_0 + \beta_1~ X\) |
Linear regression (transform Y) | Quantitative | \(\log(Y) = \beta_0 + \beta_1~ X\) |
Logistic regression | Binary | \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\) |
Suppose there is a 70% chance it will rain tomorrow
# A tibble: 2 × 3
Sleep7 n p
<int> <int> <dbl>
1 0 150 0.336
2 1 296 0.664
\(P(\text{7+ hours of sleep}) = P(Y = 1) = p = 0.664\)
\(P(\text{< 7 hours of sleep}) = P(Y = 0) = 1 - p = 0.336\)
\(P(\text{odds of 7+ hours of sleep}) = \frac{0.664}{0.336} = 1.976\)
odds
\[\omega = \frac{\pi}{1-\pi}\]
probability
\[\pi = \frac{\omega}{1 + \omega}\]
\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]
Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]
Probability form:
\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]
This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use age
to predict if a randomly selected adult is high risk of having coronary heart disease in the next 10 years.
high_risk
:
age
: Age at exam time (in years)
heart
heart_disease <- read_csv(here::here("slides", "data/framingham.csv")) %>%
select(age, TenYearCHD) %>%
drop_na() %>%
mutate(high_risk = as.factor(TenYearCHD)) %>%
select(age, high_risk)
heart_disease
# A tibble: 4,240 × 2
age high_risk
<dbl> <fct>
1 39 0
2 46 0
3 48 0
4 61 1
5 46 0
6 43 0
7 63 1
8 45 0
9 52 0
10 43 0
# … with 4,230 more rows
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5.561 | 0.284 | -19.599 | 0 |
age | 0.075 | 0.005 | 14.178 | 0 |
. . .
\[\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.075 \times \text{age}\] where \(\hat{\pi}\) is the predicted probability of being high risk
# A tibble: 4,240 × 8
high_risk age .fitted .resid .std.resid .hat .sigma .cooksd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 39 -2.65 -0.370 -0.370 0.000466 0.895 0.0000165
2 0 46 -2.13 -0.475 -0.475 0.000322 0.895 0.0000192
3 0 48 -1.98 -0.509 -0.509 0.000288 0.895 0.0000199
4 1 61 -1.01 1.62 1.62 0.000706 0.895 0.000968
5 0 46 -2.13 -0.475 -0.475 0.000322 0.895 0.0000192
6 0 43 -2.35 -0.427 -0.427 0.000384 0.895 0.0000183
7 1 63 -0.858 1.56 1.56 0.000956 0.895 0.00113
8 0 45 -2.20 -0.458 -0.458 0.000342 0.895 0.0000189
9 0 52 -1.68 -0.585 -0.585 0.000262 0.895 0.0000244
10 0 43 -2.35 -0.427 -0.427 0.000384 0.895 0.0000183
# … with 4,230 more rows
For observation 1
\[\text{predicted odds} = \hat{\omega} = \frac{\hat{\pi}}{1-\hat{\pi}} = \exp\{-2.650\} = 0.071\]
# A tibble: 4,240 × 2
.pred_0 .pred_1
<dbl> <dbl>
1 0.934 0.0660
2 0.894 0.106
3 0.878 0.122
4 0.733 0.267
5 0.894 0.106
6 0.913 0.0870
7 0.702 0.298
8 0.900 0.0996
9 0.843 0.157
10 0.913 0.0870
# … with 4,230 more rows
\[\text{predicted probabilities} = \hat{\pi} = \frac{\exp\{-2.650\}}{1 + \exp\{-2.650\}} = 0.066\]
For a logistic regression, the default prediction is the class
.
What does the following table show?
Logistic regression for binary response variable
Relationship between odds and probabilities
Used logistic regression model to calculate predicted odds and probabilities