library(tidyverse)
library(tidymodels)
library(knitr)
library(palmerpenguins)
HW 3 - Logistic regression and log transformation
Due Friday, March 25, 5pm on Gradescope
Introduction
In this assignment, you’ll get to put into practice the logistic regression skills you’ve developed.
Learning goals
In this assignment, you will…
- Fit and interpret logistic regression models.
- Fit and interpret multiple linear regression models with log transformed outcomes.
- Reason around log transformations of various types.
- Continue developing a workflow for reproducible data analysis.
Getting started
Your repo for this assignment is at github.com/sta210-s22 and starts with the prefix hw-3. For more detailed instructions on getting started, see HW 1.
Packages
The following packages will be used in this assignment. You can add other packages as needed.
Part 1 - Palmer penguins
In this part we’ll go back to the Palmer penguins dataset from HW 2.
We will use the following variables:
variable | class | description |
---|---|---|
species | integer | Penguin species (Adelie, Gentoo, Chinstrap) |
island | integer | Island where recorded (Biscoe, Dream, Torgersen) |
flipper_length_mm | integer | Flipper length in mm |
The goal of this analysis is to use logistic regression to understand the relationship between flipper length, island, and whether a penguin is from the Adelie species. First, we need to create a new response variable to identify whether a penguin is from the Adelie species.
<- penguins %>%
penguins mutate(adelie = factor(if_else(species == "Adelie", 1, 0)))
And let’s check to make sure the new variable looks right before we continue with the analysis.
%>%
penguins count(adelie, species)
# A tibble: 3 × 3
adelie species n
<fct> <fct> <int>
1 0 Chinstrap 68
2 0 Gentoo 124
3 1 Adelie 152
Let’s start by looking at the relationship between island and whether a penguin is from the Adelie species.
What does the
values_fill
argument do in the following chunk? The documentation for the function will be helpful in answering this question.%>% penguins count(island, adelie) %>% pivot_wider(names_from = adelie, values_from = n, values_fill = 0)
# A tibble: 3 × 3 island `0` `1` <fct> <int> <int> 1 Biscoe 124 44 2 Dream 68 56 3 Torgersen 0 52
Calculate the odds ratio of a penguin being from the Adelie species for those recorded on Dream compared to those recorded on Biscoe.
You want to fit a model using
island
to predict the odds of being from the Adelie species. Let \(\pi\) be the probability a penguin is from the Adelie species. The model has the following form. What do you expect the value of \(\hat{\beta}_1\), the estimated coefficient for Dream, to be? Explain your reasoning.
\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1 ~ Dream + \beta_2 ~ Torgersen \]
- Fit a model predicting
adelie
fromisland
and display the model output. For the following exercise, use this model. - Based on this model, what are the odds of a penguin being from the Adelie species if it was recorded on Biscoe island? on Dream island?
- Next, add flipper length to the model so that there are two predictors. Display the model output. For the following exercises, use this model.
- Write the regression equation for the model.
- Interpret the coefficient of
flipper_length_mm
in terms of the log-odds of being from the Adelie species. - Interpret the coefficient of
flipper_length_mm
in terms of the odds of being from the Adelie species. - Interpret the coefficient of
Dream
in terms of the odds of being from the Adelie species. - How do you expect the log-odds of being from the Adelie species to change when going from a penguin with flipper length 185 mm to a penguin with flipper length 200 mm? Assume both penguins were recorded on the Dream island.
- How do you expect the odds of being from the Adelie species to change when going from a penguin with flipper length 185 mm to a penguin with flipper length 200 mm? Assume both penguins were recorded on the Dream island.
Part 2 - GDP and Urban population
Data on countries’ Gross Domestic Product (GDP) and percentage of urban population was collected and made available by The World Bank in 2020. A description of the variables as defined by The World Bank are provided below.
- GDP: “GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.”
- Urban Population (% of total): “Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects.”
The data can be found in the data
folder of your repository. Read the data and name it gdp_2020
.
- Fit a model predicting GDP from urban population. Then make a plot of residuals vs. fitted for this model. Does the linear model seem appropriate for modeling this relationship? Explain your reasoning.
- Add a new column to the
gdp_2020
dataset calledgdp_log
which is the (natural) log ofgdp
. - Fit a new model, predicting the log of GDP from urban population. Then make a plot of residuals vs. fitted for this model. Does the model predicting logged GDP or original GDP appear to be a better fit? Explain your reasoning.
The model output for predicting logged GDP.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.107 | 0.202 | 30.291 | 0 |
urban | 0.042 | 0.003 | 13.769 | 0 |
The linear model for predicting log of GDP can be expressed as follows:
\[ \widehat{\log(GDP)} = 6.11 + 0.042 \times urban \]
Therefore, the coefficient of urban
(0.042) can be interpreted as the change in logged GDP associated with 1 percentage point increase in urban population. The problem is, logged GDP is not a very informative value to talk about. So we need to undo the transformation we’ve done.
To do so, let’s do a quick review of some properties of logs.
- Subtraction and logs: \(log(a) − log(b) = log(\frac{a}{b})\)
- Natural logarithm: \(e^{log(x)} = x\)
Based on the interpretation of the slope above, the difference between the predicted values of logged GDP for a given value of urban
and a value that is 1 percentage point higher is 0.0425. Let’s write this out mathematically, and then use the properties we’ve listed above to work through the equation.
\[ \begin{aligned} log(\text{GDP for urban } x + 1) - log(\text{GDP for urban } x) &= 0.042 \\ log\Big( \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} \Big) &= 0.042 \\ e^{log\Big( \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} \Big)} &= e^{0.042}\\ \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} &= e^{0.042} \end{aligned} \]
Based on the derivation above, fill in the blanks in the following sentence for an alternative (and more useful interpretation) of the slope of
urban
.For each additional percentage point the urban population is higher, the GDP of a country is expected to be ___, on average, by a factor of ___.
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 9 | 45 |
Workflow & formatting | 51 |
Footnotes
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎