HW 3 - Logistic regression and log transformation

Due Friday, March 25, 5pm on Gradescope

Introduction

In this assignment, you’ll get to put into practice the logistic regression skills you’ve developed.

Learning goals

In this assignment, you will…

Fit and interpret logistic regression models.
Fit and interpret multiple linear regression models with log transformed outcomes.
Reason around log transformations of various types.
Continue developing a workflow for reproducible data analysis.

Getting started

Your repo for this assignment is at github.com/sta210-s22 and starts with the prefix hw-3. For more detailed instructions on getting started, see HW 1.

Packages

The following packages will be used in this assignment. You can add other packages as needed.

library(tidyverse)
library(tidymodels)
library(knitr)
library(palmerpenguins)

Part 1 - Palmer penguins

In this part we’ll go back to the Palmer penguins dataset from HW 2.

We will use the following variables:

variable	class	description
species	integer	Penguin species (Adelie, Gentoo, Chinstrap)
island	integer	Island where recorded (Biscoe, Dream, Torgersen)
flipper_length_mm	integer	Flipper length in mm

The goal of this analysis is to use logistic regression to understand the relationship between flipper length, island, and whether a penguin is from the Adelie species. First, we need to create a new response variable to identify whether a penguin is from the Adelie species.

penguins <- penguins %>%
  mutate(adelie = factor(if_else(species == "Adelie", 1, 0)))

And let’s check to make sure the new variable looks right before we continue with the analysis.

penguins %>%
  count(adelie, species)

# A tibble: 3 × 3
  adelie species       n
  <fct>  <fct>     <int>
1 0      Chinstrap    68
2 0      Gentoo      124
3 1      Adelie      152

Let’s start by looking at the relationship between island and whether a penguin is from the Adelie species.

What does the values_fill argument do in the following chunk? The documentation for the function will be helpful in answering this question.

penguins %>%
  count(island, adelie) %>%
  pivot_wider(names_from = adelie, values_from = n, values_fill = 0)

# A tibble: 3 × 3
  island      `0`   `1`
  <fct>     <int> <int>
1 Biscoe      124    44
2 Dream        68    56
3 Torgersen     0    52

Calculate the odds ratio of a penguin being from the Adelie species for those recorded on Dream compared to those recorded on Biscoe.
You want to fit a model using island to predict the odds of being from the Adelie species. Let \(\pi\) be the probability a penguin is from the Adelie species. The model has the following form. What do you expect the value of \(\hat{\beta}_1\), the estimated coefficient for Dream, to be? Explain your reasoning.

\[ \log\Big(\frac{\pi}{1-\pi}\Big) = \beta_0 + \beta_1 ~ Dream + \beta_2 ~ Torgersen \]

Fit a model predicting adelie from island and display the model output. For the following exercise, use this model.
Based on this model, what are the odds of a penguin being from the Adelie species if it was recorded on Biscoe island? on Dream island?
Next, add flipper length to the model so that there are two predictors. Display the model output. For the following exercises, use this model.
Write the regression equation for the model.
Interpret the coefficient of flipper_length_mm in terms of the log-odds of being from the Adelie species.
Interpret the coefficient of flipper_length_mm in terms of the odds of being from the Adelie species.
Interpret the coefficient of Dream in terms of the odds of being from the Adelie species.
How do you expect the log-odds of being from the Adelie species to change when going from a penguin with flipper length 185 mm to a penguin with flipper length 200 mm? Assume both penguins were recorded on the Dream island.
How do you expect the odds of being from the Adelie species to change when going from a penguin with flipper length 185 mm to a penguin with flipper length 200 mm? Assume both penguins were recorded on the Dream island.

Part 2 - GDP and Urban population

Data on countries’ Gross Domestic Product (GDP) and percentage of urban population was collected and made available by The World Bank in 2020. A description of the variables as defined by The World Bank are provided below.

GDP: “GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.”
Urban Population (% of total): “Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects.”

The data can be found in the data folder of your repository. Read the data and name it gdp_2020.

Fit a model predicting GDP from urban population. Then make a plot of residuals vs. fitted for this model. Does the linear model seem appropriate for modeling this relationship? Explain your reasoning.
Add a new column to the gdp_2020 dataset called gdp_log which is the (natural) log of gdp.
Fit a new model, predicting the log of GDP from urban population. Then make a plot of residuals vs. fitted for this model. Does the model predicting logged GDP or original GDP appear to be a better fit? Explain your reasoning.

The model output for predicting logged GDP.

term	estimate	std.error	statistic	p.value
(Intercept)	6.107	0.202	30.291	0
urban	0.042	0.003	13.769	0

The linear model for predicting log of GDP can be expressed as follows:

\[ \widehat{\log(GDP)} = 6.11 + 0.042 \times urban \]

Therefore, the coefficient of urban (0.042) can be interpreted as the change in logged GDP associated with 1 percentage point increase in urban population. The problem is, logged GDP is not a very informative value to talk about. So we need to undo the transformation we’ve done.

To do so, let’s do a quick review of some properties of logs.

Subtraction and logs: \(log(a) − log(b) = log(\frac{a}{b})\)
Natural logarithm: \(e^{log(x)} = x\)

Based on the interpretation of the slope above, the difference between the predicted values of logged GDP for a given value of urban and a value that is 1 percentage point higher is 0.0425. Let’s write this out mathematically, and then use the properties we’ve listed above to work through the equation.

\[ \begin{aligned} log(\text{GDP for urban } x + 1) - log(\text{GDP for urban } x) &= 0.042 \\ log\Big( \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} \Big) &= 0.042 \\ e^{log\Big( \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} \Big)} &= e^{0.042}\\ \frac{\text{GDP for urban } x + 1}{\text{GDP for urban } x} &= e^{0.042} \end{aligned} \]

Based on the derivation above, fill in the blanks in the following sentence for an alternative (and more useful interpretation) of the slope of urban.

For each additional percentage point the urban population is higher, the GDP of a country is expected to be ___, on average, by a factor of ___.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading

Total points available: 50 points.

Component	Points
Ex 1 - 9	45
Workflow & formatting	5¹

Footnotes

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎