library(tidyverse)
library(tidymodels)
library(knitr)
Lab 6 - Why Many Americans Don’t Vote
Introduction
In today’s lab you will analyze data from an online Ipsos survey that was conducted for the FiveThirtyEight article “Why Many Americans Don’t Vote”.
Learning goals
By the end of the lab you will be able to…
- Conduct exploratory data analysis for multinomial logistic regression.
- Fit and interpret coefficients of the multinomial logistic regression model.
- Use the multinomial logistic regression model for prediction.
Getting started
- A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
- Go to the sta210-s22 organization on GitHub. Click on the repo with the prefix lab-6. It contains the starter documents you need to complete the lab.
- Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Packages
The following packages are used in the lab.
Data: Five Thirty Eight
The data for this assignment comes from an online Ipsos survey that was conducted for the FiveThirtyEight article “Why Many Americans Don’t Vote”. You can read more about the survey design and respondents in the README of the GitHub repo for the data.
Respondents were asked a variety of questions about their political beliefs, thoughts on multiple issues, and voting behavior. We will focus on using the demographic variables and someone’s party identification to understand whether a person is a probable voter.
The variables we’ll focus on are (definitions from the codebook in data set GitHub repo):
ppage
: Age of respondenteduc
: Highest educational attainment category.race
: Race of respondent, census categories. Note: all categories except Hispanic are non-Hispanic.gender
: Gender of respondentincome_cat
: Household income category of respondentQ30
: Response to the question “Generally speaking, do you think of yourself as a…”- 1: Republican
- 2: Democrat
- 3: Independent
- 4: Another party, please specify
- 5: No preference
- -1: No response
voter_category
: past voting behavior:- always: respondent voted in all or all-but-one of the elections they were eligible in
- sporadic: respondent voted in at least two, but fewer than all-but-one of the elections they were eligible in
- rarely/never: respondent voted in 0 or 1 of the elections they were eligible in
You can read in the data directly from the GitHub repo:
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv") voters
Note that the authors use weighting to make the final sample more representative on the US population for their article. We will not use weighting in this assignment, so we should treat the sample as a convenience sample rather than a random sample of the population.
Exercises
Why do you think the authors chose to only include data from people who were eligible to vote for at least four election cycles?
Let’s prepare the data for analysis and modeling.
- The variable
Q30
contains the respondent’s political party identification. Make a new variable that simplifiesQ30
into four categories: “Democrat”, “Republican”, “Independent”, “Other” (“Other” also includes respondents who did not answer the question). - The variable
voter_category
identifies the respondent’s past voter behavior. Relevel the variable to make rarely/never the baseline level, followed by sporadic, then always.
- The variable
In the FiveThirtyEight article, the authors include visualizations of the relationship between the voter category and demographic variables such as race, age, education, etc. Select two demographic variables. For each variable, interpret the plot to describe its relationship with voter category.
Fit a model using mean-centered age, race, gender, income, and education to predict voter category. Show the code used to fit the model, but do not display the model output.
Next, we want to determine whether party identification be added to the model. In order to do this we need to compare two nested models.
The reduced model is the one we fit so far, including the predictors mean-centered age, race, gender, income, and education.
The full model is the one that includes, in addition to these predictors, party identification.
- Should party identification be added to the model? Use a drop-in-deviance test to determine if party identification should be added to the model. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data. Then, neatly display the model you selected.
Use the model you select for the remainder of the assignment.
Interpret the following coefficients in the context of the data in terms of the odds of voting sporadically versus rarely/never.
- Interpret the intercept in the context of the data. Use actual values in the interpretation.
- Interpret the effect of age in the context of the data.
- Interpret the effect of party ID in the context of the data. Include discussion about which level(s) differ from the baseline.
In the article, the authors write
“Nonvoters were more likely to have lower incomes; to be young; to have lower levels of education; and to say they don’t belong to either political party, which are all traits that square with what we know about people less likely to engage with the political system.”
Does your model support this statement? Briefly explain why or why not.
Let’s use the model to predict the voting categories. Obtain the predicted voter category for each observation.
- Create a table of the actual versus predicted voter categories and a visualization of the association between the two.
- How well did the model perform? Briefly assess the model performance using 2 - 3 observations from the table and/or visualization to support your response.
Submission
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 10 | 45 |
Workflow & formatting | 51 |
Footnotes
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.↩︎