AE 4: Exam 1 Review

Important

Go to the course GitHub organization and locate the repo titled ae-4-exam-1-review-YOUR_GITHUB_USERNAME to get started.

Packages

library(tidyverse)
library(tidymodels)
library(ggfortify)
library(knitr)

Restaurant tips

What factors are associated with the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1

The variables we’ll focus on for this analysis are

  • Tip: amount of the tip
  • Party: number of people in the party

View the data set to see the remaining variables.

tips <- read_csv("data/tip-data.csv")

Exploratory analysis

  1. Visualize, summarize, and describe the relationship between Party and Tip.
# add your code here

Modeling

Let’s start by fitting a model using Party to predict the Tip at this restaurant.

  1. Write the statistical model.

  2. Fit the regression line and write the regression equation. Name the model tips_fit and display the results with kable() and a reasonable number of digits.

# add your code here
  1. Interpret the slope.

  2. Does it make sense to interpret the intercept? Explain your reasoning.

Inference

Inference for the slope

  1. The following code can be used to create a bootstrap distribution for the slope (and the intercept, though we’ll focus primarily on the slope in our inference). Describe what each line of code does, supplemented by any visualizations that might help with your description.
set.seed(1234)

boot_dist <- tips %>%
  specify(Tip ~ Party) %>%
  generate(reps = 100, type = "bootstrap") %>%
  fit()
  1. Use the bootstrap distribution created in Exercise 6, boot_dist, to construct a 90% confidence interval for the slope using bootstrapping and the percentile method and interpret it in context of the data.
# add your code here
  1. Conduct a hypothesis test at the equivalent significance level using permutation. State the hypotheses and the significance level you’re using explicitly. Also include a visualization of the null distribution of the slope with the observed slope marked as a vertical line.
# add your code here
  1. Check the relevant conditions for Exercises 7 and 8. Are there any violations in conditions that make you reconsider your inferential findings?
# add your code here
  1. Now repeat Exercises 7 and 8 using approaches based on mathematical models.
# add your code here
  1. Check the relevant conditions for Exercise 9. Are there any violations in conditions that make you reconsider your inferential findings?
# add your code here

Inference for a prediction

  1. Based on your model, predict the tip for a party of 4.
# add your code here
  1. Suppose you’re asked to construct a confidence and a prediction interval for your finding in Exercise 11. Which one would you expect to be wider and why? In your answer clearly state the difference between these intervals.

  2. Now construct the intervals from Exercise 12 and comment on whether your guess is confirmed.

# add your code here

Model diagnostics

Leverage (Outliers in x direction)

  1. What is the threshold used to identify observations with high leverage? Calculate the threshold and save the value as leverage_threshold.
# add your code here
  1. Make a plot of the standardized residuals vs. leverage (you can do this with ggplot() or with autoplot(which = 5)). Use geom_vline() to add a vertical line to help identify points with high leverage.
# add your code here
  1. Let’s dig into the data further. Which observations have high leverage? Why do these points have high leverage?
# add your code here

Identifying outliers (outliers in y direction)

  1. Make a plot of the residuals vs. fitted values and a plot of the square root of the absolute value of standardized residuals vs. fitted (You can use autoplot(which = c(1, 3)) to display the plots side-by-side).
  • How are the plots similar? How do they differ?
  • What is an advantage of using the plot of the residuals vs. fitted to check conditions and model diagnostics?
  • What is an advantage of using the plot of the \(\sqrt{|\text{standardized residuals}|}\) vs. fitted to check conditions and model diagnostics?
# add your code here
  1. Are there any observations that are outliers?
# add your code here

Cook’s distance

  1. Make a plot to check Cook’s distance (autoplot(which = 4)). Based on this plot, are there any points that have a strong influence on the model coefficients?
# add your code here

Adding another variable

  1. Add another variable, Alcohol, to your exploratory visualization. Describe any patterns that emerge.
# add your code here
  1. Fit a multiple linear regression model predicting Tip from Party and Alcohol. Display the results with kable() and a reasonable number of digits.
# add your code here
  1. Interpret each of the slopes.

  2. Does it make sense to interpret the intercept? Explain your reasoning.

  3. According to this model, is the rate of change in tip amount the same for various sizes of parties regardless of alcohol consumption or are they different? Explain your reasoning.

Footnotes

  1. Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎