```
library(tidyverse) # for data wrangling + visualization
library(tidymodels) # for modeling
library(knitr) # for pretty printing of tables
```

# Lab 2 - College scorecard

## Introduction

In today’s lab, you’ll use simple linear regression to analyze the relationship between the admissions rate and total cost for colleges and universities in the United States.

### Learning goals

By the end of the lab you will…

- Be able to fit a simple linear regression model using R.
- Be able to interpret the slope and intercept for the model.
- Be able to use statistical inference to draw conclusions about the slope.
- Continue developing a workflow for reproducible data analysis.

## Getting started

Go to the sta210-s22 organization on GitHub. Click on the repo with the prefix

**lab-2**. It contains the starter documents you need to complete the lab.Clone the repo and start a new project in RStudio. See the Lab 1 instructions for details on cloning a repo, starting a new R project and configuring git.

## Packages

We will use the following package in today’s lab.

## Data: College scorecard

The data for this lab is from the `scorecard`

data set in the **rcfss** R package. It includes information originally obtained from the U.S. Department of Education’s College Scorecard for 1753 colleges and universities during the 2018 - 2019 academic year.

The lab focuses on the following variables:

`admrate`

: Undergraduate admissions rate (from 0-100%)`cost`

: The average annual total cost of attendance, including tuition and fees, books and supplies, and living expenses`type`

: Type of college (Public; Private, nonprofit; Private, for-profit)

Click here to see a full list of variables and definitions.

Use the code below to load the data set.

`<- read_csv("data/scorecard.csv") scorecard `

## Exercises

### Exercise 1

Create a histogram to examine the distribution of `admrate`

and calculate summary statistics for the center (mean and median) and the spread (standard deviation and IQR).

### Exercise 2

Use the results from the previous exercise to describe the distribution of `admrate`

. Include the shape, center, spread, and if there are potential outliers.

### Exercise 3

Plot the distribution of `cost`

and calculate the appropriate summary statistics. Describe the distribution of `cost`

(shape, center, and spread, and outliers) using the plot and appropriate summary statistics.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

### Exercise 4

The goal of this analysis is to fit a regression model that can be used to understand the variability in the cost of college based on the admission rate. Before fitting the model, let’s look at the relationship between the two variables. Create a scatterplot to display the relationship between cost and admissions rate. Describe the relationship between the two variables based on the plot.

### Exercise 5

Does the relationship between cost and admissions rate differ by type of college? Modify the plot from the previous exercise visualize the relationship by type of college.

### Exercise 6

Describe two new observations from the scatterplot in Exercise 5 that you didn’t see in the scatterplot from Exercise 4.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

### Exercise 7

Fit the linear regression model. Use the `kable`

function to neatly display the results with a reasonable number of decimals.

### Exercise 8

Consider the model from the previous exercise.

- Interpret the slope in the context of the problem.
- Does the intercept have a meaningful interpretation? If so, write the interpretation in the context of the problem. Otherwise, explain why the interpretation is not meaningful.

### Exercise 9

Construct a 95% confidence interval for the slope using bootstrapping. Follow these steps to accomplish this:

- First set a seed for simulating reproducibly.
- Then, simulate the bootstrap distribution of the slope using 1,000 bootstrap samples.
- Then, visually estimate the bounds of the bootstrap interval based on a histogram of the distribution of the bootstrapped slopes, using the percentile method.
- And then, use the
`get_confidence_interval()`

function to explicitly calculate the bounds of the confidence interval using the percentile method. - Finally, interpret the confidence interval in the context of the data.

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

### Exercise 10

Finally, we want to answer the question “Do the data provide sufficient evidence of a linear relationship between cost and admissions rate, i.e. \(\beta_1\) is different from 0?”

To answer this question we will use a hypothesis test. We can conduct a hypothesis test via simulation (what we’ll do in this lab) or using mathematical models (what we’ll do in the next class).

Before we can conduct the hypothesis test, let’s first set our hypotheses. Remember that the null hypothesis represents the status quo (nothing going on, i.e. there is no relationship) and the alternative hypothesis represents our research question (there is something going on, i.e. there is a relationship).

- \(H_0\): There is no linear relationship between the admissions rate and cost of colleges in the United States, \(\beta_1 = 0\)
- \(H_A\): There is a linear relationship between the admissions rate and cost of colleges in the United States, \(\beta_1 \ne 0\)

To test these hypotheses, we will use a permutation test, where we

- Simulate new samples from the original sample via permutation under the assumption that the null hypothesis is true
- Fit models to each of the samples and estimate the slope
- Use features of the distribution of the permuted slopes to calculate the p-value for the hypothesis test

The major difference between constructing a confidence interval and conducting a hypothesis test is that for the hypothesis test we assume that the null hypothesis is true. This requires a simulation scheme that will allow us to measure the natural variability in the data due to sampling but **not** due to cost and admission rate being correlated by permuting permute one variable to eliminate any existing relationship between the variables. To do so, we randomly assign each `admrate`

value to `cost`

of a given university, i.e. `cost`

and `admrate`

are no longer matched for a given university.

In the following code chunk we

- First set a seed for simulating reproducibly.
- Then, we start with our data frame and specify our model as
`cost`

vs.`admrate`

. - Then, we set our null hypothesis (
`cost`

and`admrate`

are independent) - And then we generate 1000 replicates of our data where, for each replicate, we permute values of
`admrate`

to randomly assign them to values of`cost`

- Finally, we fit our model to each of our 1000 permuted datasets

```
set.seed(1234)
<- scorecard %>%
perm_fits specify(cost ~ admrate) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
fit()
```

The resulting dataset `perm_fits`

has `nrow(perm_fits)`

and `ncol(perm_fits)`

columns. The first column, `replicate`

indicates the replicate number of the dataset the models were fit to; the values in this column range between 1 and 1000. The second column, `term`

, tells us which term (intercept of the model or slope of `admrate`

) the `estimate`

value in the third column is for.

` perm_fits`

```
# A tibble: 2,000 × 3
# Groups: replicate [1,000]
replicate term estimate
<int> <chr> <dbl>
1 1 intercept 36857.
2 1 admrate -781.
3 2 intercept 35901.
4 2 admrate 643.
5 3 intercept 36608.
6 3 admrate -411.
7 4 intercept 35831.
8 4 admrate 746.
9 5 intercept 36367.
10 5 admrate -51.7
# … with 1,990 more rows
```

- Create a histogram of the slope estimates in
`perm_fits`

. (Hint: Filter the dataset for just the slope values,`term == "admrate".`

) - Estimate the p-value of the hypothesis test based on this distribution.
- State your conclusion for the test in context.
- Indicate whether or not it is consistent with the results of the hypothesis test from the previous exercise. Briefly explain your response.

## Submission

To submit your assignment:

- Go to http://www.gradescope.com and click
*Log in*in the top right corner. - Click
*School Credentials*➡️*Duke NetID*and log in using your NetID credentials. - Click on your
*STA 210*course. - Click on the assignment, and you’ll be prompted to submit it.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the
*“Workflow & formatting”*section.

## Grading

Total points available: 50 points.

Component | Points |
---|---|

Ex 1 - 10 | 45 |

Workflow & formatting | 5^{1} |

^{1} The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages and updating the name and date in the YAML.