STAT 19000: Project 13 — Fall 2020
Motivation: It’s important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like paste0 or sapply, and noticed that in the "usage" section, one of the arguments is an ellipsis (…). Well, unless you understand what this does, it’s hard to really get it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.
Context: We’ve learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.
Scope: r, functions
Questions
Question 1
Read /class/datamine/data/beer/beers.csv into a data.frame named beers. Read /class/datamine/data/beer/breweries.csv into a data.frame named breweries. Read /class/datamine/data/beer/reviews.csv into a data.frame named reviews.
| Notice that  | 
| Do not forget to load the  | 
Below we show you an example of how fast the fread function is compared to`read.csv`.
microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)Unit: milliseconds
expr
read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
       min        lq      mean    median        uq       max neval
 5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589     5
  120.7705  122.3812  127.9842  128.7794  133.7695  134.2205     5| This video demonstrates how to read the  | 
- 
R code used to solve the problem. 
Question 2
Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in breweries don’t have an associated beer in beers? How many beers in beers don’t have an associated brewery in breweries?
| We compare lists of names using  | 
- 
R code used to solve the problem. 
- 
A description of columns which connect each of the files. 
- 
How many breweries don’t have an associated beer in beers.
- 
How many beers don’t have an associated brewery in breweries.
Question 3
Run ?sapply and look at the usage section for sapply. If you look at the description for the … argument, you’ll see it is "optional arguments to FUN`". What this means is you can specify additional input for the function you are passing to `sapply. One example would be passing T to na.rm in the mean function: sapply(dat, mean, na.rm=T). Use sapply and the strsplit function to separate the types of breweries (types) by commas. Use another sapply to loop through your results and count the number of types for each brewery. Be sure to name your final results n_types. What is the average amount of services (n_types) breweries in IN and MI offer (we are looking for the average of IN and MI combined)? Does that surprise you?
| When you have one  | 
| We show, in this video, how to find the average number of parts in a midwesterner’s name. Perhaps surprisingly, this same technique will be useful in solving Question 3. | 
- 
R code used to solve the question. 
- 
1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer. 
Question 4
Write a function called compare_beers that accepts a function that you will call FUN, and any number of vectors of beer ids. The function, compare_beers, should cycle through each vector/groups of beer_id`s, compute the function, `FUN, on the subset of reviews, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying FUN on the subset of the reviews data.
In the example below the function FUN is the median function and we have two vectors/groups of beer_id`s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our `compare_beers function, we want to write the function in a way that we could pass as many vectors as we want to.
Example:
compare_beers(reviews, median, c(271781), c(125646, 82352))This example gives the output:
Group 1: 4 Group 2: 4.56
For your solution to this question, find the behavior of compare_beers in this example:
compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))| There are different approaches to this question. You can use for loops or  | 
| This first video shows how to use  | 
| This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions. | 
- 
R code used to solve the problem. 
- 
The result from running the provided example. 
Question 5
Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.
| Create a vector of  | 
| This video demonstrates an example of how to use the  | 
- 
R code used to solve the problem. 
- 
The result from running your function. 
- 
The resulting plot. 
- 
1-2 sentence commenting on your plot.