TDM 10100: R Project 1 — 2024

Motivation: The goal of this project is to get you comfortable with the basics of operating in Jupyter notebooks as hosted on Anvil, our computing cluster. If you don’t understand the code you’re running/writing at this point in the course, that’s okay! We are going to go into detail about how everything works in future projects.

Context: There’s no important prior context needed for this project! However, if you are interested in learning more there are plenty of online resources available that go into greater detail about the inner workings of Jupyter notebooks.

Scope: Anvil, Jupyter Labs, Jupyter Notebooks, R, bash

Learning Objectives:
  • Learn to create Jupyter notebooks

  • Gain proficiency manipulating Jupyter notebook contents

  • Learn how to upload/download files to/from Anvil

  • Write basic R code to read in data

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/icecream/breyers/reviews.csv

  • /anvil/projects/tdm/data/icecream/bj/products.csv

Questions

Question 1 (2 pts)

First and foremost, welcome to The Data Mine! We hope that throughout your journey with us, you learn a lot, make new friends, and develop skills that will help you with your future career. Throughout your time with The Data Mine, you will have plenty of resources available should you need help. By coming to weekly seminar, posting on the class Piazza, and joining Dr. Ward and the TA team’s office hours, you can ensure that you always have the support you need to succeed in this course.

If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: the-examples-book.com/starter-guides/anvil/access-setup. If you’re still having issues with your ACCESS ID, please send an email containing as much information as possible about your issue to datamine-help@purdue.edu

Let’s start off by starting up our first Jupyter session on Anvil! First, visit this link and sign in using the username and password you picked when you set up your credentials.

These credentials may not necessarily be the same as your Purdue Career account.

In the upper-middle part of your screen, you should see a dropdown button labeled The Data Mine. Click on it, then select Jupyter Notebook from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this:

OnDemand Jupyter Lab
Figure 1. OnDemand Jupyter Lab

If your screen doesn’t look like this, please try and select the correct dropdown option again or visit seminar for more assistance.

Jupyter Lab and Jupyter Notebook are technically different things, but in the context of seminar we will refer to them interchangeably to represent the tools you’ll be using to work on your projects.

There are a few key parts of this screen to note:

  • Allocation: this should always be cis220051 for The Data Mine

  • Queue: again, this should stay on the default option shared unless otherwise noted.

  • Time in Hours: The amount of time your Jupyter session will last. When this runs out, you’ll need to start a new session. It may be tempting to set it to the maximum, but our computing cluster is a shared resource. This means every hour you use is an hour someone else can’t use, so please only reserve it for 1-2 hours at a time.

  • CPU cores: CPU cores do the computation for the programs we run. All programs require memory to run as well. On Anvil, we will request one or more CPU cores to run our Jupyter Notebooks for Seminar. Each CPU core requested is also allocated 1896MB (1.9GB) of memory. Many programs are only able to use a single CPU core for computing, meaning requesting more than one CPU core would be wasteful. However, even if a program requires just a single CPU core for computing it may require, for example, 7GB of memory. This means we must request 4 CPU cores to get sufficient memory (1.9GB times 4 CPU cores gives us 7.6 GB of memory). We have determined in advance the number of CPU cores (and thus memory) will be required for each project and will tell you how many CPU cores you should use.

When your session ends, you will no longer be able to save/edit your work. Be sure to save on a regular basis so that even when your session ends, your work is safe. To put it more simply: a session ending does not delete your work, and anything you saved prior to the session ending will still be there when you start a new session.

With the key parts of this screen explained, go ahead and select 1 hour of time with 2 CPU cores and click Launch! After a bit of waiting, you should see something like below. Click connect to Jupyter and proceed to the next question!

Launch Jupyter Lab
Figure 2. Launch Jupyter Lab

You likely noticed a short wait before your Jupyter session launched. This happens while Anvil finds and allocates space for you to work. The more students are working on Anvil, the longer this will take, so it is our suggesting to start your projects early during the week to avoid any last-minute hiccups causing a missed deadline.

To cement the idea of Anvil being a large (but still limited) resource, please visit this website. Read through the information about Anvil (its short!) and pay special notice to the table at the bottom about Anvil’s sub-clusters. For this question, we want you to calculate how many nodes, cores, and total memory (in GB) Anvil has between sub-clusters A, B, and G. (Hint: 1TB = 1000GB). In the next question, we’ll walk you through where to write your answer, so for now just keep these numbers noted.

Deliverables
  • The total number of nodes, cores, and memory in Anvil sub-clusters A, B, and G combined.

Question 2 (2 pts)

Once you connect to Jupyter, you should be on a screen that looks similar to this (though yours may have a white background):

Jupyter Lab Homescreen
Figure 3. Jupyter Lab Homescreen

Before you jump into Jupyter, take a minute to read through this page that runs through most of the basics about Jupyter Labs. Additionally, take note of the 'Launcher' tab that is taking up most of the screen. The different options that you see (like Python 3, R 4.0.5, and Testing) are called kernels, and each kernel reads and runs code slightly differently. For Python, we’ll be using the seminar kernel, but you should just keep that in your back pocket for now.

Take a second to download our project template here (which can also be found on Anvil at /anvil/projects/tdm/etc/project_template.ipynb) Then upload the template to Jupyter and open it.

When you first open the template, you may get a pop-up asking you to select what kernel you’ll be using. Select seminar. You may have to scroll down to find it. If you do not get this pop-up, you can also select a kernel by clicking on the upper right part of your screen that likely says something similar to No Kernel, and then selecting the kernel you want to use.

A Jupyter notebook is made up of cells, which you can edit and then run. There are two types of cells we’ll work in for this class:

  • Markdown cells. These are where your writing, titles, sections, and paragraphs will go. Double clicking a markdown cell puts it in edit mode, and then clicking the play button near the top of the screen runs the cell, which puts it in its formatted form. More on this in a second. For now, just recognize that most markdown looks like regular text with extra characters like #, *, and - to specify bolding, indentation font, size, and more!

  • Code cells. These are where you will write and run all your code! Clicking the play button will run the code in that cell, and the programming language is specified by the language or languages known by the kernel that you chose.

For this question, you’re responsible for three main tasks:

  1. Fill in Question 1 with the information you found previously, in a markdown cell.

  2. In Question 2, copy and paste the code below this list into a code cell and then run it. You should see it output "[1] Hello and welcome to The Data Mine!", which is the result of running your code. Don’t worry about understanding the %%R part yet, as we will learn more about this in the next question.

  3. In the markdown cell for Question 2, please show three different examples of markdown elements. This cheatsheet is a good resource for some common markdown elements that you can see. An example you could do is a header, an ordered list, and some bold text. Be sure to run the cell after filling it in to see the results of your markdown!

%%R
print("Hello and Welcome to The Data Mine!")

Some common Jupyter notebooks shortcuts:

  • Instead of clicking the play button, you can press ctrl+enter (or cmd+enter on Mac) to run a cell.

  • If you want to run a cell and then move immediately to the next cell, you can use shift+enter. This is oftentimes more useful than ctrl+enter

  • If you want to run the current cell and then immediately create a new code cell below it, you can press alt+enter (or option+enter on Mac) to do so.

  • When a cell is selected (this means you clicked next to it, and it should show a blue bar to its left to signify this), pressing the d key twice will delete that cell.

  • When a cell is selected, pressing the a key will create a new code cell `a`bove the currently selected cell.

  • When a cell is selected, pressing the b key will create a new code cell `b`elow the selected cell

As this is our first real task of the semester, you’ll find a photo below of what your completed Question 2 may look like. Note that yours may differ slightly.

Question 2 Example Answer
Figure 4. Question 2 Example Answer
Deliverables
  • Your answers from Question 1, filled in.

  • The result of running the provided print() code.

  • Three examples of markdown elements in your markdown cell.

Question 3 (2 pts)

Let’s get more comfortable with code cells in Jupyter by learning how to run code in different languages! While most of the code you’ll run in this course will be in either Python or R, sometimes different languages like Bash, Perl, and more will provide more straightforward answers to a problem.

In Question 3, copy the following R code into a code cell and run it. This will read in some data to a dataframe called df, and then tell you how much space (in bytes) your dataframe is taking up!:

%%R
df <- read.csv("/anvil/projects/tdm/data/icecream/breyers/reviews.csv")
object.size(df)

Now let’s do the same thing but in Bash! Create a new code cell below the one you just ran (refer to the hint in the last question for a shortcut on how to do this), and copy in the below code:

%%bash

echo $(du /anvil/projects/tdm/data/icecream/breyers/reviews.csv --bytes | cut -f1) bytes

Running this should give you a smaller output than the R output. This is because in bash, we are checking the size of the stored data, while in Python we are reading the data into a dataframe that has a bit more memory associated with it to make it easier to work with.

Notice that %%R is capitalized, while %%bash and, similarly, %%python are not. Cell magic is Case-sensitive, so be sure to check your spelling!

As a side note, bash is an extremely important foundational tool to working with data and computers more generally. As a 'command line tool', bash is essentially a foundational programming language, and has a lot of fast, efficient, and useful tools that are useful no matter what project you’re working on. From navigating through file directories, to writing basic scripts, to locating and running programs, bash is hiding in the background of most everything your computer does.

Take note of the %%bash line in the cell you just ran. This is called cell magic, and it tells our kernel that we want it to run our code as a different language than the default. As an added example, writing %%python will allow us to run code in the Python programming language.

For more information on cell magic and how it works, please refer to this page.

To further cement your understanding of cell magic, we are going to translate one more bit of code from R to Bash. Let’s take the print() code from the last problem and convert it to its Bash equivalent! As a reminder, here is the R code to translate to Bash

%%R

print("Hello and Welcome to The Data Mine!")

Printing in Bash can be done using the echo command. For example, if I wanted to print "Dr. Ward is a robot" I could write echo Dr. Ward is a robot

Deliverables
  • The code and results of running the code for each language and task in the question.

  • The Bash equivalent of the print() statement from the last problem, and the results of running it.

Question 4 (2 pts)

Great work making it this far! At this point, you should now be pretty comfortable running code in Jupyter Notebooks.

In the next 2 questions we are going to introduce some new code that will allow us to read in large datasets and begin to work with them! If you don’t understand the specifics, that’s okay for now. For now, let’s just learn by doing. To start, run the following R code:

%%R

my_df <- read.csv("/anvil/projects/tdm/data/icecream/breyers/reviews.csv")
print(dim(my_df))
head(my_df)

The breakdown of this code is as follows:

  1. We use cell magic to make sure the seminar kernel knows to read our code in the R language

  2. We use the read.csv function to read the data from the given file into a dataframe we call my_df.

  3. We print the dimensions of the dataframe, my_df. You should see an output of "[1] 5007 8"

  4. We print the head of the dataframe, which is just the first 5 rows of our dataframe and the column headers (if they exist).

For the last part of this question, we want you to create a new code cell and write some R to print the names of the columns of our dataframe. If you do everything correctly, you should see the columns are named key, author, date, stars, title, helpful_yes, helpful_no, and text. If you’re struggling, take a look at the hint below:

R has two built-in functions, called colnames() and names(), respectively, that will both return the names of the columns of a dataframe. Here is a link to a documentation page on colnames() and some examples of it being used. An important part of data science and writing code is being able to read and learn from documentation, so we will try and provide relevant pages throughout the course. If you have any questions or are having trouble interpreting some documentation (often just called 'docs'), please reach out.

Deliverables
  • The result of runnning the provided code that reads in a dataframe and prints its shape.

  • A code cell that prints the names of the columns in my_df

Question 5 (2 pts)

Let’s take a second to reflect on everything you did and learned during this project. First, you learned how to Launch a Jupyter Notebook session on the Anvil supercomputing cluster. Next you learned about uploading files to Anvil, the general structure of Jupyter notebooks, and how to manipulate the contents of a notebook to fit your working style. Finally, you learned how to write and run some basic code in Jupyter notebooks, including how to read in data!

In this last question, we are going to try and put everything you learned today together. In the previous question, you read a file on Breyer’s ice cream reviews into a dataframe called my_df and printed the number of columns and rows in the dataframe. Finally, we had you write some code to print the names of the columns of my_df

In this question, we want you to read a file on Breyers’s ice cream products into a dataframe called BreyProd_df. The path to the file is /anvil/projects/tdm/data/icecream/bj/products.csv. Next, print the number of rows and columns in BreyProd_df, and then print the names of the columns in BreyProd_df.

The code needed to solve this problem is almost identical to that of the last problem. If you’re struggling, considering revisiting Question 4 and trying to better understand what is going on in that code, and feel free to copy the code from Question 4 into Question 5 and modify it directly.

One way you can validate that your code is working correctly is comparing the results of your code that outputs the number of rows/columns in the dataframe with the code that outputs the names of the columns in the dataframes. The number of columns in the dataframe should match the number of names printed.

Finally, make sure that your name is at the top of the project template. If you used outside resources (like Stack Overflow) or got help from TAs, make sure to note where you got assistance from, and on what part of the project they assisted you, in the appropriate sections at the top of the template.

Deliverables
  • Code that reads the products.csv file into a dataframe

  • Code that prints the shape of the resulting dataframe

  • Code that prints the names of the columns in the resulting dataframe

Submitting your Work

Congratulations! Assuming you’ve completed all the above questions, you’ve just finished your first project for TDM 10100! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours. Prior to submitting, make sure you’ve run all of the code in your Jupyter notebook and the results of running that code is visible. More detailed instructions on how to ensure that your submission is formatted correctly can be found here. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. At the bottom of each 101 project, you will find a comprehensive list of all the files that need to be submitted for that project. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!

Items to submit
  • firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.