Objectives

Learn to use the statistical computing language R in RStudio.

Reading this document

Please read this document completely, and follow along in RStudio or Posit Cloud. There is a simple exercise that you must do and turn in located in the last section.

As you have already learned in a previous exercise, this is an R Markdown document, which mixes text with code. To start out, you should ‘knit’ the document before making any changes. It will be easier to read, with nice formatting.

You can read this document in 3 different ways, depending on how you like it. 1. View it as plain “Source” text. You will see all the formatting codes (which might be distracting and ‘ugly’ and numbering may seem wrong) but this is the rawest form, and how the document is actually written. Above and to the right of the code editor pane (top left pane), click the “source” button. 1. View it in the visual editor to make the formatting nice and avoid seeing all the formatting code. This is how I recommend you read the document. Click the “visual” button to the top left of the top left (code editor) pane. 1. Knit the document and read the HTML. This is the final form of the document but some things that you’ll need get hidden in this form, so I don’t recommend it. You can view it either in a separate window or in the “Plots and files” pane (lower right).

Try viewing the document in each of these forms to see what you like.

Getting started and running commands

Below, you will see a gray code ‘chunk’ that begins and ends with 3 backticks (under the tilde on your keyboard; however when you view the document in the Visual mode, as I suggest, you won’t see those backticks). All of the code you want to execute will need to go in those gray areas. To execute it, you can click the little right-facing green arrow at the top right of the gray box. Go ahead and experiment and make changes directly into the Rmd document. Do not copy commands into a separate script file.

The following code chunk is ready to run. Press the green arrow now and check the R console (usually in the bottom left portion of your screen) to make sure it ran without errors.

a <- 2

This is a simple command that we’ll learn about in a moment. You can run commands either one at a time by placing your cursor on the line you want to run and using cntrl-enter or you can run various combinations of commands using the run button at the top of the code editor pane. Try doing both methods now.

If you want to run a single command without recording it into your script, you can enter it into the R console (bottom left of your screen). This is useful for getting help or seeing your data (see below), but if you want to record this step, you should put it into a code chunk of an Rmd document in the top left pane of your screen.

An R Primer

You can do a lot with R but for now, we need to understand a few basics.

Try to think of R like a language. In any language, there are nouns and verbs. Nouns can take the form of a single value such as the number 10. We can use verbs to operate on those nouns, such as the plus sign in $10 + 5 = 15$. Try typing ‘$10 + 5$’ into the code chunk below. Run it. You should see that R returns the answer.

# Put your code (10 + 5) below this line. Don't include parentheses. Then, run the line with 'cntrl-enter' or by clicking the green arrow at the top right of this code chunk.

Variables with a single value

In order to make R more than a glorified calculator, we will save values into named variables. To do so, we will use the assignment operator (<-). This operator (a verb) assigns the value on the right into the variable on the left. So, when you ran the command

a <- 2

above, you assigned the value of 2 into the variable a. Now, if you want to operate on the variable a, you can do so. The following code multiplies whatever value(s) are in the variable a by 10, and assigns the resulting value to a variable called b. We can see the contents of a variable by typing its name alone on a line (and running it).

b <- a * 10
b

## [1] 20

Variables with multiple values (vectors)

Variables can also contain a group of values called a vector. To make a vector, we use the ‘combine’ c() operator. Most operators (verbs) are followed by parentheses, and are called ‘functions’. Inside the parentheses, we put variables or other information (collectively called arguments) that the function needs in order to do its job properly. The c() function combines all of its arguments. You can see the contents of a variable by typing it alone on a line (and running it).

my_variable <- c(a, b, 1, 2, 3)
my_variable

## [1]  2 20  1  2  3

Now that we have a variable with more than one value, we can do some more interesting things with it. For example, we can do mathematical operations on all the numbers at once just by referring to the vector:

(my_variable * 1000 + 50) / 10

## [1]  205 2005  105  205  305

We can also apply other functions (verbs) to the entire vector of numbers. For example, we can take the average using the mean() function and find the minimum and maximum values. There’s even a function to get several useful statistics called summary(). Try the following commands. (Remember, you can run them either by clicking the green arrow, or by ‘cntrl-enter’ on each line)

mean(my_variable)

## [1] 5.6

max(my_variable)

## [1] 20

min(my_variable)

## [1] 1

summary(my_variable)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.0     2.0     5.6     3.0    20.0

Data tables

In addition to holding a vector of values, a variable can hold a whole table of values. For this, we’ll use a small built-in dataset about cars called ‘mtcars’. Run the following line of code to get access to this dataset. Then you can look at the table by typing the name of the dataset (only the first few rows are shown here).

data(mtcars)
mtcars

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also see the dataset in the “Workspace and History” panel (top right panel) under ‘Environment’. Try double-clicking the name of the dataset to see it in tabular form. You should see rows labeled for the type of car, and then characteristics of the car as a number of columns.

You specify which column you want using the dollar sign ($). First rows of mtcars dataset We can use this dollar sign notation to isolate one column of data. For example, to calculate the mean (average) fuel economy for these cars, you can do:

mtcars$mpg # show only the 'mpg' (miles per gallon) column

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

mean(mtcars$mpg) # get the mean (average) mpg for this dataset

## [1] 20.09062

mileage <- mtcars$mpg # Save the mpg data into a separate vector called 'mileage'.

This simply means that you are choosing the mpg column of the variable ‘mtcars’, and want to find out the mean value of that column. How would you get the average horsepower (‘hp’) for this dataset? Save it into a variable called ‘average.horsepower’.

# What code would you use to get the mean horsepower?

Getting help

R contains a help file for every function. In order to see it, type a question mark in front of the function name. For example, the ‘rnorm()’ function allows you to generate a group of normally-distributed random numbers:

?rnorm # get help on the rnorm() function

The help files will tell you what the function does, and what arguments you can give to the function. For the ‘rnorm()’ function, it shows that you can specify (as an ‘argument’) what you want the mean and standard deviation of those random numbers to be. At the bottom of most help files, there are examples which help you to figure out how you can use the function.

Try getting help on the ‘median()’ function.

# How do you get help on how to use the median() function?

Arguments

As mentioned above, most functions (verbs) take one or more arguments. These arguments tell R exactly what you want the verb to do, and on which data you want the verb to act. The function mean() that you used above takes an argument that we put in parentheses after the function, and it tells R to take the ‘mean’ of whatever data you give to the function. Some functions have long lists of arguments, and you should separate them with commas. Look again at the help for rnorm() using ?rnorm(). In that, you will see

rnorm(n, mean = 0, sd = 1)

This indicates that this function takes 3 arguments. You can simply use 3 unlabeled values, such as

rnorm(100, 5, 1)

which would generate 100 random values with a mean of 5 and standard deviation of 1. The order of these arguments is important, since they are unnamed – they need to appear in the same order as specified in the help file. A better way to do it is to name the arguments, such as:

rnorm(n = 100, mean = 5, sd = 1).

In this case, since each argument is named, we can put it in any order we want. It’s also good to write them out like this so that we avoid confusion. We can also leave out some arguments, and use the defaults that are specified in the help file. In this case we could just use rnorm(100) and accept the defaults of mean = 0, sd = 1.

Comments

Computer code is difficult to read and understand! To make it easier, you should add comments to your code to explain what you are doing. This is very important as you will often not remember what or why you did something later (even the next day). It will also help other people understand what your code does.

In order to add comments, you can use the ‘pound’ symbol (#; yeah, I guess you probably call it a ‘hashtag’). R ignores everything on a line after the pound sign, so you can either use it on a line of its own or at the end of a line with code:

# This is a comment that might be used to explain some
# long, complicated bit of code. The computer won't read
# it, but its useful for humans! In particular, the code
# that follows generates 100 random numbers with a mean of 10 and
# standard deviation of 2 and assigns them to a variable
random_nums <- rnorm(n = 100, mean = 10, sd = 2) # Random number generation
# Draw a plot of those random numbers. Note that each time you
# run this code chunk, the numbers will be different and so
# the plot will also be different.
plot(random_nums, xlab='Number', ylab='Value of random number')

A note on spelling and cases

Since R is a computer program, it is vary finicky about spelling. If you spell something wrong, it won’t know what you are talking about! For example, typing mtcars$mbg instead of mtcars$mpg will generate an error because there is no data column called mbg in the mtcars dataset.

Similarly, letters of different case are very different in R. For example, if you type Mtcars hoping to get the mtcars dataset, R will tell you that there is no such dataset.

Misspelling and mixing up cases is the single thing that trips up and causes frustration among Introductory Biology students! Be Warned!

Getting your data into R

R doesn’t have a spreadsheet program built in, so we will use Google Sheets. Before doing this, however, we need to ‘publish’ our data as a CSV (Comma-separated values) file. You will need to do this with each Google dataset that you use. Start by logging in to your Google account and creating a file with a few values in 2 columns (this example is from my Google Drive):

Google sheets file with 3 columns of data

Give your document a title, and make sure you know where it is in your Google Drive. Next, select the ‘File’ menu, then ‘Share’ and ‘Publish to the web’. Then choose to publish it as a CSV file.

Publishing to the Web

–>

You should then copy the URL supplied, and paste it into your code. Probably the easiest function to use is ‘read.csv()’ which will read that file and make it available for you to use in R. The code chunk below reads in the csv (using read.csv()) and then saves that table to an object called dat. As before, you can see what is in the table by simply putting the object name on its own line and running it.

Notice in the read.csv() command below that the command has parentheses, and the information that goes within the parentheses is the URL given to you by Google. You need to put that entire URL into quotes (““).

# The read.csv() function is an easy way to get data into R. Simply paste
# the URL provided by Google into the command, and place it between quotes.
# Note that files that are exported by Google have "...output=csv" at the
# end of the URL. If yours doesn't look like that, it wasn't exported properly.

# Don't change the following line -- you'll do the same thing for your own data
# below
dat <- read.csv(file = "https://docs.google.com/spreadsheets/d/e/2PACX-1vR1WD5pL0ueIrDigzuFBuMJhhs0_y5mLSNsfL1y9hxRJrA6qFJNcnQSEuREzvM3wLHaQjHSX8SY2c_D/pub?output=csv")
dat # don't change this either.

##    leaf length width
## 1     1    4.5   2.1
## 2     2    4.6   2.2
## 3     3    4.3   1.9
## 4     4    4.4   2.2
## 5     5    4.8   2.4
## 6     6    4.4   2.1
## 7     7    4.5   2.0
## 8     8    4.6   2.3
## 9     9    4.4   2.0
## 10   10    4.7   2.2

Finally, you can see whether there seems to be a relationship between the length and width of the leaves in my dataset by drawing a simple graph that includes the data, axis labels, and a main title. We can also do a linear regression of the data and plot that line on the graph (you’ll learn more about these later; for now, simply run the code, don’t change it):

# Don't change anything in this code chunk -- just run it.
plot(x = dat$length,
     y = dat$width,
     xlab = "Leaf length (cm)",
     ylab = "Leaf width (cm)",
     main = "Leaf Dimensions")
linear.regression <- lm(dat$width ~ dat$length) # don't change this
abline(linear.regression)                       # don't change this either

Debugging your code

Things won’t always work perfectly the first time you try them. Think of your code like a recipe that has to be followed from start to finish. If something doesn’t work, R will produce an Error message. Try to figure out what went wrong from that message, remembering that your code is like a recipe.

Consider the following recipe:

Add dry ingredients to bowl
Add wet ingredients to bowl
Stir to make thick batter
Place batter in 9x13in pan
Place pan in oven for 35minutes at 350F

Imagine you forget to add the wet ingredients. If you haven’t done the 2nd step successfully, you can’t do the third step and successfully make a batter. The computer will give you an error that says that you can’t make a batter with what is in the bowl. You therefore also can’t do any of the later steps (such as putting a non-existent batter in the oven!).

How should you solve this problem? Go back to the beginning and re-run the first step. If it runs successfully, run the second and so on. For each line, make sure the data look like what you expect (by typing the name of an object and seeing the output). It’s important to check to see that the data looks right because a problem might happen much earlier than the error. For example, if you get an error on step 3, then you might not know whether it was the dry ingredients (step 1) or the wet ingredients (step 2) that was the problem. Looking at the bowl after each step will help you to determine which.

Also, please remember what I mentioned above about spelling and upper/lowercase letters being different. Computer languages are picky and you have to be precise.

Whitespace

Within a code chunk, each command should start on a new line. However, a single command can take up more than one line, and spacing out your commands over several lines is a good way to make it easy to read (and debug). A command only ends when open parentheses are closed. For example, in the following plot command, I’ve put each argument on it’s own line to make the lines short and easy to read. The opening parentheses after plot stay open until the last line, at which point the command is finished. I encourage you to use this type of spacing in your own code, and RStudio will make it easy for you.

plot(x = dat$length,            # Note that this command is still 'open'
     y = dat$width,             # ...and still not finished ...
     xlab = "Leaf length (cm)", # ...
     ylab = "Leaf width (cm)",  # ...
     main = "Leaf Dimensions")  # ... until it finally finishes here!
# notice that the arguments are separated by commas and are labeled (eg. x, y, xlab, ylab, main).

Exercises

Before you continue, knit this document to make sure everything above this point works.

Demonstrate that you can read a file into R. I recommend using Google Sheets, as above, but you may also use another program such as Excel and save it as a .csv file. There are other ways, but for now, lets just keep it simple!

Replace my name on line 3 with your name.
Make a data file (I recommend Google Sheets).
- Make sure you have at least 2 columns and 10 rows of numeric data. Make up values.
- Name your columns without using spaces or other ‘funny’ characters. Capitals, underscores (_), and dots (.) can be used to make the column headers easier to read. But don’t use spaces!!!
- Make sure there aren’t letters in the columns (other than the first row). If there are, R will assume that the data are textual, rather than numeric, and it can’t do mathematical operations on text.
Write code in the empty code chunk below to read a data file (such as Google sheets).
- Save the data into a variable called “mydata” using the assignment operator (“<-”)
- Remember that we learned the read.csv() command above.
Make R print out the data by simply typing “mydata” alone on a line.

# Put your code into this chunk to read a google sheets file using 'read.csv()'
# and save it into a variable called mydata.

The line of code in the next chunk will calculate summary statistics for your dataset. However, it can’t do that until you have a dataset called “mydata”, which you should have created above. In order to not have an error, I have put “eval=FALSE” in the top line of the code chunk. This prevents knitr from evaluating something that will cause an error. However, by the time you get here, you should have this data, so you can ask knitr to evaluate it by changing “eval=FALSE” to “eval=TRUE”. This is a common thing you’ll have to do frequently in these lab assignments.

Change the “eval=FALSE” to “eval=TRUE” to allow knitr to have access to this code. You will need to look at the Rmd version of this document – it won’t show up in the knitted version.

summary(mydata)

Use the plot() command to generate a simple graph of your data. I have put in the most important parts of the command, but you will need to fill it in depending on the names in your dataset. You can see in the comments what I used when I was using the leaf data from my own dataset. You should use the same method, but your names will be different.
Note that you should have previously (in a step above) saved your data into a variable called “mydata”.
Put some text between the quotes to label the x-axis and the y-axis
Put some text between the quotes to label the entire graph (main)
Change the ‘eval=FALSE’ to ‘eval=TRUE’ so that this graph will be included in the knitted output.

plot(x = mydata     ,     # dat$length
     y = mydata     ,     # dat$width
     xlab="",             # text within the quotes to label the x-axis
     ylab="",             # text within the quotes to label the y-axis
     main="")             # text within the quotes to label the whole graph

What to turn in

Save your file. Please remember to do this periodically.
Remember to clean your environment and clear the knitr cache. If you forgot how to do this, look here. You don’t need to do this every time you knit, just the last time as you prepare your html file to turn in.
Knit the document to produce an html file.
Check that the html file has your graph near the bottom.
Save the html file.
Turn in your html file on canvas. I will be looking to make sure that you completed the exercises section. Specifically, you should have a simple graph using your own data.

Introduction to R

Senanu Spring-Pearson