Learn to use the statistical computing language R in
RStudio.
Please read this document completely, and follow along in RStudio or Posit Cloud. There is a simple exercise that you must do and turn in located in the last section.
As you have already learned in a previous exercise, this is an R Markdown document, which mixes text with code. To start out, you should ‘knit’ the document before making any changes. It will be easier to read, with nice formatting.
You can read this document in 3 different ways, depending on how you like it. 1. View it as plain “Source” text. You will see all the formatting codes (which might be distracting and ‘ugly’ and numbering may seem wrong) but this is the rawest form, and how the document is actually written. Above and to the right of the code editor pane (top left pane), click the “source” button. 1. View it in the visual editor to make the formatting nice and avoid seeing all the formatting code. This is how I recommend you read the document. Click the “visual” button to the top left of the top left (code editor) pane. 1. Knit the document and read the HTML. This is the final form of the document but some things that you’ll need get hidden in this form, so I don’t recommend it. You can view it either in a separate window or in the “Plots and files” pane (lower right).
Try viewing the document in each of these forms to see what you like.
Below, you will see a gray code ‘chunk’ that begins and ends with 3 backticks (under the tilde on your keyboard; however when you view the document in the Visual mode, as I suggest, you won’t see those backticks). All of the code you want to execute will need to go in those gray areas. To execute it, you can click the little right-facing green arrow at the top right of the gray box. Go ahead and experiment and make changes directly into the Rmd document. Do not copy commands into a separate script file.
The following code chunk is ready to run. Press the green arrow now and check the R console (usually in the bottom left portion of your screen) to make sure it ran without errors.
a <- 2
This is a simple command that we’ll learn about in a moment. You can
run commands either one at a time by placing your cursor on the line you
want to run and using cntrl-enter or you can run various
combinations of commands using the run button at the top of
the code editor pane. Try doing both methods now.
If you want to run a single command without recording it into your script, you can enter it into the R console (bottom left of your screen). This is useful for getting help or seeing your data (see below), but if you want to record this step, you should put it into a code chunk of an Rmd document in the top left pane of your screen.
You can do a lot with R but for now, we need to
understand a few basics.
Try to think of R like a language. In any language,
there are nouns and verbs. Nouns can
take the form of a single value such as the number 10. We can use verbs
to operate on those nouns, such as the plus sign in \(10 + 5 = 15\). Try typing ‘\(10 + 5\)’ into the code chunk below. Run
it. You should see that R returns the answer.
# Put your code (10 + 5) below this line. Don't include parentheses. Then, run the line with 'cntrl-enter' or by clicking the green arrow at the top right of this code chunk.
In order to make R more than a glorified calculator, we will save values into named variables. To do so, we will use the assignment operator (<-). This operator (a verb) assigns the value on the right into the variable on the left. So, when you ran the command
a <- 2
above, you assigned the value of 2 into the variable
a. Now, if you want to operate on the variable
a, you can do so. The following code multiplies whatever
value(s) are in the variable a by 10, and
assigns the resulting value to a variable called
b. We can see the contents of a variable by typing its name
alone on a line (and running it).
b <- a * 10
b
## [1] 20
Variables can also contain a group of values called a
vector. To make a vector, we use the ‘combine’
c() operator. Most operators (verbs) are followed by
parentheses, and are called ‘functions’. Inside the parentheses, we put
variables or other information (collectively called
arguments) that the function needs in order to do its
job properly. The c() function combines
all of its arguments. You can see the contents of a variable by typing
it alone on a line (and running it).
my_variable <- c(a, b, 1, 2, 3)
my_variable
## [1] 2 20 1 2 3
Now that we have a variable with more than one value, we can do some more interesting things with it. For example, we can do mathematical operations on all the numbers at once just by referring to the vector:
(my_variable * 1000 + 50) / 10
## [1] 205 2005 105 205 305
We can also apply other functions (verbs) to the entire vector of
numbers. For example, we can take the average using the
mean() function and find the minimum and maximum values.
There’s even a function to get several useful statistics called
summary(). Try the following commands. (Remember, you can
run them either by clicking the green arrow, or by ‘cntrl-enter’ on each
line)
mean(my_variable)
## [1] 5.6
max(my_variable)
## [1] 20
min(my_variable)
## [1] 1
summary(my_variable)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.0 2.0 5.6 3.0 20.0
In addition to holding a vector of values, a variable can hold a whole table of values. For this, we’ll use a small built-in dataset about cars called ‘mtcars’. Run the following line of code to get access to this dataset. Then you can look at the table by typing the name of the dataset (only the first few rows are shown here).
data(mtcars)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can also see the dataset in the “Workspace and History” panel (top right panel) under ‘Environment’. Try double-clicking the name of the dataset to see it in tabular form. You should see rows labeled for the type of car, and then characteristics of the car as a number of columns.
You specify which column you want using the dollar sign
($). We can use this dollar sign
notation to isolate one column of data. For example, to calculate the
mean (average) fuel economy for these cars, you can do:
mtcars$mpg # show only the 'mpg' (miles per gallon) column
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mean(mtcars$mpg) # get the mean (average) mpg for this dataset
## [1] 20.09062
mileage <- mtcars$mpg # Save the mpg data into a separate vector called 'mileage'.
This simply means that you are choosing the mpg column
of the variable ‘mtcars’, and want to find out the mean
value of that column. How would you get the average horsepower (‘hp’)
for this dataset? Save it into a variable called
‘average.horsepower’.
# What code would you use to get the mean horsepower?
R contains a help file for every function. In order to see it, type a
question mark in front of the function name. For example, the
‘rnorm()’ function allows you to generate a group of
normally-distributed random numbers:
?rnorm # get help on the rnorm() function
The help files will tell you what the function does, and what
arguments you can give to the function. For the ‘rnorm()’
function, it shows that you can specify (as an ‘argument’) what you want
the mean and standard deviation of those random numbers to be. At the
bottom of most help files, there are examples which help you to figure
out how you can use the function.
Try getting help on the ‘median()’ function.
# How do you get help on how to use the median() function?
As mentioned above, most functions (verbs) take one or more
arguments. These arguments tell R exactly what you want the verb to do,
and on which data you want the verb to act. The function
mean() that you used above takes an argument that we put in
parentheses after the function, and it tells R to take the
‘mean’ of whatever data you give to the function. Some functions have
long lists of arguments, and you should separate them with commas. Look
again at the help for rnorm() using ?rnorm().
In that, you will see
rnorm(n, mean = 0, sd = 1)
This indicates that this function takes 3 arguments. You can simply use 3 unlabeled values, such as
rnorm(100, 5, 1)
which would generate 100 random values with a mean of 5 and standard deviation of 1. The order of these arguments is important, since they are unnamed – they need to appear in the same order as specified in the help file. A better way to do it is to name the arguments, such as:
rnorm(n = 100, mean = 5, sd = 1).
In this case, since each argument is named, we can put it in any
order we want. It’s also good to write them out like this so that we
avoid confusion. We can also leave out some arguments, and use the
defaults that are specified in the help file. In this case we could just
use rnorm(100) and accept the defaults of
mean = 0, sd = 1.
Since R is a computer program, it is vary finicky about spelling. If
you spell something wrong, it won’t know what you are talking about! For
example, typing mtcars$mbg instead of
mtcars$mpg will generate an error because there is no data
column called mbg in the mtcars dataset.
Similarly, letters of different case are very different in R. For
example, if you type Mtcars hoping to get the
mtcars dataset, R will tell you that there is no such
dataset.
R doesn’t have a spreadsheet program built in, so we will use Google Sheets. Before doing this, however, we need to ‘publish’ our data as a CSV (Comma-separated values) file. You will need to do this with each Google dataset that you use. Start by logging in to your Google account and creating a file with a few values in 2 columns (this example is from my Google Drive):
Give your document a title, and make sure you know where it is in your Google Drive. Next, select the ‘File’ menu, then ‘Share’ and ‘Publish to the web’. Then choose to publish it as a CSV file.
–>
You should then copy the URL supplied, and paste it into your code.
Probably the easiest function to use is ‘read.csv()’ which
will read that file and make it available for you to use in R. The code
chunk below reads in the csv (using read.csv()) and then
saves that table to an object called dat. As before, you
can see what is in the table by simply putting the object name on its
own line and running it.
Notice in the read.csv() command below that the command
has parentheses, and the information that goes within the parentheses is
the URL given to you by Google. You need to put that entire URL into
quotes (““).
# The read.csv() function is an easy way to get data into R. Simply paste
# the URL provided by Google into the command, and place it between quotes.
# Note that files that are exported by Google have "...output=csv" at the
# end of the URL. If yours doesn't look like that, it wasn't exported properly.
# Don't change the following line -- you'll do the same thing for your own data
# below
dat <- read.csv(file = "https://docs.google.com/spreadsheets/d/e/2PACX-1vR1WD5pL0ueIrDigzuFBuMJhhs0_y5mLSNsfL1y9hxRJrA6qFJNcnQSEuREzvM3wLHaQjHSX8SY2c_D/pub?output=csv")
dat # don't change this either.
## leaf length width
## 1 1 4.5 2.1
## 2 2 4.6 2.2
## 3 3 4.3 1.9
## 4 4 4.4 2.2
## 5 5 4.8 2.4
## 6 6 4.4 2.1
## 7 7 4.5 2.0
## 8 8 4.6 2.3
## 9 9 4.4 2.0
## 10 10 4.7 2.2
Finally, you can see whether there seems to be a relationship between the length and width of the leaves in my dataset by drawing a simple graph that includes the data, axis labels, and a main title. We can also do a linear regression of the data and plot that line on the graph (you’ll learn more about these later; for now, simply run the code, don’t change it):
# Don't change anything in this code chunk -- just run it.
plot(x = dat$length,
y = dat$width,
xlab = "Leaf length (cm)",
ylab = "Leaf width (cm)",
main = "Leaf Dimensions")
linear.regression <- lm(dat$width ~ dat$length) # don't change this
abline(linear.regression) # don't change this either
Things won’t always work perfectly the first time you try them. Think of your code like a recipe that has to be followed from start to finish. If something doesn’t work, R will produce an Error message. Try to figure out what went wrong from that message, remembering that your code is like a recipe.
Consider the following recipe:
Imagine you forget to add the wet ingredients. If you haven’t done the 2nd step successfully, you can’t do the third step and successfully make a batter. The computer will give you an error that says that you can’t make a batter with what is in the bowl. You therefore also can’t do any of the later steps (such as putting a non-existent batter in the oven!).
How should you solve this problem? Go back to the beginning and re-run the first step. If it runs successfully, run the second and so on. For each line, make sure the data look like what you expect (by typing the name of an object and seeing the output). It’s important to check to see that the data looks right because a problem might happen much earlier than the error. For example, if you get an error on step 3, then you might not know whether it was the dry ingredients (step 1) or the wet ingredients (step 2) that was the problem. Looking at the bowl after each step will help you to determine which.
Also, please remember what I mentioned above about spelling and upper/lowercase letters being different. Computer languages are picky and you have to be precise.
Within a code chunk, each command should start on a new line.
However, a single command can take up more than one line, and spacing
out your commands over several lines is a good way to make it easy to
read (and debug). A command only ends when open parentheses are closed.
For example, in the following plot command, I’ve put each argument on
it’s own line to make the lines short and easy to read. The opening
parentheses after plot stay open until the last line, at
which point the command is finished. I encourage you to use this type of
spacing in your own code, and RStudio will make it easy for you.
plot(x = dat$length, # Note that this command is still 'open'
y = dat$width, # ...and still not finished ...
xlab = "Leaf length (cm)", # ...
ylab = "Leaf width (cm)", # ...
main = "Leaf Dimensions") # ... until it finally finishes here!
# notice that the arguments are separated by commas and are labeled (eg. x, y, xlab, ylab, main).
Before you continue, knit this document to make sure everything above this point works.
Demonstrate that you can read a file into R. I recommend using Google Sheets, as above, but you may also use another program such as Excel and save it as a .csv file. There are other ways, but for now, lets just keep it simple!
read.csv() command
above.# Put your code into this chunk to read a google sheets file using 'read.csv()'
# and save it into a variable called mydata.
The line of code in the next chunk will calculate summary statistics for your dataset. However, it can’t do that until you have a dataset called “mydata”, which you should have created above. In order to not have an error, I have put “eval=FALSE” in the top line of the code chunk. This prevents knitr from evaluating something that will cause an error. However, by the time you get here, you should have this data, so you can ask knitr to evaluate it by changing “eval=FALSE” to “eval=TRUE”. This is a common thing you’ll have to do frequently in these lab assignments.
summary(mydata)
plot() command to generate a simple graph of
your data. I have put in the most important parts of the command, but
you will need to fill it in depending on the names in your dataset. You
can see in the comments what I used when I was using the leaf data from
my own dataset. You should use the same method, but your names will be
different.plot(x = mydata , # dat$length
y = mydata , # dat$width
xlab="", # text within the quotes to label the x-axis
ylab="", # text within the quotes to label the y-axis
main="") # text within the quotes to label the whole graph
Comments
Computer code is difficult to read and understand! To make it easier, you should add comments to your code to explain what you are doing. This is very important as you will often not remember what or why you did something later (even the next day). It will also help other people understand what your code does.
In order to add comments, you can use the ‘pound’ symbol (
#; yeah, I guess you probably call it a ‘hashtag’). R ignores everything on a line after the pound sign, so you can either use it on a line of its own or at the end of a line with code: