Note that you will not be able to knit this document yourself until there is data entered into the spreadsheet.

Purpose, and getting credit

This is designed to be a very easy assignment but it deals with some complex (statistical) issues. Since this isn’t a statistics course, my goal here is just to expose you to the idea of hypothesis testing (see below) which we’ll revisit later in the course.

This exercise will also give you the chance to make some (hopefully) nicer graphs than we’ve previously done.

You will not have to change any code to complete this.

Finally, to get credit, you must read through this whole document and answer the few questions in the “Back to your data…” section at the bottom. Knit it to an html to turn it in.

Install a graphics library

The code in this file uses a more advanced graphing function than is available in base R. You will download this library in order to use it, but fortunately R makes this very easy. You will need to run this only once for your computer and afterwards, you can ignore it or delete it.

After installing the ggplot2 package, you need to load it into the memory of your computer. You need to do this every time you start R (or RStudio). Once loaded, you don’t need to reload it, though running the code to do so is very quick, so its better to just leave the following code in.

Get data

You will enter your data into a Google spreadsheet in a ‘long’ format. In ‘long’ format, each observation is on its own line, with one or more columns to distinguish between them. You can enter your data here. Each lab group should enter their data in a different portion of the spreadsheet, and we will use the whole class’ data for our analysis.

Plot all data

We will start by simply plotting all the data, using a separate line for each carbohydrate. Notice that we don’t use the simple plot() function like we’ve done before, but we instead use ggplot() which we got from the ggplot2 package above.

ggplot2::ggplot(data = dat,
                aes(x = time, y = co2vol, color = carb, group = carb)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  xlab("Time (minutes)") +
  ylab(expression(Volume~of~CO[2])) +
  labs(color = "Carbohydrate")
## `geom_smooth()` using formula = 'y ~ x'

The graph above is a placeholder only and uses fake data. Your data will be much less regular

Notice that the regression lines all may have different slopes and intercepts. You might be tempted to conclude that they are therefore very different, but as scientists, we need to run a statistical test in order to make that conclusion.

You may imagine that there is some underlying biological principle at work (there is!) but that we don’t really know what effect it will have on the volume of carbon dioxide produced. By measuring the volume of CO2 we are attempting to discover the truth about that biological principle. However, because of random experimental errors and the fact that we are just taking a ‘sample’ of the ‘truth’, our data won’t perfectly reflect this ‘truth’. We want to know whether it is likely that there is a difference between the lines on the graph (we’ll study probability and statistics in more depth later in the course).

Many basic statistical methods use what is called a ‘hypothesis test’. In a hypothesis test, we are interested in finding out whether it is likely that our data is different from what we might get from randomly sampling data in which there is no relationship between two variables.

Imagine you did an experiment similar to this but instead of using live yeast, you used something relatively inert like salt. You’d expect that there would be no relationship between time and the volume of CO2 produced. In other words, you’d expect the line to be flat, even though the error and randomness of your measurements resulted in non-zero data. You might have data like shown below:

set.seed(9)
salt <- data.frame(time = rep(seq(0, 90, by=10), times = 10),
                   co2  = rnorm(100, mean=0, sd=1),
                   carb = c(rep("carbA", times = 50), rep("carbB", times = 50)))
ggplot2::ggplot(data = salt,
                aes(x = time, y = co2)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE) +
  xlab("Time (minutes)") +
  ylab(expression(Volume~of~CO[2])) +
  labs(color = "Carbohydrate")
## `geom_smooth()` using formula = 'y ~ x'

Is this line flat?

The line above isn’t quite level – the slope doesn’t seem to be quite 0. In statistics, we would state a “null hypothesis” (H0) that states that there is no relationship between time and the volume of CO2 produced. An alternate hypothesis (HA) would state that there is a relationship between the two:

H0: There is no relationship between the two variables. The slope is not different from 0

HA: There is a relationship between the two variables. The slope is different from 0.

Usually, we hypothesize that there is a relationship between 2 variables, and so we want to disprove the null hypothesis H0.

In the graph above, the line has a slight slope (ie. the slope isn’t quite 0), but is that slight slope meaningful and indicative that there is some biological effect going on, or was it generated just by our random sampling of the data (plus some measurement errors). Distinguishing between the null hypothesis H0 and an alternative hypothesis (HA) is what statistical tests are designed to do.

Fortunately, R was originally designed as a statistical language and so has many statistical test built-in. You have already learned to do a regression analysis, so let’s do a regression of the volume of CO2 on Time using this fake salt data to see if the line is statistically flat or if there is something interesting causing it to be not flat.

First, we do the regression:

salt.regression <- lm(co2 ~ time, data=salt)
salt.regression
##
## Call:
## lm(formula = co2 ~ time, data = salt)
##
## Coefficients:
## (Intercept)         time
##   -0.189953     0.003032

Remember that this uses the formula for a straight line \[Y = mX + b\] where \(Y\) is CO2 and \(X\) is Time. \(m\) is therefore the slope and \(b\) is the intercept. So we could write this as \[ Y=0.003X + -0.19\]

Let’s use the summary() function to get more information about this linear regression:

summary(salt.regression)
##
## Call:
## lm(formula = co2 ~ time, data = salt)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.6400 -0.6680 -0.1230  0.4738  2.7203
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.189953   0.178367  -1.065    0.290
## time         0.003032   0.003341   0.907    0.366
##
## Residual standard error: 0.9597 on 98 degrees of freedom
## Multiple R-squared:  0.008333,   Adjusted R-squared:  -0.001786
## F-statistic: 0.8235 on 1 and 98 DF,  p-value: 0.3664

Find the ‘Coefficients’ table in the middle. In the first column, you can see the ‘Estimate’ of the Intercept and the slope (labeled ‘time’). At the far right in this column, you can see a column labeled as Pr(>|t|). This is the probability (p-value) that the estimate is the same as the null hypothesis. In this case, the probability that the slope is the same as H0 is 0.366. If the value were more extreme, the probability that it is the same as the null hypothesis decreases. When that probability hits 0.05 (there is only a 5% chance that our data match the null hypothesis), then we say that it is ‘statistically different’ from the null hypothesis, and we therefore reject H0 and accept the alternate hypothesis HA.

Back to your data…

Let’s first see whether there is a relationship between your time and the production of CO2 in your negative control. First, we will select only the ‘water’ data:

## We can select all rows matching 'water' with square brackets
## dat[r,c] means to select row 'r' and column 'c'
## You can remember the order by thinking RC as "Roman Catholic"
## So in this case, we select all rows in which dat$carb is
## equal to (note the double ==) 'water' and then we choose all
##.columns (no filter after the comma)
water <- dat[dat$carb == 'water',]
water.reg <- lm(co2vol ~ time, data = water)
summary(water.reg)
##
## Call:
## lm(formula = co2vol ~ time, data = water)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.037606 -0.001583  0.000500  0.002583  0.039364
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0029091  0.0055417   0.525    0.606
## time        0.0119242  0.0001038 114.872   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01333 on 18 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9986
## F-statistic: 1.32e+04 on 1 and 18 DF,  p-value: < 2.2e-16

Is the p-value less than 0.05? If so, that tells us that there is a relationship between time and CO2. Given that this is your negative control, did you expect there to be a relationship? Why or why not?

Type your answer here. If you use 4 spaces at the beginning, it'll format nicely.

OK, now lets look to see whether the slopes of the lines are different from each other. We need to use a more complex model here (don’t worry about details). Let’s specifically compare the slopes for glucose and sucrose:

##
## Call:
## lm(formula = co2vol ~ time * carb, data = gluc.suc)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.046970  0.000644  0.003864  0.006212  0.049848
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.364e-04  4.908e-03  -0.028    0.978
## time         9.420e-03  9.193e-05 102.464  < 2e-16 ***
## carb1       -4.818e-03  9.816e-03  -0.491    0.626
## time:carb1  -9.485e-04  1.839e-04  -5.159 9.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0167 on 36 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9963
## F-statistic:  3535 on 3 and 36 DF,  p-value: < 2.2e-16

For this exercise, we can focus in on the time:carb1 row and look at the p-value. If that value is less than 0.05, then we can conclude that the slopes of the two lines are statistically different.

What do your data show? Is the p-value in the last row of the Coefficients table less than 0.05 \((p<0.05)\)?

Type your answer here. If you use 4 spaces at the beginning, it'll format nicely.

Although we won’t individually test each of the lines, they almost certainly have different slopes. What does this tell you about the different enzymes that yeast has to break down these different molecules?

Type your answer here. If you use 4 spaces at the beginning, it'll format nicely.

Once you have answered the questions in this section, change the name at the top of this document to your own, knit it to html, and turn it in on Canvas.