IRDS: Lab Session 2: R

R is a powerful system for statistical computation. It has good graphing facilities, and a set of contributed packages that implement a wide variety of statistical methods.

Pre-lab Work

If you have not seen R before, you should spend a few minutes going through a tutorial on R to get you oriented. One good R tutorial is here — you don't need to go through the whole thing, probably sections 1-8 are most relevant to this lab.

As this lab is optional and is not marked, you may use any materials that you wish to help you. Here are a few that might help:

R Project web site
R for Beginners
R Quick Reference
R for Matlab users
R for Programmers
R for data mining
Rseek: For searching for documentation about "R". Useful if your search engine does not recognize "R" as a token.
Amusing: R Inferno. Abstract: "If you are using R and you think you're in hell, this is a map for you." Circle 8 about common mistakes is useful.

Also, here are a few useful functions to get you oriented:

The function ls() will tell you the list of variables that you have currently defined, called the "workspace" (NOT the filesystem).
You can manipulate the current working directory of R using the setwd and getwd functions.
To get help on a particular function, such as read.csv, you can do help(read.csv) or ?read.csv
To search help, do help.search("gaussian")
help.start() will open a window in the web browser showing the online help for R.
You can enter in R commands interactively, but often it's easier to enter them in a separate editor window. Then you can use Ctrl+R to run blocks of code directly from the editor.

Lab Work

In this lab, we will plot some time series data. The data are polls taken about a recent political issue (the Scottish independence referendum that was held in 2014), obtained from the What Scotland Thinks web site. You can download the data from here.

Use the read.csv function to read the data from a file. (If the data were separated by spaces or tabs instead of commas, you would use read.table instead.) Save the results in a variable called data. This will load the data into a data structure called a data frame. There are many things that you can do with this data frame:
- You can view the first few rows using head(data). Notice that the data have column names; these were taken from the first line of the file.
- You can access a particular row, e.g., row 63, using data[63,]. You can access a particular column using either data[,3] or the more descriptive data$No.
- You can find out how big the data set is by doing dim(data) or get summary statistics by doing summary(data).
You can get simple scatterplots by using the plot function, for example, try the command x = 1:10; y = 3*x + rnorm(10); plot(x,y).
If you want to customize plots for publications, R provides a bewildering number of options. If you want to change the axis labels, margins, fonts, tick labels, etc., there are three places that you need to look: First, optional argument to the plot function. To find out about these, check the documentation for methods of the plot function itself, especially plot.default. Second, the par command sets many global graphics parameters, so check its documentation. Finally, if you are writing a plot to a file, you call the pdf (or postscript, etc.) functions before you start plotting. Those functions have important optional arguments as well. There is a good table listing these in the Graphics chapter of R for Beginners.
plot is a polymorphic function that does different clever things depending on the types of its arguments. For example, a factor object is a R structure for representing categorical variables. Many R functions, such as plotting and modelling functions, know how to do special things with factors, e.g., to split the data by each "level" of the factor before doing analysis or plotting. R has already created some factor objects in your data frame based on the fields that it thought were character strings. To see one, look at data$Pollster. If you ever get confused about what type any R object has, you can get a compact summary by doing str(data$Pollster).

But back to plot. See what happens when you try to plot the factor data$Pollster.
Now let's try something more interesting. Try to plot the percentage of people who answered "yes" in the polls, using plot(data$Begin, data$Yes).
Yipes. I don't understand what happened either. But recall that plot does different things depending on the types of their arguments. So what type of object is data$Begin?
We need to convert the Begin column into Date objects. Do this using the as.Date function. The data are in the format "%d/%m/%Y". Then try the plot again.
If we thought that the "yes" vote was increasing linearly with time, we could perform a linear regression of the yes vote onto the date of the poll. R represents dates internally as the number of days since January 1, 1970 (with that date counting as 0) [1], which gives us a numeric input to the regression. Strictly speaking, this model is nonsense (why?), but it is good practice and could be useful for interpolation. Fit a linear model using the lm function and save the result of a function to a variable called mdl. The result will be an R object of class lm. This object has many fields you can inspect, such as mdl$coefficients and mdl$fitted.values. In fact, using the last two you can double-check that the lm function is doing what you expect, that is, that the reported fitted values actually match the reported linear function of the inputs.
Let's add the regression line to the current plot. The lines and points functions will add things to existing plots. For example, suppose that your Date objects are in a vector called dates. Then you can add a few lines to the plot using
```
lines(c(dates[1], dates[25], dates[25]), c(35, 35, 40), col="RED")
```
You should have enough tools now to be able to use the lines function to add the regression line to the plot.
This case is so important that there is actually a simpler way to do this: abline(mdl). abline is also convenient for adding horizontal, vertical, and diagonal lines to plots.
By typing plot(mdl) you have access to up to six diagnostic plots for a linear model. Refer to the documentation for more details.
Clearly a line does not fit the increasing trend at the end. Let's try a locally quadratic fit instead. You can fit this using the loess function. Do this and save the resulting model in another variable. NOTE: For some reason, loess does not like having Date objects as input. You will need to coerce the input to numbers first using as.numeric. For a simple example of this, see the footnote from earlier.
Now let's add the fit to the plot. We'll do this in three steps.
1. First, create a sequence of points that form a one dimensional grid on the x axis. You can use the seq function to do this. Use 1000 evenly spaced points.
2. Now, you can sue the predict function to obtain a vector containing the loess fit at each of the points in the sequence.
3. Finally you can use the lines function as before to draw a series of line segments that connect these points.
Again, adding a loess curve is a common enough operation to have a convenience function: scatter.smooth. Try smoothing the plot this way. Do you get the same fitted curve as from the previous question? If not, what could be different between the two, given that they are both using loess? (Hint: If this is too hard, just keep going with the lab. Maybe an idea will come to you later.)

Extra Credit

It is often believed that polls have "house effects", i.e., that polling firms tend to have a systematic bias. The polling firm is a categorical variable in the data, under the column data$Pollster. Before trying to model whether this effect is apparent in the data, we should visualize it. One of my favorite visualizations is simply to represent the categorical data by colour in the plot. There is a really fun trick for this, which I will explain in stages.
- Let's try adding a single colour to a plot. For example, many R plotting functions will understand a string like "BLUE" as a colour name. For another way of specifying colours in R, look at the output of the command rainbow(10). Redo the scatter plot, but change all the points to a different colour (choose whichever one you like).
- R likes to vectorize operations, i.e., extending operations on scalars to work on sequences and vectors. For example, see what happens when you do
```
c(1,2) + c(2,4,6,8)
```
  That's a fun example, but someday this may cause you horrible problems. For now, however, we'll notice that the sequence lookup operator [] is also vectorized. So you can do things like data$Begin[c(1,3,3,5)]. Check that this command does what your would expect, comparing it to data[1:5,].
- Combining what we've done before, plot the date versus the yes vote again, but colour the points by the pollster. For more help, see the footnote [2].
- For keeners: If you want to use a different pretty set of colours, see related functions to rainbow in the help. Or you can choose your own colours, using a site like Color Brewer. If you'd like to show off a bit, change the plotting character (pch=) argument as well, so that each pollster gets both its own colour and its own plotting character. (This is good if your graph might be viewed either in colour or black and white.) Finally, if you really want to show off, add a legend.
People who make their living by aggregating polls have very sophisticated ways of estimating house effects, including Bayesian approaches. For this lab, we will do something simple. For each row in the data frame, the loess fit gives an estimate of the true vote, so we can estimate the house effect by subtracting the poll value from the loess fit. Add a column to the data frame, e.g., data$fit, that contains this value for each poll.
Plot this new variable against the pollster column. Notice that R (this time helpfully) chose a style of plot for you. If you'd like to run some analyses separately by pollster, you can use split and lapply.
If you'd like to think about this domain further: This graph is not a fair way to assess the polling firm's bias, because there are other factors about the poll (some of which are collected in this data) that will also affect the bias. How could we take this into account in the analysis?

[1] To check this, I did

startdate <- "1970-01-02"; as.numeric(as.Date(startdate))

↩

[2] What you need to do is get yourself a list of colour names, like we did in the previous part of the question. Then look up the colour names using the Pollster column as the array index. ↩