Main | Lectures | Labs | Projects

R is a powerful system for statistical computation. It has good graphing facilities, and a set of contributed packages that implement a wide variety of statistical methods.

If you have not seen R before, you should spend a few minutes going through a tutorial on R to get you oriented. One good R tutorial is here — you don't need to go through the whole thing, probably sections 1-8 are most relevant to this lab.

As this lab is optional and is not marked, you may use any materials that you wish to help you. Here are a few that might help:

Rseek: For searching for documentation about "R". Useful if your search engine does not recognize "R" as a token.

Amusing: R Inferno. Abstract: "If you are using R and you think you're in hell, this is a map for you." Circle 8 about common mistakes is useful.

Also, here are a few useful functions to get you oriented:

- The function
`ls()`will tell you the list of variables that you have currently defined, called the "workspace" (NOT the filesystem). - You can manipulate the current working directory of R using the
`setwd`and`getwd`functions. - To get help on a particular function, such as
`read.csv`, you can do`help(read.csv)`or`?read.csv` - To search help, do
`help.search("gaussian")` `help.start()`will open a window in the web browser showing the online help for R.- You can enter in R commands interactively, but often it's easier to enter them in a separate editor window. Then you can use Ctrl+R to run blocks of code directly from the editor.

In this lab, we will plot some time series data. The data are polls taken about a recent political issue (the Scottish independence referendum that was held in 2014), obtained from the What Scotland Thinks web site. You can download the data from here.

Use the

`read.csv`function to read the data from a file. (If the data were separated by spaces or tabs instead of commas, you would use`read.table`instead.) Save the results in a variable called`data`. This will load the data into a data structure called a*data frame*. There are many things that you can do with this data frame:- You can view the first few rows using
`head(data)`. Notice that the data have column names; these were taken from the first line of the file. - You can access a particular row, e.g., row 63, using
`data[63,]`. You can access a particular column using either`data[,3]`or the more descriptive`data$No`. - You can find out how big the data set is by doing
`dim(data)`or get summary statistics by doing`summary(data)`.

- You can view the first few rows using
You can get simple scatterplots by using the

`plot`function, for example, try the command`x = 1:10; y = 3*x + rnorm(10); plot(x,y)`.If you want to customize plots for publications, R provides a bewildering number of options. If you want to change the axis labels, margins, fonts, tick labels, etc., there are three places that you need to look: First, optional argument to the

`plot`function. To find out about these, check the documentation for methods of the`plot`function itself, especially`plot.default`. Second, the`par`command sets many global graphics parameters, so check its documentation. Finally, if you are writing a plot to a file, you call the`pdf`(or`postscript`, etc.) functions before you start plotting. Those functions have important optional arguments as well. There is a good table listing these in the Graphics chapter of R for Beginners.`plot`is a polymorphic function that does different clever things depending on the types of its arguments. For example, a factor object is a R structure for representing categorical variables. Many R functions, such as plotting and modelling functions, know how to do special things with factors, e.g., to split the data by each "level" of the factor before doing analysis or plotting. R has already created some factor objects in your data frame based on the fields that it thought were character strings. To see one, look at`data$Pollster`. If you ever get confused about what type any R object has, you can get a compact summary by doing`str(data$Pollster)`.But back to

`plot`. See what happens when you try to plot the factor`data$Pollster`.- Now let's try something more interesting. Try to plot the percentage of people who answered "yes" in the polls, using
`plot(data$Begin, data$Yes)`. - Yipes. I don't understand what happened either. But recall that
`plot`does different things depending on the types of their arguments. So what type of object is`data$Begin`? - We need to convert the
`Begin`column into`Date`objects. Do this using the`as.Date`function. The data are in the format "%d/%m/%Y". Then try the plot again. If we thought that the "yes" vote was increasing linearly with time, we could perform a linear regression of the yes vote onto the date of the poll. R represents dates internally as the number of days since January 1, 1970 (with that date counting as 0) [1], which gives us a numeric input to the regression. Strictly speaking, this model is nonsense (why?), but it is good practice and could be useful for interpolation. Fit a linear model using the

`lm`function and save the result of a function to a variable called`mdl`. The result will be an R object of class`lm`. This object has many fields you can inspect, such as`mdl$coefficients`and`mdl$fitted.values`. In fact, using the last two you can double-check that the lm function is doing what you expect, that is, that the reported fitted values actually match the reported linear function of the inputs.- Let's add the regression line to the current plot. The
`lines`and`points`functions will add things to existing plots. For example, suppose that your`Date`objects are in a vector called`dates`. Then you can add a few lines to the plot usinglines(c(dates[1], dates[25], dates[25]), c(35, 35, 40), col="RED")

You should have enough tools now to be able to use the`lines`function to add the regression line to the plot. - This case is so important that there is actually a simpler way to do this:
`abline(mdl)`.`abline`is also convenient for adding horizontal, vertical, and diagonal lines to plots. - By typing
`plot(mdl)`you have access to up to six diagnostic plots for a linear model. Refer to the documentation for more details. - Clearly a line does not fit the increasing trend at the end. Let's try a locally quadratic fit instead. You can fit this using the
`loess`function. Do this and save the resulting model in another variable. NOTE: For some reason,`loess`does not like having`Date`objects as input. You will need to coerce the input to numbers first using`as.numeric`. For a simple example of this, see the footnote from earlier. - Now let's add the fit to the plot. We'll do this in three steps.
- First, create a sequence of points that form a one dimensional grid on the x axis. You can use the
`seq`function to do this. Use 1000 evenly spaced points. - Now, you can sue the
`predict`function to obtain a vector containing the loess fit at each of the points in the sequence. - Finally you can use the
`lines`function as before to draw a series of line segments that connect these points.

- First, create a sequence of points that form a one dimensional grid on the x axis. You can use the
- Again, adding a loess curve is a common enough operation to have a convenience function:
`scatter.smooth`. Try smoothing the plot this way. Do you get the same fitted curve as from the previous question? If not, what could be different between the two, given that they are both using loess? (*Hint*: If this is too hard, just keep going with the lab. Maybe an idea will come to you later.)

- It is often believed that polls have "house effects", i.e., that polling firms tend to have a systematic bias.
The polling firm is a categorical variable in the data, under the column
`data$Pollster`. Before trying to model whether this effect is apparent in the data, we should visualize it. One of my favorite visualizations is simply to represent the categorical data by colour in the plot. There is a really fun trick for this, which I will explain in stages.Let's try adding a single colour to a plot. For example, many R plotting functions will understand a string like "BLUE" as a colour name. For another way of specifying colours in R, look at the output of the command

`rainbow(10)`. Redo the scatter plot, but change all the points to a different colour (choose whichever one you like).R likes to vectorize operations, i.e., extending operations on scalars to work on sequences and vectors. For example, see what happens when you do

c(1,2) + c(2,4,6,8)

That's a fun example, but someday this may cause you horrible problems. For now, however, we'll notice that the sequence lookup operator

`[]`is also vectorized. So you can do things like`data$Begin[c(1,3,3,5)]`. Check that this command does what your would expect, comparing it to`data[1:5,]`.Combining what we've done before, plot the date versus the yes vote again, but colour the points by the pollster. For more help, see the footnote [2].

For keeners: If you want to use a different pretty set of colours, see related functions to

`rainbow`in the help. Or you can choose your own colours, using a site like Color Brewer. If you'd like to show off a bit, change the plotting character (`pch=`) argument as well, so that each pollster gets both its own colour and its own plotting character. (This is good if your graph might be viewed either in colour or black and white.) Finally, if you really want to show off, add a legend.

- People who make their living by aggregating polls have very sophisticated ways of estimating house effects, including Bayesian approaches.
For this lab, we will do something simple. For each row in the data frame, the loess fit gives an estimate of the true vote, so we can estimate
the house effect by subtracting the poll value from the loess fit. Add a column to the data frame, e.g.,
`data$fit`, that contains this value for each poll. - Plot this new variable against the pollster column. Notice that R (this time helpfully) chose a style of plot for you.
If you'd like to run some analyses separately by pollster, you can use
`split`and`lapply`. - If you'd like to think about this domain further: This graph is not a fair way to assess the polling firm's bias, because there are other factors about the poll (some of which are collected in this data) that will also affect the bias. How could we take this into account in the analysis?

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |