R is a powerful system for statistical computation. It has good graphing facilities, and a set of contributed packages that implement a wide variety of statistical methods.
If you have not seen R before, you should spend a few minutes going through a tutorial on R to get you oriented. One good R tutorial is here — you don't need to go through the whole thing, probably sections 1-8 are most relevant to this lab.
As this lab is optional and is not marked, you may use any materials that you wish to help you. Here are a few that might help:
Rseek: For searching for documentation about "R". Useful if your search engine does not recognize "R" as a token.
Amusing: R Inferno. Abstract: "If you are using R and you think you're in hell, this is a map for you." Circle 8 about common mistakes is useful.
Also, here are a few useful functions to get you oriented:
In this lab, we will plot some time series data. The data are polls taken about a recent political issue (the Scottish independence referendum that was held in 2014), obtained from the What Scotland Thinks web site. You can download the data from here.
Use the read.csv function to read the data from a file. (If the data were separated by spaces or tabs instead of commas, you would use read.table instead.) Save the results in a variable called data. This will load the data into a data structure called a data frame. There are many things that you can do with this data frame:
You can get simple scatterplots by using the plot function, for example, try the command x = 1:10; y = 3*x + rnorm(10); plot(x,y).
If you want to customize plots for publications, R provides a bewildering number of options. If you want to change the axis labels, margins, fonts, tick labels, etc., there are three places that you need to look: First, optional argument to the plot function. To find out about these, check the documentation for methods of the plot function itself, especially plot.default. Second, the par command sets many global graphics parameters, so check its documentation. Finally, if you are writing a plot to a file, you call the pdf (or postscript, etc.) functions before you start plotting. Those functions have important optional arguments as well. There is a good table listing these in the Graphics chapter of R for Beginners.
plot is a polymorphic function that does different clever things depending on the types of its arguments. For example, a factor object is a R structure for representing categorical variables. Many R functions, such as plotting and modelling functions, know how to do special things with factors, e.g., to split the data by each "level" of the factor before doing analysis or plotting. R has already created some factor objects in your data frame based on the fields that it thought were character strings. To see one, look at data$Pollster. If you ever get confused about what type any R object has, you can get a compact summary by doing str(data$Pollster).
But back to plot. See what happens when you try to plot the factor data$Pollster.
If we thought that the "yes" vote was increasing linearly with time, we could perform a linear regression of the yes vote onto the date of the poll. R represents dates internally as the number of days since January 1, 1970 (with that date counting as 0) [1], which gives us a numeric input to the regression. Strictly speaking, this model is nonsense (why?), but it is good practice and could be useful for interpolation. Fit a linear model using the lm function and save the result of a function to a variable called mdl. The result will be an R object of class lm. This object has many fields you can inspect, such as mdl$coefficients and mdl$fitted.values. In fact, using the last two you can double-check that the lm function is doing what you expect, that is, that the reported fitted values actually match the reported linear function of the inputs.
lines(c(dates[1], dates[25], dates[25]), c(35, 35, 40), col="RED")You should have enough tools now to be able to use the lines function to add the regression line to the plot.
Let's try adding a single colour to a plot. For example, many R plotting functions will understand a string like "BLUE" as a colour name. For another way of specifying colours in R, look at the output of the command rainbow(10). Redo the scatter plot, but change all the points to a different colour (choose whichever one you like).
R likes to vectorize operations, i.e., extending operations on scalars to work on sequences and vectors. For example, see what happens when you do
c(1,2) + c(2,4,6,8)
That's a fun example, but someday this may cause you horrible problems. For now, however, we'll notice that the sequence lookup operator [] is also vectorized. So you can do things like data$Begin[c(1,3,3,5)]. Check that this command does what your would expect, comparing it to data[1:5,].
Combining what we've done before, plot the date versus the yes vote again, but colour the points by the pollster. For more help, see the footnote [2].
For keeners: If you want to use a different pretty set of colours, see related functions to rainbow in the help. Or you can choose your own colours, using a site like Color Brewer. If you'd like to show off a bit, change the plotting character (pch=) argument as well, so that each pollster gets both its own colour and its own plotting character. (This is good if your graph might be viewed either in colour or black and white.) Finally, if you really want to show off, add a legend.
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |