---
title: "Computational Cognitive Science 2019-2020"
output:
  pdf_document: default
  word_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 4, fig.height=4)
```

# Tutorial 4: Model Comparison à la XKCD

Our data in this tutorial comes from a visual motion perception task (Karvelis et al., 2018): participants saw either moving dots or no stimuli for 3000 milliseconds, then reported the motion of direction, or an absence of stimuli if no dots were present. The independent variable `RISC` is a scale of schizotypical traits, measured on the healthy participants in the study. The dependent variable `false.detections` is a count of the number of trials in which participants falsely detected (i.e. hallucinated) stimuli --- that is, they reported that moving dots were present when no dots were shown on that trial. For more details, see the original paper.

Consider an initial sample of 62 randomly selected participants (perhaps the first batch of subjects tested). We'll load in their data for analysis.

```{r load_data}
data <- read.csv("sample1.csv")
```

Inspired by [Randall Munroe's XKCD](https://xkcd.com/2048/), we will compare the fit of various models to some real-world data. As before, we'll use ggplot to visualize.

```{r load_ggplot}
library(ggplot2)

base_plot <- ggplot(data) + # pass data to ggplot
  aes(RISC, false.detections) + # tell it what to plot on x and y axes respectively
  geom_point() + # add scatterplot layer
  theme_minimal() # nicer-looking style

base_plot
```

If you would like to get the full XKCD feel, you can optionally install the [xkcd package](http://xkcd.r-forge.r-project.org/) for a bit more verisimilitude in plotting! But if you don't want to, no worries, all the later code will run just fine without it.

```{r install_xkcd, eval=TRUE}
# check for xkcd package, install if missing
xkcd_available <- require("xkcd")
if (!xkcd_available) {
  install.packages("xkcd", repos="http://cran.uk.r-project.org")
  xkcd_available <- require("xkcd")
}

# Note: xkcd_available will be false if the installation didn't work, 
# in which case we'll have to fall back to base_plot
if(xkcd_available) {
  
  # check for xkcd font, install if missing
  if (!('xkcd' %in% fonts())) {
    download.file("http://simonsoftware.se/other/xkcd.ttf",
                  dest="xkcd.ttf", mode="wb")
    system("mkdir  ~/.fonts")
    system("cp xkcd.ttf  ~/.fonts")
    #system("cp xkcd.ttf  /Library/Fonts") # uncomment if you're running on a Mac
    font_import(paths=c("~/.fonts", "/Library/Fonts"), pattern = "[X/x]kcd", prompt=FALSE)
    loadfonts()
  }
  
  # add xkcd layers to the plot we defined above
  base_plot <- base_plot + 
    xkcdaxis(range(data$RISC), range(data$false.detections)) + # make plot axes look xkcd-ish
    theme(text=element_text(family="xkcd", size=14)) # use xkcd font
} 

# Here it is again, in XKCD style
base_plot
```

## Nested models

First, we'll fit the classic linear model: $y = \beta_1 x + \beta_0$.

```{r lm_fit}
# for more details on R's linear model function:
# ?lm
lm_fit <- lm(false.detections ~ RISC, # formula: y ~ x 
             data=data) 
# intercept term is assumed, doesn't need to be specified
# so actual formula is y ~ x + 1

summary(lm_fit)
```
Look over the summary output. What are the estimated values of the coefficients $\beta_0$ and $\beta_1$? What are the standard errors? Does this look like a reasonable fit to you? 

We can define a helper function to plot the model's predictions.
```{r lm_plot}
plot_preds <- function(preds, title="", subtitle="") {
  (base_plot + geom_line(aes(y=preds), col="red") + ggtitle(title, subtitle))
}

plot_preds(fitted(lm_fit), # 'fitted' extracts model's predictions for training data 
           "Linear", "Hey, I did a regression.")
```

As noted in the [original comic](https://xkcd.com/2048/), we can also just fit the intercept term, $\beta_0$, to get a linear model with no slope. 
```{r intercept_fit, warning=FALSE}
intercept_fit <- lm(false.detections ~ 1, data=data)

summary(intercept_fit)

plot_preds(fitted(intercept_fit),
           "Linear, no slope", "I'm making a scatterplot but I don't want to.")
```

**Exercise**: Modify the model-fitting code above to fit a quadratic function, i.e. an order-2 polynomial: $y = \beta_2 x^2 + \beta_1 x + \beta_0$.

```{r quad_fit, eval=TRUE}
# a straightforward solution:
quad_fit <- lm(false.detections ~ RISC^2 + RISC, data=data) 
# R also has a convenience function for up to k polynomials
quad_fit <- lm(false.detections ~ poly(RISC, 2), data=data) 

summary(quad_fit)

plot_preds(fitted(quad_fit), "Quadratic", "I wanted a curved line,\nso I made one with math.")
```

As these are all nested models, we can compare them using the likelihood ratio test in R's `anova` function.

```{r anova}
anova(intercept_fit, lm_fit, 
      quad_fit, # uncommented
      test="LRT") # specify likelihood ratio test
```

**Exercise**: Can you think of any reasons you might not want to use the likelihood ratio test? Are there other approaches you could use to compare nested models?

**Solution**: The likelihood ratio test can only be used on nested models, so we can't use it for broader model comparison, but that isn't an issue in this case. 
In general, the LRT uses null hypothesis significance testing (as shown in the anova result above: `Pr(>Chi)` is the p-value, here nonsignificant). This means we assume that the simplest model is favored by default, unless there sufficient evidence to reject it for the more complex model; however, we may wish to measure the strength of the evidence for *each* model given the data, rather than stick with the simplest until proven otherwise. 
The Savage-Dickey density ratio is an alternative metric for Bayesian hypothesis testing with nested models, which avoids the null hypothesis assumption. 

## AIC & BIC

A more general approach which does not require models to be nested is to use AIC or BIC.
```{r aic_bic}
models = list(intercept_fit, lm_fit, quad_fit # uncommented
              )
model_AIC = AIC(intercept_fit, lm_fit, quad_fit # uncommented
      )

cbind(model_AIC, 
      BIC = sapply(models, BIC),
      log_lik = sapply(models, logLik))
```


**Exercise**: How do AIC and BIC estimate model complexity? What's the difference between the two? Are there cases when you can't use these metrics?

**Solution**: Both AIC and BIC use number of parameters as a proxy for model complexity. In general, they differ in how parameters are *penalized* --- AIC weights each parameter by a constant penalty factor of 2, while BIC penalizes each parameter by `log(N)` where `N` is the number of observations. 
The result is that model complexity is penalized more heavily by BIC, as we can see by comparing the model fits shown here. The quadratic model has the highest likelihood, and its AIC (500.6) is only slightly higher than the linear model (500.4), and both are comfortably lower than then intercept-only model (AIC 504.3). However, the BIC of the quadratic model (509.1) is *higher* than both the intercept-only model (508.5) and the linear model (506.8), reflecting BIC's higher penalty for additional parameters. If we removed the linear model from consideration, the AIC and BIC would point us toward different model choices in this case.

AIC and BIC can compare non-nested models, but only models that are fit *to the same data*. One less obvious case of this is different *data transformations*: if you fit a linear model with a transformed response variable such as `lm(log(false.detections) ~ RISC)`, AIC/BIC comparison to the models above would not be valid. 

# Held out set

An *even better* approach to model comparison is to directly predict --- as we're mainly interested in a model's ability to generalize to unseen data, evaluating on a sequestered set lets us look at prediction directly, without even considering model complexity.

Let's load in the data from the remaining 20 participants (perhaps the second batch tested), and see how well our models predict their behavior. For simplicity, we'll use mean squared error to evaluate performance here, but in most settings log predictive density would be preferred (Gelman et al., 2014).

```{r sample2}
sample2 <- read.csv("sample2.csv")
```

**Exercise**: Calculate the mean squared error (MSE) for each model's predictions on the new data. (Hint: you can use the function `predict`; for example, `preds <- predict(lm_fit, sample2)` gets the linear model's predictions for `sample2`, which you can then use to calculate MSE.) Which one has the lowest mean squared error on the new data?

**Solution**
```{r sample2_mse}
mse <- function(model, newdata=sample2) {
  preds <- predict(model, newdata)
  sqe <- (newdata[["false.detections"]]-preds)^2
  mean(sqe)
}

sapply(list(intercept_fit, lm_fit, quad_fit), mse)
```
It turns out that the intercept-only model has the lowest error on our held-out data!


## Cross-validation

We can use *cross-validation* to get a more robust comparison of different models' ability to generalize to unseen data, by taking all the data available and systematically using different subsets to estimate test error.

```{r combine_data}
data <- rbind(data, sample2) # combine all our data
```
```{r cv_prep}
n_splits = 82 # specify number of splits to partition

# assign each row a split
# this assumes the rows are randomly ordered
splits = rep_len(1:n_splits, length.out=nrow(data)) 

# you can use `splits` to index your data frame
# for example, to test on split 1:
# training_data <- data[splits != 1,]
# test_data <- data[splits == 1,]
```

**Exercise**: Cross-validate the intercept, linear, and quadratic models using the splits defined above. Which model has the lowest average test error across splits? Do you get different results if you try different numbers of splits (e.g. by setting `n_splits` to another value)?

**Solution**
```{r cv_helpers}
# specify each model as its input formula
formulae <- paste0("false.detections ~ ", c("1", "RISC", "poly(RISC, 2)"))
formulae <- sapply(formulae, as.formula) # R can represent these as 'forumla' objects

# function to obtain test split MSE from each model
get_test_error <- function(train, test, f_list=formulae) {
  sapply(f_list, # for each formula (= type of model)
         function(f) {
           model <- lm(f, data=train) # train model
           test_error <- mse(model, test) # get test error
           test_error
  })
}
```
```{r cv_eval}
# apply function over splits: data not in split = train, data in split = test
test_error <- sapply(1:n_splits, function(s) get_test_error(data[splits != s,], data[splits == s,]))

# get the model with the average test error over all splits
mean_test_error <- apply(test_error, 
                         1, # apply function over rows
                         mean) 

# which model has the lowest average error?
mean_test_error
```
With the 6 splits defined above, the linear model has the lowest average test error. 
The linear model comes out on top fairly consistently across different numbers of splits, although not always --- the intercept model sometimes has lower error, for example with `n_splits = 9` or `13`.

Setting the maximally possible number of splits, i.e. `n_splits = nrow(data)` or `n_splits = 82`, performs leave-one-out cross-validation, where each model is fit on all the data except one point and then evaluated by predicting the held-out data point. In this case, the linear model prevails, although the intercept model shows only slightly worse generalization error, suggesting that we wouldn't do too badly just predicting the mean on future trials. The quadratic model is quite a bit worse than both and can clearly be eliminated from consideration.

## References:

Karvelis, Povilas, Aaron R Seitz, Stephen M Lawrie, and Peggy Seriès (2018). “Autistic Traits, but Not Schizotypy, Predict Increased Weighting of Sensory Information in Bayesian Visual Integration.” ELife 7. https://doi.org/10.7554/eLife.34115.

Gelman, Andrew, Jessica Hwang, and Aki Vehtari (2014). “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24:6: 997–1016. https://doi.org/10.1007/s11222-013-9416-2.