---
title: "Computational Cognitive Science 2019-2020"
output:
  pdf_document: default
  word_document: default
  html_document: default
---
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Tutorial 3: Bayesian Estimation, Concept Learning

## Question 1: Exercise in Bayesian Estimation

In an experiment on face recognition, subjects are presented with
images of people they know, and asked to identify them.  The images
are presented for a very short period of time so that subjects may
not have time to see the details of the entire face, but are likely to
get a general impression of things like hair color and style, overall
shape, skin color, etc.  In this question we will consider how to
formulate the face recognition problem as a probabilistic inference
model.
  
  1. What is the hypothesis space in this problem?  Is it continuous or discrete?  Finite or
    infinite?
    
  **Solution**: There are a few ways to define a hypothesis space here, but one sensible approach is to say it is the set of different people that the subject knows, a finite (though very large) discrete space.

  2. What constitutes the observed data $y$ and what kinds of values can it take on?
    
  **Solution**: The data is the image the subject sees, or more precisely, what the subject actually perceives.  It's difficult to say precisely what that might be without knowing more about the low-level features used by the visual system, but it could be things like the color and  intensity in different regions of the image.  In this case the values of the observed data would be continuous. However, if we were actually going to model this problem, we might want to simplify by assuming that higher-level discrete features are directly observed, e.g. face shape, hair style, skin color.  But note that not all of these higher-level features would necessarily be observed for each trial.  (We could also make an intermediate assumption, using a discretized space of color/intensity features, or maybe intermediate-level features like edges).

  3. Write down an equation that expresses the inference problem that the subjects must solve
    to identify each face.  Describe what each term in the equation represents.
    
  **Solution**: $P(H|y) = \frac{P(y|H)P(H)}{P(y)}$ where the $P(H)$ is the subject's prior belief that any particular person will appear in the photo, $P(H|y)$ is the  subject's posterior belief that any particular person is in the image, given what the subject sees in the image, $P(y|H)$ (likelihood) is the probability that a particular set of features will be perceived given that a particular person is shown in the image, and $P(y)$ is the overall probability of perceiving a particular set of features.

  4. What factors might influence the prior in this situation?
    
  **Solution**: The prior could be influenced by factors such as the frequency or recency with which the subject has seen each of the people outside of the experimental situation, and the frequency of the particular person's face within the experimental situation (if images are reused).  One could imagine other possible factors such as the emotional closeness of the subject to the person, or the frequency with which the subject has seen photographs of the person (as opposed to the live person), or even the person's beliefs about the experimenter; they would presumably be surprised if the experimenter presented them with an image of their childhood friend.

  5. Suppose one group of subjects sees clear images, such as the one on the left below, and
    another group sees noisy images, such as the one on the right below. Which term(s) in your equation
    will be different for the noisy group compared to the clear group?

  ![Eric Idle, clear](../figures/eric_idle.jpg){#id .class width=25% height=25%}
  ![Eric Idle, noise](../figures/eric_idle_noise.jpg){#id .class width=25% height=25%}
    
  **Solution**: The prior will  be the same.  The likelihood will be different, since the manipulation changes how the images look -- there will be a different distribution over observed features given the same person being shown.  $P(y)$ will be different since there is a different overall distribution of what the images look like.  And since $P(y)$ and $P(y|H)$ are different, the posterior will also be different.

  6. What does the model predict about subjects' performance with
    noisy images compared to clear images?  Rather than working with
    the full scenario above, you can simplify by supposing the
    experiment has images of only two people, $h_1 =$ Eric Idle and
    $h_2 =$ John Cleese.
    How does image
    noisiness affect the model's probability of inferring that the
    image is of Eric Idle, rather than John Cleese?  (Hint: since
    you are considering only two hypotheses, think about the *posterior odds*.
    How will this quantity differ between the
    case where the image is noisy and the case where it is clear?)
    Do you need to make any further assumptions in order to make
    predictions about subject behavior?
    
**Solution**: It helps if we are more specific about the changes we expect in the likelihood.  Consider the distribution $P(y|h)$ for a specific hypothesis $h$ (say, Eric Idle). If the images are clear, then we would expect relatively little variation in the features we perceive when shown his image. That is, a relatively small number of possible values for $y$ will have high probability, and other possible values will have low probability. However, if the images are noisy, this effectively spreads out the probability mass over a larger number of possible values for $y$: we are more likely to see features that are further from those in the original image, but less likely to see features that are exactly those in the original image.  

The above description assumes discrete values for $y$, but we can also get an intuition for what's going on by imagining that the different possible values for $y$ are continuous values along a 1-dimensional space, and then plotting $P(y|h)$ against $y$:

\centerline{\includegraphics[width=0.5\textwidth]{../figures/normals-crop}}

The mean of these curves represents the ``average'' data that would be seen when Eric's picture is shown. The curves for the noisy and clean cases have the same mean, but the distribution of observations in the noisy case is broader than that in the clean case.  [I'm assuming that the noise itself is unbiased, otherwise the mean could change also, but this would  needlessly complicate the analysis.]

The model predicts that subject will have a harder time discriminating Idle from Cleese in the noisy scenario.  To see why, consider the posterior odds in each scenario.  For a particular observation $d$ the posterior odds can be written as
\[\frac{P(H=h_1|y)}{P(H=h_2|y)} =  \frac{P(y|H=h_1) P(H=h_1)}{P(y|H=h_2) P(H=h_2)}
\]
and represents how much more favored $h_1$ is than $h_2$ given $y$.  

What can we say about the posterior odds in the noisy and clean situations?  Well, first of all, any differences between the two situations will come from the *likelihood ratio* $P(y|H=h_1) / P(y|H=h_2)$ because the priors (and prior odds $P(H=h_1) / P(H=h_2)$) are the same in both cases.  So let's focus on the likelihood ratio, using our analysis of the likelihood from the previous question part.  According to our graphs, $P(y|H=h_1)$ and $P(y|H=h_2)$ will have similar shapes, except that the means will be different.   In the clean case, when we see an image of Idle, the actual observation $d$ is very likely to be close to the mean of $P(y|H=h_1)$, and not close to the mean of  $P(y|H=h_2)$.  Therefore because $P(y|H=h_1)$ and  $P(y|H=h_2)$ are highly peaked,  $P(y|H=h_1)$ will be very high while $P(y|H=h_2)$ will be very low and the likelihood ratio will strongly favor $h_1$.  On the other hand in the noisy case, the observation $d'$ is less likely to be near the mean of $P(y|H=h_1)$, and in addition observations that are far from the mean of  $P(y|H=h_2)$ have higher probability than in the clean case.  So overall the likelihood ratio will not favor  $h_1$ as strongly.

Since the likelihood ratio favors 
$h_1$ more strongly in the clean case, so do the posterior odds.  If we assume that subjects are more likely to give the name of the person corresponding to the higher probability hypothesis, then they would be more accurate in the clean scenario.  If we wanted to make any more quantitative predictions, we would need to be more specific about the relationship between probability and subject's responses (e.g. do their responses exactly match the posterior distribution, is there some non-linear relationship between the two, etc.)


## Question 2: A Bayesian Model of Concept Learning

In this question we will consider a simplified version of the Tenenbaum (2000) model of generalization where there are only four hypotheses under consideration, each of which has equal prior probability:

  > $h_1 =$\{odd numbers\}, $h_2 =$\{even numbers\}, $h_3 =$\{multiples of 5\}, $h_4 =$\{multiples of 10\}

**Exercise**: Given the set of examples X = {10, 40} from the target concept, compute the posterior probability of each hypothesis under the model. 

The code block below has partly set up the problem, and indicates where you should add your own code.

Feel free to use more steps than shown below, e.g. defining intermediate variables or helper functions.

```{r X_setup}
# identify the set of numbers, out of 100, defined by each hypothesis
h_set <- list(odd=seq(1, 99, by=2),
              even=seq(2,100, by=2),
              x5=seq(5,100, by=5),
              x10=seq(10,100, by=10))

# our example set:
X_set <- c(10, 40)
```
```{r X_calculate_posterior}
# prior:
p_h <- rep(1/length(h_set), length(h_set)) # equal probability for each hypothesis

# likelihood function:
likelihood <- function(X_set, h) if (all(X_set %in% h)) 1/(length(h)^length(X_set)) else 0 # Eq. 1

# calculate likelihood:
p_X_given_h <- sapply(h_set, likelihood, X_set=X_set)

# calculate posterior:

# first get proportionate posterior prob for each hypothesis
p_h_given_X_unnorm <- p_X_given_h * p_h

# then normalize (necessary for a true probability): divide by p(X) (summed over all p(h|X))
p_h_given_X <- p_h_given_X_unnorm/sum(p_h_given_X_unnorm)

print(p_h_given_X)
```

**Exercise**: Determine the model’s predicted probability that each of the following new data points is also part of the same concept: 2, 3, 5, and 20. 

```{r y_generalization}

new_y <- c(2, 3, 5, 20)
#new_y <- 1:100 # to visualize entire range

# I give you: *three* different ways to define gen_probs

# version 1 (for loop version, not recommended):

# first combine the probabilities
pred_y_h <- function(y, h, p_h_x) if (y %in% unlist(h)) p_h_x else 0 

# then loop over hypotheses + posterior probs
gen_probs <- function(y, all_h, all_p_h_x) {
  p_gen = 0
  for (i in 1:length(all_h)) p_gen = p_gen + pred_y_h(y, all_h[i], all_p_h_x[i])
  p_gen
}
# ^ this function also doesn't handle names properly, leading to some confusing outputs

# version 2 (replace for loop with vectorized function application):

# still using pred_y_h function defined above

# apply over hypotheses + posterior probs
# mapply loops over aligned inputs, 
#   in this case the paired hypthoeses all_h and posterior probs all_p_h_x
gen_probs <- function(y, all_h, all_p_h_x) sum(mapply(pred_y_h, all_h, all_p_h_x, MoreArgs = list(y=y)))


# version 3 (my preferred version):

# first check whether y is a member of C, for all C
y_in_C <- function(y, C_set) sapply(C_set, `%in%`, x=y)

# then use boolean vector to index and sum posterior probabilities
gen_probs <- function(y, all_h, all_p_h_x) sum(all_p_h_x[y_in_C(y, all_h)])

# calculate p(y|X) for each new input y

p_y_given_X <- sapply(X = new_y, # input to apply over
                      FUN = gen_probs, # function to apply
                      all_h = h_set, all_p_h_x = p_h_given_X) # additional named args

print(p_y_given_X)
```

```{r tenenbaum_plot, fig.height=2}
library(ggplot2)

ggplot() +
  geom_vline(data=data.frame(x=X_set),
             aes(xintercept=x)) +
  geom_point(data=data.frame(new_y = new_y,
                             p_y = p_y_given_X),
             aes(x=new_y, y=p_y)) +
  ggtitle("p(y|X) for new y [points] given example set X [lines] under h1-h4") +
  theme_bw()
```

**Exercise**: Note the plot above. Go back to the first cell and try to:

  - add 50 to `X_set` and run everything again until the plot. What happens? 
  - add 30 to `X_set` and do it again.
  - assign a new example set like so: `X_set <- c(15, 45)`, and run until the plot again. What's changed? 
  - add more examples to `new_y`. You can look at isolated numbers, or visualize the whole range.

**Solution**:

```{r more_examples, fig.height=2}
library(ggplot2)

newYandX <- function(new_X_set,new_y) {
  # Assume same h as before 
  p_X_given_h <- sapply(h_set, likelihood, X_set=new_X_set)
  p_h_given_X_unnorm <- p_X_given_h * p_h

  p_h_given_X <- p_h_given_X_unnorm/sum(p_h_given_X_unnorm)
  p_y_given_X <- sapply(X = new_y, # input to apply over
                      FUN = gen_probs, # function to apply
                      all_h = h_set, all_p_h_x = p_h_given_X) # additional named args
  ggplot() +
    geom_vline(data=data.frame(x=new_X_set),
               aes(xintercept=x)) +
    geom_point(data=data.frame(number = new_y,
                               probability = p_y_given_X),
               aes(x=number, y=probability)) +
                ggtitle("p(y|X) for new y [points] given example set X [lines] under h1-h4") +
                  theme_bw()
}

# add 50 to `X_set` and run everything again until the plot. What happens? 
newYandX(c(10,40,50),new_y)

# Add 30
newYandX(c(10,40,50,30),new_y)

# 15/45
newYandX(c(15,45),new_y)
```

Figure 1b from Tenenbaum (2000), reproduced below, shows model predictions for the sets listed (e.g. `X = 16`, `X = 16 8 2 64`, etc.). Note: This originally gave an example of the Tenenbaum numbers as 16, 8, 2, and 24.

  ![Tenenbaum predictions](../figures/tenenbaum_2000_1b.png){#id .class}

**Exercise**: Run your code above using the example sets shown in Figure 1b (e.g. `X_set <- c(16)`). Do the plotted results look different from the figure? If so, why?

**Solution**: 
One big difference between Tenenbaum's model and the simplified version here is that we only consider hypotheses consistent with generation by rule, and so the model does not assign any probability to values which could not have been produced by any of our candidate hypotheses. By contrast, Tenenbaum's model reserves some probability for all consecutive intervals within 1-100 (e.g. 1-10, 1-20, 2-20, 5-15, 5-30, etc.). This means that, for example, in the case of `X_set <- c(16)`, Tenenbaum's model assigns some probability to values around 16 on the hypothesis that 16 is a point in some consecutive interval. This is impossible in our simple version, which does not consider interval hypotheses. This same lack of hypotheses is why one of the plots is empty.

Tenenbaum's model also considers a wider range of rules, e.g. $h_5 =$\{powers of 2\}, which covers `X_set <- c(16, 8, 2, 24)`. Our model only considers a very small set of rules as hypotheses, and `X_set <- c(16)` is only consistent with one rule in our model, namely 'even numbers' --- so the model only needs to see 16 to be extremely confident that it must have been generated from this rule alone.

These differences highlight how important it is to formulate a hypothesis space with sufficiently broad coverage --- a model that is highly confident concerning some outcome may reflect a model in which plausible alternative hypotheses were not specified.

```{r tenenbaum_examples, fig.height=2}
# We'll use all examples now.
new_y <- 1:100

newYandX(c(16),new_y)

newYandX(c(16,8,2,64),new_y)

newYandX(c(16,23,19,20),new_y)

newYandX(c(60),new_y)

newYandX(c(60, 80, 10, 30),new_y)
```


**Exercise**: Explain, with reference to the terms in the model, why the model’s predictions tend to become more ‘rule-like’ as more examples are seen.

**Solution**: The model's behavior becomes more 'rule-like' as more examples are seen because of the form of the likelihood term, $P(X | h_i) = \frac{1}{|h_i|^n}$, where $n$ is the number of examples seen.  In particular, the model will favor more and more the smallest hypothesis consistent with all examples seen---the one that follows the same 'rule'.
In cases where the different hypotheses have quite different sizes (which will be the case when they are the 'mathematical rule' type of hypothesis), 
the likelihood of larger hypotheses decreases a lot more than the likelihood of smaller hypothesis with every additional example (because of $|h_i|$ in the denominator).  Eventually the smallest hypothesis dominates all others.

## References

Tenenbaum, Joshua B. (2000) “Rules and Similarity in Concept Learning,” In *Advances in neural information processing systems*, 59–65.