Probability and statistics

1. The nature of evaluation

The scientific method is founded on making and testing hypotheses

Evaluation is just another name for testing

Our hypotheses are often about differences

Some experimental condition is manipulated
- Or identified
And some outcome is measured in the two conditions
And we ask: is a change in the condition reliably reflected in some measured outcome value?
Some of you may recognise the distinction between independent and dependent variables here
- We'll go into detail on that tomorrow

So, we often look at a number of trials

With different conditions
- different values for one or more independent variables

And ask whether the results are different or not

But what counts as different?

Or, is the difference significant

2. Looking at measurements: what is significant?

So we're going to be measuring things

And comparing (sets of) measurements

Each task we look at will have its own appropriate measurements

And thus each comparison will be in its own terms

But one issue will be present throughout: are the differences we find significant?

Here's a simple example

Over a period of 25 days, genders of newborns were tabulated at two hospitals
- In one hospital, 60% or more of the births were boys on 7 out of 25 days
- In the other hospital, 60% or more of the births were boys on only 2 out of 25 days
Is there something to worry about here?
- In other words, is there a real difference?

Answer, in fact: 'No'

Because the first hospital is much smaller than the second: 15 births a day as opposed to 45

Percentages can be the wrong basis for comparing outcomes if the population is of different sizes

3. How representative is the mean: Standard deviation

Significance measurement can be complex to understand, but the basic idea is simple:

Measure difference in terms of standard deviation

Standard deviation is essentially a measure of how representative the mean is

The more outliers, and the further they are from the mean, the higher the standard deviation

Definitions:

Mean of N measurements: $\frac{n_{1} + n_{2} + n_{3} + ... + n_{N}}{N} Call this μ$
Standard deviation of N measurements: $\sqrt{(\frac{{(n_{1} - μ)}^{2} + {(n_{2} - μ)}^{2} + {(n_{3} - μ)}^{2} + ... + {(n_{N} - μ)}^{2}}{N})}$

Different standard deviation means different representativeness or reliability for the means

Graph of two distros, same mean, different SD

In the blue case, some items are a long way from the mean

4. Normal distributions and standard deviation

I tossed one coin 40 times: it came up heads 17 times.
- Is it fair?
- Probably, yes
I toss a different coin, and it comes up heads only 13 times
- Is it fair?
- Probably, no

If we look at the distribution of outcomes over many coin-toss trials, it looks like this:

A typical bell curve, with the mean and +/- 1 and 2SD points marked

That's a classic normal distribution

The peak is at 50% heads, but there are lots of other plausible outcomes.

It looks like most of the results are within 2 standard deviations.

In fact, about 96% of them

This is true by definition for any true normal distribution

5. Back to significance

Consider my earlier claim that 13 heads is probably a sign of an unfair coin

The graph tells us that 13 is outside the 2 standard deviation boundary

And both our empirical observation (the graph) and the underlying maths tell us that flipping a fair coin 40 times will give a result inside the boundary about 96% of the time

So there's only about a 4% chance that a coin which gives 13 heads is fair

That is, is drawn from a population whose measured distribution is normal with a mean of 20 and a standard deviation of 3.16

6. Reporting significance: 'p' values

"about 4%" doesn't seem a crisp way to report on an experiment

By convention, we say a result is significant if the chance is 5%

What chance?
The chance that the coin is fair after all, and we were just unlucky
We're always reporting the chance that we're wrong (to conclude that the coin is bent)

So, 2 standard deviations is not quite the right boundary

With more tosses per trial, we could see more detail
And determine that the 95% point is 1.96 standard deviations

So a result outside the 1.96 boundary will come up once in 20 trials, even if the coin is fair.

So we say that a result outside that boundary is significant "p < .05"

That is, the probability is at most 1 in 20 that we are wrong to call such a coin bent
Because a fair coin will only show such a result 1 time in 20

7. Back to the hospitals

We can look at the hospital data again

Plotting SD bars this time

Gender statistics for small hospital: 7 days > 60% boys, only 2 days > 2SD, with SD bars

Gender statistics for large hospital: 2 days > 60% boys, with SD bars, no days > 2SD

Now they don't look so different

8. Example of real significance reporting

We looked at the Rensink et al (1997) change blindness paper

One of the independent variables in this flicker paradigm experiment was the location of the change

Central interest changes were in areas important to the overall understanding of the scene and were often near the image centre.
Marginal interest changes were in less important or physically peripheral areas of the image.

The authors compared the number of image alternations (reaction time) which participants required to detect three different types of change (presence/absence, colour, location) in central versus marginal areas.

They report:

"...within each type of change, perception of Ml changes took significantly longer than perception of CI changes (p < .001 for presence vs. absence; p < .05 for colour; p < .001 for location), even though MI changes were on average more than 20% larger in area."

Based on a standard alpha-level of 0.05, all three comparisons are statistically significant.

So what they're saying, for example with respect to the difference in the detection times of colour differences

"The chance we're wrong to claim there's a difference here is less that 5%"

9. Non-significant results

The authors repeated the same experiment, removing the blank "masking" slides between the original and changed images in order to confirm that participants physically could see the changes.

Now it required an average of just less than 1 second to see the changes!

The authors report:

"A completely different pattern of results emerged...No significant differences were found between MIs and CIs for any type of change, and no significant differences were found between types of change (p > .3 for all comparisons)."

Here, the no-blank version of the experiment is represented as a horizontal line.

What they're saying here is

"We can't claim there's a difference here, because if we did the chance we were wrong might be as high as 30%"

10. Lots of measures, lots of significance tests

Different kinds of measurements require different significance tests.

Broadly speaking, there are two classes of significance tests:

parametric When the underlying distribution is known to be normal, and the values are continuous, or at least proportionate
non-parametric Otherwise

Measures such as the t-test or z-test are the classic parametric measures of significance.

But for non-parametric distributions, for example many kinds of token frequency data, we can't use them

Remember Zipf's Law!
That is, many linguistic phenomena are not normally distributed.

11. And finally, something for the weekend. . .

XKCD comic joke about 'statistically significant other'

Courtesy of XKCD

Another XKCD about significance. For this one you need the original, for the hovertext