The scientific method is founded on making and testing hypotheses
Evaluation is just another name for testing
Sometimes our hypotheses are about existing linguistic objects:
Not the details of how to evaluate a particular system
But the concepts, methods and materials which are drawn on to do this
Sometimes they are about the output of our systems:
Sometimes they are about human beings:
Our hypotheses are often about differences
We typical look at a number of trials (repetitions) in each condition
And ask whether the resulting distributions (population of values) are different or not
But what counts as different?
In many cases we have a record of 'the truth'
That is, the best human judgement as to what the correct segmentation/tag/parse/reading is, or what the right documents are in response to a query.
Gold standards can be used both for training and for evaluation
But reliable testing must be done on unseen data:
Don't use your training data for (reportable) testing!
Crucially, evaluation isn't just for public review:
[Read section 4.8 of J&M (3rd edition; 5.7 in 2nd edition) for a good review of all this]
Always using the same, say, 75% for training, 15% for development testing and 10% for 'real' testing isn't the best possible use of your data
The answer is cross-validation (also sometimes referred to as jackknife)
If tuning the system via cross-validation, still important to have a separate blind test set
Gold standards can be very expensive
Because they involve lots of trained human annotators
For example, as near as I can estimate, the Penn treebank
Virtually all the large gold standards available today have been paid for by government agencies.
It is no longer always necessary to employ more-or-less experts as annotators
Amazon's Mechanical Turk is a social machine which has been widely used for the creation of gold standards for NLP tasks
Named after a 19th century fake automaton
A typical task, or HIT (Human Intelligence Task)
Amazon acts as a marketplace, managing the connection between task owners and 'turkers'
I used the Mechanical Turk to evaluate the results of a so-called "semantic search engine" competition in 2010.
The competition task was to provide semantic-web-sourced descriptions of people and places
Query-result pairs packaged into batches of 12 HITs
Each HIT done by 3 workers (3 'assignments' per HIT)
2 minutes time allotted and $0.20 per HIT
Execution monitored online
64 Turks in total, 4 bad apples
Total cost of
So we're going to be measuring things
And comparing (distributions of) measurements
Each task we look at will have its own appropriate measurements
And thus each comparison will be in its own terms
But one issue will be present throughout: are the differences we find significant?
>> from nltk.book import *
>>> f=FreqDist(x.lower() for z in [text1,text2,text3,text4,text5,text6,text7,text8,text9]
>>> for y in z for x in y if ord(x)<128)
>>> f.items()[4:6]
(u'n', 220713),
(u'i', 218590),
>>> f['n']-f['i']
2123
In general, if we can only sample from a distribution
Over a period of 25 days, genders of newborns were tabulated at two hospitals
Is there something to worry about here?
Answer, in fact: 'No'
Percentages can be the wrong basis for comparing outcomes if the population is of different sizes
Significance measurement can be complex to understand, but the basic idea is simple (for normal distributions):
Standard deviation is essentially a measure of how representative the mean is
Definitions:
Different standard deviation means different representativeness or reliability for the means
In the blue case, some items are a long way from the mean
If we look at the distribution of outcomes over many coin-toss trials, it looks like this:
That's a classic normal distribution
The peak is at 50% heads, but there are lots of other plausible outcomes.
It looks like most of the results are within 2 standard deviations.
This is true by definition for any true normal distribution
Consider my earlier claim that 13 heads is probably a sign of an unfair coin
The graph tells us that 13 is outside the 2 standard deviation boundary
So there's only about a 4% chance that a coin which gives 13 heads is fair
"about 4%" doesn't seem a crisp way to report on an experiment
By convention, we say a result is significant if the chance is 5% or less
So, 2 standard deviations is not quite the right boundary
So a result outside the 1.96 boundary will come up once in 20 trials, even if the coin is fair.
So we say that a result outside that boundary is significant "p < .05"
We can look at the hospital data again
Now they don't look so different
Here's a tabulation of the top 6 character counts from the Project Gutenberg Sense and Sensibility (approx. 500,000 characters):
Is this ranking correct?
We can do an empirical version of the "p < .05" test
By looking at 20 samples of characters from a larger corpus (Reuters newswire)
So we can be pretty sure that in the underlying distribution 't' really is more frequent than 'a', but we really don't have a big enough sample to be sure about 'n' versus 'i'.
It's also worth noting that the distribution for Austen and for the Reuters data is probably not the same. . .
What's wrong with this statement:
There was no change in the control group's average blood pressure by the end of the trial. The intervention group showed a small improvement, but it was not statistically significant (p > .5)
The use of the word 'improvement' in the second sentence implies a particular direction of the change in average over the trial.
The temptation is to say something like this
Our theory predicted a speed-up in response time. Although the meaasured change was not significant, it was in the right direction.
Don't do that!
Different kinds of measurements require different significance tests.
Broadly speaking, there are two classes of significance tests:
Measures such as the t-test or z-test are the classic parametric measures of significance.
But for non-parametric distributions, for example many kinds of token frequency data, we can't use them
Slides 29–34 of Lecture 7 introduced these two measures of classification success
Thinking still about binary classification, we can use a contingency table to help understand these
Figure 4.4 from https://web.stanford.edu/~jurafsky/slp3/4.pdf
Why do we need all these measures?
Here's a real contingency table, in this case better named a confusion matrix:
A B C D E F G H ...
A 168 1 0 2 5 5 1 3 ...
B 0 136 1 0 3 2 0 4 ...
C 1 6 111 5 11 6 36 5 ...
D 1 17 4 157 6 11 0 5 ...
E 2 10 0 1 98 27 1 5 ...
F 1 0 0 1 9 73 0 6 ...
G 1 3 32 1 5 3 127 3 ...
H 2 0 0 0 3 3 0 4 ...
... ... ... ... ... ... ... ... ... ...
We can still extract single-class precision and recall:
Figure 4.5 from https://web.stanford.edu/~jurafsky/slp3/4.pdf
Like so many other things we've looked at
Ever boarded an airplane, and the captain announces there's a technical problem? Ever wondered what's behind that?
Now consider hiring trainee pilots
Another XKCD about significance. For this one you need the original, for the hovertext
The truth about significance tests is, that p-values measure laboratory budgets -- Ron Kaplan