FNLP Course Revision Guide
This page provides a list of concepts you should be familiar with and questions you should be able to answer if you are thoroughly familiar with the material in the course. It is safe to assume that if you have a good grasp of everything listed here, you will do well on the exam. However, we cannot guarantee that only the topics mentioned here, and nothing else, will appear on the exam.
Past papers and other materials
As noted in the final lecture (see the slides if you haven't), the
course was taught by new lecturers last year, and both topics and
emphasis were changed somewhat. So, although past papers before
last year's can give you an idea of what we might ask, please don't overfit your revision to past papers. The final lectures slides give a list of what topics have changed.
It is strongly recommended that you work through the lab materials and make sure you understand the answers and reasons for them, as these will give you better intuitions about many of the concepts, models and formalisms covered in class.
Generative probabilistic models
We have discussed the following generative probabilistic models:
- N-Gram Language Model
- Simple noise model for spelling correction
- Hidden Markov Model
- Probabilistic Context-Free Grammar
- Naive Bayes classifier
For each of these, you should be able to
- describe the imagined process by which data is generated, and say what independence assumptions are made.
- write down the associated formula for the joint probability of latent and observed variables (or just the observed variables if there are no latent variables).
- compute the probability of (say) a tag-word sequence, parse tree, or whatever the model describes (assuming you know the model's parameters).
- for the models with latent variables, compute the most probable [tag sequence/parse tree/class] for a particular input, hand-simulating any algorithms that might be needed (again assuming you know the model parameters).
- explain how the model is trained.
- give examples of tasks the model could be applied to, and how it would be applied.
- say what the model can and cannot capture about natural language, ideally giving examples of its failure modes.
Logistic Regression/MaxEnt model
For this model, you should be able to
- understand the formula for computing the conditional probability of the hidden class given the obervations/features, and be able to apply that formula if you are given an example problem with features and weights. You do not need to memorize the formula.
- give examples of tasks the model could be applied to, and how it would apply (e.g., what features might be useful).
- explain at a high level what training the model aims to achieve, and how it differs from training a generative model.
- explain the role of regularization.
- discuss the pros and cons of MaxEnt vs Naive Bayes.
In addition to the equations for the generative models listed above, you should know the formulas for the following concepts, what they may be used for, and be able to apply them appropriately. Where relevant you should be able to discuss strengths and weaknesses of the associated method, and alternatives.
- Bayes' Rule (also: the definition of Condition Probability, law of Total Probability aka Sum Rule, and all other relevant formulas in the Basic Probability Theory reading)
- Noisy channel model
- Add-One / Add-Alpha Smoothing
- Interpolation (for smoothing)
- Dot product
- Cosine similarity
- Pointwise mutual information
- Precision and recall
Algorithms and computational methods
For each of the following algorithms, you should be able to explain what it computes (its input and output) and what it is used for, and be able to hand simulate it. How does the algorithm solve the efficiency problems that a more naive algorithm would face?
- Minimum string edit distance algorithm
- Viterbi Algorithm for Hidden Markov Models
- CKY parsing
For each of the following methods, you should be able to explain what it computes (its input and output), what it is used for, and be able to describe how it works at a high level.
- Expectation-Maximization (in the context of spelling correction or POS tagging)
Additional Mathematical and Computational Concepts
- Dynamic programming: What characterizes the tasks this is applied to, and the way that DP solves them? What are examples of DP algorithms?
- Zipf's Law and sparse data: What is Zipf's law and what are its implications? What does "sparse data" refer to? Be able to discuss these with respect to specific tasks and models.
- Probability estimation and smoothing: What are different methods for estimating probabilities from corpus data, and what are the pros and cons of each, and the characteristic errors? Under what circumstances might you find simpler methods acceptable, or unacceptable? You should be familiar at a high level at least with:
Except as noted under "Formulas" above, you do not need to memorize the formulas, but should understand the conceptual differences and motivation behind each method, and should be able to use the formulas if they are given to you.
- Maximum Likelihood Estimation
- Add-One / Add-Alpha Smoothing
- Training, development, and test sets: How are these used and for what reason? Be able to explain their application to particular problems.
In addition, for the following concepts you should be able to
explain each one, give one or two examples where appropriate, and be
able to identify examples if given to you. You should be able to say
what NLP tasks these are relevant to and why.
- Noisy channel model
- Supervised vs. unsupervised learning
- Search space in parsing: What is one searching for and what is one searching through?
- Breadth-first search, depth-first search, and differences between them
- Top-Down parsing vs. Bottom-Up parsing
- Well-formed Substring Tables
- Difference between recognition and parsing
- Pointwise mutual information
- Context vector
- Vector representation of words
- Vector-based and string-based similarity measures
- Alignment (for string edit distance or machine translation)
Linguistic and Representational Concepts
You should be able to explain each of these concepts, give
one or two examples where appropriate, and be able to identify examples if given to you. You should be able to say what NLP tasks these are relevant to and why.
- Ambiguity (of many varieties, wrt all tasks we've discussed)
- Open-class Words, Closed-class Words
- Context-Free Grammar
- Terminal and non-terminal (phrasal) categories
- Dependency grammar
- The distributional hypothesis
- Projective vs. nonprojective dependency parse
- Head Words (in syntax)
- Word Senses and relations between them (synonym, hypernym, hyponym, similarity)
- Semantic Roles
Also, you should be able to give an analysis of a phrase or sentence using the following formalisms. Assume that either the example will be very simple and/or some grammar/set of labels is provided for you to use. (i.e. you should know some standard categories for English but you don't need to memorize details of specific tagsets etc.)
- label parts of speech
- parse using context-free grammar
- parse using dependency grammar
- label semantic roles
You should be able to explain each of these tasks, give one or two
examples where appropriate, and discuss cases of ambiguity or what
makes the task difficult. In most cases you should be able to say what
algorithm(s) or general method(s) can be used to solve the task, and
what evaluation method(s) are typically used.
- Spelling correction
- Language modelling
- Text categorization
- Syntactic parsing
- Word sense disambiguation/supersense tagging
- Semantic role labelling
- Question answering
- Sentiment analysis
Corpora, Resources, and Evaluation
You should be able to describe what linguistic information is captured in each of the following resources, and how it might be used in an NLP system.
- Penn Treebank
For each of the following evaluation measures, you should be able to explain what it measures, what tasks it would be appropriate for, and why.
- Precision and recall (for parsing and for other tasks)
- Instrinsic vs. extrinsic evaluation: be able to explain the difference and give examples of each for particular tasks.
- Inter-annotator agreement: what is it and what is it used for?
- Gold standard: what is it and what is it used for?
- Differences between corpora/domains/genres: what are some of the differences you might see, what are some ways you could start to analyze them, and what issues might they cause in running an NLP system on a new corpus?