NLG assignment 2


This assignment is due in at 16.00 on Thursday 28 March. Submit your answers in a single document in plain text or pdf (you don't need to submit any of the files you created while running the software, but you can include some sample output in your document in cases where you think it is relevant).

The submit command for this assignment is:
submit nlg cw2 <your document>

Part 1

From inside the subdirectory of your home directory where you keep all your code for the NLG course, type the following command into a terminal window:

      svn checkout ass-2

This should create a subdirectory called ass-2 containing the materials which you will need for this assignment.

In the ass-2 directory you should see:

In the data subdirectory you will find:

The generate-texts software is written in Java and should run on any platform. However it has only been tested on DICE and on a MacBook. If you use a Windows machine, you do so at your own risk. The ngram-count software has been compiled on DICE and will only run on DICE or a similar a 64-bit linux platform.

Running the Software

Examine the two content plans. First we are going to generate texts from one of them, using the parameters.xml parameter file. This file specifies two things: (a) the content plan to be used as input to the generator; and (b) the language model to be used to rank the ouput texts. At the moment, we are going to use the content plan called compare-giovanni-ti_amo.xml and no language model. Take a look at the contents of the parameters.xml file to see how these things are encoded in the XML.

From inside the assignment directory, type the following command:

  ./generate-texts parameters.xml

The generation software will output some information to the screen while it is running. When it has finished, you should now find that you now have three new subdirectories: (a) data/text-plans; (b) data/logical-forms; (c) output.

First, look at the text plan files in the text-plans subdirectory. There should be three of these. These are in a format similar to that of Sparky. They form a subset of the possible text plans which could be created from the content plan - propositions within an infer relation can occur in any order.

Next, look at the logical form files in the logical-forms directory. There should be one logical form file for each text plan, both files having the same number. The logical forms can be used as input to the OpenCCG surface realiser. You will notice that they are presented in a more concise format than the hybrid logic format you worked with before. This is for convenience and readability since they are quite long. All of the same information is present.

Question 1.1 Text Plan Output (3 marks)

Compare the various text plans with the input content plan.

Describe what changes have been made in the process of converting the content plan into the text plans.

Question 1.2 Text Realisation Output and ranking using ngram language models (8 marks)

Now, look at the file in the output directory. You should see a file called compare-giovanni-ti_amo.output that contains a list of 108 sentences generated from the various logical forms - note that all the realisations have been collected into this single file. At the moment the outputs are not ranked.

In order to rank the realised output texts, we have provided you with an SRILM toolkit ngram model trained on part of the Wall Street Journal newswire corpus. You can find the model, named wsj.lm, in the ngram-models directory. Have a look at the language model file to see what it contains.

OpenCCG allows you to load one or more ngram models and use them to rank the outputs. Open up the parameters.xml file and uncomment the ngram-models element. Now run the generation command again.

Look at the output file again. You should now notice that the sentences are now ranked, from best to worst.

Do you think that the the ngram model ranks them correctly? If not, suggest why the ngram model is performing poorly in this case.

Question 1.3 Better ngram models by hand (12 marks)

In this question, you are going to create an ngram model by hand, to try and get a better ranking of the output texts than the Wall Street Journal language model can give us.

First, copy the output sentences file from question 4, and put the copy, called restaurants-1.txt, inside the ngram-models directory. Open this copy in a text editor and simply delete those texts which you think are bad, keeping the ones which are acceptable. You will also need to remove the ranks from the beginning of each text. This file will be our little training corpus for this question.

In the ngram-models directory, type the following command to train an ngram model from the corpus:

  ./ngram-count -unk -text restaurants-1.txt -write restaurants-1.count -lm restaurants-1.lm

This will create a trigram language model, called restaurants-1.lm, based on your little corpus. You can ignore all of the output to the screen when you run ngram-count.

Now we will run the generator again using the same content plan, but adding your new ngram model. To do this, edit the parameters.xml file so that the ngram-models element contains the following two models, weighted equally:

    <model file="wsj.lm" weight="50"/>
    <model file="restaurants-1.lm" weight="50"/>

Run the generation software again using your updated parameter file and examine the output texts.

Is the ordering better than using the WSJ language model on its own? If so, why? If not, why not?

Question 1.4 Better ngram models from the web (10 marks)

In this question, we are going to create a language model, called restaurants-2.lm, trained on relevant texts found on the web.

Using your favourite search engine, type in a few relevant keywords (e.g. "Italian restaurant review"), in order to find web pages which might contain relevant text to our domain. Copy and paste all the paragraphs of text which you find on each page into a file in the ngram-models directory called restaurants-2.txt. Do not select individual sentences from within the text. Try and avoid copying in non-text. Do as many different queries as you like, until your corpus has 10,000 words.

Now create an ngram model from your web corpus, called restaurants-2.lm.

Use this language model, in place of the one you used in question 5, in combination with the generation software.

Discuss the results, comparing them with what you got in questions 4 and 5. Adjust the relative weighting of the various language models to see if this has any effect.

Question 1.5 Crowd-sourcing a corpus (8 marks)

Email your 10,000 word corpus to and it will be added to a crowd-sourced corpus of everyone's individual corpora, plus the corpus created by the last two years of students, to get more data. This corpus is accessible from From March 11th it will be updated once a day as and when new contributions arrive.

Create a language model restaurants-3.lm from the crowd-sourced corpus. Run the generation software using this instead of the previous one.

Discuss what results you get compared to before. Does having more data lead to better results? What happens when you drop the WSJ language model?

Question 1.6 Using your language models on the other content plan (10 marks)

You now have two useful language models, one hand-made (restaurants-1.lm) and one crowd-sourced from the internet (restaurants-3.lm). Run the generation software on the other content plan (recommend-giovanni.xml), using various combinations of these and wsj.lm if you want, with any weights you choose.

Describe what you did, and what results you got. Can you learn anything from these results?

Question 1.7 The length effect (5 marks)

You may have noticed that the language models generally prefer shorter sentences.

Why do you think this is the case? Can you suggest a solution to this problem?

Question 1.8 Using Google (8 marks)

The results below were obtained by running a script to get search results from Google. Explain, in as much detail as you can, how we can use this information to improve our generation software.


Part 2

Question 2.1 (8 marks)

Your NLG company has been commissioned to adapt the MATCH multimodal restaurant recommendation system to work for the Edinburgh Festival, initially covering only performance art-forms (such as theatre, comedy and dance).

The system generates the following texts, which compare two theatre shows (productions):

  1. “Smalltown” and “Pandas” are the best shows for you. “Smalltown” opens soon, while “Pandas” opens later, and “Smalltown and “Pandas” are from very strong companies. “Smalltown is a teen drama while “Pandas” is a comedy-thriller.
  2. “Smalltown” and Pandas” are the best shows for you. “Smalltown”, which is a teen drama, opens soon, and is from a strong company. “Pandas”, which is a comedy-thriller, is from a strong company, and opens later.

Suppose you show these texts to two different test-users, and get them to rate them. Both raters prefer text (1), compared with (2).

Briefly write out the text plans underlying the two cases, and indicate why (1) might be preferred.

Question 2.2 (8 marks)

The system also generates the following texts, which recommend the same dance show:

  1. “Letters from America” is the best show for you since it is a modern dance piece, from a strong company, tickets are \pounds14, and it got four stars in “The List”.
  2. “Letters from America” is a modern dance piece, with four stars in “The List”. It is from a strong company. Tickets are \pounds14. It is the best show for you.

Suppose you show these texts to two different test-users, and get them to rate them. User X gives (1) a rating of 1 (low), while Y gives it 4 (high). User X gives (2) a rating of 4 (high), while Y gives it 2 (low).

How would you re-design your generation system, so as to satisfy both X and Y?

Question 2.3 (12 marks)

To generate all these texts, your company must have adapted the kind of multi-attribute value theoretic user model used in MATCH. Some assumptions have been made about which attributes and values of shows should be included. You personally may or may not agree with these assumptions; now is your chance to fix them.

Indicate what attributes and values of shows you think would be most important in your new user model, and give an example user model with up to six factors relevant to choosing among shows.

Describe how you would go about attaching weights to the factors to construct a specific user model.

Question 2.4 (8 marks)

Suppose your improved system is now embedded within a spoken dialogue system, and you want to demonstrate that the complete system can align with its users.

Give an example showing the complete system aligning with a user requesting a recommendation, and show how this behaviour might differ from that of a system which fails to align.


If you have any problems, please email (note that I will be out of town from March 2-10 inclusive). If you have a software problem, you can move on to Question 2, which doesn't require any software, and come back to Question 1 later.