ANLP 2015


Lecture 28: Discourse, coherence, cohesion

Henry S. Thompson
With input from Johanna Moore and Bonnie Webber
23 November 2015
Creative CommonsAttributionShare Alike

1. "If we do not hang together

then surely we must hang separately" (Benjamin Franklin)

Not just any collection of sentences makes a discourse.

The difference?

Cohesion
The (linguistic) clues that sentences belong to the same discourse
Coherence
The underlying (semantic) way in which it makes sense that they belong together

2. Linking together

Cohesive discourse often uses lexical chains

Longer texts usually contain several discourse segments

Intuition: When the topic shifts, different words will be used

But, the presence of cohesion does not guarantee coherence

  • John found some firm ripe apples and dropped them in an wooden bucket filled with water
  • Newton is said to have discovered gravity when hit on the head by an apple that dropped from a tree.

3. Identifying sub-topics/segmenting discourse

The goal is to delimit coherent sub-sequences of sentences

By division

By (generative) modelling

Relevant for

4. Finding discontinuities: TextTiling

An unsupervised approach based on lexical chains

Three steps:

  1. Preprocess: tokenise, filter and partition
  2. Score: pairwise cohesion
  3. Locate: threshhold discontinuities

5. TextTiling: Preprocessing

In order to focus on what is assumed to matter

Moderately aggressive preprocessing is done:

6. TextTiling: Scoring

Compute a score for the gap between each adjacent pair of token sequences, as follows

  1. Reduce blocks of k pseudo-sentences on either side of the gap to a bag of words
    • That is, a vector of counts
    • With one position for every 'word' in the whole text
  2. Compute the normalised dot product of the two vectors
    • The cosine distance
  3. Smooth the resulting score sequence by averaging the scores in a symmetrical window of width s around each gap

7. TextTiling: Locate

We're looking for discontinuities

That is, something like this:

score graph fragment showing valley around y[i]

The depth score at each gap is then given by (yi-1-yi)+(yi+1-yi)

Larger depth scores correspond to deeper 'valleys'

Scores larger than some threshhold are taken to mark topic boundaries

Liberal
s¯-σ
Conservative
s¯-σ2

8. Evaluating segmentation

How well does TextTiling work?

Just classifying every possibly boundary as correct (Y+Y or N+N) vs. incorrect (Y+N or N+Y) doesn't work

Counting just Y+Y seems too strict

9. Evaluation, cont'd

The WindowDiff metric, which counts only misses (Y+N or N+Y) within a window attempts to address this

Specifically, to compare boundaries in a gold standard reference (Ref) with those in a hypothesis (Hyp):

0 is the best result

1 is the worst

10. Machine learning?

More recently, (semi-)supervised machine learning approaches to uncovering topic structure have been explored

Over-simplifying, you can think of the problem as similar to POS-tagging

So you can even use Hidden Markov Models to learn and label:

But now the distribution governs the whole space of (substantive) lexical choice within a topic

See Purver, M. 2011, "Topic Segmentation", in Tur, G. and de Mori, R. Spoken Language Understanding for a more detailed introduction

11. Topic is not the only divider

Topic/sub-topic is not the only structuring principle we find in discourse

Some common patterns, by genre

Expository
Topic/sub-topic
Task-oriented
Function/precondition
Narrative
Cause/effect, sequence/sub-sequence, state/event

But note that some of this is not necessarily universal

12. Richer structure

Discourse structure is not (always) just ODTAA

And sometimes detecting this structure really matters

  • Welcome to word processingi
    • That’s using a computer to type letters and reports
    • Make a typoi?
      • No problem
      • Just back up, type over the mistakej, and itj’s gone
      • And, *itj eliminates retyping
    • And, iti eliminates retyping

13. Topic is not the only dimension of discourse change

Topic/sub-topic is not the only structuring principle we find in discourse

Some common patterns, by genre

Expository
Topic/sub-topic
Task-oriented
Function/precondition
Narrative
Cause/effect, sequence/sub-sequence, state/event

But note that some of this is not necessarily universal

Cohesion sometimes manifests itself differently for different genres

14. Functional Segmentation

Texts within a given genre

generally share a similar structure, independent of topic

That is, their structure

15. Example: news stories

The conventional structure is so 'obvious' that you hardly notice it

In decreasing order of importance

16. Example: Scientific journal papers

In particular, experimental reports

Highly conventionalised

Front matter
Title, Abstract
Body
(or, mnemonically, IMRAD
  • Introduction (or Objective), including background
  • Methods
  • Results
  • Discussion
Back matter
Acknowledgements, References

Although the major divisions (IMRAD) will usually be typographically distinct and of explicitly labelled

17. Theories of discourse structure

Early discourse resources were task-oriented

And the structure of task-oriented discourse often mirrored the structure of the task

Pre-computational theories had focussed on narrative structures

These gave way to structurally rich generative models

Both were expressed in terms of coherence relations

Still depending on observable phenomena (cohesion) to detect/identify them