ICL Home >> Lab Sessions >> Lab 3
 

Introduction to Computational Linguistics

Lab 3 — Regular Expressions

  1. Set s to be the string
    """IBM rose 1.2% to $81.49
       thanks to an upgrade to 'buy' from 'hold' from Citigroup,
       but Cisco fell 0.2% to $100.02.""" 
    Using re_show from nltk_lite.utilities, specify patterns that will pick out the following substrings from s:
    • all numerical amounts (i.e. '81.49' — don't worry about the '$' or '%')
    • just dollar amounts
    • just percentage amounts
    • words in quotes
    • all words which are not numerical amounts
  2. Using from nltk_lite import tokenize, and with s assigned the same string as above, try out list(tokenize.regexp(s, '\w+')). Essentially, this selects strings of alphanumeric characters; but why does it break numerical expressions into substrings, for example? Now try to develop a new pattern in place of '\w+' that will tokenize strings like '$81.49' and '1.2%' as single tokens. When you have done this, try out the same tokenization rule on a longer piece of text, such as this extract from the Wall Street Journal (WSJ). (Read this is in as a string, using t = open("wsj_0012").read().) In particular, specialize your regular expression so that it now captures all the new numerical expressions as tokens.
  3. Still using the WSJ text and list(tokenize.regexp(s, pat)), try to develop a pat which will tokenize the sentences in the text.
  4. Just using Python's re library, try to transform the WSJ text so that there is exactly one newline between every sentence.


Home : Teaching : Courses : Icl 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh