ICL Home >> Lab Sessions >> Lab 3 |
s
to be the string
"""IBM rose 1.2% to $81.49 thanks to an upgrade to 'buy' from 'hold' from Citigroup, but Cisco fell 0.2% to $100.02."""Using
re_show
from nltk_lite.utilities
, specify patterns
that will pick out the following substrings from s
:
from nltk_lite import tokenize
, and with s
assigned
the same string as above, try out list(tokenize.regexp(s, '\w+'))
.
Essentially, this selects strings of alphanumeric characters;
but why does it break numerical expressions into substrings, for example?
Now try to develop a new pattern in place of '\w+'
that will tokenize
strings like '$81.49' and '1.2%' as single tokens. When you have done this,
try out the same tokenization rule on a longer piece of text, such as this
extract from the Wall Street Journal (WSJ).
(Read this is in as a string, using t = open("wsj_0012").read()
.)
In particular,
specialize your regular expression so that it now captures all
the new numerical expressions as tokens.
list(tokenize.regexp(s, pat))
, try to develop a
pat
which will tokenize the sentences in the text.
re
library, try to transform the WSJ text
so that there is exactly one newline between every sentence.
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |