In this question, you will carry out some basic text analysis. Two sample files are provided:
We will create a class Text which reads in a text file and breaks it into individual words. This exercise will introduce you to two concepts:
- Tokenization — the process of turning a single text string into a list of simpler strings, or tokens, by splitting the input at certain places.
- Associative arrays — data structures that map from a set of unique keys to a corresponding set of values. An example would be a telephone directory that maps from person names to phone numbers. Associative arrays are implemented in Java by the type HashMap.
Create a class Tokenizer with the following API:
- public Tokenizer()
- No argument class constructor. A tokenizer created with this constructor should have an empty list of tokens as data.
- public Tokenizer(String fname)
- Class constructor. This calls tokensFromFile() on the constructor’s argument.
- public void tokensFromFile(String fname)
- Uses the In library from stdlib to read the contents of the file fname into a string via a call to Ins readAll() method. Alternatively, you can use the readString method from java.io.Files. The method tokenize() is then called on this string.
- public void tokenize(String str)
- Create an array of tokens of type String[].
- public String[] getTokens()
- Return the tokens
- public int getNumberTokens()
- Return the number of tokens.
Notice that this API supports three approaches to tokenizing some text:
//Use the constructor to read in a file Tokenizer t0 = new Tokenizer("melville-moby_dick.txt"); String[] tokens0 = t0.getTokens(); //Call tokensFromFile() to read in a file Tokenizer t1 = new Tokenizer(); t1.tokensFromFile("melville-moby_dick.txt"); String[] tokens1 = t1.getTokens(); //Call tokenize() on a string Tokenizer t2 = new Tokenizer(); String sent = "Together we can change the world." t2.tokenize(sent); String[] tokens2 = t2.getTokens();
The tokenize() method takes a String argument, and returns an array of Strings, which contains the original string split into words. Use the split() method of the String class. You may recall that this takes one argument, which is the delimiter used for splitting. Try first using a single space as the argument: .split(" "). This is not too bad, but preserves newlines and tabs in the input; you can test this by running the method on a string like "number of wheels\t:2\ncost:\t$500".
In order to filter out newlines and tabs, we’ll use a regular expression special symbol \W which splits on any so-called ‘non-word character’, namely whitespace or punctuation. (See the Java Pattern API for more details.) In order to include \W in a string, you will need to escape the slash with another slash, like this: "\\W".
As you may recall from the Informatics 1 - Logic and Computation course, the Kleene star * is a repeat operator. Thus "\\W*" means zero or more non-word characters. A closely related operator is Kleene plus, where "\\W+" means one or more or more non-word characters. Decide whether you should have an operator as part of your regular expression, and if so, which one. As mentioned in Computation and Logic, you can find more information in the Wikipedia entry on regular expressions.
An automated test has been created for this exercise: TokenizerTest.java.
One of the things that we might be interested in is the distribution of words of different length in a given text. For example, we might expect that in a typical English text, relatively short words are most frequent. But are words of length 4, say, more or less frequent than those of length 3 or 5? Would the distribution of word lengths be different in a technical article than in a piece of popular journalism? And what would you expect to be the length of the longest words in a text? 15, 20, 25?
We can answer these questions by building a HashMap whose keys are the lengths of words in a text and whose values are the number of word occurrences that have that length. So we need to map from integers to integers. As you may recall, only reference types can appear as parameters in a HashMap, and consequently we have to use a wrapper class, namely Integer. Apart from the name change, you can think of this as being just like an int. Our HashMap will therefore have the following type:
HashMap<Integer, Integer>
We can use this as the return value of a method in the same way as any other type. For example:
public HashMap<Integer, Integer> myFunc() { HashMap<Integer, Integer> result = new HashMap<Integer, Integer>(); // ... return result; }
We will be using the HashMap to hold a frequency distribution. This records the number of times a particular event occurs, for example the event that a token with length \( n \) appears in a list. The logic of this is as follows: given a frequency distribution freq and an event e, check how many times e has already been recorded in freq. If the answer is never, then set the value to 1. If the answer is \( k \gt 0 \), then set the value to \( k + 1 \).
Warning
If a key k is not already mapped in a HashMap m, then m.get(k) will return null.
In order to use HashMap, make sure you include the following line as the first line in your file:
import java.util.HashMap;
In this exercise, you should implement a WordCounter. This will hold data consisting of a frequency distribution, as described above. It should meet the following API:
- public WordCounter(String[] tokens)
- Class constructor. When a WordCounter is created, it calls wordLengthFreq() on the input tokens.
- public void wordLengthFreq(String[] tokens)
- Replace the object’s frequency distribution with information about the lengths of the strings in tokens.
- public HashMap<Integer, Integer> getFreqDist()
- Returns the frequency distribution as a HashMap.
- public int maxVal()
- Returns the highest value in the frequency distribution.
- public double[] map2array()
- Convert the frequency distribution to a normalized array of doubles. Each (integer) key of the frequency distribution corresponds to an index into the array, and the value for that key corresponds to the element at the index. Convert the values into percentages (where the maximum value of the distribution is 100%).
Here is an example of client code for this class.
Tokenizer tokenizer = new Tokenizer("melville-moby_dick.txt"); String[] tokens = tokenizer.getTokens(); WordCounter wordCounter = new WordCounter(tokens); System.out.println(wordCounter.getFreqDist()); double[] points = wordCounter.map2array(); int n = points.length; StdDraw.clear(); StdDraw.setXscale(0, n - 1); StdDraw.setYscale(0, 100); StdDraw.setPenRadius(0.5 / n); for (int i = 0; i < n; i++) { StdDraw.line(i, 0, i, points[i]); }
The last section of this code uses StdDraw from the stdlib library to build a histogram corresponding to the frequency distribution.
An automated test has been created for this exercise: WordCounterTest.java.