|Author:||Sharon Goldwater, Ida Szubert, Henry S. Thompson|
|Date:||2014-09-01, updated 2015-09-15, 2016-09-15, 2017-09-15, 2018-08-20, 2018-09-18|
|Copyright:||This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License: You may re-use, redistribute, or modify this work for non-commercial purposes provided you retain attribution to any previous author(s).|
This course has a biweekly 1.5-hour lab session where you will work through the instructions and questions either from the web page or the PDF version, which we will hand out in the lab. You should work with a partner, sharing a single keyboard and screen. Each person should ensure that both they and their partner understand what is going on. Pay attention to what your partner is doing and discuss what's going on. That means:
You should be able to work through the main set of questions in 1.5 hours. However, we encourage you to go beyond the basic set of questions to further improve your practical skills and understanding, using the "Going Further" section at the end of each lab.
NOTE: This week's lab has a lot to read. Future labs will have less reading and more thinking/coding, but you need to read through some basics first.
Please do the following before arriving at the first lab session.
To set up the workspace, log into your DICE account and open a shell (also called a terminal) following the instructions in the Introduction to DICE document. You should see a command prompt of the form:
Now, create a directory (folder) called anlp that you will use for all your ANLP work. You can do this by entering the following command at the prompt. (Remember, mkdir stands for 'make directory')
Now, create a subdirectory for your labs, and then another one for this lab. (You cannot create a subdirectory before creating the directory it belongs inside.)
mkdir anlp/labs mkdir anlp/labs/lab1
We've created an environment for this course that's preloaded with various software packages that we will use. Enter the following command to activate the environment for this course and check that it's working for you:
conda activate anlp
You should see the prompt change to:
If it doesn't, and you're actually doing this in the lab, please enter the following command:
If you see module-anlp in the resulting list, try this command:
where [hostname] is the name of the computer you are sitting in front of (should be showing up as part of the prompt at the beginning of each command line), and then:
conda activate anlp
Whether this works or not, just do:
and continue on with the rest of the lab---I'll ask later for people to let me know what did or didn't work.
If you didn't see module-anlp, then either you are not (yet) registered on this course, or that happened recently and hasn't propagated through to your online role list yet. Don't worry, you don't actually need this to work for the rest of today's lab: just carry on with the instructions.
Each time you start work for ANLP for coursework or labs after this today, you will need to activate the environment:
conda activate anlp
If this still doesn't work a day or two after you have been enrolled, get in touch by email to firstname.lastname@example.org and describe what's wrong, including your roles list and any error messages you get when trying to activate.
You can deactivate this environment by calling:
It's a good idea to do this when you're finished doing ANLP stuff, but if you're about to log off you don't need to bother.
(conda is what's called a package manager. We'll say a bit more about how it's used in a later lab).
This lab aims to:
What is UNIX?
This operating system is commonly used by computing researchers and programmers; in Informatics we use a version called Scientific Linux, which is installed on all of the student lab machines and which you will be using for the practical labs in this course. (The UNIX tools described here are also available in Apple's OS X, and in Windows by installing additional software such as Cygwin.)
UNIX commands can be a bit confusing at first but are powerful tools for manipulating files and data, including exploratory data analysis. This lab will introduce some basic UNIX commands for exploring and processing files. You can do a lot of useful stuff with just a few commands, and we hope you'll keep using the skills from this lab for the rest of the semester and in your other courses.
But I already know UNIX!
This course is taken by students with many different backgrounds. If you are already familiar with UNIX, please skim through anyway to make sure you know all the commands we discuss and can do the tasks at the end of the lab. You may also want to take a look at the Going Further section for more practice, and you can also help out your classmates who have less experience than you.
What data will we use and what will we do with it?
We will explore the data from a corpus of parent-child interactions, building up to computing the child's mean length of utterance (MLU) in one or two files. MLU measures a child's language development: it is just the average number of words (or, sometimes, morphemes) in each of the child's utterances (spoken sentences).
In this lab we will be working with data from CHILDES, the Child Language Data Exchange System. CHILDES is a collection of many different corpora in many different languages, all contributed by researchers of child language development. Different researchers have different interests, so the corpora contain different types of data (transcriptions, audio, and/or video) and different kinds of annotations (ranging from detailed phonetic transcripts to morphological and syntactic annotations).
We will use the Providence corpus for this lab. To download the corpus, click on the following URL or paste it into your web browser: http://homepages.inf.ed.ac.uk/sgwater/teaching/Providence.zip. You should get a dialog box asking what to do with the file; choose Open with Archive Manager (default). In the next window, click Extract and then navigate to your lab1 directory by double-clicking in the right-hand pane on the anlp folder, then labs, then lab1. Then click Extract.
Go back to your terminal window and cd (meaning change directory) into the lab1 directory:
If you ever get confused or forget which directory you are in, use the pwd (print working directory) command. Try it now:
Now let's list the contents of the current directory using ls:
You should now see a subdirectory called Providence, which contains the data you just downloaded. What do you see in it? What about in the further subdirectories?
ls Providence ls Providence/Ethan
To get an idea of what is in the files you just downloaded, type:
cd Providence/Ethan less eth01.cha
What information is in the metadata at the top of each file? (Hint: child language researchers use the format y;m.d to indicate a child's age in years;months.days)
(less may seem like a funny name for this command which shows you more of the file; it's because there was an earlier similar command called more, so when this newer version was developed the developers decided to give it a cute name. UNIX developers like cute names.)
When you are looking at a file using less, you can scroll up and down using the arrow keys or <PageUp>/<PageDown>. <space> also acts like <PageDown> and <Enter> acts like <down-arrow>.
What do you see in the rest of the file? (Hint: the string of numbers at the end of each line is a code that links to a time point in the audio recording of this data. The audio isn't included here but can be obtained from the CHILDES database.)
One final useful thing you can do with less is search for a particular string in the file by typing / followed by the string. Try it: type
You should see all of the instances of the being highlighted.
To stop viewing the file and go back to the terminal command line, type q.
Actually, the eth01.cha file maybe is not so interesting. Look now at eth50.cha. What are some of the main differences between the data in these two files? Can you think of an explanation for those differences? (If you want to look at both files simultaneously, you could start a new terminal and open one file in each terminal.)
Sometimes we just need a quick peek at part of a file. What does this command do?
What about this one?
These commands can be useful, for example, if we want to look over the range of ages in the files we have. Try:
What does the * do?
We don't really need to look at the first 10 lines of each file to see the ages, we only need the first five lines. If we could print only the first five lines, we could see the information we want more compactly. Most UNIX commands, including head, have many possible options to change their behavior. To see what options are available, look at the manual for the command:
You should see:
HEAD(1) User Commands HEAD(1) NAME head - output the first part of files SYNOPSIS head [OPTION]... [FILE]... DESCRIPTION Print the first 10 lines of each FILE to standard output. With more than one FILE, precede each with a header giving the file name. With no FILE, or when FILE is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -c, --bytes=[-]K print the first K bytes of each file; with the leading `-', print all but the last K bytes of each file -n, --lines=[-]K print the first K lines instead of the first 10; with the lead- ing `-', print all but the last K lines of each file -q, --quiet, --silent never print headers giving file names -v, --verbose always print headers giving file names ...
Note: you should be able to move up and down in the man pages using the same keys you used for less [I hope; the default for the student account might not allow you to move upwards...], and you can also return to the command line using q.
The man page starts with the name of the command and a brief synopsis of how to use it. In this case, it says that the head command can be followed by zero or more options (sometimes called flags) and then zero or more files as arguments (things to act on). The square brackets indicate that these options and files need not be included at all, and the ... indicate that you can include more than one of each.
The description after the synopsis says what will happen if you include zero or multiple files. (Standard input refers to the text you input in the terminal. Try entering head with no filename to see how this works. You will need to enter some more text after that! To quit, type Ctrl-c.)
The next part of the description tells you what the possible options are and if they require arguments themselves. Many (in this case all) options have both a short and a long form, which are equivalent. For example, to print just the first 5 lines of each file, we could either use:
head -n5 eth01.cha
head --lines=5 eth01.cha
(To try this yourself, you will need to either use another terminal or quit the man page first by typing q.)
The K specified in the description is a variable indicating a required argument to the -n option. Here we use the value 5 for K. Notice that this option also has a non-required argument [-] so if we wanted to print all but the last 5 lines we could type:
head -n-5 eth01.cha
As noted above, you can specify multiple options at once. Compare the output of the following commands:
head -n5 *.cha head -n5 -q *.cha
Now that you know how to read a man page, you may want to look at the man pages for some of the other commands we've seen to see what options are available for them.
There is also a lot of help about these commands on the Internet. You can find more examples or maybe a better explanation than the man page has. If you know the name of the command you want to use, like grep, try searching for UNIX command grep. If you know what you want to do but not the name of the command, searching is trickier, but try searches like UNIX search regular expression or even UNIX find string in file. Get in the habit of looking for help--you will start to recognize some of the main help forums and which ones are most useful to you.
As you may have noticed, most UNIX commands are very short to avoid lots of typing. You can avoid even more typing by using tab completion and history.
Type the following into your terminal. Instead of pressing <enter> at the end, press the <tab> key:
You should find that the filename has been automatically extended to eth. If you type 01 and then <tab> again, you will have the complete filename. In this case the tabbing didn't save many keystrokes, but for long filenames it can save a lot. You can use tab completion for nearly anything in UNIX (commands, filenames, directories). The operating system will complete as much as it can but will stop where there are multiple possible extensions (like here, where all files in the current directory start with eth but we then had to type in 01 to disambiguate before tabbing again.) In DICE, tab completion is set up so that if you hit <tab> twice when there are multiple completions, the system will print out all possible completions for you to see.
Your terminal keeps track of all the commands you have entered. To access the previous command, just type the up-arrow key. If you hit <up> twice, you will get the command before that, and so forth. This can save a lot of typing.
Try this command:
What are the numbers that you got back? What do you think wc stands for? (Hint: look at the man page!)
Using a single command, can you figure out which of the first nine Ethan files has the most lines in it?
Sometimes we want some more specific information. The number of lines in a file roughly indicates how much language data is in there, but what if we want to know how many utterances are spoken by a particular person? Try this command:
grep 'MOT' eth01.cha
What did this command do?
Using the man page for grep, figure out what option you can use with grep to compute the number of utterances spoken by the mother in this file.
In case you are wondering, grep stands for Globally search for a Regular Expression and Print. We won't discuss regular expressions (REs) in detail here, but they are covered in Section 2.1 of the Jurafsky and Martin textbook, and also in the Computer Programming for Speech and Language Processing course. (The grep man page also explains the syntax that grep uses for REs.) REs are a powerful way to match patterns in text and we hope you will learn to love them. For now, we will just show you a few simple REs to show some of what you can do.
First, look at this command but don't run it yet:
grep 'MOT.*the' eth01.cha
What do you think it will do? (Hint: . matches any character, and * means "zero or more repetitions of the previous character")
To see if you were right, run the command and look at the first few lines of output. Even better, let's only print out the first few lines of output. We can do that by combining two commands we already know, grep and head:
grep 'MOT.*the' eth01.cha | head
What just happened?
The | (vertical bar on the keyboard, called pipe in UNIX-world) tells the operating system to use the output of the first command as the input to the second command. (Another way to think of this is that it turns the output of the first command into "standard input", which as you will recall, is what is used as input to head if no filename is given.
Can you think of a way to use pipe to combine two commands we already know to count the number of lines that include the string MOT without using the -c option for grep?
Instead of sending the output of one command to the input of another, we can also just send the output of a command into a file. This can be useful if we want to save the output for later inspection, and we do it using the > (output redirect) character:
grep 'MOT.*the' eth01.cha > my_output.txt less my_output.txt
Coming back to regular expressions, let's make sure you understand what the RE 'MOT.*the' is matching in the grep command. What is the difference between the following two commands?
grep 'MOT.*the' eth01.cha grep 'MOT.*\bthe\b' eth01.cha
You may want to look at just the start of the output by piping it to head. Or try using the --color=auto option for grep (unfortunately the color is lost if you pipe the output).
In general, backslashes are used in regular expressions to indicate something special (like a word boundary or any whitespace character) and are also used to make characters that are usually special (like *) be interpreted literally. So, for example, if we want to match only the lines spoken by the mother, we should use '^\*MOT' rather than just 'MOT', because the first version only matches the exact string *MOT at the beginning of a line (indicated by the ^). The second version also matches lines that are not spoken by the mother but happen to contain the string MOT. (Are there any?)
As you can start to see, regular expressions can be very powerful but proper use takes time and practice, and we will not go any further here.
Believe it or not, we now have all the commands we need to be able to compute the child's mean length of utterance in any given file. Remember, MLU is the average number of words spoken by the child in each of the utterances in the file. What is Ethan's MLU in the file eth50.cha? Assume for this question that a word is any whitespace-delimited string of characters (including punctuation) in the transcription. So, for example, you should count fill him with pom+poms xxx . as six words.
Hint: you can't get the answer entirely automatically. Use the commands you know to get a few key numbers and then perform some basic arithmetic on them yourself. This works fine if we only care about the MLU for one or two files. If we wanted to compute MLU for all the files, we would probably want to write a short program to do it.
If you want, try your hand at one of the following, in whatever order you like: