Extreme Computing GNU Parallel

Kenneth Heafield, Rafael Karampatsis, and Matt Pugh 2015

Part 1: Preliminaries

Non-DICE

If you are not using a DICE machine, log into a DICE machine. If you are in the lab, you can skip this step.

ssh sXXXXXXX@student.ssh.inf.ed.ac.uk

You can now continue with the DICE steps.

DICE

Pick a machine from the server list at random. This time you will be running tasks locally, so it's in your interest to pick a free machine.

ssh YYYYY

Disable citation message

Cite tools used a research paper, like the GNU parallel paper. Now that we've gotten that out of the way, we can turn off the nagging message every time it runs.

mkdir -p ~/.parallel && touch ~/.parallel/will-cite

Files are on local disk

The data and some code are in /disk/scratch/exc. To make things shorter, we'll define

e=/disk/scratch/exc

and use $e to refer to this directory.

Tokenizing Text

In text processing, it's often useful to split punctuation off from words. For example, we want to convert "As the economy improves, rates can only rise." to "As the economy improves , rates can only rise ." by splitting off the comma and period. Don't worry about this for Assignment 1; we already did it for you.

Look at the news data

head $e/news

and look at the tokenized version

head $e/news |$e/tokenizer.perl

You can try to tokenize the whole corpus, which will be slow, so you will want to hit Ctrl+C to stop this command:

pv $e/news |$e/tokenizer.perl >/dev/null

A brief note on an awesome feature of pv: if you forgot to add it to your pipeline, you can still watch processes. Run this command, but read on while it's running.

$e/tokenizer.perl <$e/news >/dev/null

Open up another terminal, log into the same machine using ssh YYYYY (and going through student.ssh if you're outside Informatics). Run this command to find the process ID of the tokenizer:

pgrep -u $USER perl

That should print exactly one integer, presuming you're only running the tokenizer at the time. This is the process ID of the tokenizer process. Tell pv to attach to that process (substitute the correct ID):

/disk/scratch/exc/pv -d PROCESS_ID_GOES_HERE

This runs a newer version of pv. Now you have a progress bar for the files a process has open. It will take a while to finish so, when you get bored, stop both of them with Ctrl+C. To make things faster, we'll try GNU parallel. First, a sanity check: GNU parallel produces the output we expect.

head $e/news |parallel --pipe $e/tokenizer.perl

It does (on larger data it might reorder some of the batches, but we can pass -k to strictly preserve order). Now try it on the whole corpus

pv $e/news |parallel --pipe $e/tokenizer.perl >/dev/null

It should be faster. The number of cores on each machine varies and other students might be running on the same machine. GNU parallel also lets you specify a block size, which is the amount of text given to each tokenizer process. This is a rough guide, since it is splitting at the line level. The default is 1 MB. Let's raise it to 5 MB.

pv $e/news |parallel --block 5M --pipe $e/tokenizer.perl >/dev/null

Notice how pv updates the bar less often and the amounts it displays (on the left) are generally multiples of 5 MB. To make things more exciting, we'll run the tokenizer on more machines. Visit the servers page and pick another machine. Substitute ZZZZZ with the name of the machine you picked.

pv $e/news |parallel --pipe $e/tokenizer.perl --sshlogin ZZZZZ >/dev/null

That command typically runs slower because it's only running on the remote machine (unless the remote machine is much faster). Remember to tell parallel that it can also run locally. You do this by specifying : as one of the hosts.

pv $e/news |parallel --pipe $e/tokenizer.perl --sshlogin ZZZZZ,: >/dev/null

Benchmarking Memory

In the lectures, we saw that machines have different levels of cache. I wrote a benchmarking program for you to test this. Those of you who read C++ can look at $e/benchmark.cc. It has been compiled to $e/benchmark. The program benchmarks the scenario where we have an array and need to read from it at random offsets. If the offsets are sorted first, then access will be more sequential. Let's start by creating an array of size 10 and randomly reading from it 10000 times.

$e/benchmark 10 10000 random

The program prints one line: the arguments you gave it and the average cost of each read (in seconds). It can also sort the offsets before running the benchmark (note: sorting is not included in the time calcuation, though that might be interesting too!).

$e/benchmark 10 10000 sort

Use GNU parallel to sweep over array sizes. As we saw in the demo lecture, {} is a stand-in for an argument, while ::: says what arguments to try.

parallel $e/benchmark {} 10000 random ::: 1 10 100 1000 10000 100000 1000000 10000000 100000000

Try it a few times and you might notice that the order of the lines is non-deterministic. Run with -k to keep output lines in order.

parallel -k $e/benchmark {} 10000 random ::: 1 10 100 1000 10000 100000 1000000 10000000 100000000

Try a bunch of sizes, but keep in mind that I hard-coded a cap of 2 GB to prevent you from thrashing the machine too much. We can also have GNU parallel sweep over several parameters.

parallel -k $e/benchmark {} ::: 1 10 100 1000 10000 100000 1000000 10000000 100000000 ::: 10000 1000000 ::: sort random

There's a lot of noise in these measurements because other students are running etc. Let's run each experiment three times. But the experiment number is not an argument to the benchmark program. Fortunately, GNU parallel lets us specify which part of the arguments to use. For example {2} refers to the second set of parameters.

parallel -k $e/benchmark {2} 10000 random ::: 1 2 3 ::: 1 10 100 1000 10000 100000 1000000 10000000 100000000

You might notice that I put the experiment number as the outermost loop. That way, there's less chance than random fluctuation will impact all of the samples for a given condition. Write a program in your favorite programming language to postprocess the output and take the minimum time from each run.

Let's do a sweep from 1MB to 64MB. The seq program generates numeric sequences

seq 1000000 1000000 64000000

You're also free to use multiples of 1048576 if that suits you. We can tell GNU parallel to sweep over the sequence.

parallel -k $e/benchmark {2} 10000 random ::: 1 2 3 ::: $(seq 1000000 1000000 64000000)

Take the data and pass it through your program that takes the minimum time under each condition. Plot the relationship between size and average time in your favorite plotting program. It may still be noisy because other students are using the same machine. Do you find that access time increases at some point? Ask the kernel about the CPUs.

cat /proc/cpuinfo

Does the cache size: field correspond to the bump in your graph?