Supplementary Material - Using HDFS

EXC 2019: Antonios Katsarakis, Chris Vasiladiotis, Dmitrii Ustiugov, Volker Seeker, Pramod Bhatotia

Based on previous material from Matt Pugh, Artemiy Margaritov, Michail Basios, Sasa Petrovic, Stratis Viglas & Kenneth Heafield

Getting Started

The purpose of working through this document is to make sure you understand how to use HDFS. The majority of the commands following must be entered in a terminal application of your choice.

Non-DICE

First, if you are working through this material using a machine that isn’t DICE, you’ll need to ssh into a DICE machine from either a UNIX terminal or a Windows ssh application like PuTTY, where sXXXXXXX is your matriculation number. If you’re already using a DICE computer, you can skip this step.

ssh sXXXXXXX@student.ssh.inf.ed.ac.uk

You can now continue with the DICE steps.

DICE

We’re going to connect to the resourceManager node in the cluster running on machine scutter02. From within a DICE terminal, run the following:

ssh hadoop.exc

You can now test Hadoop by running the following command:

hdfs dfs -ls /

If you get a listing of directories on HDFS, you’ve successfully configured everything. If not, make sure you do ALL of the described steps EXACTLY as they appear in this document. Note that you should not continue if you have not managed to do this section. If the hdfs command isn’t available, try the following section.

Environment Variables

Hadoop requires some environment variables to run which should be automatically configured. If you can run hdfs dfs -ls /, you should skip this section. If not, you can add these to your environment by running:

. /opt/hadoop/inithadoop

It’s annoying to have to run that every time you log in. We recommend that you configure it to run automatically by running:

echo '[ -f /opt/hadoop/inithadoop ] && . /opt/hadoop/inithadoop' >>.benv

You’ll now have the hadoop command available every time you log in. If you’re still having trouble, please contact the teaching assistants.

HDFS

In order to let you copy-paste commands, we’ll use $USER which the shell will turn into your user name (i.e. sXXXXXXX).

Here are a number of small pointers you should work through to familiarise yourself with navigating around HDFS

Make sure that your home directory exists:

hdfs dfs -ls /user/$USER

To create a directory called /user/$USER/data in Hadoop:

hdfs dfs -mkdir /user/$USER/data

Create the following directories in a similar way (these directories will NOT have been created for you, so you need to create them yourself):
- /user/$USER/data/input
- /user/$USER/data/output
- /user/$USER/source
Confirm that you’ve done the right thing by typing
```
hdfs dfs -ls /user/$USER
```
For example, if your matriculation number is s0123456, you should see something like:
```
Found 2 items
drwxr-xr-x   - s0123456 supergroup          0 2011-10-19 09:55 /user/s0123456/data
drwxr-xr-x   - s0123456 supergroup          0 2011-10-19 09:54 /user/s0123456/source
```
Copy the file title.basics.tsv to /user/$USER/data/output by typing:
```
hdfs dfs -cp /data/supplementary/title.basics.tsv /user/$USER/data/output  
```
It might warn you about DFSInputStream. Just ignore that.
Obviously, title.basics.tsv doesn’t belong there. Move it from /user/$USER/data/output to /user/$USER/data/input where it belongs and delete the /user/$USER/data/output directory:
```
  hdfs dfs -mv /user/$USER/data/output/title.basics.tsv /user/$USER/data/input/
  hdfs dfs -rm -r /user/$USER/data/output/``
```

Examine the contents of title.basics.tsv using cat and then tail:

  hdfs dfs -cat /user/$USER/data/input/title.basics.tsv
  hdfs dfs -tail /user/$USER/data/input/title.basics.tsv

Create an empty file named example1 in /user/$USER/data/input. Use test to check if it exists and that it is indeed zero length.
```
  hdfs dfs -touchz /user/$USER/data/input/example1
  hdfs dfs -test -z /user/$USER/data/input/example1
```

Remove the file example1:

  hdfs dfs -rm /user/$USER/data/input/example1

List of HDFS Commands

What follows is a list of useful HDFS shell commands.

cat – copy files to stdout, similar to UNIX cat command:

hdfs dfs -cat /user/$USER/data/input/title.basics.tsv

copyFromLocal – copy single source, or multiple sources from local file system to the destination filesystem. Source must be a local file reference:
```
hdfs dfs -copyFromLocal <localfile> /user/$USER/file1
```
copyToLocal – copy files to the local file system. Destination must be a local file reference.
```
hdfs dfs -copyToLocal /user/$USER/file1 <localfile>
```
Options:
- -ignoreCrc – files that fail the CRC check will be copied.
- -crc – files and CRCs will be copied.
cp – copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Similar to UNIX cp command.
```
hdfs dfs -cp /user/$USER/file1 /user/$USER/file2
```
getmerge – take a source directory and a destination file as input and concatenate files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
```
hdfs dfs -getmerge /data/supplementary ~/result_file
```
ls – for a file returns stat on the file with the format: filename num_replicas size modification_date modification_time permissions userid groupid

For a directory it returns list of its direct children as in UNIX, with the format: dirname <dir> modification_time modification_time permissions userid groupid
```
hdfs dfs -ls /user/$USER
```
You can also pass -R for recursive listing.

mkdir – create a directory.

hdfs dfs -mkdir /user/$USER/deleteme

You can pass -p to make directories along a path

hdfs dfs -mkdir -p /user/$USER/deleteme/and/this

mv – move files from source to destination similar to UNIX mv command. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
```
hdfs dfs -mv /user/$USER/file1 /user/$USER/file2
```
rm – delete files, similar to UNIX rm command. Only deletes empty directories and files.
```
hdfs dfs -rm /user/$USER/file1
```
Also supports -r to recursively delete files like rm -r on UNIX.
tail – Displays last kilobyte of the file to stdout. Similar to UNIX tail command.
```
hdfs dfs -tail /user/$USER/file1
```
Options:
- -f output appended data as the file grows (follow)
test – perform various test.
```
hdfs dfs -test -e /user/$USER/file1
```
Options:
- -e check to see if the file exists. Return 0 if true.
- -z check to see if the file is zero length. Return 0 if true.
- -d check return 1 if the path is directory else return 0.
-test returns the value of its test (0 or 1) to the environment variable $?, to view its value enter the following into your terminal:
```
echo $?
```
touchz – create a file of zero length. Similar to UNIX touch command.
```
hdfs dfs -touchz /user/$USER/file1
```

Conclusion

In this lab, we studied how to use HDFS. In the next lab – Designing a Solution using Hadoop Streaming – you will learn how to design a MapReduce program for Hadoop Streaming infrastructure.