Supplementary Material - Using HDFS
EXC 2019: Antonios Katsarakis, Chris Vasiladiotis, Dmitrii Ustiugov, Volker Seeker, Pramod Bhatotia
Based on previous material from Matt Pugh, Artemiy Margaritov, Michail Basios, Sasa Petrovic, Stratis Viglas & Kenneth Heafield
Getting Started
The purpose of working through this document is to make sure you understand how to use HDFS. The majority of the commands following must be entered in a terminal application of your choice.
Non-DICE
First, if you are working through this material using a machine that isn’t DICE, you’ll need to ssh into a DICE machine from either a UNIX terminal or a Windows ssh application like PuTTY, where sXXXXXXX
is your matriculation number. If you’re already using a DICE computer, you can skip this step.
ssh sXXXXXXX@student.ssh.inf.ed.ac.uk
You can now continue with the DICE steps.
DICE
We’re going to connect to the resourceManager
node in the cluster running on machine scutter02
. From within a DICE terminal, run the following:
ssh hadoop.exc
You can now test Hadoop by running the following command:
hdfs dfs -ls /
If you get a listing of directories on HDFS, you’ve successfully configured everything. If not, make sure you do ALL of the described steps EXACTLY as they appear in this document. Note that you should not continue if you have not managed to do this section. If the hdfs
command isn’t available, try the following section.
Environment Variables
Hadoop requires some environment variables to run which should be automatically configured. If you can run hdfs dfs -ls /
, you should skip this section. If not, you can add these to your environment by running:
. /opt/hadoop/inithadoop
It’s annoying to have to run that every time you log in. We recommend that you configure it to run automatically by running:
echo '[ -f /opt/hadoop/inithadoop ] && . /opt/hadoop/inithadoop' >>.benv
You’ll now have the hadoop
command available every time you log in. If you’re still having trouble, please contact the teaching assistants.
HDFS
In order to let you copy-paste commands, we’ll use $USER
which the shell will turn into your user name (i.e. sXXXXXXX).
Here are a number of small pointers you should work through to familiarise yourself with navigating around HDFS
Make sure that your home directory exists:
hdfs dfs -ls /user/$USER
To create a directory called
/user/$USER/data
in Hadoop:hdfs dfs -mkdir /user/$USER/data
Create the following directories in a similar way (these directories will NOT have been created for you, so you need to create them yourself):
/user/$USER/data/input
/user/$USER/data/output
/user/$USER/source
Confirm that you’ve done the right thing by typing
hdfs dfs -ls /user/$USER
For example, if your matriculation number is
s0123456
, you should see something like:Found 2 items drwxr-xr-x - s0123456 supergroup 0 2011-10-19 09:55 /user/s0123456/data drwxr-xr-x - s0123456 supergroup 0 2011-10-19 09:54 /user/s0123456/source
Copy the file
title.basics.tsv
to/user/$USER/data/output
by typing:hdfs dfs -cp /data/supplementary/title.basics.tsv /user/$USER/data/output
It might warn you about
DFSInputStream
. Just ignore that.Obviously,
title.basics.tsv
doesn’t belong there. Move it from/user/$USER/data/output
to/user/$USER/data/input
where it belongs and delete the/user/$USER/data/output
directory:hdfs dfs -mv /user/$USER/data/output/title.basics.tsv /user/$USER/data/input/ hdfs dfs -rm -r /user/$USER/data/output/``
Examine the contents of
title.basics.tsv
usingcat
and thentail
:hdfs dfs -cat /user/$USER/data/input/title.basics.tsv hdfs dfs -tail /user/$USER/data/input/title.basics.tsv
Create an empty file named
example1
in/user/$USER/data/input
. Usetest
to check if it exists and that it is indeed zero length.hdfs dfs -touchz /user/$USER/data/input/example1 hdfs dfs -test -z /user/$USER/data/input/example1
Remove the file
example1
:hdfs dfs -rm /user/$USER/data/input/example1
List of HDFS Commands
What follows is a list of useful HDFS shell commands.
cat
– copy files tostdout
, similar to UNIXcat
command:hdfs dfs -cat /user/$USER/data/input/title.basics.tsv
copyFromLocal
– copy single source, or multiple sources from local file system to the destination filesystem. Source must be a local file reference:hdfs dfs -copyFromLocal <localfile> /user/$USER/file1
copyToLocal
– copy files to the local file system. Destination must be a local file reference.hdfs dfs -copyToLocal /user/$USER/file1 <localfile>
Options:
-ignoreCrc
– files that fail the CRC check will be copied.-crc
– files and CRCs will be copied.
cp
– copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Similar to UNIXcp
command.hdfs dfs -cp /user/$USER/file1 /user/$USER/file2
getmerge
– take a source directory and a destination file as input and concatenate files in src into the destination local file. Optionallyaddnl
can be set to enable adding a newline character at the end of each file.hdfs dfs -getmerge /data/supplementary ~/result_file
ls
– for a file returns stat on the file with the format:filename num_replicas size modification_date modification_time permissions userid groupid
For a directory it returns list of its direct children as in UNIX, with the format:
dirname <dir> modification_time modification_time permissions userid groupid
hdfs dfs -ls /user/$USER
You can also pass
-R
for recursive listing.mkdir
– create a directory.hdfs dfs -mkdir /user/$USER/deleteme
You can pass
-p
to make directories along a pathhdfs dfs -mkdir -p /user/$USER/deleteme/and/this
mv
– move files from source to destination similar to UNIXmv
command. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.hdfs dfs -mv /user/$USER/file1 /user/$USER/file2
rm
– delete files, similar to UNIXrm
command. Only deletes empty directories and files.hdfs dfs -rm /user/$USER/file1
Also supports
-r
to recursively delete files likerm -r
on UNIX.tail
– Displays last kilobyte of the file to stdout. Similar to UNIXtail
command.hdfs dfs -tail /user/$USER/file1
Options:
-f
output appended data as the file grows (follow)
test
– perform various test.hdfs dfs -test -e /user/$USER/file1
Options:
-e
check to see if the file exists. Return 0 if true.-z
check to see if the file is zero length. Return 0 if true.-d
check return 1 if the path is directory else return 0.
-test
returns the value of its test (0 or 1) to the environment variable$?
, to view its value enter the following into your terminal:echo $?
touchz
– create a file of zero length. Similar to UNIXtouch
command.hdfs dfs -touchz /user/$USER/file1
Conclusion
In this lab, we studied how to use HDFS. In the next lab – Designing a Solution using Hadoop Streaming – you will learn how to design a MapReduce program for Hadoop Streaming infrastructure.