Supplementary Material - Using HDFS

EXC 2019: Antonios Katsarakis, Chris Vasiladiotis, Dmitrii Ustiugov, Volker Seeker, Pramod Bhatotia

Based on previous material from Matt Pugh, Artemiy Margaritov, Michail Basios, Sasa Petrovic, Stratis Viglas & Kenneth Heafield

Getting Started

The purpose of working through this document is to make sure you understand how to use HDFS. The majority of the commands following must be entered in a terminal application of your choice.

Non-DICE

First, if you are working through this material using a machine that isn’t DICE, you’ll need to ssh into a DICE machine from either a UNIX terminal or a Windows ssh application like PuTTY, where sXXXXXXX is your matriculation number. If you’re already using a DICE computer, you can skip this step.

ssh sXXXXXXX@student.ssh.inf.ed.ac.uk

You can now continue with the DICE steps.

DICE

We’re going to connect to the resourceManager node in the cluster running on machine scutter02. From within a DICE terminal, run the following:

ssh hadoop.exc

You can now test Hadoop by running the following command:

hdfs dfs -ls /

If you get a listing of directories on HDFS, you’ve successfully configured everything. If not, make sure you do ALL of the described steps EXACTLY as they appear in this document. Note that you should not continue if you have not managed to do this section. If the hdfs command isn’t available, try the following section.

Environment Variables

Hadoop requires some environment variables to run which should be automatically configured. If you can run hdfs dfs -ls /, you should skip this section. If not, you can add these to your environment by running:

. /opt/hadoop/inithadoop

It’s annoying to have to run that every time you log in. We recommend that you configure it to run automatically by running:

echo '[ -f /opt/hadoop/inithadoop ] && . /opt/hadoop/inithadoop' >>.benv

You’ll now have the hadoop command available every time you log in. If you’re still having trouble, please contact the teaching assistants.

HDFS

In order to let you copy-paste commands, we’ll use $USER which the shell will turn into your user name (i.e. sXXXXXXX).

Here are a number of small pointers you should work through to familiarise yourself with navigating around HDFS

  1. Make sure that your home directory exists:

    hdfs dfs -ls /user/$USER

    To create a directory called /user/$USER/data in Hadoop:

    hdfs dfs -mkdir /user/$USER/data

    Create the following directories in a similar way (these directories will NOT have been created for you, so you need to create them yourself):

    • /user/$USER/data/input
    • /user/$USER/data/output
    • /user/$USER/source

    Confirm that you’ve done the right thing by typing

    hdfs dfs -ls /user/$USER
    

    For example, if your matriculation number is s0123456, you should see something like:

    Found 2 items
    drwxr-xr-x   - s0123456 supergroup          0 2011-10-19 09:55 /user/s0123456/data
    drwxr-xr-x   - s0123456 supergroup          0 2011-10-19 09:54 /user/s0123456/source
    
  2. Copy the file title.basics.tsv to /user/$USER/data/output by typing:

    hdfs dfs -cp /data/supplementary/title.basics.tsv /user/$USER/data/output  
    

    It might warn you about DFSInputStream. Just ignore that.

  3. Obviously, title.basics.tsv doesn’t belong there. Move it from /user/$USER/data/output to /user/$USER/data/input where it belongs and delete the /user/$USER/data/output directory:

      hdfs dfs -mv /user/$USER/data/output/title.basics.tsv /user/$USER/data/input/
      hdfs dfs -rm -r /user/$USER/data/output/``
    
  4. Examine the contents of title.basics.tsv using cat and then tail:

      hdfs dfs -cat /user/$USER/data/input/title.basics.tsv
      hdfs dfs -tail /user/$USER/data/input/title.basics.tsv
    
  5. Create an empty file named example1 in /user/$USER/data/input. Use test to check if it exists and that it is indeed zero length.

      hdfs dfs -touchz /user/$USER/data/input/example1
      hdfs dfs -test -z /user/$USER/data/input/example1
    
  6. Remove the file example1:

      hdfs dfs -rm /user/$USER/data/input/example1
    

List of HDFS Commands

What follows is a list of useful HDFS shell commands.

Conclusion

In this lab, we studied how to use HDFS. In the next lab – Designing a Solution using Hadoop Streaming – you will learn how to design a MapReduce program for Hadoop Streaming infrastructure.