Introduction to Hadoop (Revised)

CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, January 21, 2015

NEW:

______

Introduction to Hadoop

Preliminary material

Laboratory exercises

Note: Local reference for running Hadoop jobs is the Hadoop Reference, January 2011 page on the Beowiki.

  1. Create directories

  2. Create a short file try.txt on the Hadoop DFS, by creating a file on your admin node's file system then uploading it to the DFS. Also, create a directory bin in your DFS directory for holding compiled Hadoop code.

    Notes:

  3. Hadoop example in C++ using hadoop pipes.

    Note: in Spring 2013, the C++ interface Hadoop Pipes, is only available on Helios. The Java Hadoop interface is available on both Helios and Mist.

    Our example C++ Hadoop program consists of two files wordcount.cpp and wordcount.xml. We will first discuss each of these files, then describe how to compile and run the Hadoop pipes job.

    wordcount.cpp

    wordcount.xml

    DO THIS: Enter and run the C++ program wordcount.cpp using Hadoop pipes.

    Notes: (Reference: Beowiki page Hadoop Reference, January 2011)

  4. Hadoop example in Java using hadoop jar.

    Here we provide a list of steps for running a Hadoop job using the standard Java interface that you can follow these steps even if you don't know Java. Only a single source file is required for this Java exercise. As before, we will start with a discussion of the source file.

    WordCount.java

    ______ Notes.

  5. Follow the instructions of the Beowiki page Passwordless_SSH_setup. This will enable you to ssh into a cluster admin node from a CS lab machine without having to provide a password.

    Note. Being able to log in without a password is a great convenience, but it is obviously less secure: if someone can gain access to your St. Olaf account (e.g., if you leave yourself logged in somewhere by mistake), the intruder can automatically access your Beowulf cluster accounts. This is a security concern for both your account and the cluster.

    However, we will be using ssh and related utilities such as scp so often in our work, that supplying all of those passwords would be unbearably frustrating without passwordless ssh. The mechanism for passwordless ssh itself is rather secure. Just be sure to protect your account...

  6. Accessing the DFS from a CS lab machine. Create a file on a CS Link machine, and use scp and a script hadoop.mist or hadoop.helios to upload it to a cluster's DFS.

    1. First, create file myfile.txt in your link machine lab10 directory ~/PCS/lab10, using emacs or another editor.

      % emacs myfile.txt
      ...
      
      myfile.txt may have your choice of contents as data for wordcount.

      Note: Be sure to create myfile.txt in your Link-machine shell, not your admin node shell.

    2. Copy myfile.txt to your admin node's lab10 directory, using scp. For example,

      % scp myfile.txt mist.public.stolaf.edu:lab10
      (No password should be required!)

    3. You can use a convenience script hadoop.mist or hadoop.helios to perform a remote hadoop invocation from a CS Link machine. For example,

      % hadoop.mist fs -ls
      
      is equivalent to the ssh command
      % ssh mist.public.stolaf.edu /opt/hadoop/bin/hadoop fs -ls
      
      which connects to the mist and executes hadoop with arguments fs -ls.

      Use one of these convenience scripts hadoop.name to upload myfile.txt to the DFS on admin node name. For example,

      % hadoop.mist fs -put lab10/myfile.txt mice
      
      Note: This operation puts the admin node's copy of myfile.txt into the DFS's directory mice.

  7. Using Hadoop pipes from a CS Link machine. Use a convenience script hadoop.name to perform a remote hadoop pipes job on a cluster.

  8. Using Hadoop from within Eclipse (Java, optional)

    Note: Please do not run Eclipse on admin nodes, since it takes so many resources. Instead, enter the eclipse command on a Link-machine's shell.

Multiple passes

  1. Make a copy CountWordsA of your CountWords program (either countwords.cpp or CountWords.java) in your lab10 directory, and test to insure it is working properly. The expected output from CountWordsA using the "Three Blind Mice" data should begin something like this:

    2       a
    1       after
    1       all
    1       as
    3       blind
    1       carving
    1       cut
    1       did
    1       ever
    1       farmer's
    ...
    

  2. Makefile for Hadoop jobs. Test a locally developed Makefile for managing Hadoop jobs.

  3. Modify your CountWordsA program to lower the case of all words; for example, the input words "The" and "the" are both counted as "the."

  4. Write a Hadoop program CountWordsB in either C++ or Java that operates on the output of CountWordsA to produce a summary list of word frequencies. For the "Three Blind Mice" data, your program should produce a 3-line summary:

    1       off ran run! run, she such tails the their thing wife, with you your mice? after all as carving cut did ever farmer's in knife. life mice! mice, 
    2       a how 
    3       blind three they see 
    

    1. Instead of your default input directory on the DFS (e.g., mice), we want to use the output from the last run of CountWordsA as input for this program.

      As explained above, you can enter the job name for the last run of CountWordsA as the value of DATA

      $ make hadoop DATA=countwordsA.NNNN        # C++
      $ make hadoop DATA=CountWordsA.NNNN        # Java
      
      Our Makefile can ordinarily deduce the job number for you (assuming you last ran CountWordsA in the current directory):
      $ make hadoop DATA=countwordsA                      # C++
      $ make hadoop DATA=CountWordsA                      # Java
      

    2. The default InputFormat class is TextInputFormat, which provides a numerical offset as key and a line of input as the value as input for the mapper. But we want to read a count number as the key and a word as the value. You can accomplish this by using KeyValueTextInputFormat for the InputFormat class.

      • For C++: add a -inputformat command argument to your hadoop command. Here's a way to do this with the Makefile:

        make hadoop DATA=countwordsA OPT='-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat'
        

      • For Java: Add or modify a call to conf.setInputFormat() in your main():

        conf.setInputFormat(KeyValueTextInputFormat.class);
        
        Instead of WritableComparable or LongWritable as the input key type for the mapper, use Text as both the input key and input value types with this input format.

      • There is no work for the mapper to do in this program, so an identity mapper should be used. You can either program an identity mapper yourself, or use the predefined IdentityMapper class. (Note: Not clear how to use the predefined class in C++ yet...)

    3. In the mapper, separate the values (words) with blanks (instead of the ^ character used for the concordance problem).

Deliverables

Submit your work using the following command on your cluster admin node:

$ cd ~
$ labsubmit.cs300 lab10 

This lab is due by Wednesday, January 21, 2015.