Introduction to Hadoop (Revised) (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Wednesday, January 21, 2015</span>)

Introduction to Hadoop (Revised)

CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, January 21, 2015

NEW:

explain about multiple output files
______
______
______
______
______
______
______
______
______

______

Introduction to Hadoop

C++, Java versions
Word count
index

Laboratory exercises

Note: Local reference for running Hadoop jobs is the Hadoop Reference, January 2011 page on the Beowiki.

Create directories

Create shell windows on a lab machine and on one of the cluster admin nodes.
On an admin node, create ~/lab10. Change directory to lab10.

Create a short file try.txt on the Hadoop DFS, by creating a file on your admin node's file system then uploading it to the DFS. Also, create a directory bin in your DFS directory for holding compiled Hadoop code.

Notes:

Don't forget that you can run emacs on a Link machine but edit files on an admin machine by finding the filename /admin.public.stolaf.edu:filename, e.g., /mist.public.stolaf.edu:try.txt
Your home directory on the DFS is /user/username where username is your St. Olaf user name *(e.g., rab).
The command on your admin node
```
   admin$ hadoop fs -put local-file dfs-dir
```
will cause the file local-file to be copied to the file /user/username/dfs-dir/local-file. If you want the file to have a different filename fname on the DFS, enter
```
   admin$ hadoop fs -put local-file dfs-dir/fname
```
To create the directory /user/username/bin, enter the command
```
   admin$ hadoop fs -mkdir bin
```
Alternatively, you could have specified hadoop fs -mkdir /user/username/bin
Other options for hadoop fs operations may be viewed by giving the command
```
$ hadoop fs -help
```
or by consulting the HDFS documentation (look under shell commands).

Hadoop example in C++ using hadoop pipes.

Note: in Spring 2013, the C++ interface Hadoop Pipes, is only available on Helios. The Java Hadoop interface is available on both Helios and Mist.

Our example C++ Hadoop program consists of two files wordcount.cpp and wordcount.xml. We will first discuss each of these files, then describe how to compile and run the Hadoop pipes job.

wordcount.cpp

Observe that Hadoop conventionally uses the extension .cpp instead of .C for C++ implementation modules, and .hh instead of .H or .h for C++ interface modules or header files.
The source code file wordcount.cpp contains two brief class definitions:
- WordCountMap, a subclass of the Hadoop pipes Mapper class, for containing the mapper method map().
- WordCountReduce a subclass of the Hadoop pipes Reducer class, for containing the reducer method reduce().
The arguments for the methods map() and reduce() are Hadoop contexts, which provide I/O access to the source data, intermediate list of key-value pairs, and result data used in the map-reduce model. The Hadoop job tracker task coordinates multiple copies of the mapper and reducer to work on different parts of the data, shuffle the intermediate list, and collect the results from reducers to create a output files.
Some types used in wordcount.cpp are parametrized template classes. For example, std::vector<std::string> creates a vector (like a resizable array) of string objects.

Note that these classes are part of C++'s STL (Standard Template Library). Ordinarily, a programmer must enter #include commands for <string> and <vector> in order to access those classes, but the Hadoop pipes header files (e.g., "hadoop/StringUtils.hh") already #include the headers necessary to use those classes.
The main() program simply calls a class method runTask in the HadoopPipes class, whose argument derives from a template generated by the mapper and reducer classes.
The mapper algorithm in wordcount.cpp splits an input line (obtained from the call context.getInputValue()) into "words", using a space as a delimiter. This means that the strings "Hello!", "Hello" and "hello," will count as three distinct words for the purposes of this program.

The mapper produces one pair in the intermediate key-value list using the call context.emit(...).
Each instance of the reducer receives key-value pairs for the same key. It obtains new pairs using context.nextValue(), adds their values to a running sum, and emits the key and that sum.
In pipes, the values in these contexts must be strings, so integer values are converted from/to strings in a reducer, and the value 1 in the mapper is expressed as the string "1".
______

wordcount.xml

The wordcount.xml file specifies configuration parameters for the word-count example.
This .xml file specifies three properties or configuration parameters for running the Hadoop job. In particular, it specifies that the property hadoop.pipes.executable has the value /user/rab/bin/wordcount, which informs hadoop pipes of the file on the DFS that should be executed.

DO THIS: Enter and run the C++ program wordcount.cpp using Hadoop pipes.

Notes: (Reference: Beowiki page Hadoop Reference, January 2011)

You can create files wordcount.cpp and wordcount.xml to your lab10 directory on your admin machine either by copying and pasting with an editor or via scp
```
   admin$ scp shelob.cs.stolaf.edu:~cs300/hadoop/wordcount.{cpp,xml} .
```
(Don't forget the final dot '.' .)
To compile your C++ file wordcount.cpp, you can use the command
```
   admin$ g++ -I/usr/lib/hadoop/c++/Linux-amd64-64/include -Wall -c wordcount.cpp 
```
- The -I flag indicates where to find #include files, and -Wall prints all warnings that may arise during compilation.
To link the resulting object file, you can use the command
```
   admin$ libtool --mode=link --tag=CXX \
      g++ -L/usr/lib/hadoop/c++/Linux-amd64-64/lib -lhadooppipes -lhadooputils -lpthread -Wall \
      wordcount.o -o wordcount 
```
- The -L flag indicates where to find object files and library archive files. This list of locations is called the library path. (In this case, there's only one element in the list. Other locations could be added, separated by a colon character :.)
- The program libtool enables the program to find all of its library paths.
- The \ character at the end of a line causes the following line to be appended to that line.
Now, you must upload the executable to the DFS.
```
   admin$ hadoop fs -rm bin/wordcount 
   admin$ hadoop fs -put wordcount bin/wordcount 
```
- The hadoop fs -rm command deletes an old version of the file. Of course, this is not needed when uploading a new executable.
- The same effect results from using bin instead of bin/wordcount as the final argument, since hadoop fs -put reuses the local filename if the destination is a directory name.
Create a directory of data on the DFS for your Hadoop word-count job.
- We will assume that this data is located in a file named mice.txt in your lab10 directory on your admin node, containing the following text:
```
   Three blind mice, three blind mice!
   See how they run, see how they run!
   They all ran after the farmer's wife,
   She cut off their tails with a carving knife.
   Did you ever see such a thing in your life
   As three blind mice?
```
- Upload this file to a directory (we'll call it mice) in your area of the DFS.
```
   admin$ hadoop fs -mkdir mice
   admin$ hadoop fs -put mice.txt mice
```
- Note: All files in this input directory will be processed by the Hadoop job, so it's best to have a dedicated directory for this purpose.
Finally, you're ready to run your Hadoop pipes job.
```
   admin$ hadoop pipes -conf wordcount.xml -input input-dir/ -output output_dir/
```
- The name input-dir should be replaced by your HDFS input directory name. This can be an absolute path (starting with /) on the DFS or a relative path (not starting with /), which is then relative to your DFS directory /usr/username .
  
  In our example, input-dir should be mice. The trailing / is optional.
- output-dir should be a DFS directory name to receive the output from the job. This directory does not need to exist ahead of time.
After the job completes, look in the output-dir for the results.
```
   admin$ hadoop fs -ls output-dir
```
Files produced by the mapper have the names part-00000, part-00001, etc. You can examine an individual file without -getting it using the -cat option for hadoop fs:
```
   admin$ hadoop fs -cat output-dir/part-00000
```
this should print lines such as
```
   a       2
   after   1
   all     1
   ...
```

Hadoop example in Java using hadoop jar.

Here we provide a list of steps for running a Hadoop job using the standard Java interface that you can follow these steps even if you don't know Java. Only a single source file is required for this Java exercise. As before, we will start with a discussion of the source file.

WordCount.java

Examining that source code, the package command in Java provides a naming context for the classes defined in the file WordCount.java. Next, several import commands provide names of classes used later in the file. Some of these import commands draw from standard Java code libraries, and some are specific to Hadoop.
The remainder of the file defines a single class WordCount with methods main(), and two nested classes Map and Reduce. As with C++, one can define one class inside of another class; a nested class is only visibile within the scope of its containing class.
The method main() corresponds to the main() function of a C++ program. (Incidentally, in Java, there are no standalone functions, but only methods of classes.) The WordCount class's main() sets up configuration parameters for a Hadoop job, then calls JobClient.runJob() to launch that Hadoop job.
______
______
______
______
______
______

______ Notes.

Only one file WordCount.java is required. This contains a single Java class WordCount that contains a method main() and two inner classes Map and Reduce.
Instead of contexts, the Java implementation uses other classes for accessing data, intermediate list, and output, such as the templated OutputCollector class.
The Map.map() method and the Reduce.reduce() method use Java methods to carry out their work, rather than methods and functions involving C++ classes such as std::vector<std::string>.

Follow the instructions of the Beowiki page Passwordless_SSH_setup. This will enable you to ssh into a cluster admin node from a CS lab machine without having to provide a password.

Note. Being able to log in without a password is a great convenience, but it is obviously less secure: if someone can gain access to your St. Olaf account (e.g., if you leave yourself logged in somewhere by mistake), the intruder can automatically access your Beowulf cluster accounts. This is a security concern for both your account and the cluster.

However, we will be using ssh and related utilities such as scp so often in our work, that supplying all of those passwords would be unbearably frustrating without passwordless ssh. The mechanism for passwordless ssh itself is rather secure. Just be sure to protect your account...

Accessing the DFS from a CS lab machine. Create a file on a CS Link machine, and use scp and a script hadoop.mist or hadoop.helios to upload it to a cluster's DFS.

First, create file myfile.txt in your link machine lab10 directory ~/PCS/lab10, using emacs or another editor.
```
% emacs myfile.txt
...
```
myfile.txt may have your choice of contents as data for wordcount.
Note: Be sure to create myfile.txt in your Link-machine shell, not your admin node shell.
Copy myfile.txt to your admin node's lab10 directory, using scp. For example,
```
% scp myfile.txt mist.public.stolaf.edu:lab10
```
(No password should be required!)
You can use a convenience script hadoop.mist or hadoop.helios to perform a remote hadoop invocation from a CS Link machine. For example,
```
% hadoop.mist fs -ls
```
is equivalent to the ssh command
```
% ssh mist.public.stolaf.edu /opt/hadoop/bin/hadoop fs -ls
```
which connects to the mist and executes hadoop with arguments fs -ls.

Use one of these convenience scripts hadoop.name to upload myfile.txt to the DFS on admin node name. For example,
```
% hadoop.mist fs -put lab10/myfile.txt mice
```
Note: This operation puts the admin node's copy of myfile.txt into the DFS's directory mice.

Using Hadoop pipes from a CS Link machine. Use a convenience script hadoop.name to perform a remote hadoop pipes job on a cluster.

Perform steps as before, except using hadoop.name on a Link machine shell instead of using hadoop on an admin node shell. Not all of the prior steps are necessary, since the executable wordcount already exists on the DFS, and the configuration file wordcount.xml has already been set up. That leaves something like the following (if your admin node is mist):
```
% hadoop.mist pipes -conf lab10/wordcount.xml -input mice -output pipes2
% hadoop.mist fs -cat pipes2/part-00000
```
Note: The absolute directory path prefix /user/username/ is not necessary for the input directory mice or output directory pipes2, because these DFS names are relative to your DFS home directory /user/username by default. Also, you should get different results from this run because of the second input file myfile.txt.

Using Hadoop from within Eclipse (Java, optional)

Note: Please do not run Eclipse on admin nodes, since it takes so many resources. Instead, enter the eclipse command on a Link-machine's shell.

Multiple passes

Make a copy CountWordsA of your CountWords program (either countwords.cpp or CountWords.java) in your lab10 directory, and test to insure it is working properly. The expected output from CountWordsA using the "Three Blind Mice" data should begin something like this:

2       a
1       after
1       all
1       as
3       blind
1       carving
1       cut
1       did
1       ever
1       farmer's
...

Makefile for Hadoop jobs. Test a locally developed Makefile for managing Hadoop jobs.

Delete all files in your lab10 directory except your source file countwordsA.cpp or CountWordsA.java for purposes of this test.
Use scp to copy the files ~cs300/hadoop/Makefile and ~cs300/hadoop/proto.xml into your lab10 directory.

It's easiest to do this from a CS Link machine, e.g.,
```
% scp ~cs300/hadoop/Makefile ~cs300/hadoop/proto.xml mist.public.stolaf.edu:lab10/
```
If you program in C++, this Makefile uses the file proto.xml to automatically generate .xml files for your job.
This Makefile uses a file targets to determine the default targets.
- If you are using C++, the C++ executable countwords is the default target.
- If you are using Java, the .jar file CountWords.jar is the default target.
Create a file targets containing this target. (You can add more targets later in the lab; if you use this Makefile for other assignments in other directories, those directories can have their own targets file.)
If you are using C++, issue the command
```
$ make countwordsA.o
```
This step of making a .o file is only needed for new .cpp files. We won't need it again for the countwordsA, even if we make changes to that file.
Now, issue the command
```
$ make
```
This command should make your default target. (You will see a one-time warning that a file .lasttarget doesn't exist, which you can safely ignore.) Check to insure that the default target was in fact created.
This Makefile keeps track of the target you most recently made. If you issue the command
```
$ make hadoop
```
the Makefile will retrieve the fact that you most recently made the target for CountWordsA and will start a Hadoop job for that target.
- If you program in C++, it will replace your prior executable on the DFS with the newly generated one.
- It will create a new DFS directory for your output, named countwordsA.NNNN for C++ and named CountWordsA.NNNN for Java, where NNNN is a unique identifying "job number."
- It will run the Hadoop job for you.
- If you have more than one program in your directory, you can run a Hadoop job with another program prog by entering
```
$ make prog.hadoop
```
Note: The input data directory in the DFS is named mice by default. You can change this in your copy of the Makefile. Alternatively, you can override it at the command line:
```
$ make hadoop DATA=cheese
```
If your Hadoop job succeeds, you can print your results conveniently as follows:
```
$ make print
```
(or make prog.print).

Modify your CountWordsA program to lower the case of all words; for example, the input words "The" and "the" are both counted as "the."

For C++: use a function such as this function toLower() to lower the case of each word words.

string toLower(string str) {
  for (int i=0;  i < str.length();  i++)
    str[i] = tolower(str[i]);
  return str;
}

For Java: Look for a String method to lower the case of a word.

Write a Hadoop program CountWordsB in either C++ or Java that operates on the output of CountWordsA to produce a summary list of word frequencies. For the "Three Blind Mice" data, your program should produce a 3-line summary:

1       off ran run! run, she such tails the their thing wife, with you your mice? after all as carving cut did ever farmer's in knife. life mice! mice, 
2       a how 
3       blind three they see

Instead of your default input directory on the DFS (e.g., mice), we want to use the output from the last run of CountWordsA as input for this program.

As explained above, you can enter the job name for the last run of CountWordsA as the value of DATA
```
$ make hadoop DATA=countwordsA.NNNN        # C++
$ make hadoop DATA=CountWordsA.NNNN        # Java
```
Our Makefile can ordinarily deduce the job number for you (assuming you last ran CountWordsA in the current directory):
```
$ make hadoop DATA=countwordsA                      # C++
$ make hadoop DATA=CountWordsA                      # Java
```
The default InputFormat class is TextInputFormat, which provides a numerical offset as key and a line of input as the value as input for the mapper. But we want to read a count number as the key and a word as the value. You can accomplish this by using KeyValueTextInputFormat for the InputFormat class.
- For C++: add a -inputformat command argument to your hadoop command. Here's a way to do this with the Makefile:
```
make hadoop DATA=countwordsA OPT='-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat'
```
- For Java: Add or modify a call to conf.setInputFormat() in your main():
```
conf.setInputFormat(KeyValueTextInputFormat.class);
```
  Instead of WritableComparable or LongWritable as the input key type for the mapper, use Text as both the input key and input value types with this input format.
- There is no work for the mapper to do in this program, so an identity mapper should be used. You can either program an identity mapper yourself, or use the predefined IdentityMapper class. (Note: Not clear how to use the predefined class in C++ yet...)
In the mapper, separate the values (words) with blanks (instead of the ^ character used for the concordance problem).

Preliminary material

Laboratory exercises

Multiple passes

Deliverables