Introduction to Hadoop (Revised)
CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, January 21, 2015
NEW:
explain about multiple output files
______
______
______
______
______
______
______
______
______
Introduction to Hadoop
C++, Java versions
Word count
index
Map-reduce model
Hadoop implementation.
HDFS, distributed file system
______
Mappers, Reducers, and types.
Input formats; KeyValueTextInputFormat
Job configuration with JobConf; with command-line options
Identity mappers and identity reducers
Note: Local reference for running Hadoop jobs is the
Hadoop Reference, January 2011 page on the Beowiki.
Create directories
Create shell windows on a lab machine and on one of the cluster admin nodes.
On an admin node, create ~/lab10. Change directory to lab10.
Create a short file try.txt on the Hadoop DFS, by creating a file on
your admin node's file system then uploading it to the DFS. Also,
create a directory bin in your DFS directory for holding
compiled Hadoop code.
Notes:
Don't forget that you can run emacs on a Link
machine but edit files on an admin machine by finding the filename
/admin.public.stolaf.edu:filename,
e.g., /mist.public.stolaf.edu:try.txt
Your home directory on the DFS is
/user/username where
username is
your St. Olaf user name *(e.g., rab).
The command on your admin node
admin$ hadoop fs -put local-file dfs-dirwill cause the file
local-file to be copied to
the file /user/username/dfs-dir/local-file. If
you want the file to have a different filename fname on
the DFS, enter
admin$ hadoop fs -put local-file dfs-dir/fname
To create the directory
/user/username/bin, enter the command
admin$ hadoop fs -mkdir binAlternatively, you could have specified
hadoop fs -mkdir
/user/username/bin Other options for hadoop fs operations may be
viewed by giving the command
$ hadoop fs -helpor by consulting the HDFS documentation (look under shell commands).
Hadoop example in C++ using hadoop pipes.
Note: in Spring 2013, the C++ interface Hadoop Pipes, is only available on Helios. The Java Hadoop interface is available on both Helios and Mist.
Our example C++ Hadoop program consists of two files
wordcount.cpp and
wordcount.xml. We will first discuss each of these
files, then describe how to compile and run the Hadoop
pipes job.
Observe
that Hadoop conventionally uses the extension .cpp
instead of .C for C++ implementation modules, and
.hh instead of .H or .h for C++
interface modules or header files.
The source code file wordcount.cpp
contains two
brief class definitions:
WordCountMap, a subclass of the Hadoop pipes
Mapper class, for containing the mapper method map().
WordCountReduce a subclass of the Hadoop pipes
Reducer class, for containing the reducer method
reduce().
The arguments for the methods map() and
reduce() are Hadoop contexts, which provide
I/O access to the source data, intermediate list of key-value pairs,
and result data used
in the map-reduce model. The Hadoop job tracker task coordinates
multiple copies of the mapper and reducer to work on different parts
of the data, shuffle the intermediate list, and collect the results
from reducers to create a output files.
Some types used in wordcount.cpp
are parametrized
template classes. For example,
std::vector<std::string> creates a vector
(like a resizable array) of string objects.
Note that these classes are part of C++'s STL (Standard Template
Library). Ordinarily, a programmer must enter #include
commands for <string> and
<vector> in order to access those classes, but the Hadoop
pipes header files (e.g.,
"hadoop/StringUtils.hh") already #include
the headers necessary to use those classes.
The
main() program simply calls a class method
runTask in the HadoopPipes class, whose
argument derives from a template generated by the mapper and reducer
classes.
The mapper algorithm in wordcount.cpp splits an
input line (obtained from the call
context.getInputValue()) into "words", using a space as a
delimiter. This means that the strings "Hello!",
"Hello" and "hello," will count as three
distinct words for the purposes of this program.
The mapper produces one pair in the intermediate key-value list
using the call context.emit(...).
Each instance of the reducer receives key-value pairs for the
same key. It obtains new pairs using
context.nextValue(), adds their values to a running
sum, and emits the key and that
sum.
In pipes, the values in these contexts must be
strings, so integer values
are converted from/to strings in a reducer, and the
value 1 in the
mapper is expressed as the string "1".
______
The wordcount.xml file specifies
configuration parameters for the word-count example.
This .xml file specifies three properties
or configuration parameters for
running the Hadoop job. In particular, it specifies that the property
hadoop.pipes.executable has the value
/user/rab/bin/wordcount, which informs hadoop
pipes of the file on the DFS that should be
executed.
DO THIS: Enter and run the C++ program wordcount.cpp using Hadoop pipes.
Notes: (Reference: Beowiki page Hadoop Reference, January 2011)
You can create files wordcount.cpp and
wordcount.xml to your directory on your admin machine either
by copying and pasting with an editor or via lab10scp
admin$ scp shelob.cs.stolaf.edu:~cs300/hadoop/wordcount.{cpp,xml} .
(Don't forget the final dot '.' .)
To compile your C++ file wordcount.cpp, you can use the command
admin$ g++ -I/usr/lib/hadoop/c++/Linux-amd64-64/include -Wall -c wordcount.cpp
The -I flag indicates where to find #include
files, and -Wall prints all warnings that may arise
during compilation.
To link the resulting object file, you can use the command
admin$ libtool --mode=link --tag=CXX \
g++ -L/usr/lib/hadoop/c++/Linux-amd64-64/lib -lhadooppipes -lhadooputils -lpthread -Wall \
wordcount.o -o wordcount
The -L flag indicates where to find object files and
library archive files. This list of locations is called the
library path. (In this case, there's only one element in the
list. Other locations could be added, separated by a colon character
:.)
The program libtool enables the program to find
all of its library paths.
The \ character at the end of a
line causes the following line to be appended to that line.
Now, you must upload the executable to the DFS.
admin$ hadoop fs -rm bin/wordcount admin$ hadoop fs -put wordcount bin/wordcount
The hadoop fs -rm command deletes an old version
of the file. Of course, this is not needed when uploading a new executable.
The same effect results from using bin instead of
bin/wordcount as the final argument,
since hadoop fs -put reuses the local filename if the
destination is a directory name.
Create a directory of data on the DFS for your Hadoop word-count job.
We will assume that this data is located in a file named
mice.txt in your
directory on your admin node, containing the following text:lab10
Three blind mice, three blind mice! See how they run, see how they run! They all ran after the farmer's wife, She cut off their tails with a carving knife. Did you ever see such a thing in your life As three blind mice?
Upload this file to a directory (we'll call it
mice) in your area of the DFS.
admin$ hadoop fs -mkdir mice admin$ hadoop fs -put mice.txt mice
Note: All files in this input directory will be processed by the Hadoop job, so it's best to have a dedicated directory for this purpose.
Finally, you're ready to run your Hadoop pipes
job.
admin$ hadoop pipes -conf wordcount.xml -input input-dir/ -output output_dir/
The name input-dir should be replaced by
your HDFS input directory name. This can be an absolute path (starting
with /) on the DFS or a relative path (not starting with
/), which is then relative to your DFS directory
/usr/username .
In our example, input-dir should be
mice. The trailing / is optional.
output-dir should be a DFS directory name to
receive the output from the job. This directory does not need to
exist ahead of time.
After the job completes, look in the
output-dir for the results.
admin$ hadoop fs -ls output-dirFiles produced by the mapper have the names
part-00000, part-00001, etc. You can
examine an individual file without -getting it using the
-cat option for hadoop fs:
admin$ hadoop fs -cat output-dir/part-00000this should print lines such as
a 2 after 1 all 1 ...
Hadoop example in Java using hadoop
jar.
Here we provide a list of steps for running a Hadoop job using the standard Java interface that you can follow these steps even if you don't know Java. Only a single source file is required for this Java exercise. As before, we will start with a discussion of the source file.
Examining that source code, the package command in
Java provides a naming context for the classes defined in the file
WordCount.java. Next, several
import commands provide names of classes used later in
the file. Some of these import commands draw from
standard Java code libraries, and some are specific to Hadoop.
The remainder of the file defines a single class
WordCount with methods main(), and two
nested classes Map and Reduce. As with
C++, one can define one class inside of another class; a nested class
is only visibile within the scope of its containing class.
The method main() corresponds to the
main() function of a C++ program. (Incidentally, in Java,
there are no
standalone functions, but only methods of classes.) The
WordCount class's main() sets up
configuration parameters for a Hadoop job, then calls
JobClient.runJob() to launch that Hadoop job.
______
______
______
______
______
______
Only one file WordCount.java is required. This
contains a single Java class WordCount that contains a
method main() and two inner classes
Map and Reduce.
Instead of contexts, the Java implementation uses other classes
for accessing data, intermediate list, and output, such as the
templated OutputCollector class.
The Map.map() method and the
Reduce.reduce() method use Java methods to carry out
their work, rather than methods and functions involving C++ classes
such as std::vector<std::string>.
Follow the instructions of the Beowiki page Passwordless_SSH_setup.
This will enable you to ssh into a cluster admin node
from a CS lab machine without having to provide a password.
Note. Being able to log in without a password is a great convenience, but it is obviously less secure: if someone can gain access to your St. Olaf account (e.g., if you leave yourself logged in somewhere by mistake), the intruder can automatically access your Beowulf cluster accounts. This is a security concern for both your account and the cluster.
However, we will be using ssh and related utilities
such as scp so often in our work, that supplying all of
those passwords would be unbearably frustrating without passwordless
ssh. The mechanism for passwordless ssh
itself is rather secure. Just be sure to protect your account...
Accessing the DFS from a CS lab machine. Create a
file on a CS Link machine, and use
scp and a
script hadoop.mist or hadoop.helios to
upload it to a cluster's DFS.
First, create file myfile.txt in your link
machine lab10 directory
~/PCS/lab10,
using emacs or another editor.
% emacs myfile.txt ...
myfile.txt may have your choice of contents as data for wordcount.
Note: Be sure to create myfile.txt in your
Link-machine shell, not your admin node shell.
Copy myfile.txt to your admin node's
lab10 directory, using scp. For example,
% scp myfile.txt mist.public.stolaf.edu:lab10(No password should be required!)
You can use a convenience script hadoop.mist or
hadoop.helios to perform a remote hadoop
invocation from a CS Link machine. For example,
% hadoop.mist fs -lsis equivalent to the
ssh command
% ssh mist.public.stolaf.edu /opt/hadoop/bin/hadoop fs -lswhich connects to the
mist and executes
hadoop with arguments fs -ls.
Use one of these convenience scripts
hadoop.name to
upload myfile.txt to the DFS on admin node
name. For example,
% hadoop.mist fs -put lab10/myfile.txt miceNote: This operation puts the admin node's copy of
myfile.txt into the DFS's directory
mice. Using Hadoop pipes from a CS Link machine. Use a
convenience script hadoop.name to perform a
remote hadoop pipes job on a cluster.
Perform steps as before, except using
hadoop.name on a Link machine shell instead of
using hadoop on an admin node shell. Not all of the
prior steps are necessary, since the executable wordcount
already exists on the DFS, and the configuration file
wordcount.xml has already been set up. That leaves
something like the following (if your admin node is
mist):
% hadoop.mist pipes -conf lab10/wordcount.xml -input mice -output pipes2 % hadoop.mist fs -cat pipes2/part-00000Note: The absolute directory path prefix
/user/username/ is not necessary for the input
directory mice or output directory pipes2,
because these DFS names are relative to your DFS home directory
/user/username by default. Also, you should get
different results from this run because of the second input file
myfile.txt.Using Hadoop from within Eclipse (Java, optional)
Note: Please do not run Eclipse on admin nodes, since it takes so many resources. Instead, enter theeclipse command on a Link-machine's shell. Make a copy CountWordsA of your
CountWords program (either
countwords.cpp or CountWords.java) in your
lab10 directory, and test to insure it is working
properly. The expected output from CountWordsA using the
"Three Blind Mice" data should begin something like this:
2 a 1 after 1 all 1 as 3 blind 1 carving 1 cut 1 did 1 ever 1 farmer's ...
Makefile for Hadoop jobs. Test a locally developed
Makefile for managing Hadoop jobs.
Delete all files in your lab10 directory except
your source file countwordsA.cpp or
CountWordsA.java for purposes of this test.
Use scp to copy the files
~cs300/hadoop/Makefile and
~cs300/hadoop/proto.xml into your
lab10 directory.
It's easiest to do this from a CS Link machine, e.g.,
% scp ~cs300/hadoop/Makefile ~cs300/hadoop/proto.xml mist.public.stolaf.edu:lab10/
If you program in C++, this Makefile uses the file
proto.xml to automatically generate .xml
files for your job.
This Makefile uses a file targets to
determine the default targets.
If you are using C++, the C++
executable countwords is the default target.
If you are using
Java, the .jar file CountWords.jar is the
default target.
Create a file targets containing this target. (You
can add more targets later in the lab; if you use this Makefile for
other assignments in other directories, those directories can have
their own targets file.)
If you are using C++, issue the command
$ make countwordsA.oThis step of making a
.o file is only needed for new
.cpp files. We won't need it again for the
countwordsA, even if we make changes to that file.Now, issue the command
$ makeThis command should make your default target. (You will see a one-time warning that a file
.lasttarget doesn't exist,
which you can safely ignore.) Check to insure that the default target
was in fact created.This Makefile keeps track of the target you most
recently made. If you issue the command
$ make hadoopthe
Makefile will retrieve the fact that you most
recently made the target for CountWordsA and will start a Hadoop job
for that target.
If you program in C++, it will replace your prior executable on the DFS with the newly generated one.
It will create a new DFS directory for your output, named
countwordsA.NNNN for C++ and named
CountWordsA.NNNN for Java, where
NNNN is a unique identifying "job
number."
It will run the Hadoop job for you.
If you have more than one program in your directory, you can
run a Hadoop job with another program prog by
entering
$ make prog.hadoop
Note: The input data directory in the DFS is named
mice by default. You can change this in your copy of the
Makefile. Alternatively, you can override it at the command line:
$ make hadoop DATA=cheese
If your Hadoop job succeeds, you can print your results conveniently as follows:
$ make print(or
make prog.print).
Modify your CountWordsA program to lower the case
of all words; for example, the
input words "The" and "the" are both counted
as "the."
For C++: use a function such as this function
toLower()
to lower the case of each word words.
string toLower(string str) {
for (int i=0; i < str.length(); i++)
str[i] = tolower(str[i]);
return str;
}
For Java: Look for a String method to
lower the case of a word.
Write a Hadoop program CountWordsB in either C++
or Java that operates on the output of CountWordsA to
produce a summary list of word frequencies. For the "Three Blind
Mice" data, your program should produce a 3-line summary:
1 off ran run! run, she such tails the their thing wife, with you your mice? after all as carving cut did ever farmer's in knife. life mice! mice, 2 a how 3 blind three they see
Instead of your default input directory on the DFS (e.g.,
mice), we want to use the output from the last run of
CountWordsA as input for this program.
As explained above, you can enter the job name for the last run of
CountWordsA as the value of DATA
$ make hadoop DATA=countwordsA.NNNN # C++ $ make hadoop DATA=CountWordsA.NNNN # JavaOur
Makefile can ordinarily deduce the job number for you
(assuming you last ran CountWordsA in the current directory):
$ make hadoop DATA=countwordsA # C++ $ make hadoop DATA=CountWordsA # Java
The default InputFormat class
is TextInputFormat, which provides a numerical offset as
key and a line of input as the value as input for the mapper. But we
want to read a count number as the key and a word as the value. You
can accomplish this by using KeyValueTextInputFormat for
the InputFormat class.
For C++: add a -inputformat command
argument to your hadoop command. Here's a way to do this
with the Makefile:
make hadoop DATA=countwordsA OPT='-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat'
For Java: Add or modify a call to
conf.setInputFormat() in your main():
conf.setInputFormat(KeyValueTextInputFormat.class);Instead of
WritableComparable or
LongWritable as the input key type for the mapper, use
Text as both the input key and input value types with
this input format.
There is no work for the mapper to do in this program, so an
identity mapper should be used. You can either program an
identity mapper yourself, or use the predefined
IdentityMapper class. (Note: Not clear how to
use the predefined class in C++ yet...)
In the mapper, separate the values (words) with blanks (instead
of the ^ character used for the concordance problem).
Submit your work using the following command on your cluster admin node:
$ cd ~ $ labsubmit.cs300 lab10
This lab is due by Wednesday, January 21, 2015.