Introduction to map-reduce computing with WebMapReduce
CS 300, Parallel and Distributed Computing (PDC)
Due Tuesday, January 20, 2015
______
Map-reduce concepts
WebMapReduce (WMR) orientation
WMR User guide. Includes specs for user interface, sample code.
On a link computer, create a
lab9 subdirectory for work on this
lab, and change directory to that directory. Note: In many
cases, the only deliverable for a step will be a README
recording a job number and any observations.
Enter and run the word-count code presented in class for C++ and at least one other programming language of your choice. Use the same interactively entered data for all runs. In each case, apply the test feature first, then turn off testing for actual Hadoop computation.
Record your test data, language choices, WMR job numbers
for those language choices, and any observations/feedback
in lab9/README.
HDFS data sets. Complete data from Project Gutenberg
has been downloaded from the web and loaded onto the HDFS for WMR (April
2013). These data are in sudirectories of
/data/gutenberg on the HDFS for Hadoop on
helios. Note That data does not
reside on the helios file system, but on the HDFS for
Hadoop on helios. Thus, for example,
helios $ ls /data/gutenbergwill not show these files (in fact, the directory
/data does not exist on helios files system,
only on the HDFS).
That directory contains some books/complete works and some directories that contain the Project Gutenberg data.
Run the word count computation on one of the following files:
/data/gutenberg/AnnaKarenina.txt (complete Tolstoy
novel)
/data/gutenberg/WarAndPeace.txt (complete Tolstoy novel)
/data/gutenberg/CompleteShakespeare.txt
Note: Do not attempt these on the testing interface, which is designed for test runs and only has the capacity for small files.
Record your WMR job numbers and observations in your
README file.
Also try your word-count job on a directory containing a subset of Project Gutenberg data. The following directories (number of files per directory also indicated) are available under the following directories:
/data/gutenberg/all -- preserves Project Gutenberg's
original line structure
/data/gutenberg/all_nonl -- one document per line, with the
original newlines within documents replaced by spaces
/data/gutenberg/all_n -- like all,
except each line is preceded by a line number within that file,
followed by a TAB character. In a mapper, the line number is assigned
to the first argument (key), and the line content is
assigned to the second argument (value)
group1 2163
group2 735
group3 593
group4 630
group5 819
group6 830
group7 667
group8 541
group9 691
group10 2018
group11 293
Note: Having one document (e.g., a whole book) per line
as in /data/gutenberg/all_nonl
makes it possible for a single WMR mapper call to operate on an entire document
instead of only one line from a document.
Make a test run, and record job number and observation in
README. Be sure to label your README entry
well, so you can reuse the output from this data in future runs.
The word-count code produces a list that is sorted by words. Use the output from this program for a novel (or complete works) file as input for a second map-reduce cycle, in order to cause the list to be sorted by frequency instead of by word.
Hints: WMR has a feature for using the output from one
run as input for a next run; or, you can use a saved data-set name
from a prior step. The mapper for this second map-reduce cycle will
receive key-value pairs (the values will be non-empty, this time).
Your mapper code should
emit a new pair that interchanges the role of the key and the value,
e.g., if the input is key=the, val=15, then
you should emit key=15, val=the. The
reducer should simply emit the same key-value pairs it receives.
Note: The reducer for this program is called an identity reducer, since the key-value pairs are unchanged. But you will have to use the iterator structure of the value stream for a reducer to regenerate those pairs in your identity reducer. One could also have an identity mapper, which is useful in other contexts.
Test your code with small data using the testing interface, then
run it on a novel or the complete works of Shakespeare.
Note: You can enter key-value pairs by hand using a tab to
separate key from value.
Make a copy of your mapper and reducer in your lab9 directory. Use the names map_2pass.ext and
red_2pass.ext for these files. (Replace
ext by an extension that indicates your
programming language, e.g., cpp, java,
py.) Record labelled
job number for your run and observations
in README.
The given word-count mappers and reducers break a line up into
words using whitespace characters as delimiters, but they make no
attempt to remove punctuation or count capitalized words the same as
lower-case words. Thus, the , The , and
"The, are all counted as different words.
Modify a word-count mapper to lower the case of all letters in a word and to strip punctuation from both ends of a word, before emitting that word's key-value pair. Thus, a short line
"Whose book is this?" "Fred's," I said.should emit key-value pairs
whose 1 book 1 is 1 this 1 fred's 1 i 1 said 1
Notes:
Python and Java have convenient methods for these operations -- look them up if you use those languages.
C++ has a tolower() library function for lowering
the case of a single character. You can use a WMR library function for
removing non-letters from either end of a word, or program such a
function yourself, using loops with character-by-character
operations.
You're welcome to apply other C++ approaches you may be aware of...
Test your code with small data using the testing interface, then
run it on a novel or the complete works of Shakespeare.
Make a copy of your mapper in your lab9 directory. Use the name map_strip.ext for this
file. Record labelled job number for your run and observations
in README.
WMR does not currently support combiners. Instead, implement a combiner within your mapper, as described in Section 3.1 of Lin and Dyer's book.
Notes:
Python and Java support associative arrays/hash tables for storing "subtotals" for particular words.
C++ STL provides a templated data structure map
that can be used for the associative array.
Test your program with small data and the testing interface. Then
run on a novel or other larger data set, and record job number and
observations in README. Also, store your mapper using
the name map_comb.ext in your lab directory.
(Note: Similar exercises to this could be carried out
using MovieLens data (/data/MovieLens2) instead of
Netflix data.)
The Netflix Prize data was assembled by Netflix for a contest to encourage programmers to come up with more efficient programs to process data about movies and their ratings. (A $1 million prize was offered for a significantly more efficient solution.) The data is stored in the following directories on the HDFS:
/data/netflix/test contains all the ratings
of 999 movies
/data/netflix/all contains 100,000,000
movie ratings, involving about 18,000 movies rated by 480,000 users
Each line of these files contains one movie rating in the following format:
movieid,userid,rating,dateExample
1629,1488844,3,2005-11-16The ratings are on a scale of 1 to 5.
Write two WMR applications for this part:
A map-reduce application to sort userid values by number of ratings. (This will require counting just the user ids that appear on a line, plus a second pass for sorting.)
A map-reduce application to determine mean movie ratings and print a list of movieids sorted by rating
Notes for (b):
Think of what to emit for the key and what for the value in a key-value pair proceeding from one line of the Netflix data.
Your reducer must keep both a sum of ratings for a particular movie and a count of those ratings, in order to compute the mean (average).
Use a second map-reduce pass to sort the data numerically by rating.
Debug using the testing interface, entering sample data by hand. Then test on one of the Netflix data sets. Record your job ids and data set locations for all WMR passes in README, together with any observations.
Optionally try one or more of the following additional exercises.
Collocations of words. If two words occur near each other, they are described as collocated words. Computational linguists may have an interest in various types of collation in a corpus or body of text, for example, words that appear immediately next to each other, or words that appear in the same syntactic unit (e.g., sentence).
Write a map-reduce application that prints consecutive word pairs, ordered by their frequencies. For example, if the input is
the cat in the cat hatthen the output might be
1 cat|in 1 in|the 1 cat|hat 2 the|catUse the pairs pattern described in the text for this part. Thus, a mapper could emit a key-value pair
(the|cat, 1)
for the first word pair in the example above.
Notes:
This will require two map-reduce passes.
Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.
Carry out the word-pair exercise, except using the stripes pattern described in the text. Thus, if the input is
the cat and the dog fought over the dog bonea mapper might emit a key-value pair
(the, {cat:1, dog:2}), in which the
value indicates that the word cat occurred once
(immediately) after the word the, and the word
dog occurred twice after the.
Notes:
Use associative arrays to accumulate the information. For
example, an associative array A for the word
the could increment the element A[dog]
when the collocated words the dog are encountered.
Associative arrays are built into Python (dictionaries), and are
available as a type Map<String, String> in Java, or
as a STL templated class map in C++.
You can store the associative arrays of collation counts in an "outer"
associative array W, having one element for each first word
among collated pairs. Thus W[the] would hold the
associative array for the.
After processing a line of input, a mapper will emit the
"stripe" pairs it has found (such as (the, {cat:1,
dog:2}. In order to do that, that mapper must have a way of
printing out the values in that associative array for the word
the. Look for features for output of an
associative array in your language; for example, Python dictionaries
can simply be printed, and Java Map objects have
toString() method.
Also, a reducer must be able to parse a value representing an
associative array (such as {cat:1, dog:2} for the word
the), process
the information it contains, then produce key-value pairs with
combined frequency counts for each collocation (such as
(the|cat, n), where n is
the count of all occurrence (from all original input lines) of the
word the followed by the word cat).
Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.
Modify your Netflix processing program (part (b)) to track how movie ratings change over time, by producing a list of movies and their average ratings per day, sorted first by date then by average rating on that date. Example output:
2005-11-16 4.8, 1629 2005-11-16 4.78, 1643 2005-11-16 4.78, 11 2005-11-16 4.76, 8001 ... 2005-11-16 1.1, 384 2005-11-17 5.0, 2288 ...
Notes:
To accomplish this, add the date to the key for the
first pass of the mean rating computation, so that mean rating will be
computed per movie per day. Thus, the key for the first
mapper will consist of date and movie, e.g.,
2005-11-16|1629, and the value will consist of a
rating.
In the reducer for the first pass, compute average rating for a
date|movieid, then emit a pair whose key is
date|rating and whose value is a
movieid. Sort alphabetically in a second pass; since the
dates have uniform
lengths and the ratings all have a floating-point format that begins
with x. for some digit x, Hadoop's shuffle phase
will sort the data as desired.
The reducer in the second (sorting) pass can reformat each key and value to produce the desired output format.
Alternate solutions to date-sorted ratings... Associative arrays, alternate approaches to secondary sort.
Submit your work using the following command on your link machine.
$ cd ~ $ labsubmit.cs300 lab9
This lab is due by Tuesday, January 20, 2015.