Introduction to map-reduce computing with WebMapReduce (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Tuesday, January 20, 2015</span>)

Introduction to map-reduce computing with WebMapReduce

CS 300, Parallel and Distributed Computing (PDC)
Due Tuesday, January 20, 2015

______

______

Preliminary material

Map-reduce concepts
WebMapReduce (WMR) orientation
WMR User guide. Includes specs for user interface, sample code.
- C++ library documentation

Laboratory exercises

On a link computer, create a lab9 subdirectory for work on this lab, and change directory to that directory. Note: In many cases, the only deliverable for a step will be a README recording a job number and any observations.
Enter and run the word-count code presented in class for C++ and at least one other programming language of your choice. Use the same interactively entered data for all runs. In each case, apply the test feature first, then turn off testing for actual Hadoop computation.

Record your test data, language choices, WMR job numbers for those language choices, and any observations/feedback in lab9/README.
HDFS data sets. Complete data from Project Gutenberg has been downloaded from the web and loaded onto the HDFS for WMR (April 2013). These data are in sudirectories of /data/gutenberg on the HDFS for Hadoop on helios. Note That data does not reside on the helios file system, but on the HDFS for Hadoop on helios. Thus, for example,
```
   helios $ ls /data/gutenberg
```
will not show these files (in fact, the directory /data does not exist on helios files system, only on the HDFS).

That directory contains some books/complete works and some directories that contain the Project Gutenberg data.
1. Run the word count computation on one of the following files:
  - /data/gutenberg/AnnaKarenina.txt (complete Tolstoy novel)
  - /data/gutenberg/WarAndPeace.txt (complete Tolstoy novel)
  - /data/gutenberg/CompleteShakespeare.txt
  Note: Do not attempt these on the testing interface, which is designed for test runs and only has the capacity for small files.
  
  Record your WMR job numbers and observations in your README file.
2. Also try your word-count job on a directory containing a subset of Project Gutenberg data. The following directories (number of files per directory also indicated) are available under the following directories:
  - /data/gutenberg/all -- preserves Project Gutenberg's original line structure
  - /data/gutenberg/all_nonl -- one document per line, with the original newlines within documents replaced by spaces
  - /data/gutenberg/all_n -- like all, except each line is preceded by a line number within that file, followed by a TAB character. In a mapper, the line number is assigned to the first argument (key), and the line content is assigned to the second argument (value)
  The following list indicates the subdirectories of these choices, followed by the number of files in that subdirectory (one book per file)
```
    group1    2163 
    group2     735  
    group3     593  
    group4     630  
    group5     819  
    group6     830  
    group7     667  
    group8     541  
    group9     691  
    group10    2018 
    group11     293  
```
  Note: Having one document (e.g., a whole book) per line as in /data/gutenberg/all_nonl makes it possible for a single WMR mapper call to operate on an entire document instead of only one line from a document.
  
  Make a test run, and record job number and observation in README. Be sure to label your README entry well, so you can reuse the output from this data in future runs.
The word-count code produces a list that is sorted by words. Use the output from this program for a novel (or complete works) file as input for a second map-reduce cycle, in order to cause the list to be sorted by frequency instead of by word.

Hints: WMR has a feature for using the output from one run as input for a next run; or, you can use a saved data-set name from a prior step. The mapper for this second map-reduce cycle will receive key-value pairs (the values will be non-empty, this time). Your mapper code should emit a new pair that interchanges the role of the key and the value, e.g., if the input is key=the, val=15, then you should emit key=15, val=the. The reducer should simply emit the same key-value pairs it receives.

Note: The reducer for this program is called an identity reducer, since the key-value pairs are unchanged. But you will have to use the iterator structure of the value stream for a reducer to regenerate those pairs in your identity reducer. One could also have an identity mapper, which is useful in other contexts.

Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Note: You can enter key-value pairs by hand using a tab to separate key from value. Make a copy of your mapper and reducer in your lab9 directory. Use the names map_2pass.ext and red_2pass.ext for these files. (Replace ext by an extension that indicates your programming language, e.g., cpp, java, py.) Record labelled job number for your run and observations in README.
The given word-count mappers and reducers break a line up into words using whitespace characters as delimiters, but they make no attempt to remove punctuation or count capitalized words the same as lower-case words. Thus, the , The , and "The, are all counted as different words.

Modify a word-count mapper to lower the case of all letters in a word and to strip punctuation from both ends of a word, before emitting that word's key-value pair. Thus, a short line
```
"Whose book is this?" "Fred's," I said.
```
should emit key-value pairs
```
whose	1
book	1
is	1
this	1
fred's	1
i	1
said	1
```
Notes:
- Python and Java have convenient methods for these operations -- look them up if you use those languages.
- C++ has a tolower() library function for lowering the case of a single character. You can use a WMR library function for removing non-letters from either end of a word, or program such a function yourself, using loops with character-by-character operations.
  
  You're welcome to apply other C++ approaches you may be aware of...
Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Make a copy of your mapper in your lab9 directory. Use the name map_strip.ext for this file. Record labelled job number for your run and observations in README.
WMR does not currently support combiners. Instead, implement a combiner within your mapper, as described in Section 3.1 of Lin and Dyer's book.

Notes:
- Python and Java support associative arrays/hash tables for storing "subtotals" for particular words.
- C++ STL provides a templated data structure map that can be used for the associative array.
Test your program with small data and the testing interface. Then run on a novel or other larger data set, and record job number and observations in README. Also, store your mapper using the name map_comb.ext in your lab directory.
(Note: Similar exercises to this could be carried out using MovieLens data (/data/MovieLens2) instead of Netflix data.)
The Netflix Prize data was assembled by Netflix for a contest to encourage programmers to come up with more efficient programs to process data about movies and their ratings. (A $1 million prize was offered for a significantly more efficient solution.) The data is stored in the following directories on the HDFS:
- /data/netflix/test contains all the ratings of 999 movies
- /data/netflix/all contains 100,000,000 movie ratings, involving about 18,000 movies rated by 480,000 users
Each line of these files contains one movie rating in the following format:
```
  movieid,userid,rating,date
```
Example
```
  1629,1488844,3,2005-11-16
```
The ratings are on a scale of 1 to 5.

Write two WMR applications for this part:
1. A map-reduce application to sort userid values by number of ratings. (This will require counting just the user ids that appear on a line, plus a second pass for sorting.)
2. A map-reduce application to determine mean movie ratings and print a list of movieids sorted by rating
Notes for (b):
- Think of what to emit for the key and what for the value in a key-value pair proceeding from one line of the Netflix data.
- Your reducer must keep both a sum of ratings for a particular movie and a count of those ratings, in order to compute the mean (average).
- Use a second map-reduce pass to sort the data numerically by rating.
Debug using the testing interface, entering sample data by hand. Then test on one of the Netflix data sets. Record your job ids and data set locations for all WMR passes in README, together with any observations.

Further exercises (Optional)

Optionally try one or more of the following additional exercises.

Collocations of words. If two words occur near each other, they are described as collocated words. Computational linguists may have an interest in various types of collation in a corpus or body of text, for example, words that appear immediately next to each other, or words that appear in the same syntactic unit (e.g., sentence).

Write a map-reduce application that prints consecutive word pairs, ordered by their frequencies. For example, if the input is
```
   the cat in the cat hat
```
then the output might be
```
1	cat|in
1	in|the
1	cat|hat
2	the|cat
```
Use the pairs pattern described in the text for this part. Thus, a mapper could emit a key-value pair (the|cat, 1) for the first word pair in the example above.

Notes:
- This will require two map-reduce passes.
Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.
Carry out the word-pair exercise, except using the stripes pattern described in the text. Thus, if the input is
```
  
   the cat and the dog fought over the dog bone
```
a mapper might emit a key-value pair (the, {cat:1, dog:2}), in which the value indicates that the word cat occurred once (immediately) after the word the, and the word dog occurred twice after the.

Notes:
- Use associative arrays to accumulate the information. For example, an associative array A for the word the could increment the element A[dog] when the collocated words the dog are encountered.
  
  Associative arrays are built into Python (dictionaries), and are available as a type Map<String, String> in Java, or as a STL templated class map in C++.
- You can store the associative arrays of collation counts in an "outer" associative array W, having one element for each first word among collated pairs. Thus W[the] would hold the associative array for the.
- After processing a line of input, a mapper will emit the "stripe" pairs it has found (such as (the, {cat:1, dog:2}. In order to do that, that mapper must have a way of printing out the values in that associative array for the word the. Look for features for output of an associative array in your language; for example, Python dictionaries can simply be printed, and Java Map objects have toString() method.
- Also, a reducer must be able to parse a value representing an associative array (such as {cat:1, dog:2} for the word the), process the information it contains, then produce key-value pairs with combined frequency counts for each collocation (such as (the|cat, n), where n is the count of all occurrence (from all original input lines) of the word the followed by the word cat).
Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.
Modify your Netflix processing program (part (b)) to track how movie ratings change over time, by producing a list of movies and their average ratings per day, sorted first by date then by average rating on that date. Example output:
```
   2005-11-16  4.8, 1629
   2005-11-16  4.78, 1643
   2005-11-16  4.78, 11
   2005-11-16  4.76, 8001
   ...
   2005-11-16  1.1, 384
   2005-11-17  5.0, 2288
   ...
```
Notes:
- To accomplish this, add the date to the key for the first pass of the mean rating computation, so that mean rating will be computed per movie per day. Thus, the key for the first mapper will consist of date and movie, e.g., 2005-11-16|1629, and the value will consist of a rating.
- In the reducer for the first pass, compute average rating for a date|movieid, then emit a pair whose key is date|rating and whose value is a movieid. Sort alphabetically in a second pass; since the dates have uniform lengths and the ratings all have a floating-point format that begins with x. for some digit x, Hadoop's shuffle phase will sort the data as desired.
- The reducer in the second (sorting) pass can reformat each key and value to produce the desired output format.
Alternate solutions to date-sorted ratings... Associative arrays, alternate approaches to secondary sort.

Deliverables

Submit your work using the following command on your link machine.

$ cd ~
$ labsubmit.cs300 lab9

This lab is due by Tuesday, January 20, 2015.