Home
>>    




Introduction to map-reduce computing with WebMapReduce

CS 300, Parallel and Distributed Computing (PDC)
Due Saturday, November 12, 2016

    ______
  • ______

______

Preliminary material

Laboratory exercises

  1. On a link computer, create a ~/PDC/lab22 subdirectory for work on this lab, and change directory to that directory. Put a copy of any code in your lab22 subdirectory. Since WMR can upload mapper and reducer files from your local system, it may be best to edit Link files in your lab22 subdirectory using an ordinary editor, and upload those to WMR. In practice, you will probably make changes in that code as you test it in WMR in order for it to work correctly. In any case, be sure that the final version of each program file is stored in your lab22 directory, copying and pasting from within WMR if necessary.

    Note: In some steps, the only deliverable for a step may be a README recording a job number and any observations.

  2. Enter and run the word-count code presented in class for python3, C++, and at least one other programming language of your choice. (These are available in the introductory WMR module.) Store these files in your lab22 directory using the names wc_mapper.py and wc_reducer.py for Python 3 versions, wc_mapper.cpp and wc_reducer.cpp for C++ versions, and similar wc_* names for other languages. Use the same interactively entered data for all runs (which could be read from another file in your lab22 directory). for each run below, apply the test feature first, check the intermediate key-value pairs and output, then run an actual Hadoop computation.

    Record your test data, language choices, WMR job numbers for those language choices, and any observations/feedback in lab22/README. Then save your work in a commit. Be sure to add your small test data file and other source files such besides the .cpp and .py files.

    link%  git add README wc_{mapper,reducer}.{cpp,py}
    link%  git commit -m "lab22: First tests of WMR"
    

  3. HDFS data sets. Complete data from Project Gutenberg (as of April 2013) has been downloaded from the web and loaded onto the HDFS for WMR. These data are in sudirectories of /shared/gutenberg on the HDFS for Hadoop on cumulus. Note That data does not reside on the cumulus file system, but on the HDFS for Hadoop on cumulus. Thus, for example,

    cumulus $ ls /shared/gutenberg
    
    will not show these files (in fact, the directory /shared does not exist on cumulus files system, only on the HDFS).

    That HDFS directory contains some books/complete works, and some subdirectories that contain the Project Gutenberg data.

    1. Run the word count computation on one of the following files:

      • /shared/gutenberg/AnnaKarenina.txt (complete Tolstoy novel)

      • /shared/gutenberg/WarAndPeace.txt (complete Tolstoy novel)

      • /shared/gutenberg/CompleteShakespeare.txt

      Note: The testing interface will truncate these large files. It is designed for test runs and only has the capacity for small files.

      Record your WMR job numbers and observations in your README file.

    2. Also try your word-count job on a directory containing a subset of Project Gutenberg data. The following directories (number of files per directory also indicated) are available under the following directories:

      • /shared/gutenberg/all -- preserves Project Gutenberg's original line structure

      • /shared/gutenberg/all_nonl -- one document per line, with the original newlines within documents replaced by spaces. Having one document (e.g., a whole book) per line as in /shared/gutenberg/all_nonl makes it possible for a single WMR mapper call to operate on an entire document instead of only one line from a document.

      • /shared/gutenberg/all_n -- like all, except each line is preceded by a line number within that file, followed by a TAB character. In a mapper, the line number is assigned to the first argument (key), and the line content is assigned to the second argument (value)

      The following list indicates the subdirectories of these choices, followed by the number of files in that subdirectory (one book per file)

      group1    2163 
      group2     735  
      group3     593  
      group4     630  
      group5     819  
      group6     830  
      group7     667  
      group8     541  
      group9     691  
      group10    2018 
      group11     293  
      

      Note for Fall 2016: Since the cumulus cluster is small, the larger data sets may not work well. It's safer to use a group such as group8 or group11 than a larger group such as group1 or group10.

      Make a test run (for example, choose the input as the HDFS directory /shared/gutenberg/all/group11, with your word-count mapper and reducer), and record job number and observation in README. Be sure to label your README entry well, so you can reuse the output from this data in future runs.

  4. The word-count code produces a list that is sorted by words. Use the output from this program for a novel (or complete works) file as input for a second map-reduce cycle, in order to cause the list to be sorted by frequency instead of by word. Use the filename swap_mapper.ext for your mapper and id_reducer.ext for your reducer, where ext is a file extension appropriate to your language (e.g., cpp for C++ code or py for Python 3 code).

    Notes:

    • WMR has a feature for using the output from one run as input for a next run; or, you can use a saved data-set name from a prior step.

    • The mapper for this second map-reduce cycle will receive key-value pairs (the values will be non-empty, this time). Your mapper code should emit a new pair that interchanges the role of the key and the value, e.g., if the input is key=the, val=15, then you should emit key=15, val=the. The shuffler will sort these by key, and the reducer can simply emit the same key-value pairs it receives.

    • In the second cycle, use numerical sorting in the shuffle stage (selected at the top of the WMR job submission page).

    • The reducer for this program is called an identity reducer, since the key-value pairs are unchanged. Note that you will have to use the iterator structure of the value stream for a reducer to regenerate those pairs in your identity reducer.

    • One could also have an identity mapper that emits the same key-value pair it receives as arguments, which is useful in other contexts. In particular, you can examine a HDFS data set within WMR by using a map-reduce cycle with identity mapper and identity reducer.

    Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Note: You can enter key-value pairs by hand using a tab to separate key from value. Be sure to make a copy of the final versions of your mapper and reducer in your lab22 directory. Record labelled job number for your run and observations in README.

    Then save your work in a commit. Note: Use your chosen extension such as cpp or py for ext.

    link%  git add README swap_mapper.ext id_reducer.ext
    link%  git commit -m "lab22: Frequency order"
    

  5. The given word-count mappers and reducers break a line up into words using whitespace characters as delimiters, but they make no attempt to remove punctuation or count capitalized words the same as lower-case words. Thus, the , The , and "The, are all counted as different words.

    Modify a word-count mapper to lower the case of all letters in a word and to strip punctuation from both ends of a word, before emitting that word's key-value pair. Thus, a short line

    "Whose book is this?" "Fred's," I said.
    
    should emit key-value pairs
    whose   1
    book    1
    is      1
    this    1
    fred's  1
    i       1
    said    1
    
    Use the name wclower_mapper.ext for your mapper code, and be sure to save a final version of that code in your lab22 directory.

    Notes:

    • Python and Java have convenient methods for these operations -- look them up if you use those languages.

    • C++ has a tolower() library function for lowering the case of a single character. You can use a WMR library function for removing non-letters from either end of a word, or program such a function yourself, using loops with character-by-character operations.

      You're welcome to apply other C++ approaches you may be aware of...

    • Only the mapper needs to be changed, in order to produce key-value pairs without capitalization or punctuation. The same reducer wc_reducer.ext as before will correctly tally frequencies of those "normalized" words.

    Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Record labelled job number for your run and observations in README.

    Then save your work in a commit.

    link%  git add README wclower_mapper.ext
    link%  git commit -m "lab22: Lower case and remove punctuation in mapper"
    


  6. Our algorithm for counting words emits every word in a line of input as key, and 1 as value. If a word appears more than once in a line, a separate key-value pair is emitted for each occurrence. An alternative would be to count the number of occurrences of each word in an input line, then emit only one key-value pair per word, with a value that may be greater than 1. With our sample data, where most lines have few repeated words, there is not much computation saved by doing this subtotalling, but with other data (e.g., the data sets /shared/gutenberg/all_nonl/groupn which have one book per line), pre-tallying before the shuffle stage may save a substantial amount of data movement and relatively costly I/O operations. Therefore, Hadoop provides a way to apply a reduce operation as part of the mapping stage. Such pre-tallying operations are called combiners.

    WMR does not currently support combiners. Instead, implement a combiner within your mapper, as described in Section 3.1 of Lin and Dyer's book.

    Notes:

    • Python and Java support associative arrays/hash tables for storing "subtotals" for particular words.

    • C++ STL provides a templated data structure map that can be used for the associative array.

    Test your program with small data and the testing interface. Then run on a novel or other larger data set, and record job number and observations in README. Also, store your mapper using the name wccombiner_mapper.ext in your lab22 directory.

    Then save your work in a commit.

    link%  git add README wccombiner_mapper.ext
    link%  git commit -m "lab22: Combiner in the mapper to subtotal words"
    

  7. (Note: Similar exercises to this could be carried out using MovieLens data (/shared/MovieLens2) instead of Netflix data.)

    The Netflix Prize data was assembled by Netflix for a contest to encourage programmers to come up with more efficient programs to process data about movies and their ratings. (A $1 million prize was offered and won for a significantly more efficient solution.) The data is stored in the following directories on the HDFS:

    • /shared/netflix/test contains all the ratings of 999 movies

    • /shared/netflix/all contains 100,000,000 movie ratings, involving about 18,000 movies rated by 480,000 users

    Each line of these files contains one movie rating in the following format:

    movieid,userid,rating,date
    
    Example
    1629,1488844,3,2005-11-16
    
    The ratings are on a scale of 1 to 5.

    Write two WMR applications for this part:

    1. A map-reduce application to sort userid values by number of ratings. Use the file names nf_ucount_mapper.ext and nf_ucount_reducer.ext for saving your new code.

      This will require counting just the user ids that appear on data lines, plus a second map-reduce cycle for sorting. The file names above are for the first map-reduce cycle. You can reuse swap_mapper.ext and id_reducer.ext for the second map-reduce cycle -- be sure to select numerical sorting for the shuffle stage.

    2. A map-reduce application to determine mean movie ratings and print a list of movieids sorted by rating. Use the names nf_movierate_mapper.ext and nf_movierate_reducer.ext for saving your new code.

    Notes for (b):

    • Think of what to emit for the key and what for the value in a key-value pair proceeding from one line of the Netflix data.

    • Your reducer must calculate both a sum of ratings for a particular movie and a count of those ratings, in order to compute the mean (average).

    • Use a second map-reduce cycle to sort the data numerically by rating.

    Debug using the testing interface, entering sample data by hand. Then test on one of the Netflix data sets. Record your job ids and data set locations for all WMR passes in README, together with any observations.

    Save your work in a commit.

    link%  git add README nf_{ucount,movierate}_{mapper,reducer}.ext
    link%  git commit -m "lab22: Netflix data userid counts and movie mean ratings"
    

  8. Collocations of words. If two words occur near each other, they are described as collocated words. Computational linguists may have an interest in various types of collation in a corpus or body of text, for example, words that appear immediately next to each other, or words that appear in the same syntactic unit (e.g., sentence).

    Write a map-reduce application that prints consecutive word pairs, ordered by their frequencies. For example, if the input is

    the cat in the cat hat
    
    then the output might be
    1       cat|in
    1       in|the
    1       cat|hat
    2       the|cat
    
    Use the pairs pattern described in the text for this part. Thus, a mapper could emit a key-value pair (the|cat, 1) for the first word pair in the example above.

    Notes:

    • This will require two map-reduce passes (i.e., cycles). Use the names colloc1_mapper.ext and colloc1_reducer.ext for your first-pass code, and use colloc2_mapper.ext and colloc2_reducer.ext for your second pass.

    • Add documentation to your code files indicating what each accomplishes, and referring to the other code files, so someone would know all the files needed for this operation if they saw the documentation in one of the files. Also identify the four code files in README.

    Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your code, job ids and data set location in README, together with any observations.

    Then save your work in a commit.

    link%  git add README colloc[12]_{mapper,reducer}.ext
    link%  git commit -m "lab22: Frequency of collocated words"
    
    Note that the shell expands the notation [12] the same way it would expand {1,2} .

Further exercises (Optional)

Optionally try one or more of the following additional exercises.

  1. Carry out the word-pair exercise, except using the stripes pattern described in the text. Thus, if the input is

      
    the cat and the dog fought over the dog bone
    
    a mapper might emit a key-value pair (the, {cat:1, dog:2}), in which the value indicates that the word cat occurred once (immediately) after the word the, and the word dog occurred twice after the.

    Use the name colloc_stripe_mapper.ext for the new mapper (a replacement for colloc1_mapper.ext).

    Notes:

    • Include documentation in colloc_stripe_mapper.ext that refers to the other code files in the computation, as for the file colloq1_mapper.ext.

    • Use associative arrays to accumulate the information. For example, an associative array A for the word the could increment the element A[dog] when the collocated words the dog are encountered.

      Associative arrays are built into Python (dictionaries), and are available as a type Map<String, String> in Java, or as a STL templated class map in C++.

    • You can store the associative arrays of collation counts in an "outer" associative array W, having one element for each first word among collated pairs. Thus W[the] would hold the associative array for the.

    • After processing a line of input, a mapper will emit the "stripe" pairs it has found (such as (the, {cat:1, dog:2}. In order to do that, that mapper must have a way of printing out the values in that associative array for the word the. Look for features for output of an associative array in your language; for example, Python dictionaries can simply be printed, and Java Map objects have toString() method.

    • Also, a reducer must be able to parse a value representing an associative array (such as {cat:1, dog:2} for the word the), process the information it contains, then produce key-value pairs with combined frequency counts for each collocation (such as (the|cat, n), where n is the count of all occurrence (from all original input lines) of the word the followed by the word cat).

    Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.

    Then save your work in a commit.

    link%  git add README colloc_stripe_mapper.ext
    link%  git commit -m "lab22: Frequency of collocated words"
    
  2. Modify your Netflix processing program (part (b)) to track how movie ratings change over time, by producing a list of movies and their average ratings per day, sorted first by date then by average rating on that date. Example output:

    2005-11-16  4.8, 1629
    2005-11-16  4.78, 1643
    2005-11-16  4.78, 11
    2005-11-16  4.76, 8001
    ...
    2005-11-16  1.1, 384
    2005-11-17  5.0, 2288
    ...
    
    Use the name nf_movierate_perday1_mapper.ext for the first-pass mapper in this computation, etc.

    Notes:

    • To accomplish this, add the date to the key for the first pass of the mean rating computation, so that mean rating will be computed per movie per day. Thus, the key for the first mapper will consist of date and movie, e.g., 2005-11-16|1629, and the value will consist of a rating.

    • In the reducer for the first pass, compute average rating for a date|movieid, then emit a pair whose key is date|rating and whose value is a movieid. Sort alphabetically in a second pass; since the dates have uniform lengths and the ratings all have a floating-point format that begins with x. for some digit x, Hadoop's shuffle phase will sort the data as desired.

    • The reducer in the second (sorting) pass can reformat each key and value to produce the desired output format.

    Debug using the testing interface, entering sample data by hand. Then test on one of the Netflix data sets. Record your job ids and data set locations for all WMR passes in README, together with any observations.

    Save your work in a commit.

    link%  git add README nf_movierate_perday*_{mapper,reducer}.ext
    link%  git commit -m "lab22: Netflix movie mean ratings per day"
    

  3. Alternate solutions to date-sorted ratings... Associative arrays, alternate approaches to secondary sort.

Deliverables

All of your code for this assignment should already be contained in commits. Modify the most recent commit message to indicate that you are submitting the completed assignment.

link%  git commit --amend
Add the following to the latest commit message.
    lab22: complete

If your assignment is not complete, indicate your progress instead, e.g.,
    lab22: items 1-5 complete, item 6 partial
You can make later commits to submit updates.

Finally, pull/push your commits in the usual way.

link%  git pull origin master
link%  git push origin master


Files: README wc_mapper.py wc_reducer.py wc_mapper.cpp wc_reducer.cpp swap_mapper.ext id_reducer.ext wclower_mapper.ext wccombiner_mapper.ext nf_ucount_mapper.ext nf_ucount_reducer.ext nf_movierate_mapper.ext nf_movierate_reducer.ext colloc1_mapper.ext colloc1_reducer.ext colloc2_mapper.ext colloc2_reducer.ext

Optional: colloc_stripe_mapper.ext nf_movierate_perday*_{mapper,reducer}.ext

Submit your work on this lab using git. Assuming that you did all your work on a Link computer, include the word Link in your commit message.

%  cd ~/PDC
%  git pull origin master
%  git add -A
%  git commit -m "Lab 22 submit (Link)"
%  git push origin master
Likewise, if you worked on both link machines and a thing, you can rename your lab22 folders and submit each, with commit messages indicating the computer on which that commit was created.

Also, fill out this form to report on your work.

This lab is due by Saturday, November 12, 2016.