Introduction to map-reduce computing with WebMapReduce

CS 300, Parallel and Distributed Computing (PDC)
Due Tuesday, January 20, 2015

______

Preliminary material

Laboratory exercises

  1. On a link computer, create a lab9 subdirectory for work on this lab, and change directory to that directory. Note: In many cases, the only deliverable for a step will be a README recording a job number and any observations.

  2. Enter and run the word-count code presented in class for C++ and at least one other programming language of your choice. Use the same interactively entered data for all runs. In each case, apply the test feature first, then turn off testing for actual Hadoop computation.

    Record your test data, language choices, WMR job numbers for those language choices, and any observations/feedback in lab9/README.

  3. HDFS data sets. Complete data from Project Gutenberg has been downloaded from the web and loaded onto the HDFS for WMR (April 2013). These data are in sudirectories of /data/gutenberg on the HDFS for Hadoop on helios. Note That data does not reside on the helios file system, but on the HDFS for Hadoop on helios. Thus, for example,

       helios $ ls /data/gutenberg
    
    will not show these files (in fact, the directory /data does not exist on helios files system, only on the HDFS).

    That directory contains some books/complete works and some directories that contain the Project Gutenberg data.

    1. Run the word count computation on one of the following files:

      • /data/gutenberg/AnnaKarenina.txt (complete Tolstoy novel)

      • /data/gutenberg/WarAndPeace.txt (complete Tolstoy novel)

      • /data/gutenberg/CompleteShakespeare.txt

      Note: Do not attempt these on the testing interface, which is designed for test runs and only has the capacity for small files.

      Record your WMR job numbers and observations in your README file.

    2. Also try your word-count job on a directory containing a subset of Project Gutenberg data. The following directories (number of files per directory also indicated) are available under the following directories:

      • /data/gutenberg/all -- preserves Project Gutenberg's original line structure

      • /data/gutenberg/all_nonl -- one document per line, with the original newlines within documents replaced by spaces

      • /data/gutenberg/all_n -- like all, except each line is preceded by a line number within that file, followed by a TAB character. In a mapper, the line number is assigned to the first argument (key), and the line content is assigned to the second argument (value)

      The following list indicates the subdirectories of these choices, followed by the number of files in that subdirectory (one book per file)

          group1    2163 
          group2     735  
          group3     593  
          group4     630  
          group5     819  
          group6     830  
          group7     667  
          group8     541  
          group9     691  
          group10    2018 
          group11     293  
      

      Note: Having one document (e.g., a whole book) per line as in /data/gutenberg/all_nonl makes it possible for a single WMR mapper call to operate on an entire document instead of only one line from a document.

      Make a test run, and record job number and observation in README. Be sure to label your README entry well, so you can reuse the output from this data in future runs.

  4. The word-count code produces a list that is sorted by words. Use the output from this program for a novel (or complete works) file as input for a second map-reduce cycle, in order to cause the list to be sorted by frequency instead of by word.

    Hints: WMR has a feature for using the output from one run as input for a next run; or, you can use a saved data-set name from a prior step. The mapper for this second map-reduce cycle will receive key-value pairs (the values will be non-empty, this time). Your mapper code should emit a new pair that interchanges the role of the key and the value, e.g., if the input is key=the, val=15, then you should emit key=15, val=the. The reducer should simply emit the same key-value pairs it receives.

    Note: The reducer for this program is called an identity reducer, since the key-value pairs are unchanged. But you will have to use the iterator structure of the value stream for a reducer to regenerate those pairs in your identity reducer. One could also have an identity mapper, which is useful in other contexts.

    Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Note: You can enter key-value pairs by hand using a tab to separate key from value. Make a copy of your mapper and reducer in your lab9 directory. Use the names map_2pass.ext and red_2pass.ext for these files. (Replace ext by an extension that indicates your programming language, e.g., cpp, java, py.) Record labelled job number for your run and observations in README.

  5. The given word-count mappers and reducers break a line up into words using whitespace characters as delimiters, but they make no attempt to remove punctuation or count capitalized words the same as lower-case words. Thus, the , The , and "The, are all counted as different words.

    Modify a word-count mapper to lower the case of all letters in a word and to strip punctuation from both ends of a word, before emitting that word's key-value pair. Thus, a short line

    "Whose book is this?" "Fred's," I said.
    
    should emit key-value pairs
    whose	1
    book	1
    is	1
    this	1
    fred's	1
    i	1
    said	1
    

    Notes:

    Test your code with small data using the testing interface, then run it on a novel or the complete works of Shakespeare. Make a copy of your mapper in your lab9 directory. Use the name map_strip.ext for this file. Record labelled job number for your run and observations in README.

  6. WMR does not currently support combiners. Instead, implement a combiner within your mapper, as described in Section 3.1 of Lin and Dyer's book.

    Notes:

    Test your program with small data and the testing interface. Then run on a novel or other larger data set, and record job number and observations in README. Also, store your mapper using the name map_comb.ext in your lab directory.

  7. (Note: Similar exercises to this could be carried out using MovieLens data (/data/MovieLens2) instead of Netflix data.)

    The Netflix Prize data was assembled by Netflix for a contest to encourage programmers to come up with more efficient programs to process data about movies and their ratings. (A $1 million prize was offered for a significantly more efficient solution.) The data is stored in the following directories on the HDFS:

    Each line of these files contains one movie rating in the following format:

      movieid,userid,rating,date
    
    Example
      1629,1488844,3,2005-11-16
    
    The ratings are on a scale of 1 to 5.

    Write two WMR applications for this part:

    1. A map-reduce application to sort userid values by number of ratings. (This will require counting just the user ids that appear on a line, plus a second pass for sorting.)

    2. A map-reduce application to determine mean movie ratings and print a list of movieids sorted by rating

    Notes for (b):

    Debug using the testing interface, entering sample data by hand. Then test on one of the Netflix data sets. Record your job ids and data set locations for all WMR passes in README, together with any observations.

Further exercises (Optional)

Optionally try one or more of the following additional exercises.

  1. Collocations of words. If two words occur near each other, they are described as collocated words. Computational linguists may have an interest in various types of collation in a corpus or body of text, for example, words that appear immediately next to each other, or words that appear in the same syntactic unit (e.g., sentence).

    Write a map-reduce application that prints consecutive word pairs, ordered by their frequencies. For example, if the input is

       the cat in the cat hat
    
    then the output might be
    1	cat|in
    1	in|the
    1	cat|hat
    2	the|cat
    
    Use the pairs pattern described in the text for this part. Thus, a mapper could emit a key-value pair (the|cat, 1) for the first word pair in the example above.

    Notes:

    Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.

  2. Carry out the word-pair exercise, except using the stripes pattern described in the text. Thus, if the input is

      
       the cat and the dog fought over the dog bone
    
    a mapper might emit a key-value pair (the, {cat:1, dog:2}), in which the value indicates that the word cat occurred once (immediately) after the word the, and the word dog occurred twice after the.

    Notes:

    Debug using the testing interface, then test on a large novel or complete works of Shakespeare. Record your job id and data set location in README, together with any observations.

  3. Modify your Netflix processing program (part (b)) to track how movie ratings change over time, by producing a list of movies and their average ratings per day, sorted first by date then by average rating on that date. Example output:

       2005-11-16  4.8, 1629
       2005-11-16  4.78, 1643
       2005-11-16  4.78, 11
       2005-11-16  4.76, 8001
       ...
       2005-11-16  1.1, 384
       2005-11-17  5.0, 2288
       ...
    

    Notes:

  4. Alternate solutions to date-sorted ratings... Associative arrays, alternate approaches to secondary sort.

Deliverables

Submit your work using the following command on your link machine.

$ cd ~
$ labsubmit.cs300 lab9 

This lab is due by Tuesday, January 20, 2015.