>>     < >




In-class notes for 04/02/2014

CS 121B (CS1), Spring 2014

Submitted questions on assignments and technology

  • String formatting examples (omit the -->)

    def moInterest(princ, rate):
        return princ*rate/(12*100)
    
    moInterest(842.57, 7.8)
    
    print("The monthly interest due on a loan with remaining principal")
    print("${:4.2f} at an annual interest rate of {:3.1f}% is ${:4.2f}".format( 
        842.57, 7.8, moInterest(842.57, 7.8)))
    

  • oneGene()

Upcoming

Submitted questions on readings

Intro to WebMapReduce

  • Motivation: e.g., photos, Facebook data

  • Units of quantity:

    • K (kilo-), about 1000
    • M (mega-), about 1,000,000 (million)
    • G (giga-), about 1,000,000,000 (billion)
    • ...
  • Other examples: Google search, email, maps

  • How can computations be performed with such large data?

  • WebMapReduce (WMR) is a general-purpose framework for very large-scale computing.

  • WMR is a simplified interface for software called Hadoop, used by Facebook and other companies for their Data-Intensive Scalable Computing

Map-reduce computing

  • Basic idea of map-reduce computing:

    1. Break data up into labelled pieces (mapper())

    2. Gather and combine pieces with the same label (reducer())

    Example: For computing word frequencies, we

    1. break up each line of text into words, and emit a labelled piece (key = word, value = '1') for each word found, then
    2. gather all the '1's for each word, and add up those '1's as integers to determine the frequency for that word.
  • Stages of computation:

    • mapper stage: start with one line of the original data; produce intermediate key-value pairs.

    • shuffle stage: organize key-value pairs according to key; transmit across network to a convenient location for next stage.

    • reducer stage: start with all key-value pairs for a given fixed key; produce result key-value pairs of interest.

    A map-reduce implementation (such as Hadoop, which we use) takes care of details such as splitting the data and shuffling, so a map-reduce programmer needs only provide mapper and reducer algorithms.

  • Teaching materials

Word counting with WebMapReduce

Exercises

  • Exercise 1: Use WMR to compute the highest reviewer id in a netflix dataset.

    Hints
    • A sample data set:

      1,1596531,5,2004-01-23
      3,2318004,4,2004-02-05
      6,110641,5,2003-12-15
      8,1447639,4,2005-03-27
      8,2557899,5,2005-04-16
      6,52076,4,2004-10-05
      1,13651,3,2004-06-16
      1,1374216,2,2005-09-23
      

    • Spec for mapper:

      # IN format
      #    key is a netflix record   value is empty string
      #    NOTE: netflix record format is   movieID,reviewerID,rating,date
      # OUT format
      #    key is empty  value is a reviewerID
      

    • Spec for reducer:

      # IN format
      #    key is empty   value are reviewerID
      #    NOTE:  the values are packed into an interator  iter
      # OUT format
      #    key is maximum reviewer id  value is empty
      

    • Hints: (a) use split() with an argument; (b) use an accumulator in reducer to determine the maximum value (since reviewerIDs are non-negative, you can initialize accumulator at -1); (c) use the Test interface to debug.

  • Exercise 2: Compute the average ratings for all movies

    • You can use the same data set as above. For that data set, the results should be:

              1	3.3333
              3	4
              6	4.5
              8	4.5
      

    • mapper() specs

      # IN format
      #    key is a netflix record   value is empty string
      #    NOTE: netflix record format is   movieID,reviewerID,rating,date
      # OUT format
      #    key is a movieID  value is a rating
      

    • reducer() specs

      # IN format
      #    key is a movieID  values are ratings
      # OUT format
      #    key is a movieID  value is the mean movie rating for that movie
      

    • Hints: (a) Use two accumulators in reducer, one for sum and one for count, in order to compute the mean

  • Large netflix data sets on WMR system (cluster paths):

    /shared/netflix/test -- all ratings on 100 movies
    /shared/netflix/all -- all ratings on all movies (don't use in class!)
    

    Note: Don't use Test computation with large data sets. The Test system will cut off the data after a certain number of characters, and it may lead to an incorrectly formatted record at the point of the cut.




< >