In-class notes for 04/02/2014 (CS 121B (CS1), Spring 2014)

In-class notes for 04/02/2014

CS 121B (CS1), Spring 2014

Homework assignment
Quiz today

Submitted questions on assignments and technology

String formatting examples (omit the -->)

def moInterest(princ, rate):
    return princ*rate/(12*100)

moInterest(842.57, 7.8)

print("The monthly interest due on a loan with remaining principal")
print("${:4.2f} at an annual interest rate of {:3.1f}% is ${:4.2f}".format( 
    842.57, 7.8, moInterest(842.57, 7.8)))

oneGene()

Upcoming

Homework 20

Submitted questions on readings

Intro to WebMapReduce

Motivation: e.g., photos, Facebook data
Units of quantity:
- K (kilo-), about 1000
- M (mega-), about 1,000,000 (million)
- G (giga-), about 1,000,000,000 (billion)
- ...
Other examples: Google search, email, maps
How can computations be performed with such large data?
WebMapReduce (WMR) is a general-purpose framework for very large-scale computing.
WMR is a simplified interface for software called Hadoop, used by Facebook and other companies for their Data-Intensive Scalable Computing

Map-reduce computing

Basic idea of map-reduce computing:
1. Break data up into labelled pieces (mapper())
2. Gather and combine pieces with the same label (reducer())
Example: For computing word frequencies, we
1. break up each line of text into words, and emit a labelled piece (key = word, value = '1') for each word found, then
2. gather all the '1's for each word, and add up those '1's as integers to determine the frequency for that word.
Stages of computation:
- mapper stage: start with one line of the original data; produce intermediate key-value pairs.
- shuffle stage: organize key-value pairs according to key; transmit across network to a convenient location for next stage.
- reducer stage: start with all key-value pairs for a given fixed key; produce result key-value pairs of interest.
A map-reduce implementation (such as Hadoop, which we use) takes care of details such as splitting the data and shuffling, so a map-reduce programmer needs only provide mapper and reducer algorithms.
Teaching materials

Word counting with WebMapReduce

Exercises

Exercise 1: Use WMR to compute the highest reviewer id in a netflix dataset.

Hints

A sample data set:

1,1596531,5,2004-01-23
3,2318004,4,2004-02-05
6,110641,5,2003-12-15
8,1447639,4,2005-03-27
8,2557899,5,2005-04-16
6,52076,4,2004-10-05
1,13651,3,2004-06-16
1,1374216,2,2005-09-23

Spec for mapper:

# IN format
#    key is a netflix record   value is empty string
#    NOTE: netflix record format is   movieID,reviewerID,rating,date
# OUT format
#    key is empty  value is a reviewerID

Spec for reducer:

# IN format
#    key is empty   value are reviewerID
#    NOTE:  the values are packed into an interator  iter
# OUT format
#    key is maximum reviewer id  value is empty

Hints: (a) use split() with an argument; (b) use an accumulator in reducer to determine the maximum value (since reviewerIDs are non-negative, you can initialize accumulator at -1); (c) use the Test interface to debug.

Exercise 2: Compute the average ratings for all movies

You can use the same data set as above. For that data set, the results should be:
```
        1	3.3333
        3	4
        6	4.5
        8	4.5
```

mapper() specs

# IN format
#    key is a netflix record   value is empty string
#    NOTE: netflix record format is   movieID,reviewerID,rating,date
# OUT format
#    key is a movieID  value is a rating

reducer() specs

# IN format
#    key is a movieID  values are ratings
# OUT format
#    key is a movieID  value is the mean movie rating for that movie

Hints: (a) Use two accumulators in reducer, one for sum and one for count, in order to compute the mean

Large netflix data sets on WMR system (cluster paths):
```
/shared/netflix/test -- all ratings on 100 movies
/shared/netflix/all -- all ratings on all movies (don't use in class!)
```
Note: Don't use Test computation with large data sets. The Test system will cut off the data after a certain number of characters, and it may lead to an incorrectly formatted record at the point of the cut.

< >