Hadoop programming
CS 300 (PDC)
Recall the three stages of a map-reduce computation.
Hadoop implements each of the three stages in substeps, and several of the substeps may be customized.
Map stage
Shuffle stage
Partitioner, which determines which reducers get which intermediate key-value pairs.
Moving the intermediate data across the network (shuffling)
Sorting by key (performed simultaneously with shuffling)
Reducer stage
______
Input format (InputFormat)
Some examples: TextInputFormat(default), KeyValueTextInputFormat, KeyValueTextInputFormat
Mapper step (Mapper)
It's helpful to think of the mapper step in terms of it's four
interface types Kin, Vin, Kout,
Vout,
, where K
represents the key,
V
represents the value, and subscripts in,
out
represent incoming vs. outgoing data from the
map()
method. Some common Java types:
______
______
______
______
______
______
______
______
Partitioner ((Partitioner)
Reeducer ((Reducer)
OutputFormat ((OutputFormat)
______
Hadoop types
DDL Type C++ Type Java Type boolean bool boolean byte int8_t byte int int32_t int long int64_t long float float float double double double ustring std::string java.lang.String buffer std::string org.apache.hadoop.record.Buffer class type class type class type vector std::vector java.util.ArrayList map std::map java.util.TreeMap
Input file format JobConf.setInputFormat()
Default mapper and reducer classes:
______
______