Hadoop programming (CS 300 (PDC))

Hadoop programming

CS 300 (PDC)

Recall the three stages of a map-reduce computation.

Hadoop implements each of the three stages in substeps, and several of the substeps may be customized.
1. Map stage
2. Shuffle stage
  1. Partitioner, which determines which reducers get which intermediate key-value pairs.
  2. Moving the intermediate data across the network (shuffling)
  3. Sorting by key (performed simultaneously with shuffling)
3. Reducer stage
  1. Reducer
  2. Output format
  3. ______
______
Input format (InputFormat)
Some examples: TextInputFormat(default), KeyValueTextInputFormat, KeyValueTextInputFormat
Mapper step (Mapper)
- It's helpful to think of the mapper step in terms of it's four interface types K_in, V_in, K_out, V_out,, where K represents the key, V represents the value, and subscripts in, out represent incoming vs. outgoing data from the map() method. Some common Java types:
  - Text
  - IntWritable
  - LongWritable
  - ______
  - ______
  - ______
  - ______
  - ______
  - ______
  - ______
  - ______
Partitioner ((Partitioner)
Reeducer ((Reducer)
OutputFormat ((OutputFormat)
______

Hadoop types


DDL Type        C++ Type            Java Type 

boolean         bool                boolean
byte            int8_t              byte
int             int32_t             int
long            int64_t             long
float           float               float
double          double              double
ustring         std::string         java.lang.String
buffer          std::string         org.apache.hadoop.record.Buffer
class type      class type          class type
vector          std::vector         java.util.ArrayList
map             std::map            java.util.TreeMap

Input file format JobConf.setInputFormat()
Default mapper and reducer classes:

IdentityMapper

IdentityReducer
Default configuration parameters
______
______