Intel Threading Building Blocks (CS 300 (PDC))

Intel Threading Building Blocks

CS 300 (PDC)

Intel Threading Building Blocks (TBB)

C++ template library for multicore programming. No compiler mods necessary; can be used together with OpenMP, etc.
Provides algorithms and data structures at a higher (more abstract) level than threads programming, designed for scalability and performance.
Parallel operations are treated as "tasks", allocated to cores dynamically at runtime, automatic efficient use of cache.
Example code: trap_tbb.cpp
Module exercise

Reading TBB code

Example: trap_tbb.cpp compared to trap_omp.cpp
Big picture: parallel_for() is a TBB template. It requires two arguments: a range object (representing all index values for the loop); and a user-defined body object for expressing the algorithm for that parallel loop.
Think of the range as an object that expresses a (parallel) for loop's init; guard; progress)
Think of the body object as a "package" for the (parallel) algorithm and data structures needed for this loop.
The body type for a parallel_for call is a class that includes the following:
- an operator(), for defining the operations to perform on each iteration of the loop; and
- a constructor, for initializing any state variables for the parallel computation.
Note: dividing by h at the end gives a more accurate result -- why?

MTL TBB module exercise

Module exercise, section 3 only (perform on a thing).
What did you find when you ran that code?

Reading TBB code, contin.

Second example: trap_tbb2.cpp
Big picture: parallel_reduce() is a TBB template for parallelizing a loop and performing a reduction of the parallel results. It requires a range and a user-defined body object, similar to parallel_for().
The body type for a parallel_reduce call is a class that includes the following:
- an operator(), for defining the operations to perform on each iteration of the loop;
- a constructor for initializing state variables for a parallel computation.
- a splitting constructor for assigning the identity value to state variables that are part of the reduce operation and copying state variables that are not part of the reduce operation.
- a join() method, specifying how to perform the reduce operation on each thread.
OpenMP's reduction() clause is simpler to use, but TBB's parallel_reduce body object supports arbitrary reduction operations.
Note: The name join() comes from "fork/join", a parallel pattern.
Note to Java programmers: Compare requiring C++ member functions in a body operator to Java Interfaces.
Observe that operator() for a parallel_reduce makes local copies of the state variables (not done for parallel_for). These local copies are not necessary for the logical correctness of the code. However, they function as "hints for the compiler": since the compiler can realize that these values will not be changed by other threads, it can deduce that those variables can be implemented using CPU (core) registers instead of memory locations, for greater efficiency (even faster than using the cache).

Overview of TBB

Templated calls:
- parallel_for
- parallel_reduce
- parallel_do
- parallel_scan
- pipeline
- parallel_sort (a parallelized iterative Quicksort)
Containers
- concurrent_vector
- concurrent_queue
- concurrent_bounded_queue
- concurrent_hash_map
Exceptions and cancellation
Mutual exclusion; atomic operations
Customizability, e.g., scheduling, partitioning, etc.