Intel Threading Building Blocks
CS 300 (PDC)
C++ template library for multicore programming. No compiler mods necessary; can be used together with OpenMP, etc.
Provides algorithms and data structures at a higher (more abstract) level than threads programming, designed for scalability and performance.
Parallel operations are treated as "tasks", allocated to cores dynamically at runtime, automatic efficient use of cache.
Example code: trap_tbb.cpp
Example: trap_tbb.cpp
compared to
trap_omp.cpp
Big picture: parallel_for()
is a TBB template.
It requires two arguments: a range object (representing all index
values for the loop); and a user-defined body object for expressing the
algorithm for that parallel loop.
Think of the range as an object that expresses a (parallel)
for
loop's init; guard;
progress)
Think of the body object as a "package" for the (parallel) algorithm and data structures needed for this loop.
The body type for a parallel_for
call is a
class that includes the following:
an operator()
, for defining the operations to perform on
each iteration of the loop; and
a constructor, for initializing any state variables for the parallel computation.
Note: dividing by h
at the end gives a more
accurate result -- why?
Module
exercise, section 3 only (perform on a
thing
).
What did you find when you ran that code?
Second example: trap_tbb2.cpp
Big picture: parallel_reduce()
is a TBB template
for parallelizing a loop and performing a reduction of the
parallel results. It requires a range and a user-defined body object,
similar to parallel_for()
.
The body type for a parallel_reduce
call is a
class that includes the following:
an operator()
, for defining the operations to perform on
each iteration of the loop;
a constructor for initializing state variables for a parallel computation.
a splitting constructor for assigning the identity value to state variables that are part of the reduce operation and copying state variables that are not part of the reduce operation.
a join()
method, specifying how to perform the
reduce operation on each thread.
OpenMP's reduction()
clause is simpler to use, but
TBB's parallel_reduce
body object supports arbitrary
reduction operations.
Note: The name join()
comes from
"fork/join", a parallel pattern.
Note to Java programmers: Compare requiring C++
member functions in a body operator to Java Interface
s.
Observe that operator()
for a
parallel_reduce
makes local copies of the state
variables (not done for parallel_for
). These local
copies are not necessary for the logical correctness of the
code. However, they function as "hints for the compiler": since the
compiler can realize that these values will not be changed by other
threads, it can deduce that those variables can be implemented using
CPU (core) registers instead of memory locations, for greater
efficiency (even faster than using the cache).
Templated calls:
parallel_for
parallel_reduce
parallel_do
parallel_scan
pipeline
parallel_sort
(a parallelized iterative Quicksort)
Containers
concurrent_vector
concurrent_queue
concurrent_bounded_queue
concurrent_hash_map
Exceptions and cancellation
Mutual exclusion; atomic operations
Customizability, e.g., scheduling, partitioning, etc.