Introduction to TBB programming
CS 300, Parallel and Distributed Computing (PDC)
Due Thursday, January 15, 2015
______
On a manycore computer (either a thing
or the MTL), create a
PDC/lab7
subdirectory of your personal
directory for work on the lab,
and change
directory to that directory.
Try TBB computation as follows.
Copy the trap_tbb.cpp
program to your
lab7
directory on the manycore
computer, for example,
% scp ~cs300/tbb/trap_tbb.cpp thingn.cs.stolaf.edu:PDC/lab7
If you are on MTL, you must source
the
appropriate tbb.sh
file.
Now, compile the code and run it, for example,
thing$ g++ -o trap_tbb trap_tbb.cpp -ltbb thing$ ./trap_tbbDoes
trap_tbb.cpp
compute a correct answer? If not, why not?The given code trap_tbb.cpp
uses a class
SumHeights
that defines operator()
in order
to define the second argument for TBB's parallel_for()
. With
C++-11, there is a more compact way to specify that argument, using
lambda expressions. To try this, copy
trap_tbb.cpp
to trap_tbbL.cpp
, and replace the call
parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));in the new copy
trap_tbbL.cpp
by
parallel_for(blocked_range<size_t>(1, n), [=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); } );You can also remove the definition of the class
SumHeights
, which is no longer being used. Instead, the
lambda expression
[=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); }behaves like an anonymous function with one argument, a
blocked_range<size_t>
named r
.
The syntax [=, &integral]
starts that lambda
expression. That syntax describes how to construct a
closure
for executing that lambda expression. A closure
contains copies of/references to local variables defined
outside of that lambda expression.
The solitary =
indicates that by default, any of
main()
's local variables that the lambda expression uses
are copied to private variables with that lambda expression's
closure. In this case, private copies are
made of the variables a
and h
, which that
lambda expression could change without affecting main()
's
variables a
and h
.
The notation &integral
indicates that
main()
's variable integral
is available
by reference in that lambda expression's closure. Thus, when
a thread
executing that lambda expression changes integral
, that
change affects main()
's variable integral
.
The elements =
and &integral
are called
captures for that lambda expression's closure.
The syntax (const blocked_range<size_t> r)
provides the type and name for the first argument in that lambda
expression. Lambda expressions may have zero or more arguments,
specified with types and argument names, as for functions.
The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
Here is an example of calling a lambda expression directly.
[] (int i) { return i+1; } (6)The lambda expression is in boldface. Since it doesn't use any variables other than the argument
i
, we don't need to
include any captures. The final parenthesized (6)
provides the argument value. That value 6 is bound to the name
i
, then the body return i+1;
is
evaluated, producing 7.To compile with lambda expressions, on a
thing
, include the
flag -std=c++0x
. For example,
thingn$ g++ -std=c++0x -o trap_tbbL trap_tbbL.cpp -ltbb______
Begin with a copy of the sequential program
omprimes1.cpp
and its Makefile
. Copy from a
link machine to your 32-core machine, using the name
sieve.cpp
for the copy of omprimes1.cpp
on
that 32-core machine. For example, for
thingn
,
% scp ~cs300/lab6/omprimes1.cpp thingn.cs.stolaf.edu:PDC/lab7/sieve.cpp % scp ~cs300/lab6/Makefile thingn.cs.stolaf.edu:PDC/lab7
Also copy the Makefile
for compiling, and verify that the
sequential sieve program compiles and produces correct answers on your
32-core machine.
Once the program is running correctly, perform benchmark tests, and
record observations in a file README
. Also,
make a copy of this version of sieve.cpp
, called tbbprimes1.cpp
.
Modify your program to use a TBB concurrent_vector
instead of an STL vector
for pset
.
You may create a fresh subdirectory for developing and testing the new version of this code, as in the previous lab, if desired.
If you are using MTL, you must source
the
appropriate tbb.sh
file.
The preprocessor directive
#include <tbb/concurrent_vector.h>will inform the compiler how to use the templated type
tbb::concurrent_vector<>
.Modify Makefile
to add -ltbb
to
compilation and linking.
Compile and run, then perform benchmark tests. Compare to the STL
sequential code's performance, and record observations in
README
. Then, make a copy tbbprimes2.cpp
of
this version of sieve.cpp
.
Add a loop for precomputing primes up to
sqrt(prime_max)
and storing those primes in another
concurrent_vector
named pset0
.
Instead of controlling the outer loop by p
,
control it by an index i
into pset0
. This will require you
to determine the number of elements in pset0
.
Define p
as p = pset0[i]
, where
i
is the loop-control variable for the outer loop.
Compile, run, benchmark, observe in README
, and copy
to tbbprimes3.cpp
Shared data structure patterns in Introduction to TBB programming.
Shared Array______
Shared Queue______
Task Queue______
Change pset0
to be a concurrent_queue
instead of a concurrent_vector
, and use that work queue
to control the outer loop for determining all primes.
concurrent_queue
is a C++ template, so don't
forget to specify the type for elements of your queue using
angle brackets, e.g., <int>
.
The method push()
for a
concurrent_queue
adds its argument value to the end of the queue.
The method try_pop()
attempts to remove a value
from the front of a concurrent_queue
and store it in its
(reference) argument. It returns TRUE if this succeeds, and returns FALSE (with
no state change) if the queue is empty.
Observe that a call pset0.try_pop(p)
could be used
to control the outer loop, e.g., as a guard for a while
loop.
Compile, run, benchmark, observe in README
, and copy
to tbbprimes4.cpp
Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.
Try adding TBB parallelism to the inner for
loop... tbbprimes5.cpp
. Record benchmark results
and comments from experience in README
.
Starting with tbbprimes4.cpp
, try adding OpenMP
parallelism... tbbprimes6.cpp
. Record benchmark results
and comments from experience in README
.
Notes.
Assuming that you are using a work queue to control your
outer loop, and that your outer loop is a while
loop, you can
parallelize that outer loop so that each element popped off of the
queue gets its own thread using OpenMP 3.0's task
construct.
Here is an example showing how to use the task
construct for a different problem, namely, applying a function
f()
to each element of
an array of integers up to the first element that contains 0:
int arr[] = {9, 9, 2, 1, 0}; // defines/initializes array of 5 integers int *ptr; // for pointing to one of the elements of the array arr #pragma omp parallel default(shared) { #pragma omp single private(ptr) { ptr = arr; while (*ptr) { // while current array element does not hold 0 #pragma omp task f(*ptr); ptr++; // advance to next array element } } }Notes:
We are stepping through the array arr[]
using a pointer
ptr
. If ptr
"points to" (holds the address
of) an element in the array arr[]
, then adding 1 to
ptr
causes it to hold the address of the next
element in that array. We initialize ptr
using the expression
arr
(no square brackets), which evaluates to the address of the
first element in the array arr[]
.
The OpenMP single
construct in this example
guarantees that only
one thread will execute the initialization, guard, and progress operations
for the while
loop. (This avoids race conditions on the
variable ptr
.)
As always, the OpenMP parallel
construct assembles a team of
threads, but no parallel computation actually occurs until a
worksharing construct is used, such as a for
or
sections
construct.
The OpenMP 3.0 task
construct is another worksharing
construct that allocates threads from the team (as available) to carry
out a statement. In this example, different threads can perform the
call f(*ptr)
in parallel for different values of
*ptr
.
(Of course, f()
needs to be a thread-safe
function, or there will be race conditions from these parallel calls of f()
.)
One subtle point: private
variables (such as
ptr
) in a
construct that encloses a task
become
firstprivate variables in that task
construct,
by default. This means that those variables become
private
to the thread executing that task
,
but are also guaranteed to be initialized with the value they had just
before that task
began.
In this case, ptr
holds a different value for each
iteration of the while
loop. When a thread is assigned
to carry out the task
directive for one of those
iterations, it receives its own copy of arr
initialized
with arr
's value for that loop's iteration. While
that thread execute's its task, ptr
may be
incremented and the next iteration can begin; if another
thread is available, the OpenMP scheduler can assign that thread to
the task
for that next iteration; and so on.
The OpenMP firstprivate feature is sometimes needed in other
contexts besides tasks
. It can be specified explicitly
(for constructs that support it)
using a firstprivate()
clause.
Algorithm patterns in Introduction to TBB programming.
______
Loop Parallel, trap_tbb* and tbbprimes4
Master-Worker______
Use one of the
git
stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3
, include the name thing3
in your
commit message.
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 7 submit (thing3)" thing3$ git push origin masterLikewise, if you worked on both link machines and a
thing
, you can rename your lab7
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git
.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Thursday, January 15, 2015.