Introduction to TBB programming
CS 300, Parallel and Distributed Computing (PDC)
Due Thursday, January 15, 2015
______
On a manycore computer (either a thing or the MTL), create a
PDC/lab7 subdirectory of your personal
directory for work on the lab,
and change
directory to that directory.
Try TBB computation as follows.
Copy the trap_tbb.cpp program to your
lab7 directory on the manycore
computer, for example,
% scp ~cs300/tbb/trap_tbb.cpp thingn.cs.stolaf.edu:PDC/lab7
If you are on MTL, you must source the
appropriate tbb.sh file.
Now, compile the code and run it, for example,
thing$ g++ -o trap_tbb trap_tbb.cpp -ltbb
thing$ ./trap_tbb
Does trap_tbb.cpp compute a correct answer? If not, why not?The given code trap_tbb.cpp uses a class
SumHeights that defines operator() in order
to define the second argument for TBB's parallel_for(). With
C++-11, there is a more compact way to specify that argument, using
lambda expressions. To try this, copy
trap_tbb.cpp to trap_tbbL.cpp, and replace the call
parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));in the new copy
trap_tbbL.cpp by
parallel_for(blocked_range<size_t>(1, n),
[=, &integral] (const blocked_range<size_t> r) {
for(size_t i = r.begin(); i != r.end(); i++)
integral += f(a+i*h);
} );
You can also remove the definition of the class
SumHeights, which is no longer being used. Instead, the
lambda expression
[=, &integral] (const blocked_range<size_t> r) {
for(size_t i = r.begin(); i != r.end(); i++)
integral += f(a+i*h);
}
behaves like an anonymous function with one argument, a
blocked_range<size_t> named r.
The syntax [=, &integral] starts that lambda
expression. That syntax describes how to construct a
closure for executing that lambda expression. A closure
contains copies of/references to local variables defined
outside of that lambda expression.
The solitary = indicates that by default, any of
main()'s local variables that the lambda expression uses
are copied to private variables with that lambda expression's
closure. In this case, private copies are
made of the variables a and h, which that
lambda expression could change without affecting main()'s
variables a and h.
The notation &integral indicates that
main()'s variable integral is available
by reference in that lambda expression's closure. Thus, when
a thread
executing that lambda expression changes integral, that
change affects main()'s variable integral.
The elements = and &integral are called
captures for that lambda expression's closure.
The syntax (const blocked_range<size_t> r)
provides the type and name for the first argument in that lambda
expression. Lambda expressions may have zero or more arguments,
specified with types and argument names, as for functions.
The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
Here is an example of calling a lambda expression directly.
[] (int i) { return i+1; } (6)
The lambda expression is in boldface. Since it doesn't use any
variables other than the argument i, we don't need to
include any captures. The final parenthesized (6)
provides the argument value. That value 6 is bound to the name
i, then the body return i+1; is
evaluated, producing 7.To compile with lambda expressions, on a
thing, include the
flag -std=c++0x. For example,
thingn$ g++ -std=c++0x -o trap_tbbL trap_tbbL.cpp -ltbb
______
Begin with a copy of the sequential program
omprimes1.cpp and its Makefile. Copy from a
link machine to your 32-core machine, using the name
sieve.cpp for the copy of omprimes1.cpp on
that 32-core machine. For example, for
thingn,
% scp ~cs300/lab6/omprimes1.cpp thingn.cs.stolaf.edu:PDC/lab7/sieve.cpp
% scp ~cs300/lab6/Makefile thingn.cs.stolaf.edu:PDC/lab7
Also copy the Makefile for compiling, and verify that the
sequential sieve program compiles and produces correct answers on your
32-core machine.
Once the program is running correctly, perform benchmark tests, and
record observations in a file README. Also,
make a copy of this version of sieve.cpp, called tbbprimes1.cpp.
Modify your program to use a TBB concurrent_vector
instead of an STL vector for pset.
You may create a fresh subdirectory for developing and testing the new version of this code, as in the previous lab, if desired.
If you are using MTL, you must source the
appropriate tbb.sh file.
The preprocessor directive
#include <tbb/concurrent_vector.h>
will inform the compiler how to use the templated type
tbb::concurrent_vector<>.Modify Makefile to add -ltbb to
compilation and linking.
Compile and run, then perform benchmark tests. Compare to the STL
sequential code's performance, and record observations in
README. Then, make a copy tbbprimes2.cpp of
this version of sieve.cpp.
Add a loop for precomputing primes up to
sqrt(prime_max) and storing those primes in another
concurrent_vector named pset0.
Instead of controlling the outer loop by p,
control it by an index i into pset0. This will require you
to determine the number of elements in pset0.
Define p as p = pset0[i], where
i is the loop-control variable for the outer loop.
Compile, run, benchmark, observe in README, and copy
to tbbprimes3.cpp
Shared data structure patterns in Introduction to TBB programming.
Shared Array______
Shared Queue______
Task Queue______
Change pset0 to be a concurrent_queue
instead of a concurrent_vector, and use that work queue
to control the outer loop for determining all primes.
concurrent_queue is a C++ template, so don't
forget to specify the type for elements of your queue using
angle brackets, e.g., <int> .
The method push() for a
concurrent_queue adds its argument value to the end of the queue.
The method try_pop() attempts to remove a value
from the front of a concurrent_queue and store it in its
(reference) argument. It returns TRUE if this succeeds, and returns FALSE (with
no state change) if the queue is empty.
Observe that a call pset0.try_pop(p) could be used
to control the outer loop, e.g., as a guard for a while loop.
Compile, run, benchmark, observe in README, and copy
to tbbprimes4.cpp
Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.
Try adding TBB parallelism to the inner for
loop... tbbprimes5.cpp. Record benchmark results
and comments from experience in README.
Starting with tbbprimes4.cpp, try adding OpenMP
parallelism... tbbprimes6.cpp. Record benchmark results
and comments from experience in README.
Notes.
Assuming that you are using a work queue to control your
outer loop, and that your outer loop is a while loop, you can
parallelize that outer loop so that each element popped off of the
queue gets its own thread using OpenMP 3.0's task
construct.
Here is an example showing how to use the task
construct for a different problem, namely, applying a function
f() to each element of
an array of integers up to the first element that contains 0:
int arr[] = {9, 9, 2, 1, 0}; // defines/initializes array of 5 integers
int *ptr; // for pointing to one of the elements of the array arr
#pragma omp parallel default(shared)
{
#pragma omp single private(ptr)
{
ptr = arr;
while (*ptr) { // while current array element does not hold 0
#pragma omp task
f(*ptr);
ptr++; // advance to next array element
}
}
}
Notes:
We are stepping through the array arr[] using a pointer
ptr. If ptr "points to" (holds the address
of) an element in the array arr[], then adding 1 to
ptr causes it to hold the address of the next
element in that array. We initialize ptr using the expression
arr (no square brackets), which evaluates to the address of the
first element in the array arr[].
The OpenMP single construct in this example
guarantees that only
one thread will execute the initialization, guard, and progress operations
for the while loop. (This avoids race conditions on the
variable ptr.)
As always, the OpenMP parallel construct assembles a team of
threads, but no parallel computation actually occurs until a
worksharing construct is used, such as a for or
sections construct.
The OpenMP 3.0 task construct is another worksharing
construct that allocates threads from the team (as available) to carry
out a statement. In this example, different threads can perform the
call f(*ptr) in parallel for different values of
*ptr.
(Of course, f() needs to be a thread-safe
function, or there will be race conditions from these parallel calls of f().)
One subtle point: private variables (such as
ptr) in a
construct that encloses a task become
firstprivate variables in that task construct,
by default. This means that those variables become
private to the thread executing that task,
but are also guaranteed to be initialized with the value they had just
before that task began.
In this case, ptr holds a different value for each
iteration of the while loop. When a thread is assigned
to carry out the task directive for one of those
iterations, it receives its own copy of arr initialized
with arr's value for that loop's iteration. While
that thread execute's its task, ptr may be
incremented and the next iteration can begin; if another
thread is available, the OpenMP scheduler can assign that thread to
the task for that next iteration; and so on.
The OpenMP firstprivate feature is sometimes needed in other
contexts besides tasks. It can be specified explicitly
(for constructs that support it)
using a firstprivate() clause.
Algorithm patterns in Introduction to TBB programming.
______
Loop Parallel, trap_tbb* and tbbprimes4
Master-Worker______
Use one of the
git stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3, include the name thing3 in your
commit message.
thing3$ cd ~/PDC
thing3$ git pull origin master
thing3$ git add -A
thing3$ git commit -m "Lab 7 submit (thing3)"
thing3$ git push origin master
Likewise, if you worked on both link machines and a
thing, you can rename your lab7
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Thursday, January 15, 2015.