Introduction to TBB programming
CS 300, Parallel and Distributed Computing (PDC)
Due Saturday, October 29, 2016
______
Preliminary material
work queues
Basic laboratory exercises: TBB containers
Log into
thing4
and create alab20
subdirectory of yourPDC
working directory for work on this lab, and change directory to that directory.Also create a second terminal window that is logged into a Link computer. We will use this to copy files from the Link file system to
thing4
. Note: The CS program permits logins andscp
file transfers from IT-managed computers to CS-managed computers, but St. Olaf does not these from CS-managed computers to IT-managed computers for security reasons.Try TBB computation as follows.
Copy the
trap_tbb.cpp
program to yourlab20
directory onthing4
. For example, on a Link terminal window,link% scp ~cs300/tbb/trap_tbb.cpp thing4.cs.stolaf.edu:PDC/lab20
Now, compile the code and run it on
thing4
, for example,thing4$ g++ -o trap_tbb trap_tbb.cpp -ltbb thing4$ ./trap_tbb
Doestrap_tbb.cpp
compute a correct answer? If not, why not?The given code
trap_tbb.cpp
uses a classSumHeights
that definesoperator()
in order to define the second argument for TBB'sparallel_for()
. With C++-11, there is a more compact way to specify that argument, using lambda expressions. To try this, copytrap_tbb.cpp
totrap_tbbL.cpp
, and replace the callparallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));
in the new copytrap_tbbL.cpp
byparallel_for(blocked_range<size_t>(1, n), [=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); } );
You can also remove the definition of the classSumHeights
, which is no longer being used. Instead, the lambda expression[=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); }
behaves like an anonymous function with one argument, ablocked_range<size_t>
namedr
.The syntax
[=, &integral]
starts that lambda expression. That syntax describes how to construct aclosure
for executing that lambda expression. A closure contains copies of/references to local variables defined outside of that lambda expression.The solitary
=
indicates that by default, any ofmain()
's local variables that the lambda expression uses are copied to private variables with that lambda expression's closure. In this case, private copies are made of the variablesa
andh
, which that lambda expression could change without affectingmain()
's variablesa
andh
.The notation
&integral
indicates thatmain()
's variableintegral
is available by reference in that lambda expression's closure. Thus, when a thread executing that lambda expression changesintegral
, that change affectsmain()
's variableintegral
.
The elements
=
and&integral
are called captures for that lambda expression's closure.
The syntax
(const blocked_range<size_t> r)
provides the type and name for the first argument in that lambda expression. Lambda expressions may have zero or more arguments, specified with types and argument names, as for functions.The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
Here is an example of calling a lambda expression directly.
[] (int i) { return i+1; } (6)
The lambda expression is in boldface. Since it doesn't use any variables other than the argumenti
, we don't need to include any captures. The final parenthesized(6)
provides the argument value. That value 6 is bound to the namei
, then the bodyreturn i+1;
is evaluated, producing 7.
To compile with lambda expressions, on a
thing
, include the flag-std=c++11
. For example,thingn$ g++ -std=c++11 -o trap_tbbL trap_tbbL.cpp -ltbb
Record this work in a
commit
.thing4$ git add trap_tbb.cpp trap_tbbL.cpp thing4$ git commit -m "lab20: Buggy TBB implementations of trap program, using a class and a lambda"
To try a TBB container, first copy the sequential program
omprimes1.cpp
and itsMakefile
to yourlab20
directory onthing4
, and test that program. For example, on a Link computer,link% scp ~cs300/lab6/{omprimes1.cpp,Makefile} thing4.cs.stolaf.edu:PDC/lab20
Then, in the directorylab20
onthing4
,thing4$ cp omprimes1.cpp sieve.cpp thing4$ make thing4$ ./sieve 100 4
Once the program is running correctly, perform benchmark tests, and record observations in a file
lab20/README
.Record this work in a
commit
.thing4$ git add Makefile omprimes1.cpp sieve.cpp thing4$ git commit -m "lab20: omprimes1 computation of primes"
Modify your program to use a TBB
concurrent_vector
instead of an STLvector
forpset
. Note: Since the program has not yet been parallelized, we don't yet need a thread-safe queue. This change introducesconcurrent_vector
and prepares for the next lab, when we will parallelize the code using TBB.Create a subdirectory
lab20/tbb
for developing and benchmarking, and copyMakefile
andsieve.cpp
to that subdirectory. Then, replace the STLvector<>
template withconcurrent_vector<>
.The preprocessor directive
#include <tbb/concurrent_vector.h>
will inform the compiler how to use the templated typetbb::concurrent_vector<>
.Modify
Makefile
inlab20/tbb
to add-ltbb
to compilation and linking.Perform a simple test of the resulting
sieve.cpp
to insure that it compiles and runs correctly.Record this work in a
commit
.thing4$ cd ~/PDC/lab20/tbb
thing4$ git add Makefile sieve.cpp thing4$ git commit -m "lab20: substitute TBB concurrent_vector for vector in sieve.cpp"
Now, perform benchmark tests, to compare performance of the STL
vector
with TBB'sconcurrent_vector
in this application. sequential code's performanceCopy your benchmark script from the previous lab, and move the original sieve code to a new subdirectory
stl
thing4$ git pull origin master thing4$ cd ~/PDC/lab20 thing4$ cp ../lab/sievetest.sh . thing4$ mkdir stl thing4$ git mv Makefile omprimes1.cpp sieve.cpp stl
Here, wepull
in order to make sure that thething4
working directory is up to date, before copying the scriptsievetest.sh
from the previous lab's subdirectory. Also, thegit mv
operation prepares to rename indicated files in the (sto)git repository, fromlab20/*
tolab20/stl/*
. In particular, thegit mv
operation above is equivalent to the following:thing4$ mv Makefile omprimes1.cpp sieve.cpp stl thing4$ git rm Makefile omprimes1.cpp sieve.cpp thing4$ git add stl/Makefile stl/omprimes1.cpp stl/sieve.cpp
Run the benchmark script on
stl
and enter any observations in a filelab20/README
.Run the benchmark script on
tbb
and enter any observations in a filelab20/README
. Be sure to identify thesievetest.*_*
file you are analyzing in yourREADME
observation.Record this work in a
commit
.thing4$ cp tbb/sieve.cpp tbb/tbbprimes1.tbb
thing4$ git add sievetest.sh README sievetest.*_* tbb/tbbprimes1.cpp thing4$ git commit -m "lab20: Relocate sieve.cpp to stl/; benchmark stl and tbb using sievetest.sh"
Note: The copytbbprimes1.cpp
makes it easy for graders to retrieve this stage of your work.
Revise
tbb/sieve.cpp
to add a loop for precomputing primes up tosqrt(prime_max)
and storing those primes in anotherconcurrent_vector
namedpset0
.Instead of controlling the outer loop by
p
, control it by an indexi
intopset0
. This will require you to determine the number of elements inpset0
.Define
p
asp = pset0[i]
, wherei
is the loop-control variable for the outer loop.
Compile, run, benchmark, observe in
README
, and copy the resulting codetbb/sieve.cpp
in a filetbb/tbbprimes2.cpp
. Then record this work in acommit
.thing4$ git add README sievetest.*_* tbb/{sieve.cpp,tbbprimes2.cpp} thing4$ git commit -m "lab20: Add loop to precompute primes; benchmark. tbbprimes2.cpp"
-
Shared data structure patterns in Introduction to TBB programming .
TBB's
concurrent_vector<>
implements a Shared Array pattern, enhanced by its ability to grow dynamically throughpush_back()
and other methods. Theconcurrent_vector
class also offers agrow-to-at-least()
method for expanding the size of the underlying array in advance. The growing operations are costly in vector classes, although using them once they have the correct size can be as efficient as using ordinary arrays. If you can estimate the size a vector needs to be in advance, you can avoid most of the extra computational cost over arrays while retaining benefits such as growing when necessary.Likewise, the templated class
concurrent_queue<>
implements a Shared Queue. Besides the convenience of not having to build these concurrent data structures ourselves, these TBB containers are designed for scalability, so they will perform well even for large data.In
tbb/sieve.cpp
, Changepset0
to be aconcurrent_queue
instead of aconcurrent_vector
, and use that work queue to control the outer loop for determining all primes.concurrent_queue
is a C++ template, so don't forget to specify the type for elements of your queue using angle brackets, e.g.,<int>
.The method
push()
for aconcurrent_queue
adds its argument value to the end of the queue.The method
try_pop()
attempts to remove a value from the front of aconcurrent_queue
and store it in its (reference) argument. It returns TRUE if this succeeds, and returns FALSE (with no state change) if the queue is empty.Observe that a call
pset0.try_pop(p)
could be used to control the outer loop, e.g., as a guard for awhile
loop.
Compile, run, benchmark, observe in
README
, and copy the resulting codetbb/sieve.cpp
in a filetbb/tbbprimes3.cpp
. Then record this work in acommit
.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes3}.cpp thing4$ git commit -m "lab20: Substitute concurrent_vector for precomputed primes; benchmark. tbbprimes3.cpp"
Parallel extensions
Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.
Add TBB parallelism for the inner
for
loop intbb/sieve.cpp
.Replace that
for
loop by aparallel_for()
call that uses either a "functor" class that defines an appropriateoperator()
or a C++ lambda.Test and benchmark your code, and copy the tested version of
sieve.cpp
totbb/tbbprimes4.cpp
. Then record this work in acommit
.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes4}.cpp thing4$ git commit -m "lab20: Parallelize using parallel_for(). tbbprimes4.cpp"
Try adding OpenMP parallelism to
sieve.cpp
. Call the rsulting programtbbprimes5.cpp
. Record benchmark results and comments from experience inREADME
.Notes.
Assuming that you are using a work queue to control your outer loop, and that your outer loop is a
while
loop, you can parallelize that outer loop so that each element popped off of the queue gets its own thread using OpenMP 3.0'stask
construct.Here is an example showing how to use the
task
construct for a different problem, namely, applying a functionf()
to each element of an array of integers up to the first element that contains 0:int arr[] = {9, 9, 2, 1, 0}; // defines/initializes array of 5 integers int *ptr; // for pointing to one of the elements of the array arr #pragma omp parallel default(shared) { #pragma omp single private(ptr) { ptr = arr; while (*ptr) { // while current array element does not hold 0 #pragma omp task f(*ptr); ptr++; // advance to next array element } } }
Notes:We are stepping through the array
arr[]
using a pointerptr
. Ifptr
"points to" (holds the address of) an element in the arrayarr[]
, then adding 1 toptr
causes it to hold the address of the next element in that array. We initializeptr
using the expressionarr
(no square brackets), which evaluates to the address of the first element in the arrayarr[]
.The OpenMP
single
construct in this example guarantees that only one thread will execute the initialization, guard, and progress operations for thewhile
loop. (This avoids race conditions on the variableptr
.)As always, the OpenMP
parallel
construct assembles a team of threads, but no parallel computation actually occurs until a worksharing construct is used, such as afor
orsections
construct.The OpenMP 3.0
task
construct is another worksharing construct that allocates threads from the team (as available) to carry out a statement. In this example, different threads can perform the callf(*ptr)
in parallel for different values of*ptr
.(Of course,
f()
needs to be a thread-safe function, or there will be race conditions from these parallel calls off()
.)One subtle point:
private
variables (such asptr
) in a construct that encloses atask
become firstprivate variables in thattask
construct, by default. This means that those variables becomeprivate
to the thread executing thattask
, but are also guaranteed to be initialized with the value they had just before thattask
began.In this case,
ptr
holds a different value for each iteration of thewhile
loop. When a thread is assigned to carry out thetask
directive for one of those iterations, it receives its own copy ofarr
initialized witharr
's value for that loop's iteration. While that thread execute's its task,ptr
may be incremented and the next iteration can begin; if another thread is available, the OpenMP scheduler can assign that thread to thetask
for that next iteration; and so on.The OpenMP firstprivate feature is sometimes needed in other contexts besides
tasks
. It can be specified explicitly (for constructs that support it) using afirstprivate()
clause.
Test and benchmark your code, and copy the tested version of
sieve.cpp
totbb/tbbprimes5.cpp
. Then record this work in acommit
.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes5}.cpp thing4$ git commit -m "lab20: Parallelize using openmp. tbbprimes5.cpp"
-
Algorithm patterns in Introduction to TBB programming .
The OpenMP parallelization of
tbbprimes5.cpp
and the TBB parallelizationtrap_tbb.cpp
both exhibit the familiar Loop Parallel pattern. The OpenMP code intbbprimes5.cpp
also illustrates Master-Worker.Our use of the Shared Queue pattern in
tbbprimes5.cpp
constitutes a new mid-level parallel programming pattern.Computation in a Task Queue pattern is organized as a series of tasks for distribution among threads of execution. The tasks often have variable or unpredictable computational sizes, so some threads may carry out multiple tasks while another other thread continues with a lengthy task. This behavior means that Task Queue offers some natural load balancing among threads, which makes good use of those threads' execution time.
Task Queue is closely related to Shared Queue (which one often needs in order to manage a task queue correctly in parallel) and Master-Worker (since a Task Queue algorithm is ordinarily a special case of Master-Worker). However, Task Queue is worth treating as a pattern in its own right, especially because of its natural load balancing property.
An essential difference between Task Queue and Shared Queue is that Task Queue concerns structuring a program, whereas Shared Queue pertains to structuring data.
This suggests categories for mid-level patterns.
Program structure patterns: Loop Parallel, SPMD, Fork-Join, Master-Worker, Task Queue
Data structure patterns: Shared Array, Shared Queue
Likewise, we can identify categories for low-level patterns.
Execution progress patterns: Thread Pool
Coordination patterns: Message Passing, Collective Communication, Mutual Exclusion
We choose the category name "Execution progress" because Thread Pool has to do with providing resources for parallel execution of machine instructions, not coordination of those resources.
Deliverables
All of your code for this assignment should already be
contained in commit
s. Modify the most recent
commit
message to indicate that you are submitting
the completed assignment.
thing4$ git commit --amendAdd the following to the latest
commit
message.
lab20: complete
If your assignment is not complete, indicate your progress instead, e.g.,lab20: items 1-5 complete, item 6 partialYou can make later commits to submit updates.
Finally, pull/push
your commit
s in
the usual way.
thing4$ git pull origin master thing4$ git push origin master
Files: trap_tbb.cpp trap_tbbL.cpp tbb/Makefile tbb/sieve.cpp sievetest.sh README sievetest.*_* stl/Makefile
stl/sieve.cpp stl/omprimes1.cpp tbb/tbbprimes1.cpp tbb/tbbprimes2.cpp tbb/tbbprimes3.cpp tbb/tbbprimes4.cpp tbb/tbbprimes5.cpp
Use one of the
git
stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3
, include the name thing3
in your
commit message.
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 20 submit (thing3)" thing3$ git push origin masterLikewise, if you worked on both link machines and a
thing
, you can rename your lab20
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git
.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Saturday, October 29, 2016.