Introduction to TBB programming
CS 300, Parallel and Distributed Computing (PDC)
Due Saturday, October 29, 2016
______
Preliminary material
work queues
Basic laboratory exercises: TBB containers
Log into
thing4and create alab20subdirectory of yourPDCworking directory for work on this lab, and change directory to that directory.Also create a second terminal window that is logged into a Link computer. We will use this to copy files from the Link file system to
thing4. Note: The CS program permits logins andscpfile transfers from IT-managed computers to CS-managed computers, but St. Olaf does not these from CS-managed computers to IT-managed computers for security reasons.Try TBB computation as follows.
Copy the
trap_tbb.cppprogram to yourlab20directory onthing4. For example, on a Link terminal window,link% scp ~cs300/tbb/trap_tbb.cpp thing4.cs.stolaf.edu:PDC/lab20
Now, compile the code and run it on
thing4, for example,thing4$ g++ -o trap_tbb trap_tbb.cpp -ltbb thing4$ ./trap_tbb
Doestrap_tbb.cppcompute a correct answer? If not, why not?The given code
trap_tbb.cppuses a classSumHeightsthat definesoperator()in order to define the second argument for TBB'sparallel_for(). With C++-11, there is a more compact way to specify that argument, using lambda expressions. To try this, copytrap_tbb.cpptotrap_tbbL.cpp, and replace the callparallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));
in the new copytrap_tbbL.cppbyparallel_for(blocked_range<size_t>(1, n), [=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); } );You can also remove the definition of the classSumHeights, which is no longer being used. Instead, the lambda expression[=, &integral] (const blocked_range<size_t> r) { for(size_t i = r.begin(); i != r.end(); i++) integral += f(a+i*h); }behaves like an anonymous function with one argument, ablocked_range<size_t>namedr.The syntax
[=, &integral]starts that lambda expression. That syntax describes how to construct aclosurefor executing that lambda expression. A closure contains copies of/references to local variables defined outside of that lambda expression.The solitary
=indicates that by default, any ofmain()'s local variables that the lambda expression uses are copied to private variables with that lambda expression's closure. In this case, private copies are made of the variablesaandh, which that lambda expression could change without affectingmain()'s variablesaandh.The notation
&integralindicates thatmain()'s variableintegralis available by reference in that lambda expression's closure. Thus, when a thread executing that lambda expression changesintegral, that change affectsmain()'s variableintegral.
The elements
=and&integralare called captures for that lambda expression's closure.
The syntax
(const blocked_range<size_t> r)provides the type and name for the first argument in that lambda expression. Lambda expressions may have zero or more arguments, specified with types and argument names, as for functions.The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
Here is an example of calling a lambda expression directly.
[] (int i) { return i+1; } (6)The lambda expression is in boldface. Since it doesn't use any variables other than the argumenti, we don't need to include any captures. The final parenthesized(6)provides the argument value. That value 6 is bound to the namei, then the bodyreturn i+1;is evaluated, producing 7.
To compile with lambda expressions, on a
thing, include the flag-std=c++11. For example,thingn$ g++ -std=c++11 -o trap_tbbL trap_tbbL.cpp -ltbb
Record this work in a
commit.thing4$ git add trap_tbb.cpp trap_tbbL.cpp thing4$ git commit -m "lab20: Buggy TBB implementations of trap program, using a class and a lambda"
To try a TBB container, first copy the sequential program
omprimes1.cppand itsMakefileto yourlab20directory onthing4, and test that program. For example, on a Link computer,link% scp ~cs300/lab6/{omprimes1.cpp,Makefile} thing4.cs.stolaf.edu:PDC/lab20Then, in the directorylab20onthing4,thing4$ cp omprimes1.cpp sieve.cpp thing4$ make thing4$ ./sieve 100 4
Once the program is running correctly, perform benchmark tests, and record observations in a file
lab20/README.Record this work in a
commit.thing4$ git add Makefile omprimes1.cpp sieve.cpp thing4$ git commit -m "lab20: omprimes1 computation of primes"
Modify your program to use a TBB
concurrent_vectorinstead of an STLvectorforpset. Note: Since the program has not yet been parallelized, we don't yet need a thread-safe queue. This change introducesconcurrent_vectorand prepares for the next lab, when we will parallelize the code using TBB.Create a subdirectory
lab20/tbbfor developing and benchmarking, and copyMakefileandsieve.cppto that subdirectory. Then, replace the STLvector<>template withconcurrent_vector<>.The preprocessor directive
#include <tbb/concurrent_vector.h>
will inform the compiler how to use the templated typetbb::concurrent_vector<>.Modify
Makefileinlab20/tbbto add-ltbbto compilation and linking.Perform a simple test of the resulting
sieve.cppto insure that it compiles and runs correctly.Record this work in a
commit.thing4$ cd ~/PDC/lab20/tbb
thing4$ git add Makefile sieve.cpp thing4$ git commit -m "lab20: substitute TBB concurrent_vector for vector in sieve.cpp"
Now, perform benchmark tests, to compare performance of the STL
vectorwith TBB'sconcurrent_vectorin this application. sequential code's performanceCopy your benchmark script from the previous lab, and move the original sieve code to a new subdirectory
stlthing4$ git pull origin master thing4$ cd ~/PDC/lab20 thing4$ cp ../lab/sievetest.sh . thing4$ mkdir stl thing4$ git mv Makefile omprimes1.cpp sieve.cpp stl
Here, wepullin order to make sure that thething4working directory is up to date, before copying the scriptsievetest.shfrom the previous lab's subdirectory. Also, thegit mvoperation prepares to rename indicated files in the (sto)git repository, fromlab20/*tolab20/stl/*. In particular, thegit mvoperation above is equivalent to the following:thing4$ mv Makefile omprimes1.cpp sieve.cpp stl thing4$ git rm Makefile omprimes1.cpp sieve.cpp thing4$ git add stl/Makefile stl/omprimes1.cpp stl/sieve.cpp
Run the benchmark script on
stland enter any observations in a filelab20/README.Run the benchmark script on
tbband enter any observations in a filelab20/README. Be sure to identify thesievetest.*_*file you are analyzing in yourREADMEobservation.Record this work in a
commit.thing4$ cp tbb/sieve.cpp tbb/tbbprimes1.tbb
thing4$ git add sievetest.sh README sievetest.*_* tbb/tbbprimes1.cpp thing4$ git commit -m "lab20: Relocate sieve.cpp to stl/; benchmark stl and tbb using sievetest.sh"
Note: The copytbbprimes1.cppmakes it easy for graders to retrieve this stage of your work.
Revise
tbb/sieve.cppto add a loop for precomputing primes up tosqrt(prime_max)and storing those primes in anotherconcurrent_vectornamedpset0.Instead of controlling the outer loop by
p, control it by an indexiintopset0. This will require you to determine the number of elements inpset0.Define
pasp = pset0[i], whereiis the loop-control variable for the outer loop.
Compile, run, benchmark, observe in
README, and copy the resulting codetbb/sieve.cppin a filetbb/tbbprimes2.cpp. Then record this work in acommit.thing4$ git add README sievetest.*_* tbb/{sieve.cpp,tbbprimes2.cpp} thing4$ git commit -m "lab20: Add loop to precompute primes; benchmark. tbbprimes2.cpp"-
Shared data structure patterns in Introduction to TBB programming .
TBB's
concurrent_vector<>implements a Shared Array pattern, enhanced by its ability to grow dynamically throughpush_back()and other methods. Theconcurrent_vectorclass also offers agrow-to-at-least()method for expanding the size of the underlying array in advance. The growing operations are costly in vector classes, although using them once they have the correct size can be as efficient as using ordinary arrays. If you can estimate the size a vector needs to be in advance, you can avoid most of the extra computational cost over arrays while retaining benefits such as growing when necessary.Likewise, the templated class
concurrent_queue<>implements a Shared Queue. Besides the convenience of not having to build these concurrent data structures ourselves, these TBB containers are designed for scalability, so they will perform well even for large data.In
tbb/sieve.cpp, Changepset0to be aconcurrent_queueinstead of aconcurrent_vector, and use that work queue to control the outer loop for determining all primes.concurrent_queueis a C++ template, so don't forget to specify the type for elements of your queue using angle brackets, e.g.,<int>.The method
push()for aconcurrent_queueadds its argument value to the end of the queue.The method
try_pop()attempts to remove a value from the front of aconcurrent_queueand store it in its (reference) argument. It returns TRUE if this succeeds, and returns FALSE (with no state change) if the queue is empty.Observe that a call
pset0.try_pop(p)could be used to control the outer loop, e.g., as a guard for awhileloop.
Compile, run, benchmark, observe in
README, and copy the resulting codetbb/sieve.cppin a filetbb/tbbprimes3.cpp. Then record this work in acommit.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes3}.cpp thing4$ git commit -m "lab20: Substitute concurrent_vector for precomputed primes; benchmark. tbbprimes3.cpp"
Parallel extensions
Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.
Add TBB parallelism for the inner
forloop intbb/sieve.cpp.Replace that
forloop by aparallel_for()call that uses either a "functor" class that defines an appropriateoperator()or a C++ lambda.Test and benchmark your code, and copy the tested version of
sieve.cpptotbb/tbbprimes4.cpp. Then record this work in acommit.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes4}.cpp thing4$ git commit -m "lab20: Parallelize using parallel_for(). tbbprimes4.cpp"
Try adding OpenMP parallelism to
sieve.cpp. Call the rsulting programtbbprimes5.cpp. Record benchmark results and comments from experience inREADME.Notes.
Assuming that you are using a work queue to control your outer loop, and that your outer loop is a
whileloop, you can parallelize that outer loop so that each element popped off of the queue gets its own thread using OpenMP 3.0'staskconstruct.Here is an example showing how to use the
taskconstruct for a different problem, namely, applying a functionf()to each element of an array of integers up to the first element that contains 0:int arr[] = {9, 9, 2, 1, 0}; // defines/initializes array of 5 integers int *ptr; // for pointing to one of the elements of the array arr #pragma omp parallel default(shared) { #pragma omp single private(ptr) { ptr = arr; while (*ptr) { // while current array element does not hold 0 #pragma omp task f(*ptr); ptr++; // advance to next array element } } }Notes:We are stepping through the array
arr[]using a pointerptr. Ifptr"points to" (holds the address of) an element in the arrayarr[], then adding 1 toptrcauses it to hold the address of the next element in that array. We initializeptrusing the expressionarr(no square brackets), which evaluates to the address of the first element in the arrayarr[].The OpenMP
singleconstruct in this example guarantees that only one thread will execute the initialization, guard, and progress operations for thewhileloop. (This avoids race conditions on the variableptr.)As always, the OpenMP
parallelconstruct assembles a team of threads, but no parallel computation actually occurs until a worksharing construct is used, such as afororsectionsconstruct.The OpenMP 3.0
taskconstruct is another worksharing construct that allocates threads from the team (as available) to carry out a statement. In this example, different threads can perform the callf(*ptr)in parallel for different values of*ptr.(Of course,
f()needs to be a thread-safe function, or there will be race conditions from these parallel calls off().)One subtle point:
privatevariables (such asptr) in a construct that encloses ataskbecome firstprivate variables in thattaskconstruct, by default. This means that those variables becomeprivateto the thread executing thattask, but are also guaranteed to be initialized with the value they had just before thattaskbegan.In this case,
ptrholds a different value for each iteration of thewhileloop. When a thread is assigned to carry out thetaskdirective for one of those iterations, it receives its own copy ofarrinitialized witharr's value for that loop's iteration. While that thread execute's its task,ptrmay be incremented and the next iteration can begin; if another thread is available, the OpenMP scheduler can assign that thread to thetaskfor that next iteration; and so on.The OpenMP firstprivate feature is sometimes needed in other contexts besides
tasks. It can be specified explicitly (for constructs that support it) using afirstprivate()clause.
Test and benchmark your code, and copy the tested version of
sieve.cpptotbb/tbbprimes5.cpp. Then record this work in acommit.thing4$ git add README sievetest.*_* tbb/{sieve,tbbprimes5}.cpp thing4$ git commit -m "lab20: Parallelize using openmp. tbbprimes5.cpp"-
Algorithm patterns in Introduction to TBB programming .
The OpenMP parallelization of
tbbprimes5.cppand the TBB parallelizationtrap_tbb.cppboth exhibit the familiar Loop Parallel pattern. The OpenMP code intbbprimes5.cppalso illustrates Master-Worker.Our use of the Shared Queue pattern in
tbbprimes5.cppconstitutes a new mid-level parallel programming pattern.Computation in a Task Queue pattern is organized as a series of tasks for distribution among threads of execution. The tasks often have variable or unpredictable computational sizes, so some threads may carry out multiple tasks while another other thread continues with a lengthy task. This behavior means that Task Queue offers some natural load balancing among threads, which makes good use of those threads' execution time.
Task Queue is closely related to Shared Queue (which one often needs in order to manage a task queue correctly in parallel) and Master-Worker (since a Task Queue algorithm is ordinarily a special case of Master-Worker). However, Task Queue is worth treating as a pattern in its own right, especially because of its natural load balancing property.
An essential difference between Task Queue and Shared Queue is that Task Queue concerns structuring a program, whereas Shared Queue pertains to structuring data.
This suggests categories for mid-level patterns.
Program structure patterns: Loop Parallel, SPMD, Fork-Join, Master-Worker, Task Queue
Data structure patterns: Shared Array, Shared Queue
Likewise, we can identify categories for low-level patterns.
Execution progress patterns: Thread Pool
Coordination patterns: Message Passing, Collective Communication, Mutual Exclusion
We choose the category name "Execution progress" because Thread Pool has to do with providing resources for parallel execution of machine instructions, not coordination of those resources.
Deliverables
All of your code for this assignment should already be
contained in commits. Modify the most recent
commit message to indicate that you are submitting
the completed assignment.
thing4$ git commit --amendAdd the following to the latest
commit message.
lab20: complete
If your assignment is not complete, indicate your progress instead, e.g.,lab20: items 1-5 complete, item 6 partialYou can make later commits to submit updates.
Finally, pull/push your commits in
the usual way.
thing4$ git pull origin master thing4$ git push origin master
Files: trap_tbb.cpp trap_tbbL.cpp tbb/Makefile tbb/sieve.cpp sievetest.sh README sievetest.*_* stl/Makefile
stl/sieve.cpp stl/omprimes1.cpp tbb/tbbprimes1.cpp tbb/tbbprimes2.cpp tbb/tbbprimes3.cpp tbb/tbbprimes4.cpp tbb/tbbprimes5.cpp
Use one of the
git stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3, include the name thing3 in your
commit message.
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 20 submit (thing3)" thing3$ git push origin masterLikewise, if you worked on both link machines and a
thing, you can rename your lab20
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Saturday, October 29, 2016.