Introduction to TBB, continued
CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, November 7, 2018
______
Preliminary material
______
Laboratory exercises
On a manycore computer such as
thing4.cs.stolaf.edu, create a~/PDC/lab8subdirectory for work on the lab, and change directory to that directory.-
First, note that
trap_tbb.cppproduces incorrect answers because of a race condition, since multiple threads modify the shared variableintegralthat is allocated inmain()and passed by reference to the objectSumHeights(a, h, integral)
thatparallel_for()uses for all its parallel threads. During execution, TBB'sparallel_for()has its threads calloperator()for that object, and the race condition can occur when those parallel threads modifymy_int(i.e.,integral) during their calls ofoperator().The program
trap_tbb2.cppavoids that operation using TBB'sparallel_reduce()call.SumHeights2 sh2(a, h, integral); parallel_reduce(blocked_range<size_t>(1, n), sh2);
The second argument of thatparallel_reduce()call is an objectsh2of a classSumHeights2that differs from the prior classSumHeightsin several respects.Whereas the state variable
my_intintrap_tbb.cppis a reference tomain()'s variableintegral, intrap_tbb2.cppthe variablemy_intis a public state variable. This makes the value of aSumHeights2object'smy_intvisible (and modifyable) by multiple threads.The revised class
SumHeights2contains a second constructor, called a split constructor, which TBB can call in order to create a newSumHeights2object when it wants to subdivide ablocked_rangefor computation by different threads. This way, TBB threads can perform their computations on different objects, instead of each of them sharing a single object (as in theparallel_for()algorithm oftrap_tbb.cpp). Note: This split constructor initializes the new object'smy_intvariable at 0, which is the identity value for our reduce operation+.The
parallel_reduce()algorithm can usejoin()method calls to perform a reduce operation in efficient ways. For example, suppose that an array of 40SumHeights2objectssh[0],sh[1],sh[2], ...sh[39]have computed subtotal sums of heights of trapezoids for forty differentblocked_range()s. Thenparallel_reduce()might consolidate all forty subtotals directly within oneSumHeights2object, e.g,sh[0].join(sh[1]); sh[0].join(sh[2]); ... sh[0].join(sh[39]);
(which might be accomplished by a loopfor (int i = 1; i < 40; i++) sh[0].join(sh[i]);
.) This strategy can't be parallelized without dealing with a potential race condition (involving thepublicstate variablesh[0].my_int). Alternatively,parallel_reduce()might add ten values at a time using the following strategy:sh[0].join(sh[1]); sh[0].join(sh[2]); ... sh[0].join(sh[9]); sh[10].join(sh[11]); sh[10].join(sh[12]); ... sh[10].join(sh[19]); sh[20].join(sh[21]); sh[20].join(sh[22]); ... sh[20].join(sh[29]); sh[30].join(sh[31]); sh[30].join(sh[32]); ... sh[30].join(sh[39]);
then join those partial reduce valuessh[0].join(sh[10]); sh[0].join(sh[20]); sh[0].join(sh[30]);
(Sample code:for (int d = 0; d < 4; d++) for (int i = 1; i < 10; i++) sh[d*10].join(sh[d*10+i]); for (int d = 1; d < 4; d++) sh[0] = sh[d*10];.) This approach can be parallelized, and could improve performance if we had far more than 40SumHeights2objects or if thejoin()operation involved a time-consuming computation.SumHeights2also provides ajoin()method for combining themy_intvalues fromSumHeights2objects, which would be the results of a thread's computation. The methodjoin()performs the reduce operations, which is addition+for this problem. Note: This methodjoin()(together with the use of the identity 0 in the split constructor) is comparable toMPI_SUMin an MPI reduce, orreduction(+: ...)in an OpenMP reduction clause.
TBB's
parallel_reduce()uses a split constructor and ajoin()method to do its work, dividing up the trapezoid computations among multiple objects used by multiple threads, then adding up the results.DO THIS: Copy
trap_tbb2.cppto~/PDC/lab8onthing4.cs.stolaf.edu, then compile and run the program. Verify that it produces a correct value of 2 for the integral.Then, use
timeto measure performance of several runs of the program. Record the data and make observations in a file~/PDC/lab8/README.Finally, record this work in a
commit.thing4$ git add README trap_tbb2.cpp thing4$ git commit -m "lab8: Corrected trap parallelization using parallel_reduce(). trap_tbb2.cpp"
Make a copy of
trap_tbb2.cppnamedtrap_tbb3.cpp, and modifytrap_tbb3.cppto use lambda expressions instead of a body classSumHeights2to specify the computation for a TBBparallel_reduce()call.(For more on lambda expressions in
parallel_reduce(), see this link.This requires a four-argument form of the
parallel_reduce()function:template<typename Range, typename Value, typename Func, typename Reduction> Value parallel_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction);Here is a description of the four arguments.rangeis ablocked_range<>object.identityis the (left) identity element of the typeValuefor the computation infunc. For our computation to add the heights of rectangles, the typeValueisdouble, and we are adding heights of rectangles, soidentityshould be 0.0 .funcis a function for performing a thread's computation (e.g., adding its range of trapezoid heights). This functionfunccorresponds tooperator()for an object likeSumHeights2, butfunchas an extra argument of typeValueand returns aValue, as described below.reductionis the reduction function, which takes twoValuearguments and returns aValue.reductioncorresponds to thejoin()method inSumHeights2. In our case,reductionshould add twodoublearguments and return thedoubleresult.
We will use lambda expressions for
funcandreduction.The two lambda expressions use return values to transmit their results, unlike the methods
operator()andjoinof an object likeSumHeight2.funchas two arguments, ablocked_range<>and aValueargument, andfuncreturns aValue. In contrast,operator()for a class likeSumHeight2has only oneblocked_rangeobject and no return value.func's second argument and return value are used instead of a (public) state variable to accumulate the answer.In our case,
funcshould sum the heights of rectangles for theblocked_rangeargument, add that sum to the second argument (adouble), and return the result (anotherdouble). The extra argument and return value take the place of the public state variablemy_intinSumHeights2.reduce's two arguments and return value contrast withjoin's one argument and no return value. Again, the extra argument and return value are used to perform reduction without a state variable likemy_int.
-
Capturing. For our computation, the lambda expression
funcrequires the values ofaandh. Both of these variables can be captured by copy (it should also work to capture them by reference, sincefuncshouldn't change them). Note that the use of thedoubleargument infunceliminates the need forfuncto maintain a copy ofintegral. Effectively, thedoubleargument and return value represent before and after values of theintegralcomputation for a given blocked range.The lambda expression
reductiondoesn't need to capture any values for adding its twodoublearguments. Unlike the two-argument version of
parallel_reduce(), this four-argument version returns a value in order to deliver its results. Thus, we will replaceSumHeights2 sh2(a, h, integral); parallel_reduce(blocked_range
by the functional-style call(1, n), sh2); integral += sh2.my_int; integral += parallel_reduce(blocked_range
where(1, n), 0.0, ...); ...represents your two lambda expressionsfuncandreduce.Don't forget to delete the unneeded class
SumHeight2in your code
Test your program
trap_tbb3.cppwith lambda expressions, benchmark it usingtimeas before, and record data and observations in a filelab8/README. Then, record this work in acommit.thing4$ git add README trap_tbb3.cpp thing4$ git commit -m "lab8: parallel_reduce() with lambda. trap_tbb3.cpp"
For one of the programs
tbbprimesn.cppfrom the previous lab withn≥ 4, make a copy of that program in a new subdirectorylab8/sieve, naming the copysieve.cpp.Record your initial
sieve.cppin acommit.thing4$ git add README sieve/sieve.cpp thing4$ git commit -m "lab8: initial sieve code sieve/sieve.cpp"
Add a TBB
parallel_forto parallelize theforloop for initializing your "premarked[]" array insieve/sieve.cpp, i.e., the "marked[]" array ofcharthat you use for precomputing an initial subset of primes.Notes:
This requires replacing that loop by a call to
parallel_for; see themain()intrap_tbb.cppfor an example.The range object for this call to
parallel_forneeds to be constructed with the correct arguments. For example, if your loop werefor (int i = 7; i <= N; i++) { ... }then the range object can be constructed with the call
blocked_range<size_t>(7, N+1);
The body object for this call to
parallel_forcan be a member of a class named, for example,InitPremarked. That class is responsible for performing initialization on a range of indices for the arraypremarked[].Therefore, the class
InitPremarkedwill need a (private) state variable for holding a copy of the (address of the) arraypremarked, perhaps namedmy_premarked. Also, the constructor must initializemy_premarkedfrom its argument. The state variable and constructor argument can have typechar*, but they should not beconst char*, because the purpose ofInitPremarkedis to make changes in the (characters pointed to by that) array.
Compile, run, benchmark, make observations in a file
lab8/README. Then make a copysieve/tbbprimes10.cppofsieve/sieve.cppand record this work in acommit.thing4$ cp sieve/sieve.cpp sieve/tbbprimes10.tbb
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes10}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes10.cpp"Now modify
sieve/sieve.cppto use a lambda expression instead of a classInitPremarked.Instead of
InitPremarked's private array state variablemy_premarked, use a by-copy capture ofmain()'s array variablepremarked. This is the only local variable frommain()needed in the lambda expression, so you can indicate this capture using the notation[=].This lambda expression requires only a
blocked_range<size_t>as an argument. It's body initializes the (captured) arraypremarkedfor that range.Delete the now-unneeded class
InitPremarked.
Compile, run, benchmark, make observations in a file
lab8/README. Then copy the current version of your program tosieve/tbbprimes11.cppand record this work in acommit.thing4$ cp sieve/{sieve,tbbprimes11}.cppthing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes11}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/lambda. tbbprimes11.cpp"Note: In the
cpcommand above, the shell notationsieve/{sieve,tbbprimes11}.cppexpands to thatcpcommand's two argumentssieve/sieve.cpp sieve/tbbprimes11.cpp
Next, use TBB's
parallel_forto parallelize the loop for marking multiples of a given primepin the arraypremarked[].For this loop, create a class named
SievePrime; an object of that class will be used as the second argument of this call ofparallel_for.The iteration for this loop needs to increment by
pinstead of incrementing by 1. You can accomplish this by constructingblocked_range<size_t>object for generating integer values that are 1/pof the index values you want to mark in the arraypremarked[]. For example, if your loop wasfor (int k = p*p; k < maximum; k += p) { ... }then you can generate such a range object using the call
blocked_range<size_t>(p, maximum/p);
then multiplying index values from that range by
p, e.g,premarked[p*i]. Be sure to adjust both the starting and ending values for this range.The body class for this loop will need two state variables,
premarkedandp, both passed in the constructor.Compile, run, benchmark, make observations in a file
lab8/README. Then copy the current version of your program tosieve/tbbprimes12.cppand record this work in acommit.thing4$ cp sieve/{sieve,tbbprimes12}.cppthing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes12}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes12.cpp"
Modify your
sieve/sieve.cppprogram to use a lambda expression instead of the classSievePrime.Both of the variables of
main()to be captured by this lambda expression (namely,premarkedandp) should be captured by copy. Therefore,[=]can again be used to start the lambda expression.Compile, run, benchmark, make observations in a file
lab8/README. Then copy the current version of your program tosieve/tbbprimes13.cppand record this work in acommit.thing4$ cp sieve/{sieve,tbbprimes13}.cppthing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes13}.cpp thing4$ git commit -m "lab8: Parallelize loop for marking primes using parallel_for/object. tbbprimes13.cpp"
Next, use TBB's
parallel_forwith a lambda expression to parallelize the loop that assigns the starter set of primes to your (thread-safe!)concurrent_queuevariable.This time, the lambda expression must capture both the array
premarkedand your queue variable containing known primes. The pointerpremarkedcan be captured by copy, but the queue should be captured by reference, so this lambda expression can add a prime tomain()'s state variable if it finds one.
Compile, run, benchmark, make observations in a file
lab8/README. Then copy the current version of your program tosieve/tbbprimes14.cppand record this work in acommit.thing4$ cp sieve/{sieve,tbbprimes14}.cppthing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes14}.cpp thing4$ git commit -m "lab8: Parallel_for, lambda, capture by reference. tbbprimes14.cpp"Parallelize at least one more loop in the program, aiming to improve performance the most when compared to
tbbprimes12.cpp.Compile, run, benchmark, make observations in a file
lab8/README. Then copy the current version of your program tosieve/tbbprimes15.cppand record this work in acommit.thing4$ cp sieve/{sieve,tbbprimes15}.cppthing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes15}.cpp thing4$ git commit -m "lab8: Further parallelization. tbbprimes15.cpp"Note: provide a more specific
commitmessage if you prefer.
Patterns in Introduction to TBB, continued .
The dominant parallel programming patterns in this lab are now quite familiar, since we have seen them in three different technologies: Loop Parallel and the reduce operation, which we have counted under the low-level pattern Collective Communication. However, the implementation of these patterns feels at a higher level in TBB than in MPI or even OpenMP.
With TBB's
parallel_for(), we use templated classes, and create callable objects (by definingoperator()) or write lambda expressions to specify the computational work to be done according to a Loop Parallel pattern. Once one becomes familiar with this sophisticated use of C++ language, this approach to implementation feels at a higher level -- closer to an application and generally further from hardware details -- than the explicit range splitting and many parametered message-passing or collective communication functions of MPI, or even thereductionclause for an OpenMP#pragmacompiler directive.The same implementation elements give TBB's
parallel_reduce()the appearance of a much higher level of programming than the counterparts in MPI and OpenMP. Especially with the four-argument version suitable for lambda expressions, we can implement a parallelization withparallel_reduce(), focusing primarily on the essential computations needed for an application, by simply providing the computation needed for each chunk of data (func1) and the operation needed to combine two such chunks (func2).
The term programming paradigm refers to overall abstract patterns in the uses of programming languages. The three most common programming paradigms are:
imperative programming, in which one carries out programming tasks using sequences of instructions (for order), loop constructs (for repetition), and state changes such as memory assignment and input/output (for data managment);
functional programming, which employs nested function calls (for ordering), recursion (for repetition), and function return values (for data movement); and
object-oriented programming, which uses "objects" with state variables and methods to group data and operations together, and which includes notions such as inheritance, communication through message passing, etc.
TBB with objects as arguments is consistent with solving problems
with object-oriented programming, whereas TBB using lambda expressions
deploys functional programming, as represented by the use of lambda
expressions. Note: Many experienced professionals
anticipate that functional programming will play a key role in the
parallel programming systems of the future. Using TBB's
parallel_reduce() with lambda expressions indicates why
they may think so, since it's relatively easy to think on a high
level and parallelize correctly, avoiding low-level difficulties like
race conditions, using the functional paradigm.
Patterns in the implementation of TBB
As with OpenMP, the way TBB carries out its work illustrates some interesting parallel programming patterns.
Maintaining a Thread Pool reduces the overhead of creating and destroying threads, as with OpenMP.
Through the
blocked_range<>data structure, the range of a parallelized computation can be subdivided, and those smaller ranges subdivided, etc., in order to find a computation size or granularity that matches the available computational resources. This implementation strategy reflects the Recursive Splitting pattern.These same recursive splits lead to an efficient way to compute a reduction: perform the reduce operation first on small subdivisions (producing a partial reduction for the range that was split to produce those small subdivisions), then continue recursively until the reduction of the entire tree of subdivisions has been calculated. Many implementations of MPI and OpenMP perform efficient reductions using the same Recursive Splitting strategy, although TBB's explicit ranges make the approach easier to visualize.
______ Recursive Splitting is considered to be a high level pattern for finding the parallelism in a programming problem, the same level as Data Decomposition and Task Decomposition.
TBB assigns a queue of computations to each of its threads, illustrating the Task Queue pattern. If some threads finish the work in their queues while other threads still have long queues of work to complete, TBB enables the otherwise idle threads to carry out tasks from the busy threads' queues, in an automated way. This strategy is called work stealing. Some implementations of MPI also support work stealing, but this strategy is a design goal for TBB.
Deliverables
All of your code for this assignment should already be
contained in commits. Modify the most recent
commit message to indicate that you are submitting
the completed assignment.
thing4$ git commit --amendAdd the following to the latest
commit message.
lab8: complete
If your assignment is not complete, indicate your progress instead, e.g.,lab8: items 1-5 complete, item 6 partialYou can make later commits to submit updates.
Finally, pull/push your commits in
the usual way.
thing4$ git pull origin master thing4$ git push origin master
Files: README trap_tbb2.cpp trap_tbb3.cpp sieve/sieve.cpp sievetest.*_* sieve/tbbprimes10.cpp sieve/tbbprimes11.cpp sieve/tbbprimes12.cpp sieve/tbbprimes13.cpp sieve/tbbprimes14.cpp sieve/tbbprimes15.cpp
Use one of the
git stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3, include the name thing3 in your
commit message.
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 21 submit (thing3)" thing3$ git push origin masterLikewise, if you worked on both link machines and a
thing, you can rename your lab8
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Saturday, November 5, 2016.