Introduction to TBB, continued
CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, November 7, 2018
______
Preliminary material
______
Laboratory exercises
On a manycore computer such as
thing4.cs.stolaf.edu
, create a~/PDC/lab8
subdirectory for work on the lab, and change directory to that directory.-
First, note that
trap_tbb.cpp
produces incorrect answers because of a race condition, since multiple threads modify the shared variableintegral
that is allocated inmain()
and passed by reference to the objectSumHeights(a, h, integral)
thatparallel_for()
uses for all its parallel threads. During execution, TBB'sparallel_for()
has its threads calloperator()
for that object, and the race condition can occur when those parallel threads modifymy_int
(i.e.,integral
) during their calls ofoperator()
.The program
trap_tbb2.cpp
avoids that operation using TBB'sparallel_reduce()
call.SumHeights2 sh2(a, h, integral); parallel_reduce(blocked_range<size_t>(1, n), sh2);
The second argument of thatparallel_reduce()
call is an objectsh2
of a classSumHeights2
that differs from the prior classSumHeights
in several respects.Whereas the state variable
my_int
intrap_tbb.cpp
is a reference tomain()
's variableintegral
, intrap_tbb2.cpp
the variablemy_int
is a public state variable. This makes the value of aSumHeights2
object'smy_int
visible (and modifyable) by multiple threads.The revised class
SumHeights2
contains a second constructor, called a split constructor, which TBB can call in order to create a newSumHeights2
object when it wants to subdivide ablocked_range
for computation by different threads. This way, TBB threads can perform their computations on different objects, instead of each of them sharing a single object (as in theparallel_for()
algorithm oftrap_tbb.cpp
). Note: This split constructor initializes the new object'smy_int
variable at 0, which is the identity value for our reduce operation+
.The
parallel_reduce()
algorithm can usejoin()
method calls to perform a reduce operation in efficient ways. For example, suppose that an array of 40SumHeights2
objectssh[0]
,sh[1]
,sh[2]
, ...sh[39]
have computed subtotal sums of heights of trapezoids for forty differentblocked_range()
s. Thenparallel_reduce()
might consolidate all forty subtotals directly within oneSumHeights2
object, e.g,sh[0].join(sh[1]); sh[0].join(sh[2]); ... sh[0].join(sh[39]);
(which might be accomplished by a loopfor (int i = 1; i < 40; i++) sh[0].join(sh[i]);
.) This strategy can't be parallelized without dealing with a potential race condition (involving thepublic
state variablesh[0].my_int
). Alternatively,parallel_reduce()
might add ten values at a time using the following strategy:sh[0].join(sh[1]); sh[0].join(sh[2]); ... sh[0].join(sh[9]); sh[10].join(sh[11]); sh[10].join(sh[12]); ... sh[10].join(sh[19]); sh[20].join(sh[21]); sh[20].join(sh[22]); ... sh[20].join(sh[29]); sh[30].join(sh[31]); sh[30].join(sh[32]); ... sh[30].join(sh[39]);
then join those partial reduce valuessh[0].join(sh[10]); sh[0].join(sh[20]); sh[0].join(sh[30]);
(Sample code:for (int d = 0; d < 4; d++) for (int i = 1; i < 10; i++) sh[d*10].join(sh[d*10+i]); for (int d = 1; d < 4; d++) sh[0] = sh[d*10];
.) This approach can be parallelized, and could improve performance if we had far more than 40SumHeights2
objects or if thejoin()
operation involved a time-consuming computation.SumHeights2
also provides ajoin()
method for combining themy_int
values fromSumHeights2
objects, which would be the results of a thread's computation. The methodjoin()
performs the reduce operations, which is addition+
for this problem. Note: This methodjoin()
(together with the use of the identity 0 in the split constructor) is comparable toMPI_SUM
in an MPI reduce, orreduction(+: ...)
in an OpenMP reduction clause.
TBB's
parallel_reduce()
uses a split constructor and ajoin()
method to do its work, dividing up the trapezoid computations among multiple objects used by multiple threads, then adding up the results.DO THIS: Copy
trap_tbb2.cpp
to~/PDC/lab8
onthing4.cs.stolaf.edu
, then compile and run the program. Verify that it produces a correct value of 2 for the integral.Then, use
time
to measure performance of several runs of the program. Record the data and make observations in a file~/PDC/lab8/README
.Finally, record this work in a
commit
.thing4$ git add README trap_tbb2.cpp thing4$ git commit -m "lab8: Corrected trap parallelization using parallel_reduce(). trap_tbb2.cpp"
Make a copy of
trap_tbb2.cpp
namedtrap_tbb3.cpp
, and modifytrap_tbb3.cpp
to use lambda expressions instead of a body classSumHeights2
to specify the computation for a TBBparallel_reduce()
call.(For more on lambda expressions in
parallel_reduce()
, see this link.This requires a four-argument form of the
parallel_reduce()
function:template<typename Range, typename Value, typename Func, typename Reduction> Value parallel_reduce( const Range& range, const Value& identity, const Func& func, const Reduction& reduction);
Here is a description of the four arguments.range
is ablocked_range<>
object.identity
is the (left) identity element of the typeValue
for the computation infunc
. For our computation to add the heights of rectangles, the typeValue
isdouble
, and we are adding heights of rectangles, soidentity
should be 0.0 .func
is a function for performing a thread's computation (e.g., adding its range of trapezoid heights). This functionfunc
corresponds tooperator()
for an object likeSumHeights2
, butfunc
has an extra argument of typeValue
and returns aValue
, as described below.reduction
is the reduction function, which takes twoValue
arguments and returns aValue
.reduction
corresponds to thejoin()
method inSumHeights2
. In our case,reduction
should add twodouble
arguments and return thedouble
result.
We will use lambda expressions for
func
andreduction
.The two lambda expressions use return values to transmit their results, unlike the methods
operator()
andjoin
of an object likeSumHeight2
.func
has two arguments, ablocked_range<>
and aValue
argument, andfunc
returns aValue
. In contrast,operator()
for a class likeSumHeight2
has only oneblocked_range
object and no return value.func
's second argument and return value are used instead of a (public) state variable to accumulate the answer.In our case,
func
should sum the heights of rectangles for theblocked_range
argument, add that sum to the second argument (adouble
), and return the result (anotherdouble
). The extra argument and return value take the place of the public state variablemy_int
inSumHeights2
.reduce
's two arguments and return value contrast withjoin
's one argument and no return value. Again, the extra argument and return value are used to perform reduction without a state variable likemy_int
.
-
Capturing. For our computation, the lambda expression
func
requires the values ofa
andh
. Both of these variables can be captured by copy (it should also work to capture them by reference, sincefunc
shouldn't change them). Note that the use of thedouble
argument infunc
eliminates the need forfunc
to maintain a copy ofintegral
. Effectively, thedouble
argument and return value represent before and after values of theintegral
computation for a given blocked range.The lambda expression
reduction
doesn't need to capture any values for adding its twodouble
arguments. Unlike the two-argument version of
parallel_reduce()
, this four-argument version returns a value in order to deliver its results. Thus, we will replaceSumHeights2 sh2(a, h, integral); parallel_reduce(blocked_range
by the functional-style call(1, n), sh2); integral += sh2.my_int; integral += parallel_reduce(blocked_range
where(1, n), 0.0, ...); ...
represents your two lambda expressionsfunc
andreduce
.Don't forget to delete the unneeded class
SumHeight2
in your code
Test your program
trap_tbb3.cpp
with lambda expressions, benchmark it usingtime
as before, and record data and observations in a filelab8/README
. Then, record this work in acommit
.thing4$ git add README trap_tbb3.cpp thing4$ git commit -m "lab8: parallel_reduce() with lambda. trap_tbb3.cpp"
For one of the programs
tbbprimesn.cpp
from the previous lab withn
≥ 4, make a copy of that program in a new subdirectorylab8/sieve
, naming the copysieve.cpp
.Record your initial
sieve.cpp
in acommit
.thing4$ git add README sieve/sieve.cpp thing4$ git commit -m "lab8: initial sieve code sieve/sieve.cpp"
Add a TBB
parallel_for
to parallelize thefor
loop for initializing your "premarked[]
" array insieve/sieve.cpp
, i.e., the "marked[]
" array ofchar
that you use for precomputing an initial subset of primes.Notes:
This requires replacing that loop by a call to
parallel_for
; see themain()
intrap_tbb.cpp
for an example.The range object for this call to
parallel_for
needs to be constructed with the correct arguments. For example, if your loop werefor (int i = 7; i <= N; i++) { ... }
then the range object can be constructed with the call
blocked_range<size_t>(7, N+1);
The body object for this call to
parallel_for
can be a member of a class named, for example,InitPremarked
. That class is responsible for performing initialization on a range of indices for the arraypremarked[]
.Therefore, the class
InitPremarked
will need a (private
) state variable for holding a copy of the (address of the) arraypremarked
, perhaps namedmy_premarked
. Also, the constructor must initializemy_premarked
from its argument. The state variable and constructor argument can have typechar*
, but they should not beconst char*
, because the purpose ofInitPremarked
is to make changes in the (characters pointed to by that) array.
Compile, run, benchmark, make observations in a file
lab8/README
. Then make a copysieve/tbbprimes10.cpp
ofsieve/sieve.cpp
and record this work in acommit
.thing4$ cp sieve/sieve.cpp sieve/tbbprimes10.tbb
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes10}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes10.cpp"
Now modify
sieve/sieve.cpp
to use a lambda expression instead of a classInitPremarked
.Instead of
InitPremarked
's private array state variablemy_premarked
, use a by-copy capture ofmain()
's array variablepremarked
. This is the only local variable frommain()
needed in the lambda expression, so you can indicate this capture using the notation[=]
.This lambda expression requires only a
blocked_range<size_t>
as an argument. It's body initializes the (captured) arraypremarked
for that range.Delete the now-unneeded class
InitPremarked
.
Compile, run, benchmark, make observations in a file
lab8/README
. Then copy the current version of your program tosieve/tbbprimes11.cpp
and record this work in acommit
.thing4$ cp sieve/{sieve,tbbprimes11}.cpp
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes11}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/lambda. tbbprimes11.cpp"
Note: In the
cp
command above, the shell notationsieve/{sieve,tbbprimes11}.cpp
expands to thatcp
command's two argumentssieve/sieve.cpp sieve/tbbprimes11.cpp
Next, use TBB's
parallel_for
to parallelize the loop for marking multiples of a given primep
in the arraypremarked[]
.For this loop, create a class named
SievePrime
; an object of that class will be used as the second argument of this call ofparallel_for
.The iteration for this loop needs to increment by
p
instead of incrementing by 1. You can accomplish this by constructingblocked_range<size_t>
object for generating integer values that are 1/p
of the index values you want to mark in the arraypremarked[]
. For example, if your loop wasfor (int k = p*p; k < maximum; k += p) { ... }
then you can generate such a range object using the call
blocked_range<size_t>(p, maximum/p);
then multiplying index values from that range by
p
, e.g,premarked[p*i]
. Be sure to adjust both the starting and ending values for this range.The body class for this loop will need two state variables,
premarked
andp
, both passed in the constructor.Compile, run, benchmark, make observations in a file
lab8/README
. Then copy the current version of your program tosieve/tbbprimes12.cpp
and record this work in acommit
.thing4$ cp sieve/{sieve,tbbprimes12}.cpp
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes12}.cpp thing4$ git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes12.cpp"
Modify your
sieve/sieve.cpp
program to use a lambda expression instead of the classSievePrime
.Both of the variables of
main()
to be captured by this lambda expression (namely,premarked
andp
) should be captured by copy. Therefore,[=]
can again be used to start the lambda expression.Compile, run, benchmark, make observations in a file
lab8/README
. Then copy the current version of your program tosieve/tbbprimes13.cpp
and record this work in acommit
.thing4$ cp sieve/{sieve,tbbprimes13}.cpp
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes13}.cpp thing4$ git commit -m "lab8: Parallelize loop for marking primes using parallel_for/object. tbbprimes13.cpp"
Next, use TBB's
parallel_for
with a lambda expression to parallelize the loop that assigns the starter set of primes to your (thread-safe!)concurrent_queue
variable.This time, the lambda expression must capture both the array
premarked
and your queue variable containing known primes. The pointerpremarked
can be captured by copy, but the queue should be captured by reference, so this lambda expression can add a prime tomain()
's state variable if it finds one.
Compile, run, benchmark, make observations in a file
lab8/README
. Then copy the current version of your program tosieve/tbbprimes14.cpp
and record this work in acommit
.thing4$ cp sieve/{sieve,tbbprimes14}.cpp
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes14}.cpp thing4$ git commit -m "lab8: Parallel_for, lambda, capture by reference. tbbprimes14.cpp"
Parallelize at least one more loop in the program, aiming to improve performance the most when compared to
tbbprimes12.cpp
.Compile, run, benchmark, make observations in a file
lab8/README
. Then copy the current version of your program tosieve/tbbprimes15.cpp
and record this work in acommit
.thing4$ cp sieve/{sieve,tbbprimes15}.cpp
thing4$ git add README sievetest.*_* sieve/{sieve,tbbprimes15}.cpp thing4$ git commit -m "lab8: Further parallelization. tbbprimes15.cpp"
Note: provide a more specific
commit
message if you prefer.
Patterns in Introduction to TBB, continued .
The dominant parallel programming patterns in this lab are now quite familiar, since we have seen them in three different technologies: Loop Parallel and the reduce operation, which we have counted under the low-level pattern Collective Communication. However, the implementation of these patterns feels at a higher level in TBB than in MPI or even OpenMP.
With TBB's
parallel_for()
, we use templated classes, and create callable objects (by definingoperator()
) or write lambda expressions to specify the computational work to be done according to a Loop Parallel pattern. Once one becomes familiar with this sophisticated use of C++ language, this approach to implementation feels at a higher level -- closer to an application and generally further from hardware details -- than the explicit range splitting and many parametered message-passing or collective communication functions of MPI, or even thereduction
clause for an OpenMP#pragma
compiler directive.The same implementation elements give TBB's
parallel_reduce()
the appearance of a much higher level of programming than the counterparts in MPI and OpenMP. Especially with the four-argument version suitable for lambda expressions, we can implement a parallelization withparallel_reduce()
, focusing primarily on the essential computations needed for an application, by simply providing the computation needed for each chunk of data (func1
) and the operation needed to combine two such chunks (func2
).
The term programming paradigm refers to overall abstract patterns in the uses of programming languages. The three most common programming paradigms are:
imperative programming, in which one carries out programming tasks using sequences of instructions (for order), loop constructs (for repetition), and state changes such as memory assignment and input/output (for data managment);
functional programming, which employs nested function calls (for ordering), recursion (for repetition), and function return values (for data movement); and
object-oriented programming, which uses "objects" with state variables and methods to group data and operations together, and which includes notions such as inheritance, communication through message passing, etc.
TBB with objects as arguments is consistent with solving problems
with object-oriented programming, whereas TBB using lambda expressions
deploys functional programming, as represented by the use of lambda
expressions. Note: Many experienced professionals
anticipate that functional programming will play a key role in the
parallel programming systems of the future. Using TBB's
parallel_reduce()
with lambda expressions indicates why
they may think so, since it's relatively easy to think on a high
level and parallelize correctly, avoiding low-level difficulties like
race conditions, using the functional paradigm.
Patterns in the implementation of TBB
As with OpenMP, the way TBB carries out its work illustrates some interesting parallel programming patterns.
Maintaining a Thread Pool reduces the overhead of creating and destroying threads, as with OpenMP.
Through the
blocked_range<>
data structure, the range of a parallelized computation can be subdivided, and those smaller ranges subdivided, etc., in order to find a computation size or granularity that matches the available computational resources. This implementation strategy reflects the Recursive Splitting pattern.These same recursive splits lead to an efficient way to compute a reduction: perform the reduce operation first on small subdivisions (producing a partial reduction for the range that was split to produce those small subdivisions), then continue recursively until the reduction of the entire tree of subdivisions has been calculated. Many implementations of MPI and OpenMP perform efficient reductions using the same Recursive Splitting strategy, although TBB's explicit ranges make the approach easier to visualize.
______ Recursive Splitting is considered to be a high level pattern for finding the parallelism in a programming problem, the same level as Data Decomposition and Task Decomposition.
TBB assigns a queue of computations to each of its threads, illustrating the Task Queue pattern. If some threads finish the work in their queues while other threads still have long queues of work to complete, TBB enables the otherwise idle threads to carry out tasks from the busy threads' queues, in an automated way. This strategy is called work stealing. Some implementations of MPI also support work stealing, but this strategy is a design goal for TBB.
Deliverables
All of your code for this assignment should already be
contained in commit
s. Modify the most recent
commit
message to indicate that you are submitting
the completed assignment.
thing4$ git commit --amendAdd the following to the latest
commit
message.
lab8: complete
If your assignment is not complete, indicate your progress instead, e.g.,lab8: items 1-5 complete, item 6 partialYou can make later commits to submit updates.
Finally, pull/push
your commit
s in
the usual way.
thing4$ git pull origin master thing4$ git push origin master
Files: README trap_tbb2.cpp trap_tbb3.cpp sieve/sieve.cpp sievetest.*_* sieve/tbbprimes10.cpp sieve/tbbprimes11.cpp sieve/tbbprimes12.cpp sieve/tbbprimes13.cpp sieve/tbbprimes14.cpp sieve/tbbprimes15.cpp
Use one of the
git
stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3
, include the name thing3
in your
commit message.
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 21 submit (thing3)" thing3$ git push origin masterLikewise, if you worked on both link machines and a
thing
, you can rename your lab8
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git
.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Saturday, November 5, 2016.