Introduction to TBB, continued (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Saturday, November 5, 2016</span>)

Introduction to TBB, continued

CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, November 7, 2018

______

when not to use queues, TBB tutorial 6.3.2

Preliminary material

______

Laboratory exercises

On a manycore computer such as thing4.cs.stolaf.edu, create a ~/PDC/lab8 subdirectory for work on the lab, and change directory to that directory.
First, note that trap_tbb.cpp produces incorrect answers because of a race condition, since multiple threads modify the shared variable integral that is allocated in main() and passed by reference to the object
```
SumHeights(a, h, integral)
```
that parallel_for() uses for all its parallel threads. During execution, TBB's parallel_for() has its threads call operator() for that object, and the race condition can occur when those parallel threads modify my_int (i.e., integral) during their calls of operator().

The program trap_tbb2.cpp avoids that operation using TBB's parallel_reduce() call.
```
SumHeights2 sh2(a, h, integral);
parallel_reduce(blocked_range<size_t>(1, n), sh2);
```
The second argument of that parallel_reduce() call is an object sh2 of a class SumHeights2 that differs from the prior class SumHeights in several respects.
- Whereas the state variable my_int in trap_tbb.cpp is a reference to main()'s variable integral, in trap_tbb2.cpp the variable my_int is a public state variable. This makes the value of a SumHeights2 object's my_int visible (and modifyable) by multiple threads.
- The revised class SumHeights2 contains a second constructor, called a split constructor, which TBB can call in order to create a new SumHeights2 object when it wants to subdivide a blocked_range for computation by different threads. This way, TBB threads can perform their computations on different objects, instead of each of them sharing a single object (as in the parallel_for() algorithm of trap_tbb.cpp). Note: This split constructor initializes the new object's my_int variable at 0, which is the identity value for our reduce operation +.
  
  The parallel_reduce() algorithm can use join() method calls to perform a reduce operation in efficient ways. For example, suppose that an array of 40 SumHeights2 objects sh[0], sh[1], sh[2], ... sh[39] have computed subtotal sums of heights of trapezoids for forty different blocked_range()s. Then parallel_reduce() might consolidate all forty subtotals directly within one SumHeights2 object, e.g,
```
sh[0].join(sh[1]);  sh[0].join(sh[2]);  ...  sh[0].join(sh[39]);  
```
  (which might be accomplished by a loop
```
for (int i = 1; i < 40;  i++)
  sh[0].join(sh[i]);
```
  .) This strategy can't be parallelized without dealing with a potential race condition (involving the public state variable sh[0].my_int). Alternatively, parallel_reduce() might add ten values at a time using the following strategy:
```
sh[0].join(sh[1]);  sh[0].join(sh[2]);  ...  sh[0].join(sh[9]);  
sh[10].join(sh[11]);  sh[10].join(sh[12]);  ...  sh[10].join(sh[19]);  
sh[20].join(sh[21]);  sh[20].join(sh[22]);  ...  sh[20].join(sh[29]);  
sh[30].join(sh[31]);  sh[30].join(sh[32]);  ...  sh[30].join(sh[39]);  
```
  then join those partial reduce values
```
sh[0].join(sh[10]);  sh[0].join(sh[20]);  sh[0].join(sh[30]);  
```
  (Sample code:
```
for (int d = 0;  d < 4;  d++)
  for (int i = 1;  i < 10;  i++)
    sh[d*10].join(sh[d*10+i]);
for (int d = 1;  d < 4;  d++)
  sh[0] = sh[d*10];
```
  .) This approach can be parallelized, and could improve performance if we had far more than 40 SumHeights2 objects or if the join() operation involved a time-consuming computation.
- SumHeights2 also provides a join() method for combining the my_int values from SumHeights2 objects, which would be the results of a thread's computation. The method join() performs the reduce operations, which is addition + for this problem. Note: This method join() (together with the use of the identity 0 in the split constructor) is comparable to MPI_SUM in an MPI reduce, or reduction(+: ...) in an OpenMP reduction clause.
TBB's parallel_reduce() uses a split constructor and a join() method to do its work, dividing up the trapezoid computations among multiple objects used by multiple threads, then adding up the results.

DO THIS: Copy trap_tbb2.cpp to ~/PDC/lab8 on thing4.cs.stolaf.edu, then compile and run the program. Verify that it produces a correct value of 2 for the integral.

Then, use time to measure performance of several runs of the program. Record the data and make observations in a file ~/PDC/lab8/README.

Finally, record this work in a commit.
```
thing4$  git add README trap_tbb2.cpp
thing4$  git commit -m "lab8: Corrected trap parallelization using parallel_reduce(). trap_tbb2.cpp"
```
Make a copy of trap_tbb2.cpp named trap_tbb3.cpp, and modify trap_tbb3.cpp to use lambda expressions instead of a body class SumHeights2 to specify the computation for a TBB parallel_reduce() call.

(For more on lambda expressions in parallel_reduce(), see this link.
- This requires a four-argument form of the parallel_reduce() function:
```
template<typename Range, typename Value,
         typename Func, typename Reduction>
Value parallel_reduce( const Range& range, const Value& identity,
                       const Func& func, const Reduction& reduction);
```
  Here is a description of the four arguments.
  - range is a blocked_range<> object.
  - identity is the (left) identity element of the type Value for the computation in func. For our computation to add the heights of rectangles, the type Value is double, and we are adding heights of rectangles, so identity should be 0.0 .
  - func is a function for performing a thread's computation (e.g., adding its range of trapezoid heights). This function func corresponds to operator() for an object like SumHeights2, but func has an extra argument of type Value and returns a Value, as described below.
  - reduction is the reduction function, which takes two Value arguments and returns a Value. reduction corresponds to the join() method in SumHeights2. In our case, reduction should add two double arguments and return the double result.
  We will use lambda expressions for func and reduction.
- The two lambda expressions use return values to transmit their results, unlike the methods operator() and join of an object like SumHeight2.
  - func has two arguments, a blocked_range<> and a Value argument, and func returns a Value. In contrast, operator() for a class like SumHeight2 has only one blocked_range object and no return value. func's second argument and return value are used instead of a (public) state variable to accumulate the answer.
    In our case, func should sum the heights of rectangles for the blocked_range argument, add that sum to the second argument (a double), and return the result (another double). The extra argument and return value take the place of the public state variable my_int in SumHeights2.
  - reduce's two arguments and return value contrast with join's one argument and no return value. Again, the extra argument and return value are used to perform reduction without a state variable like my_int.
- Capturing. For our computation, the lambda expression func requires the values of a and h. Both of these variables can be captured by copy (it should also work to capture them by reference, since func shouldn't change them). Note that the use of the double argument in func eliminates the need for func to maintain a copy of integral. Effectively, the double argument and return value represent before and after values of the integral computation for a given blocked range.
  
  The lambda expression reduction doesn't need to capture any values for adding its two double arguments.
- Unlike the two-argument version of parallel_reduce(), this four-argument version returns a value in order to deliver its results. Thus, we will replace
```
SumHeights2 sh2(a, h, integral);
parallel_reduce(blocked_range(1, n), sh2);
integral += sh2.my_int;
```
  by the functional-style call
```
integral += parallel_reduce(blocked_range(1, n), 0.0, ...);
```
  where ... represents your two lambda expressions func and reduce.
- Don't forget to delete the unneeded class SumHeight2 in your code
Test your program trap_tbb3.cpp with lambda expressions, benchmark it using time as before, and record data and observations in a file lab8/README. Then, record this work in a commit.
```
thing4$  git add README trap_tbb3.cpp
thing4$  git commit -m "lab8: parallel_reduce() with lambda. trap_tbb3.cpp"
```
For one of the programs tbbprimesn.cpp from the previous lab with n ≥ 4, make a copy of that program in a new subdirectory lab8/sieve, naming the copy sieve.cpp.

Record your initial sieve.cpp in a commit.
```
thing4$  git add README sieve/sieve.cpp
thing4$  git commit -m "lab8: initial sieve code sieve/sieve.cpp"
```
Add a TBB parallel_for to parallelize the for loop for initializing your "premarked[]" array in sieve/sieve.cpp, i.e., the "marked[]" array of char that you use for precomputing an initial subset of primes.

Notes:
- This requires replacing that loop by a call to parallel_for; see the main() in trap_tbb.cpp for an example.
- The range object for this call to parallel_for needs to be constructed with the correct arguments. For example, if your loop were
```
for (int i = 7;  i <= N;  i++) { 
  ... 
}
```
  then the range object can be constructed with the call
```
blocked_range<size_t>(7, N+1);
```
- The body object for this call to parallel_for can be a member of a class named, for example, InitPremarked. That class is responsible for performing initialization on a range of indices for the array premarked[].
  
  Therefore, the class InitPremarked will need a (private) state variable for holding a copy of the (address of the) array premarked, perhaps named my_premarked. Also, the constructor must initialize my_premarked from its argument. The state variable and constructor argument can have type char*, but they should not be const char*, because the purpose of InitPremarked is to make changes in the (characters pointed to by that) array.
Compile, run, benchmark, make observations in a file lab8/README. Then make a copy sieve/tbbprimes10.cpp of sieve/sieve.cpp and record this work in a commit.
```
thing4$  cp sieve/sieve.cpp sieve/tbbprimes10.tbb
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes10}.cpp
thing4$  git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes10.cpp"
```
Now modify sieve/sieve.cpp to use a lambda expression instead of a class InitPremarked.
- Instead of InitPremarked's private array state variable my_premarked, use a by-copy capture of main()'s array variable premarked. This is the only local variable from main() needed in the lambda expression, so you can indicate this capture using the notation [=] .
- This lambda expression requires only a blocked_range<size_t> as an argument. It's body initializes the (captured) array premarked for that range.
- Delete the now-unneeded class InitPremarked.
Compile, run, benchmark, make observations in a file lab8/README. Then copy the current version of your program to sieve/tbbprimes11.cpp and record this work in a commit.
```
thing4$  cp sieve/{sieve,tbbprimes11}.cpp
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes11}.cpp
thing4$  git commit -m "lab8: Parallelize premarked loop using parallel_for/lambda. tbbprimes11.cpp"
```
Note: In the cp command above, the shell notation
```
sieve/{sieve,tbbprimes11}.cpp
```
expands to that cp command's two arguments
```
sieve/sieve.cpp sieve/tbbprimes11.cpp
```
Next, use TBB's parallel_for to parallelize the loop for marking multiples of a given prime p in the array premarked[].
- For this loop, create a class named SievePrime; an object of that class will be used as the second argument of this call of parallel_for.
  
  The iteration for this loop needs to increment by p instead of incrementing by 1. You can accomplish this by constructing blocked_range<size_t> object for generating integer values that are 1/p of the index values you want to mark in the array premarked[]. For example, if your loop was
```
for (int k = p*p;  k < maximum;  k += p) {
  ...
}
```
  then you can generate such a range object using the call
```
blocked_range<size_t>(p, maximum/p);
```
  then multiplying index values from that range by p, e.g, premarked[p*i]. Be sure to adjust both the starting and ending values for this range.
- The body class for this loop will need two state variables, premarked and p, both passed in the constructor.
  
  Compile, run, benchmark, make observations in a file lab8/README. Then copy the current version of your program to sieve/tbbprimes12.cpp and record this work in a commit.
```
thing4$  cp sieve/{sieve,tbbprimes12}.cpp
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes12}.cpp
thing4$  git commit -m "lab8: Parallelize premarked loop using parallel_for/object. tbbprimes12.cpp"
```
Modify your sieve/sieve.cpp program to use a lambda expression instead of the class SievePrime.
- Both of the variables of main() to be captured by this lambda expression (namely, premarked and p) should be captured by copy. Therefore, [=] can again be used to start the lambda expression.
  
  Compile, run, benchmark, make observations in a file lab8/README. Then copy the current version of your program to sieve/tbbprimes13.cpp and record this work in a commit.
```
thing4$  cp sieve/{sieve,tbbprimes13}.cpp
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes13}.cpp
thing4$  git commit -m "lab8: Parallelize loop for marking primes using parallel_for/object. tbbprimes13.cpp"
```
Next, use TBB's parallel_for with a lambda expression to parallelize the loop that assigns the starter set of primes to your (thread-safe!) concurrent_queue variable.
- This time, the lambda expression must capture both the array premarked and your queue variable containing known primes. The pointer premarked can be captured by copy, but the queue should be captured by reference, so this lambda expression can add a prime to main()'s state variable if it finds one.
Compile, run, benchmark, make observations in a file lab8/README. Then copy the current version of your program to sieve/tbbprimes14.cpp and record this work in a commit.
```
thing4$  cp sieve/{sieve,tbbprimes14}.cpp
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes14}.cpp
thing4$  git commit -m "lab8: Parallel_for, lambda, capture by reference. tbbprimes14.cpp"
```
Parallelize at least one more loop in the program, aiming to improve performance the most when compared to tbbprimes12.cpp.

Compile, run, benchmark, make observations in a file lab8/README. Then copy the current version of your program to sieve/tbbprimes15.cpp and record this work in a commit.
```
thing4$  cp sieve/{sieve,tbbprimes15}.cpp
```
```
thing4$  git add README sievetest.*_* sieve/{sieve,tbbprimes15}.cpp
thing4$  git commit -m "lab8: Further parallelization. tbbprimes15.cpp"
```
Note: provide a more specific commit message if you prefer.

Patterns in Introduction to TBB, continued .

The dominant parallel programming patterns in this lab are now quite familiar, since we have seen them in three different technologies: Loop Parallel and the reduce operation, which we have counted under the low-level pattern Collective Communication. However, the implementation of these patterns feels at a higher level in TBB than in MPI or even OpenMP.

With TBB's parallel_for(), we use templated classes, and create callable objects (by defining operator()) or write lambda expressions to specify the computational work to be done according to a Loop Parallel pattern. Once one becomes familiar with this sophisticated use of C++ language, this approach to implementation feels at a higher level -- closer to an application and generally further from hardware details -- than the explicit range splitting and many parametered message-passing or collective communication functions of MPI, or even the reduction clause for an OpenMP #pragma compiler directive.
The same implementation elements give TBB's parallel_reduce() the appearance of a much higher level of programming than the counterparts in MPI and OpenMP. Especially with the four-argument version suitable for lambda expressions, we can implement a parallelization with parallel_reduce(), focusing primarily on the essential computations needed for an application, by simply providing the computation needed for each chunk of data (func1) and the operation needed to combine two such chunks (func2).

The term programming paradigm refers to overall abstract patterns in the uses of programming languages. The three most common programming paradigms are:

imperative programming, in which one carries out programming tasks using sequences of instructions (for order), loop constructs (for repetition), and state changes such as memory assignment and input/output (for data managment);
functional programming, which employs nested function calls (for ordering), recursion (for repetition), and function return values (for data movement); and
object-oriented programming, which uses "objects" with state variables and methods to group data and operations together, and which includes notions such as inheritance, communication through message passing, etc.

TBB with objects as arguments is consistent with solving problems with object-oriented programming, whereas TBB using lambda expressions deploys functional programming, as represented by the use of lambda expressions. Note: Many experienced professionals anticipate that functional programming will play a key role in the parallel programming systems of the future. Using TBB's parallel_reduce() with lambda expressions indicates why they may think so, since it's relatively easy to think on a high level and parallelize correctly, avoiding low-level difficulties like race conditions, using the functional paradigm.

Patterns in the implementation of TBB

As with OpenMP, the way TBB carries out its work illustrates some interesting parallel programming patterns.

Maintaining a Thread Pool reduces the overhead of creating and destroying threads, as with OpenMP.
Through the blocked_range<> data structure, the range of a parallelized computation can be subdivided, and those smaller ranges subdivided, etc., in order to find a computation size or granularity that matches the available computational resources. This implementation strategy reflects the Recursive Splitting pattern.
These same recursive splits lead to an efficient way to compute a reduction: perform the reduce operation first on small subdivisions (producing a partial reduction for the range that was split to produce those small subdivisions), then continue recursively until the reduction of the entire tree of subdivisions has been calculated. Many implementations of MPI and OpenMP perform efficient reductions using the same Recursive Splitting strategy, although TBB's explicit ranges make the approach easier to visualize.
______ Recursive Splitting is considered to be a high level pattern for finding the parallelism in a programming problem, the same level as Data Decomposition and Task Decomposition.
TBB assigns a queue of computations to each of its threads, illustrating the Task Queue pattern. If some threads finish the work in their queues while other threads still have long queues of work to complete, TBB enables the otherwise idle threads to carry out tasks from the busy threads' queues, in an automated way. This strategy is called work stealing. Some implementations of MPI also support work stealing, but this strategy is a design goal for TBB.

Deliverables

All of your code for this assignment should already be contained in commits. Modify the most recent commit message to indicate that you are submitting the completed assignment.

thing4$  git commit --amend

Add the following to the latest commit message.

    lab8: complete

If your assignment is not complete, indicate your progress instead, e.g.,
    lab8: items 1-5 complete, item 6 partial
You can make later commits to submit updates.

Finally, pull/push your commits in the usual way.

thing4$  git pull origin master
thing4$  git push origin master

Files: README trap_tbb2.cpp trap_tbb3.cpp sieve/sieve.cpp sievetest.*_* sieve/tbbprimes10.cpp sieve/tbbprimes11.cpp sieve/tbbprimes12.cpp sieve/tbbprimes13.cpp sieve/tbbprimes14.cpp sieve/tbbprimes15.cpp

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, if you worked exclusively on thing3, include the name thing3 in your commit message.

thing3$  cd ~/PDC
thing3$  git pull origin master
thing3$  git add -A
thing3$  git commit -m "Lab 21 submit (thing3)"
thing3$  git push origin master

Likewise, if you worked on both link machines and a thing, you can rename your lab8 folders and submit each, with commit messages indicating the computer on which that commit was created.

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Saturday, November 5, 2016.