Introduction to TBB programming (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Saturday, October 29, 2016</span>)

Introduction to TBB programming

CS 300, Parallel and Distributed Computing (PDC)
Due Saturday, October 29, 2016

______

Preliminary material

Lambda capture
TBB concepts; reading and approaching TBB
omprimes1.cpp
work queues
TBB reference pages

Basic laboratory exercises: TBB containers

Log into thing4 and create a lab20 subdirectory of your PDC working directory for work on this lab, and change directory to that directory.

Also create a second terminal window that is logged into a Link computer. We will use this to copy files from the Link file system to thing4. Note: The CS program permits logins and scp file transfers from IT-managed computers to CS-managed computers, but St. Olaf does not these from CS-managed computers to IT-managed computers for security reasons.
Try TBB computation as follows.
- Copy the trap_tbb.cpp program to your lab20 directory on thing4. For example, on a Link terminal window,
```
link%  scp  ~cs300/tbb/trap_tbb.cpp  thing4.cs.stolaf.edu:PDC/lab20 
```
- Now, compile the code and run it on thing4, for example,
```
thing4$  g++ -o trap_tbb trap_tbb.cpp -ltbb
thing4$  ./trap_tbb
```
  Does trap_tbb.cpp compute a correct answer? If not, why not?
- The given code trap_tbb.cpp uses a class SumHeights that defines operator() in order to define the second argument for TBB's parallel_for(). With C++-11, there is a more compact way to specify that argument, using lambda expressions. To try this, copy trap_tbb.cpp to trap_tbbL.cpp, and replace the call
```
parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));
```
  in the new copy trap_tbbL.cpp by
```
parallel_for(blocked_range<size_t>(1, n), 
	     [=, &integral] (const blocked_range<size_t> r) {
		  for(size_t i = r.begin(); i != r.end(); i++) 
		    integral += f(a+i*h);
	     } );
```
  You can also remove the definition of the class SumHeights, which is no longer being used. Instead, the lambda expression
```
	[=, &integral] (const blocked_range<size_t> r) {
	  for(size_t i = r.begin(); i != r.end(); i++) 
	    integral += f(a+i*h);
	} 
```
  behaves like an anonymous function with one argument, a blocked_range<size_t> named r.
  - The syntax [=, &integral] starts that lambda expression. That syntax describes how to construct a closure for executing that lambda expression. A closure contains copies of/references to local variables defined outside of that lambda expression.
    - The solitary = indicates that by default, any of main()'s local variables that the lambda expression uses are copied to private variables with that lambda expression's closure. In this case, private copies are made of the variables a and h, which that lambda expression could change without affecting main()'s variables a and h.
    - The notation &integral indicates that main()'s variable integral is available by reference in that lambda expression's closure. Thus, when a thread executing that lambda expression changes integral, that change affects main()'s variable integral.
    The elements = and &integral are called captures for that lambda expression's closure.
- The syntax (const blocked_range<size_t> r) provides the type and name for the first argument in that lambda expression. Lambda expressions may have zero or more arguments, specified with types and argument names, as for functions.
- The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
- A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
- Here is an example of calling a lambda expression directly.
```
[] (int i) { return i+1; } (6)
```
  The lambda expression is in boldface. Since it doesn't use any variables other than the argument i, we don't need to include any captures. The final parenthesized (6) provides the argument value. That value 6 is bound to the name i, then the body return i+1; is evaluated, producing 7.
To compile with lambda expressions, on a thing, include the flag -std=c++11. For example,
```
thingn$  g++ -std=c++11 -o trap_tbbL trap_tbbL.cpp -ltbb
```
Record this work in a commit.
```
thing4$  git add trap_tbb.cpp trap_tbbL.cpp
thing4$  git commit -m "lab20: Buggy TBB implementations of trap program, using a class and a lambda"
```

To try a TBB container, first copy the sequential program omprimes1.cpp and its Makefile to your lab20 directory on thing4, and test that program. For example, on a Link computer,
```
link%  scp  ~cs300/lab6/{omprimes1.cpp,Makefile}  thing4.cs.stolaf.edu:PDC/lab20
```
Then, in the directory lab20 on thing4,
```
thing4$  cp omprimes1.cpp sieve.cpp
thing4$  make
thing4$  ./sieve 100 4
```
Once the program is running correctly, perform benchmark tests, and record observations in a file lab20/README.

Record this work in a commit.
```
thing4$  git add Makefile omprimes1.cpp sieve.cpp
thing4$  git commit -m "lab20: omprimes1 computation of primes"
```
Modify your program to use a TBB concurrent_vector instead of an STL vector for pset. Note: Since the program has not yet been parallelized, we don't yet need a thread-safe queue. This change introduces concurrent_vector and prepares for the next lab, when we will parallelize the code using TBB.
1. Create a subdirectory lab20/tbb for developing and benchmarking, and copy Makefile and sieve.cpp to that subdirectory. Then, replace the STL vector<> template with concurrent_vector<>.
2. The preprocessor directive
```
#include <tbb/concurrent_vector.h>
```
  will inform the compiler how to use the templated type tbb::concurrent_vector<>.
3. Modify Makefile in lab20/tbb to add -ltbb to compilation and linking.
4. Perform a simple test of the resulting sieve.cpp to insure that it compiles and runs correctly.
5. Record this work in a commit.
```
thing4$  cd ~/PDC/lab20/tbb
```
```
thing4$  git add Makefile sieve.cpp
thing4$  git commit -m "lab20: substitute TBB concurrent_vector for vector in sieve.cpp"
```
Now, perform benchmark tests, to compare performance of the STL vector with TBB's concurrent_vector in this application. sequential code's performance
1. Copy your benchmark script from the previous lab, and move the original sieve code to a new subdirectory stl
```
thing4$  git pull origin master 
thing4$  cd ~/PDC/lab20 
thing4$  cp  ../lab/sievetest.sh  .
thing4$  mkdir stl
thing4$  git mv Makefile omprimes1.cpp sieve.cpp stl
```
  Here, we pull in order to make sure that the thing4 working directory is up to date, before copying the script sievetest.sh from the previous lab's subdirectory. Also, the git mv operation prepares to rename indicated files in the (sto)git repository, from lab20/* to lab20/stl/*. In particular, the git mv operation above is equivalent to the following:
```
thing4$  mv Makefile omprimes1.cpp sieve.cpp stl
thing4$  git rm Makefile omprimes1.cpp sieve.cpp
thing4$  git add stl/Makefile stl/omprimes1.cpp stl/sieve.cpp
```
2. Run the benchmark script on stl and enter any observations in a file lab20/README.
3. Run the benchmark script on tbb and enter any observations in a file lab20/README. Be sure to identify the sievetest.*_* file you are analyzing in your README observation.
4. Record this work in a commit.
```
thing4$  cp tbb/sieve.cpp tbb/tbbprimes1.tbb
```
```
thing4$  git add sievetest.sh README sievetest.*_* tbb/tbbprimes1.cpp
thing4$  git commit -m "lab20: Relocate sieve.cpp to stl/;  benchmark stl and tbb using sievetest.sh"
```
  Note: The copy tbbprimes1.cpp makes it easy for graders to retrieve this stage of your work.
Revise tbb/sieve.cpp to add a loop for precomputing primes up to sqrt(prime_max) and storing those primes in another concurrent_vector named pset0.
- Instead of controlling the outer loop by p, control it by an index i into pset0. This will require you to determine the number of elements in pset0.
- Define p as p = pset0[i], where i is the loop-control variable for the outer loop.
Compile, run, benchmark, observe in README, and copy the resulting code tbb/sieve.cpp in a file tbb/tbbprimes2.cpp. Then record this work in a commit.
```
thing4$  git add README sievetest.*_* tbb/{sieve.cpp,tbbprimes2.cpp}
thing4$  git commit -m "lab20: Add loop to precompute primes; benchmark. tbbprimes2.cpp"
```
Shared data structure patterns in Introduction to TBB programming .

TBB's concurrent_vector<> implements a Shared Array pattern, enhanced by its ability to grow dynamically through push_back() and other methods. The concurrent_vector class also offers a grow-to-at-least() method for expanding the size of the underlying array in advance. The growing operations are costly in vector classes, although using them once they have the correct size can be as efficient as using ordinary arrays. If you can estimate the size a vector needs to be in advance, you can avoid most of the extra computational cost over arrays while retaining benefits such as growing when necessary.

Likewise, the templated class concurrent_queue<> implements a Shared Queue. Besides the convenience of not having to build these concurrent data structures ourselves, these TBB containers are designed for scalability, so they will perform well even for large data.

In tbb/sieve.cpp, Change pset0 to be a concurrent_queue instead of a concurrent_vector, and use that work queue to control the outer loop for determining all primes.
- concurrent_queue is a C++ template, so don't forget to specify the type for elements of your queue using angle brackets, e.g., <int> .
- The method push() for a concurrent_queue adds its argument value to the end of the queue.
- The method try_pop() attempts to remove a value from the front of a concurrent_queue and store it in its (reference) argument. It returns TRUE if this succeeds, and returns FALSE (with no state change) if the queue is empty.
- Observe that a call pset0.try_pop(p) could be used to control the outer loop, e.g., as a guard for a while loop.
Compile, run, benchmark, observe in README, and copy the resulting code tbb/sieve.cpp in a file tbb/tbbprimes3.cpp. Then record this work in a commit.
```
thing4$  git add README sievetest.*_* tbb/{sieve,tbbprimes3}.cpp
thing4$  git commit -m "lab20: Substitute concurrent_vector for precomputed primes;
    benchmark. tbbprimes3.cpp"
```

Parallel extensions

Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.

Add TBB parallelism for the inner for loop in tbb/sieve.cpp.
- Replace that for loop by a parallel_for() call that uses either a "functor" class that defines an appropriate operator() or a C++ lambda.
- Test and benchmark your code, and copy the tested version of sieve.cpp to tbb/tbbprimes4.cpp. Then record this work in a commit.
```
thing4$  git add README sievetest.*_* tbb/{sieve,tbbprimes4}.cpp
thing4$  git commit -m "lab20: Parallelize using parallel_for(). tbbprimes4.cpp"
```
Try adding OpenMP parallelism to sieve.cpp. Call the rsulting program tbbprimes5.cpp. Record benchmark results and comments from experience in README.

Notes.
- Assuming that you are using a work queue to control your outer loop, and that your outer loop is a while loop, you can parallelize that outer loop so that each element popped off of the queue gets its own thread using OpenMP 3.0's task construct.
  Here is an example showing how to use the task construct for a different problem, namely, applying a function f() to each element of an array of integers up to the first element that contains 0:
```
int arr[] = {9, 9, 2, 1, 0};  // defines/initializes array of 5 integers
int *ptr;  // for pointing to one of the elements of the array arr

#pragma omp parallel default(shared)
{
  #pragma omp single private(ptr)
  {
    ptr = arr;
    while (*ptr) {  // while current array element does not hold 0
	#pragma omp task
	    f(*ptr);
	ptr++;  // advance to next array element
    }
  }
}
```
  Notes:
  - We are stepping through the array arr[] using a pointer ptr. If ptr "points to" (holds the address of) an element in the array arr[], then adding 1 to ptr causes it to hold the address of the next element in that array. We initialize ptr using the expression arr (no square brackets), which evaluates to the address of the first element in the array arr[].
  - The OpenMP single construct in this example guarantees that only one thread will execute the initialization, guard, and progress operations for the while loop. (This avoids race conditions on the variable ptr.)
  - As always, the OpenMP parallel construct assembles a team of threads, but no parallel computation actually occurs until a worksharing construct is used, such as a for or sections construct.
    
    The OpenMP 3.0 task construct is another worksharing construct that allocates threads from the team (as available) to carry out a statement. In this example, different threads can perform the call f(*ptr) in parallel for different values of *ptr.
    
    (Of course, f() needs to be a thread-safe function, or there will be race conditions from these parallel calls of f().)
  - One subtle point: private variables (such as ptr) in a construct that encloses a task become firstprivate variables in that task construct, by default. This means that those variables become private to the thread executing that task, but are also guaranteed to be initialized with the value they had just before that task began.
    
    In this case, ptr holds a different value for each iteration of the while loop. When a thread is assigned to carry out the task directive for one of those iterations, it receives its own copy of arr initialized with arr's value for that loop's iteration. While that thread execute's its task, ptr may be incremented and the next iteration can begin; if another thread is available, the OpenMP scheduler can assign that thread to the task for that next iteration; and so on.
    
    The OpenMP firstprivate feature is sometimes needed in other contexts besides tasks. It can be specified explicitly (for constructs that support it) using a firstprivate() clause.
Test and benchmark your code, and copy the tested version of sieve.cpp to tbb/tbbprimes5.cpp. Then record this work in a commit.
```
thing4$  git add README sievetest.*_* tbb/{sieve,tbbprimes5}.cpp
thing4$  git commit -m "lab20: Parallelize using openmp. tbbprimes5.cpp"
```
Algorithm patterns in Introduction to TBB programming .

The OpenMP parallelization of tbbprimes5.cpp and the TBB parallelization trap_tbb.cpp both exhibit the familiar Loop Parallel pattern. The OpenMP code in tbbprimes5.cpp also illustrates Master-Worker.

Our use of the Shared Queue pattern in tbbprimes5.cpp constitutes a new mid-level parallel programming pattern.
- Computation in a Task Queue pattern is organized as a series of tasks for distribution among threads of execution. The tasks often have variable or unpredictable computational sizes, so some threads may carry out multiple tasks while another other thread continues with a lengthy task. This behavior means that Task Queue offers some natural load balancing among threads, which makes good use of those threads' execution time.
Task Queue is closely related to Shared Queue (which one often needs in order to manage a task queue correctly in parallel) and Master-Worker (since a Task Queue algorithm is ordinarily a special case of Master-Worker). However, Task Queue is worth treating as a pattern in its own right, especially because of its natural load balancing property.

An essential difference between Task Queue and Shared Queue is that Task Queue concerns structuring a program, whereas Shared Queue pertains to structuring data.
This suggests categories for mid-level patterns.
- Program structure patterns: Loop Parallel, SPMD, Fork-Join, Master-Worker, Task Queue
- Data structure patterns: Shared Array, Shared Queue
Likewise, we can identify categories for low-level patterns.
- Execution progress patterns: Thread Pool
- Coordination patterns: Message Passing, Collective Communication, Mutual Exclusion
We choose the category name "Execution progress" because Thread Pool has to do with providing resources for parallel execution of machine instructions, not coordination of those resources.

Deliverables

All of your code for this assignment should already be contained in commits. Modify the most recent commit message to indicate that you are submitting the completed assignment.

thing4$  git commit --amend

Add the following to the latest commit message.

    lab20: complete

If your assignment is not complete, indicate your progress instead, e.g.,
    lab20: items 1-5 complete, item 6 partial
You can make later commits to submit updates.

Finally, pull/push your commits in the usual way.

thing4$  git pull origin master
thing4$  git push origin master

Files: trap_tbb.cpp trap_tbbL.cpp tbb/Makefile tbb/sieve.cpp sievetest.sh README sievetest.*_* stl/Makefile stl/sieve.cpp stl/omprimes1.cpp tbb/tbbprimes1.cpp tbb/tbbprimes2.cpp tbb/tbbprimes3.cpp tbb/tbbprimes4.cpp tbb/tbbprimes5.cpp

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, if you worked exclusively on thing3, include the name thing3 in your commit message.

thing3$  cd ~/PDC
thing3$  git pull origin master
thing3$  git add -A
thing3$  git commit -m "Lab 20 submit (thing3)"
thing3$  git push origin master

Likewise, if you worked on both link machines and a thing, you can rename your lab20 folders and submit each, with commit messages indicating the computer on which that commit was created.

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Saturday, October 29, 2016.