Introduction to TBB programming (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Thursday, January 15, 2015</span>)

Introduction to TBB programming

CS 300, Parallel and Distributed Computing (PDC)
Due Thursday, January 15, 2015

______

Preliminary material

TBB concepts; reading and approaching TBB
omprimes1.cpp
work queues
TBB reference pages

Basic laboratory exercises: TBB containers

On a manycore computer (either a thing or the MTL), create a PDC/lab7 subdirectory of your personal directory for work on the lab, and change directory to that directory.

Try TBB computation as follows.
- Copy the trap_tbb.cpp program to your lab7 directory on the manycore computer, for example,
```
    %  scp ~cs300/tbb/trap_tbb.cpp thingn.cs.stolaf.edu:PDC/lab7
```
- If you are on MTL, you must source the appropriate tbb.sh file.
- Now, compile the code and run it, for example,
```
    thing$  g++ -o trap_tbb trap_tbb.cpp -ltbb
    thing$  ./trap_tbb
```
  Does trap_tbb.cpp compute a correct answer? If not, why not?
- The given code trap_tbb.cpp uses a class SumHeights that defines operator() in order to define the second argument for TBB's parallel_for(). With C++-11, there is a more compact way to specify that argument, using lambda expressions. To try this, copy trap_tbb.cpp to trap_tbbL.cpp, and replace the call
```
   parallel_for(blocked_range<size_t>(1, n), SumHeights(a, h, integral));
```
  in the new copy trap_tbbL.cpp by
```
   parallel_for(blocked_range<size_t>(1, n), 
		[=, &integral] (const blocked_range<size_t> r) {
		  for(size_t i = r.begin(); i != r.end(); i++) 
		    integral += f(a+i*h);
		} );
```
  You can also remove the definition of the class SumHeights, which is no longer being used. Instead, the lambda expression
```
		[=, &integral] (const blocked_range<size_t> r) {
		  for(size_t i = r.begin(); i != r.end(); i++) 
		    integral += f(a+i*h);
		} 
```
  behaves like an anonymous function with one argument, a blocked_range<size_t> named r.
  - The syntax [=, &integral] starts that lambda expression. That syntax describes how to construct a closure for executing that lambda expression. A closure contains copies of/references to local variables defined outside of that lambda expression.
    - The solitary = indicates that by default, any of main()'s local variables that the lambda expression uses are copied to private variables with that lambda expression's closure. In this case, private copies are made of the variables a and h, which that lambda expression could change without affecting main()'s variables a and h.
    - The notation &integral indicates that main()'s variable integral is available by reference in that lambda expression's closure. Thus, when a thread executing that lambda expression changes integral, that change affects main()'s variable integral.
    The elements = and &integral are called captures for that lambda expression's closure.
- The syntax (const blocked_range<size_t> r) provides the type and name for the first argument in that lambda expression. Lambda expressions may have zero or more arguments, specified with types and argument names, as for functions.
- The for loop between curly brackets is the body of that lambda expression, comparable to the body of a function definition.
- A lambda expression may be called, and will perform binding of arguments and evaluation of the body expression, as with other functions.
- Here is an example of calling a lambda expression directly.
```
    [] (int i) { return i+1; } (6)
```
  The lambda expression is in boldface. Since it doesn't use any variables other than the argument i, we don't need to include any captures. The final parenthesized (6) provides the argument value. That value 6 is bound to the name i, then the body return i+1; is evaluated, producing 7.
To compile with lambda expressions, on a thing, include the flag -std=c++0x. For example,
```
    thingn$  g++ -std=c++0x -o trap_tbbL trap_tbbL.cpp -ltbb
```
______
Begin with a copy of the sequential program omprimes1.cpp and its Makefile. Copy from a link machine to your 32-core machine, using the name sieve.cpp for the copy of omprimes1.cpp on that 32-core machine. For example, for thingn,
```
    %  scp ~cs300/lab6/omprimes1.cpp thingn.cs.stolaf.edu:PDC/lab7/sieve.cpp
    %  scp ~cs300/lab6/Makefile thingn.cs.stolaf.edu:PDC/lab7 
```
Also copy the Makefile for compiling, and verify that the sequential sieve program compiles and produces correct answers on your 32-core machine.

Once the program is running correctly, perform benchmark tests, and record observations in a file README. Also, make a copy of this version of sieve.cpp, called tbbprimes1.cpp.
Modify your program to use a TBB concurrent_vector instead of an STL vector for pset.
- You may create a fresh subdirectory for developing and testing the new version of this code, as in the previous lab, if desired.
- If you are using MTL, you must source the appropriate tbb.sh file.
- The preprocessor directive
```
    #include <tbb/concurrent_vector.h>
```
  will inform the compiler how to use the templated type tbb::concurrent_vector<>.
- Modify Makefile to add -ltbb to compilation and linking.
Compile and run, then perform benchmark tests. Compare to the STL sequential code's performance, and record observations in README. Then, make a copy tbbprimes2.cpp of this version of sieve.cpp.
Add a loop for precomputing primes up to sqrt(prime_max) and storing those primes in another concurrent_vector named pset0.
- Instead of controlling the outer loop by p, control it by an index i into pset0. This will require you to determine the number of elements in pset0.
- Define p as p = pset0[i], where i is the loop-control variable for the outer loop.
Compile, run, benchmark, observe in README, and copy to tbbprimes3.cpp
Shared data structure patterns in Introduction to TBB programming.
- Shared Array______
- Shared Queue______
- Task Queue______
______
______
______
Change pset0 to be a concurrent_queue instead of a concurrent_vector, and use that work queue to control the outer loop for determining all primes.
- concurrent_queue is a C++ template, so don't forget to specify the type for elements of your queue using angle brackets, e.g., <int> .
- The method push() for a concurrent_queue adds its argument value to the end of the queue.
- The method try_pop() attempts to remove a value from the front of a concurrent_queue and store it in its (reference) argument. It returns TRUE if this succeeds, and returns FALSE (with no state change) if the queue is empty.
- Observe that a call pset0.try_pop(p) could be used to control the outer loop, e.g., as a guard for a while loop.
Compile, run, benchmark, observe in README, and copy to tbbprimes4.cpp

Parallel extensions

Try one (or optionally both) of these parallelization extensions of the basic lab assignment above.

Try adding TBB parallelism to the inner for loop... tbbprimes5.cpp. Record benchmark results and comments from experience in README.
Starting with tbbprimes4.cpp, try adding OpenMP parallelism... tbbprimes6.cpp. Record benchmark results and comments from experience in README.

Notes.
- Assuming that you are using a work queue to control your outer loop, and that your outer loop is a while loop, you can parallelize that outer loop so that each element popped off of the queue gets its own thread using OpenMP 3.0's task construct.
  Here is an example showing how to use the task construct for a different problem, namely, applying a function f() to each element of an array of integers up to the first element that contains 0:
```
  int arr[] = {9, 9, 2, 1, 0};  // defines/initializes array of 5 integers
  int *ptr;  // for pointing to one of the elements of the array arr

  #pragma omp parallel default(shared)
  {
    #pragma omp single private(ptr)
    {
      ptr = arr;
      while (*ptr) {  // while current array element does not hold 0
	#pragma omp task
	    f(*ptr);
	ptr++;  // advance to next array element
      }
    }
  }
```
  Notes:
  - We are stepping through the array arr[] using a pointer ptr. If ptr "points to" (holds the address of) an element in the array arr[], then adding 1 to ptr causes it to hold the address of the next element in that array. We initialize ptr using the expression arr (no square brackets), which evaluates to the address of the first element in the array arr[].
  - The OpenMP single construct in this example guarantees that only one thread will execute the initialization, guard, and progress operations for the while loop. (This avoids race conditions on the variable ptr.)
  - As always, the OpenMP parallel construct assembles a team of threads, but no parallel computation actually occurs until a worksharing construct is used, such as a for or sections construct.
    
    The OpenMP 3.0 task construct is another worksharing construct that allocates threads from the team (as available) to carry out a statement. In this example, different threads can perform the call f(*ptr) in parallel for different values of *ptr.
    
    (Of course, f() needs to be a thread-safe function, or there will be race conditions from these parallel calls of f().)
  - One subtle point: private variables (such as ptr) in a construct that encloses a task become firstprivate variables in that task construct, by default. This means that those variables become private to the thread executing that task, but are also guaranteed to be initialized with the value they had just before that task began.
    
    In this case, ptr holds a different value for each iteration of the while loop. When a thread is assigned to carry out the task directive for one of those iterations, it receives its own copy of arr initialized with arr's value for that loop's iteration. While that thread execute's its task, ptr may be incremented and the next iteration can begin; if another thread is available, the OpenMP scheduler can assign that thread to the task for that next iteration; and so on.
    
    The OpenMP firstprivate feature is sometimes needed in other contexts besides tasks. It can be specified explicitly (for constructs that support it) using a firstprivate() clause.
Algorithm patterns in Introduction to TBB programming.

______
- Loop Parallel, trap_tbb* and tbbprimes4
- Master-Worker______
______
______

Deliverables

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, if you worked exclusively on thing3, include the name thing3 in your commit message.

    thing3$  cd ~/PDC
    thing3$  git pull origin master
    thing3$  git add -A
    thing3$  git commit -m "Lab 7 submit (thing3)"
    thing3$  git push origin master

Likewise, if you worked on both link machines and a thing, you can rename your lab7 folders and submit each, with commit messages indicating the computer on which that commit was created.

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Thursday, January 15, 2015.