Introduction to OpenMP programming (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Thursday, October 18, 2018</span>)

Introduction to OpenMP programming
Split SPLIT_ID

CS 300, Parallel and Distributed Computing (PDC)
Due Thursday, October 18, 2018

______

run on >= 3 environments
```
g++ -fopenmp -o trap_omp trap_omp.cpp
```
experiment with trap_omp
fix sections
sections with function calls
Other constructs?

Preliminary material

Shared Memory Parallel computing
OpenMP concepts
parallel for and parallel sections constructs
Potential for race conditions...

Laboratory exercises

On a link computer, create a ~/PDC/lab5 subdirectory for work on the lab, and change directory to that directory.

On that link computer, copy ~rab/pdc/trap_omp.cpp to your lab5 directory, then compile and link this OpenMP version of the trapezoid computation as follows:

link%  g++ -fopenmp -o trap_omp trap_omp.cpp

The -fopenmp flag requests compiling and linking support for OpenMP. Note that one could also compile and link in separate steps, in which case -fopenmp should be used in both commands.

Try running the resulting program trap_omp without command-line arguments, then with a single positive command-line argument to request different thread counts. Observe how the output varies.

Make a commit of this code:

link%  git add trap_omp.cpp 
link%  git commit -m "lab5: Trapezoidal approximation with OpenMP"

Patterns in trap_omp.cpp.

Examining the code of trap_omp.cpp, we see Data Decomposition at work as before, this time splitting work of adding the areas of trapezoids among multiple threads within a single process instead of among multiple processes spread out on a cluster. However, the higher-level OpenMP #pragma omp call mostly conceals the details of Data Decomposition, since OpenMP divides the interval of trapezoids among the threads automatically with #pragma omp for.

The #pragma's reduction(+, integral) clause specifies that a reduce operation should take place to add the partial sums of trapezoids. We have known this as a Collective Communication pattern. However, unlike MPI, OpenMP does not need to use network communication for this reduce operation. Instead, OpenMP can share values among threads using memory locations in order to accomplish the reduction. All of the details are hidden in that reduction() clause, except the essentials of what reduction operation to perform and what values to reduce, namely the value of each thread's variable integral.

The trap_omp.cpp code also represents Loop parallel, which focuses on computationally intensive loops as opportunities for parallelism. In the case of trap_omp.cpp, as with the MPI program trap.cpp, the number of iterations constitutes the main factor in computational intensity. Other loops may include more computation within each iteration of a loop.

Threads in OpenMP

Behind the scenes, OpenMP's (parallel for) feature divides the work of the loop among multiple threads, which carry out separate paths of execution within a process. Threads are sometimes called lightweight processes: they execute their own sequences of instructions much as if they were independent processes, but there is less overhead computation to switch between threads within a process than to switch between different processes. This is because a process's threads share computational resources such as memoy with that process. Those shared resources don't need to be switched, but only thread-specific resources such as those that control the execution path (e.g., the program counter as discussed in Hardware Design). As we have seen with OpenMP clauses such as shared() and private(), a programmer can control whether certain resources are shared among threads or dedicated to individual threads when solving a parallel programming problem.

Of course, there is computational overhead for a process to create threads and destroying them when no longer needed. Here are two patterns relevant to managing threads.

The Fork-Join pattern involves (1) allocating one or more threads ("forking," as if creating a fork in the road for two separate execution paths), (2) carrying out computations in those threads in parallel, then (3) waiting for all of those threads to finish their work ("joining" the separate execution paths back into a single path) before proceeding with sequential execution.
A Thread Pool is a collection of threads that have been created and not yet destroyed that a programmer can reuse for segments of parallel computations as needed.

OpenMP uses Fork-Join and Thread Pool implicitly in its work, so an OpenMP programmer never forks threads or interacts with OpenMP's thread pool directly. The omp parallel for directive splits up the range of values of a loop-control variable (Data Decomposition) of a loop, then carries out those parcels of work using Fork-Join. Also, OpenMP typically creates a thread pool once at the outset of a program, then reuses that thread pool for all of the omp parallel regions throughout a program's code, in order to save the computational overhead of repeatedly creating and destroying threads.

A programmer doesn't need to use OpenMP to program with threads. We will see several other ways to use threads implicitly using other libraries (such as OpenMP) or other languages. A programmer can also create and use threads directly using various thread packages such as C++-11 threads and Posix threads (pthreads, also available for C language).

The time feature of the shell provides running time information for a program. Use

link%  time ./trap_omp n

to see how the running time varies depending on the number of threads used.

Try several powers of 2. Also try numbers that are near but not exactly powers of 2 (both above and below), and look for interesting patterns, as well as some arbitrary values that are not near powers of two. Try multiple runs with the same number of threads for a few thread counts: are the results always the same?

Record your observations and results in a file README in your lab5 directory.

Create a commit containing your README results.

link%  git add README 
link%  git commit -m "lab5: Link performance testing recorded in README"

Compare performance of trap_omp on that link computer with the 64-core computer thing2.cs.stolaf.edu, as follows. Report on your observations in README

To accomplish this step,

Pull/push your work on the Link computer to your stogit repository.
```
link%  git pull origin master
link%  git push origin master
```
Copy your Link public SSH key to thing2.cs.stolaf.edu. Note: Using passwordless SSH is more secure than sending a password over the network. The following command adds your Link public SSH key to your account's ~/.ssh/authorized_keys file on thing2.cs.stolaf.edu.
```
link%  ssh-copy-id username@thing2.cs.stolaf.edu
```
You can expect to be asked whether you trust thing2.cs.stolaf.edu, and for your CS-managed password for connecting to thing2 this first time.
- To test this step:
  
  Enter
```
link%  ssh thing2.cs.stolaf.edu
```
  You should be able to log in successfully without a password.

Log into thing2.cs.stolaf.edu (no password should be required). Prepare your thing2 account for git.

thing2$  git config --global user.name "Your Name"
thing2$  git config --global user.email username@stolaf.edu
thing2$  git config --global core.editor emacs

Also create an SSH key on thing2.
```
thing2$  ssh-keygen -t rsa
```
As before, you can use default responses for all three prompts from the ssh-keygen command.
- To test this step:
  
  Copy your new public SSH key to another CS-managed computer that you have an account on, and SSHing to that computer. For example,
```
thing2$  ssh-copy-id username@.cs.stolaf.edu
thing2$  ssh username@.cs.stolaf.edu
```
  Use your own username for username. The ssh-copy-id command should prompt for your CS-managed password, but the ssh command should succeed in logging you into cumulus without a password.
Then, manually copy your thing2 public SSH key to stogit, by printing that public key file in your terminal window
```
thing2$  cat ~/.ssh/id_rsa.pub
```
browsing to stogit.cs.stolaf.edu and logging in (with CS-managed password), navigating to add an SSH key, and copy/pasting the public key file's contents, as described in Getting started with Stogit.
Now clone your stogit repository on thing2
```
thing2$  cd ~
thing2$  git clone git@stogit.cs.stolaf.edu:pdc-f16/username.git PDC
```
This should create a new subdirectory ~/PDC that contains all of your PDC repository.
Change to your ~/PDC/lab5 subdirectory and compile trap_omp.cpp using the same compilation command as on the Link machine.
Proceed to use time to test the performance of the resulting executable trap_omp. Record your results by adding to README, and add observations in README about how performance and speedup differ on the two systems.

Commit your changes to README:
```
thing2$  git add README 
thing2$  git commit -m "lab5: thing2 performance testing in README"
```

Create a program trap2.cpp that is the same as trap_omp.cpp, except removing the reduction clause in the OpenMP construct and adding integral as a shared variable. Try running this with various numbers of threads, including 1 (the default). How does this change the output? Can you explain this behavior? Write your observations and conclusions in README

thing2$  git add trap2.cpp README
thing2$  git commit -m "lab5: trade reduction for shared variable"

The program sections.cpp was presented in class. Copy this program to your directory on a Link computer, compile it, and observe the behavior of the program over multiple runs.

Notes:

To avoid a merge commit, first pull/push your work on thing2.

thing2$  git pull origin master
thing2$  git push origin master

Now, log into a link computer and copy sections.cpp to your lab5 directory on that link computer. Compile using -fopenmp as you did with other OpenMP programs.
Report on your runs in your README file. Can you explain what went wrong for two or more of your sample runs? Include that analysis in README.

Note: Please create only one README file, and use it to describe what you find on both computing systems, rather than making multiple READMEs.

Create a commit to record your progress.

link$  cp sections.cpp sections1.cpp

link%  git add sections1.cpp README
link%  git commit -m "lab5: buggy runs of sections.cpp"

Since you have been committing on multiple machines, let's do a pull/push to double-check that the repository is up to date with your most recent changes on the Link machine.
```
link%  git pull origin master
link%  git push origin master
```
Note: If you did the steps above slightly differently than the instructions, you may find that pulling causes a merge commit, and potentially a merge conflict. If you are not yet comfortable with merge commits and merge conflicts, see this video.

Use OpenMP clauses, constructs, and/or other strategies to correct the behavior of sections.cpp.

Notes:

First, decide what correct behavior means. Does it mean that computation produces the same answer each time? That each C++ statement occurs completely without interruption? That each section is computed once and only once? Write your definition of correct in your README file.
The sections construct supports the following OpenMP clauses (and others): private, reduction, num_threads, shared, default (either none or shared in C/C++). Other constructs may help, too.

Keep notes on your efforts to correct sections.cpp in README.

When you're ready, create a commit

link%  git add sections.cpp README
link%  git commit -m "lab5: Fixed sections.cpp bugs"

Deliverables

First, perform a pull/push on thing2 to double check that your changes on thing2 were sent to your repository, and to update the working directory on thing2.

To do this, first SSH into thing2, cd to your lab5 subdirectory
Perform the following.
```
thing2$  git pull origin master
thing2$  git push origin master
```
If you encounter a merge commit or merge conflict, this video may help.
Log back into your link computer and cd to your lab5 file for the following steps.

All of your code for this assignment should already be contained in commits. Modify the most recent commit message to indicate that you are submitting the completed assignment.

link%  git commit --amend

Add the following to the latest commit message.

    lab5: complete

If your assignment is not complete, indicate your progress instead, e.g.,
    lab5: items 1-5 complete, item 6 partial
You can make later commits to submit updates.

Finally, pull/push your commits in the usual way.

link%  git pull origin master
link%  git push origin master

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, rename your Link lab5 directory to lab5-link, and rename your thing3 lab5 directory to lab5-thing3 on thing3. Include the cluster name in your commit message, e.g.,

thing3$  cd ~/PDC
thing3$  git pull origin master
thing3$  git add -A
thing3$  git commit -m "Lab 18 submit (thing3)"
thing3$  git push origin master

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Friday, October 14, 2016.

Files: trap_omp.cpp README trap2.cpp sections1.cpp sections.cpp