Performance improvement with OpenMP (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Wednesday, January 14, 2015</span>)

Performance improvement with OpenMP

CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, January 14, 2015

Sequential alg
Parallelize
Opt 1: delete all evens
Opt 2: precompute, then remove barriers
Opt 3: cache

Preliminary material

OpenMP features
Given sequential algorithm for sieve computation of primes
Makefile for this lab (to be used in a subdirectory per lab segment.
______

Laboratory exercises

On the link computers, create a lab6 subdirectory for work on the lab, and change directory to that directory.
Create a subdirectory omprimes1 of your lab6 directory. Copy the Makefile and sequential version omprimes1.cpp for computing primes to your omprimes1 subdirectory, then change directory to that subdirectory. Rename omprimes1.cpp to sieve.cpp (or create a symbolic link).
Issue the make command to compile the given program, then test it using
```
    % bin/sieve 100 4
```
Verify that it correctly prints the first 20 primes, and that it correctly determines the number of primes less than 100 (you can start counting at the 21st prime by using the output provided).

Note: The first command-line argument for sieve.cpp is an upper bound on the primes found by this program: the largest prime it will find is 97 (maximum prime less than or equal to 100). The second argument is the number of threads used in OpenMP parallel computation. Since this is a sequential program (no parallel constructs), the second command-line argument is actually ignored for the computation in this version of the program, although sieve.cpp requires a user to provide both arguments anyway.
Test the code to see how many primes can be computed, e.g., one million or 100 million primes. Record your observations in a file lab6/README .
Change back to your lab6 subdirectory, and create a new subdirectory omprimes2 for a second version of primes computation. Copy the sequential version omprimes1/sieve.cpp and the Makefile omprimes1/Makefile to your new omprimes2 subdirectory. (This can be done conveniently using the command
```
    % cp omprimes1/{sieve.cpp,Makefile} omprimes2
```
). Then make omprimes2 your current directory.
Modify the program sieve.cpp in lab6/omprimes2 to operate in parallel, using OpenMP. Look for any opportunities for parallelism, decide whether parallelism is wise, and double-check your work to insure that the parallelism is correct (no race conditions or other indeterminacy.

Compile and test your code, first deterimining 100 primes, then testing with large numbers. Compare results with the sequential version -- do they agree?
Use time to compare performance for omprimes1/bin/sieve and omprimes2/bin/sieve for various prime ranges and thread counts. Also compare these programs' performance on the 32-core computers and potentially other systems. Record your results, observations, and conclusions in lab6/README.

Further performance improvements

Create a shell script for uniformly and conveniently running your benchmark tests for various versions of the sieve.cpp program. Here is an example:
```
    #!/bin/sh
    
    TIME="/usr/bin/time -p"
    OUT=../sievetest.`date +%m%d_%H:%M:%S`
    for DIR in $*
    do  cd $DIR
        for MAX in 1000000 100000000
        do  for NTHREADS in 1 4 16 32
            do  echo ========== $DIR $MAX $NTHREADS ========== >> $OUT
                for COUNT in `seq 1 3`
                do  $TIME bin/sieve $MAX $NTHREADS >/dev/null 2>>$OUT
                done
                echo "" >> $OUT
            done
        done
        cd ..
    done
```
This is a file of commands for the (Bourne or bash) shell. You can create a file sievetest.sh in your lab6 subdirectory containing that code, make it executable
```
    % chmod +x sievetest.sh
```
then run
```
    % ./sievetest.sh omprimes1 omprimes2
```
as a program to run your bin/sieve programs in your omprimes1 and omprimes2 directories, three times at each of several choices of maximum prime and thread count. The results will be stored in a file whose name includes timestamp information. Note: Keep in mind that everyone will be doing this -- be selective about how many tests you run! Also, the tests other people will be running on the same machine may affect your results (which is a reason to make multiple runs for each choice of command-line arguments, but not to make too many replicate runs).

Try the benchmarking script sievetest.sh with the code in your omprimes1 and omprimes2 directories on a lab machine.

Also, copy your code directories and sievetest.sh to a thingn machine, and run the benchmarks there to compare performance on those computers. Here is a way to perform the copy and tests:
```
    % cd ~/lab6 
    % tar cf primes.tar omprimes1 omprimes2 sievetest.sh 
    % scp primes.tar helios.public.stolaf.edu:
    % ssh helios
    helios $ scp primes.tar thingn:
    helios $ ssh thingn
    thingn $ tar xvf primes.tar
    thingn $ cd omprimes1 ; make clean;  make
    thingn $ cd ../omprimes2 ; make clean;  make
    thingn $ cd ..
    thingn $ sievetest.sh omprimes1 omprimes2
```
The tar program creates archives of files and directories, stored in a single file (primes.tar in this case). The clean target for the Makefile deletes any object files and executables, so these can be remade on the new machine. (The executables created on lab machines will probably work on thingn, but this tests with code generated by the compiler on that thingn.
The program cs300/lab6/omprimes3.cpp adds OpenMP constructs to parallelize the sieve computation.

Note: The OpenMP single construct insures that only one thread updates the value of the shared variable p.

Create a subdirectory lab6/omprimes3 and copy omprimes3.cpp and the Makefile into this new omprimes3 directory.

Use ./sievetest.sh omprimes3 to perform benchmarks on this version. How does it compare with the sequential version? With your parallelization in omprimes2? Record your observations in README.
In order to improve performance of omprimes3, observe that a lot of time is spent performing initialization and sieve operations on a large array. Half of the time, multiples of a prime p will be even numbers, which we already know are non-primes (except p=2).

Use this observation to modify your program to store only odd numbers in the array marked[]. Carry out this modification in a subdirectory omprimes4.
- Use the memory location marked[i] to represent the number 2*i+1; for example, marked[0] represents 1, marked[1] represents 3, etc.
- Start count at 1 instead of zero, in order to count the only even prime 2.
- Observe that marking every p elements of this compressed marked[] array will correctly sieve the odd multiples of p from the prime computation.
- Be sure to adjust the starting and ending indices for loops. The index for a number N will be (N-1)/2; for example, the index for p*p will be (p*p-1)/2.
- Test your program manually with maxprimes = 100 and numthreads = 4 to check that your program is computing primes directly, before proceeding to benchmark tests.
Use ./sievetest.sh omprimes4 to perform benchmarks on your debugged code. How does it compare to earlier versions? Record observations in README.
Patterns in Performance improvement with OpenMP.

Loop Parallel; Mutual Exclusion single; Collective Communication Barrier
______
______
The OpenMP single construct used in omprimes3.cpp causes a barrier: all threads must wait for that new prime value p to be determined before they can mark multiples of that prime in their segments of the array marked[]. To eliminate this barrier, we will precompute all the primes we will need (up to the square root of prime_max), then let each thread proceed to move through those primes at its own rate.

To implement this idea, make a copy of our omprimes4/sieve.cpp in a new directory omprimes5, plus the Makefile, and add a sequential version of the prime sieve to compute the primes up to ceil(sqrt((double)prime_max)).

Suggestion: Compute these values in another array (perhaps named pre_marked[]), and store them in a vector variable kset for convenient, quick access by all threads during the parallel computation.

Verify that this program is running correctly, then compare to other versions. Store observations in README.

Deliverables

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, if you worked exclusively on thing3, include the name thing3 in your commit message.

    thing3$  cd ~/PDC
    thing3$  git pull origin master
    thing3$  git add -A
    thing3$  git commit -m "Lab 6 submit (thing3)"
    thing3$  git push origin master

Likewise, if you worked on both link machines and a thing, you can rename your lab6 folders and submit each, with commit messages indicating the computer on which that commit was created.

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Wednesday, January 14, 2015.