Performance improvement with OpenMP
CS 300, Parallel and Distributed Computing (PDC)
Due Wednesday, January 14, 2015
Sequential alg
Parallelize
Opt 1: delete all evens
Opt 2: precompute, then remove barriers
Opt 3: cache
OpenMP features
Given sequential algorithm for sieve computation of primes
Makefile for this lab (to be used in a subdirectory per lab segment.
______
On the link computers, create a
lab6 subdirectory for work on the lab, and change
directory to that directory.
Create a subdirectory omprimes1 of your
lab6 directory. Copy the Makefile and
sequential version omprimes1.cpp for computing primes to
your omprimes1 subdirectory, then change directory to
that subdirectory. Rename omprimes1.cpp to
sieve.cpp (or create a symbolic link).
Issue the make command to compile the given
program, then test it using
% bin/sieve 100 4
Verify that it correctly prints the first 20 primes, and that it correctly determines the number of primes less than 100 (you can start counting at the 21st prime by using the output provided).
Note: The first command-line argument for
sieve.cpp is an upper bound on the primes found by this
program: the largest prime it will find is 97 (maximum prime less
than or equal to 100). The second argument is the number of threads
used in OpenMP parallel computation. Since this is a sequential
program (no parallel constructs), the second command-line
argument is actually ignored for
the computation in this version of the program, although
sieve.cpp requires a user to provide both arguments
anyway.
Test the code to see how many primes can be computed, e.g., one
million or 100 million primes. Record your observations in a file
lab6/README .
Change back to your lab6
subdirectory, and create a new subdirectory omprimes2
for a second version of primes computation. Copy the sequential
version omprimes1/sieve.cpp and the Makefile
omprimes1/Makefile to your new omprimes2
subdirectory. (This can be done conveniently using the command
% cp omprimes1/{sieve.cpp,Makefile} omprimes2
). Then make omprimes2 your current directory.Modify the program sieve.cpp in lab6/omprimes2 to operate in parallel, using
OpenMP. Look for any opportunities for parallelism,
decide whether parallelism is wise, and double-check your work to
insure that the parallelism is correct (no race conditions or other
indeterminacy.
Compile and test your code, first deterimining 100 primes, then testing with large numbers. Compare results with the sequential version -- do they agree?
Use time to compare performance for
omprimes1/bin/sieve and
omprimes2/bin/sieve for various prime ranges and thread
counts. Also compare these programs' performance on the 32-core
computers and potentially other systems. Record your results,
observations, and conclusions in lab6/README.
Create a shell script for uniformly and conveniently running
your benchmark tests for various versions of the
sieve.cpp program. Here is an example:
#!/bin/sh
TIME="/usr/bin/time -p"
OUT=../sievetest.`date +%m%d_%H:%M:%S`
for DIR in $*
do cd $DIR
for MAX in 1000000 100000000
do for NTHREADS in 1 4 16 32
do echo ========== $DIR $MAX $NTHREADS ========== >> $OUT
for COUNT in `seq 1 3`
do $TIME bin/sieve $MAX $NTHREADS >/dev/null 2>>$OUT
done
echo "" >> $OUT
done
done
cd ..
done
This is a file of commands for the (Bourne or bash)
shell. You can create a file sievetest.sh in your
lab6 subdirectory containing that
code, make it executable
% chmod +x sievetest.sh
then run
% ./sievetest.sh omprimes1 omprimes2
as a program to run your bin/sieve programs in your
omprimes1 and omprimes2 directories,
three times at each of several choices of maximum prime and thread
count. The results will be stored in a file whose name includes
timestamp information. Note: Keep in mind that everyone
will be doing
this -- be selective about how many tests you run! Also, the
tests other people will be running on the same machine may affect
your results (which
is a reason to make multiple runs for each choice of
command-line arguments, but not to make too many
replicate runs).
Try the benchmarking script sievetest.sh with the code
in your omprimes1 and omprimes2
directories on a lab machine.
Also, copy your code directories and sievetest.sh to a
thingn machine, and run the benchmarks there
to compare performance on those computers. Here is a way to
perform the copy and tests:
% cd ~/lab6
% tar cf primes.tar omprimes1 omprimes2 sievetest.sh
% scp primes.tar helios.public.stolaf.edu:
% ssh helios
helios $ scp primes.tar thingn:
helios $ ssh thingn
thingn $ tar xvf primes.tar
thingn $ cd omprimes1 ; make clean; make
thingn $ cd ../omprimes2 ; make clean; make
thingn $ cd ..
thingn $ sievetest.sh omprimes1 omprimes2
The tar program creates archives of files and
directories, stored in a single file (primes.tar in
this case). The clean target for the
Makefile deletes any object files and executables, so
these can be remade on the new machine.
(The executables created on lab machines will probably work on
thingn, but this tests with code generated
by the compiler on that thingn.
The program cs300/lab6/omprimes3.cpp adds OpenMP constructs to parallelize
the sieve computation.
Note: The OpenMP single construct insures
that only one thread updates the value of the shared variable p.
Create a subdirectory lab6/omprimes3 and copy omprimes3.cpp and the
Makefile into this new omprimes3 directory.
Use ./sievetest.sh omprimes3 to perform benchmarks on
this version. How does it compare with the sequential version? With
your parallelization in omprimes2? Record your
observations in README.
In order to improve performance of omprimes3,
observe that a lot of time is spent performing initialization and
sieve operations on a large array. Half of the time, multiples of a
prime p will be even numbers, which we already know are
non-primes (except p=2).
Use this observation to modify your program to store only odd
numbers in the array marked[]. Carry out this
modification in a subdirectory omprimes4.
Use the memory location marked[i] to represent the
number 2*i+1; for example, marked[0]
represents 1, marked[1] represents 3, etc.
Start count at 1 instead of zero, in order to
count the only even prime 2.
Observe that marking every p elements of this compressed
marked[] array will correctly sieve the odd multiples of
p from the prime computation.
Be sure to adjust the starting and ending indices for loops.
The index for a number N will be (N-1)/2;
for example, the index for p*p will be
(p*p-1)/2.
Test your program manually with maxprimes = 100
and numthreads = 4 to check that your program is
computing primes directly, before proceeding to benchmark tests.
Use ./sievetest.sh omprimes4 to perform benchmarks on
your debugged code. How does it compare to earlier versions? Record
observations in README.
Patterns in Performance improvement with OpenMP.
Loop Parallel;
Mutual Exclusion single;
Collective Communication Barrier
The OpenMP single construct used in
omprimes3.cpp causes a barrier: all threads
must wait for that new prime value p to be determined
before they can mark multiples of that prime in their segments of the
array marked[]. To eliminate this barrier, we will
precompute all the primes we will need (up to the square root of
prime_max), then let each thread proceed to move through
those primes at its own rate.
To implement this idea, make a copy of our
omprimes4/sieve.cpp in a new directory
omprimes5, plus the Makefile, and add a
sequential version of the prime sieve to compute the primes up to
ceil(sqrt((double)prime_max)).
Suggestion: Compute these values in another array
(perhaps named pre_marked[]), and store them in a
vector variable kset for convenient, quick
access by all threads during the parallel computation.
Verify that this program is running correctly, then compare to
other versions. Store observations in README.
Use one of the
git stratgies in lab1 Deliverables to submit
your work in this lab. For example, if you worked exclusively on
thing3, include the name thing3 in your
commit message.
thing3$ cd ~/PDC
thing3$ git pull origin master
thing3$ git add -A
thing3$ git commit -m "Lab 6 submit (thing3)"
thing3$ git push origin master
Likewise, if you worked on both link machines and a
thing, you can rename your lab6
folders and submit each, with commit messages indicating the computer
on which that commit was created.Also, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Wednesday, January 14, 2015.