Introduction to OpenMP programming
CS 300, Parallel and Distributed Computing (PDC)
Due Tuesday, January 13, 2015
run on >= 3 environments
g++ -fopenmp -o trap_omp trap_omp.cpp
experiment with trap_omp
fix sections
sections with function calls
Other constructs?
Shared Memory Parallel computing
OpenMP concepts
parallel for
and parallel sections
constructs
Potential for race conditions...
On a link computer, create a
~/PDC/lab5
subdirectory for work on the lab, and change
directory to that directory.
Copy ~cs300/omp/trap_omp.cpp
to your
lab5
directory. On a link
computer, compile this OpenMP version of trap_omp.cpp
as
follows:
g++ -fopenmp -o trap_omp trap_omp.cpp
The
-fopenmp
flag requests compiling and linking support for
OpenMP. Note that one could also compile and link in separate steps,
in which case -fopenmp
should be used in both commands.
Try running the resulting program trap_omp
without
command-line arguments, then with a single positive command-line
argument to request different thread counts. Observe how the output
varies.
Patterns in trap_omp.cpp
.
Examining the code of trap_omp.cpp
, we see
Data Decomposition at work as before, this time
splitting work of adding the areas of trapezoids among multiple
threads within a single process instead of among multiple processes
spread out on a cluster. However, the higher-level OpenMP
#pragma omp
call mostly conceals the details of
Data Decomposition, since OpenMP divides the interval
of trapezoids among the threads automatically with
#pragma omp for
.
The #pragma
's reduction(+, integral)
clause
specifies that a reduce operation should take place to add the partial
sums of trapezoids. We have known this as a
Collective Communication pattern. However, unlike
MPI, OpenMP does not need to use network communication for
this reduce operation. Instead, OpenMP can share values among threads
using memory locations in order to accomplish the reduction. All of
the details are hidden in that reduction()
clause, except
the essentials of what reduction operation to perform and what values
to reduce, namely the value of each thread's variable
integral
.
The trap_omp.cpp
code represents a new pattern that is
quite common in practice.
The Loop Parallel focuses on computationally
intensive loops as opportunities for parallelism. In the case of
trap_omp.cpp
, the number of iterations constitutes the
main factor in computational intensity. Other loops may include more
computation within each iteration of a loop.
Behind the scenes, OpenMP's (parallel for
) feature
divides the work of the loop among multiple term('thread')s, which
carry out separate paths of execution within a process. Threads
are sometimes called lightweight processes: they execute
their own sequences of instructions much as if they were independent
processes, but there is less overhead computation to switch between
threads within a process than to switch between different processes.
This is because a process's threads share computational
resources such as memoy with that process. Those shared
resources don't need to be switched, but only thread-specific
resources such as those that control the execution path
(e.g., the program counter as discussed in Hardware
Design). As we have seen with OpenMP clauses such as
shared()
and private()
, a programmer can
control whether certain resources are shared among threads or dedicated to
individual threads when solving a parallel programming problem.
Of course, there is computational overhead for a process to create threads and destroying them when no longer needed. Here are two patterns relevant to managing threads.
The Fork-Join pattern involves (1) allocating one or more threads ("forking," as if creating a fork in the road for two separate execution paths), (2) carrying out computations in those threads in parallel, then (3) waiting for all of those threads to finish their work ("joining" the separate execution paths back into a single path) before proceeding with sequential execution.
A Thread Pool is a collection of threads that have been created and not yet destroyed that a programmer can reuse for segments of parallel computations as needed.
OpenMP uses Fork-Join and
Thread Pool implicitly in its work, so an
OpenMP programmer
never forks threads or interacts with OpenMP's thread pool directly.
The omp parallel for
directive splits up the range of
values of a loop-control variable
(Data Decomposition) of a loop, then carries out
those parcels of work using
Fork-Join. Also, OpenMP typically creates a
thread pool once at the outset of a program,
then reuses that thread pool for all of the omp parallel
regions throughout
a program's code, in order to save the computational overhead of
repeatedly creating and destroying threads.
A programmer doesn't need to use OpenMP to program with threads. We will see several other ways to use threads implicitly using other libraries (such as OpenMP) or other languages. A programmer can also create and use threads directly using various thread packages such as C++-11 threads and Posix threads (pthreads, also available for C language).
The time
feature of the shell provides running
time information for a program. Use
% time trap_omp nto see how the running time varies depending on the number of threads used.
Try several powers of 2. Also try numbers that are near but not exactly powers of 2 (both above and below), and look for interesting patterns, as well as some arbitrary values that are not near powers of two. Try multiple runs with the same number of threads for a few thread counts: are the results always the same?
Record your observations and results in a file README
in your lab5
directory.
Copy trap_omp.cpp
to thing3
or
thing1
as follows, then retest a selection of thread
counts on that 32- or 64-core computer. Report on your observations in
README
To accomplish this step,
Log into thingn
(n
=
either 3 or 1) and create subdirectories ~/PDC
and
~/PDC/lab5
subdirectories.
While you are logged into thingn
, set up your
~/PDC
directory for use with git
as
described in step 1 (only) of lab1 Deliverables
Use scp
(on a Link machine) to copy the file from
your Link account to
the ~/PDC/lab5
subdirectory of
thingn
.
Log into thingn
. Compile
trap_omp.cpp
on thingn
,
and proceed with tests.
Create a program trap2.cpp
that is the same as
trap_omp.cpp
, except removing the
reduction
clause in the OpenMP construct and adding
integer
as a shared
variable. Try running
this with various numbers of threads, including 1 (the default).
How does this change the output? Can you explain this behavior?
Write your observations and conclusions in README
The program ~cs300/omp/sections.cpp
was presented in
class. Copy this program to your
directory, compile it, and observe the behavior of the program over
multiple runs.
Report on your runs in your README
file. Can you
explain what went wrong for two or more of your sample runs? Include
that analysis in README
.
Note: Please create only one README
file, and
use it to describe what you find on both computing systems, rather
than making multiple README
s.
Use OpenMP clauses, constructs, and/or other strategies to
correct the behavior
of sections.cpp
. Notes:
First, decide what correct behavior means. Does it
mean that computation produces the same answer each time? That each
C++ statement occurs completely without interruption? That each
section is computed once and only once? Write your definition of
correct in your README
file.
The sections
construct supports the following OpenMP clauses (and others): private,
reduction, num_threads, shared, default
(either
none
or shared
in C/C++). Other constructs
may help, too.
Keep notes on your efforts to correct sections.cpp
in README
.
Use one of the
git
stratgies in lab1 Deliverables to submit
your work in this lab. For example, rename your Link lab5
directory to lab5-link
, and rename your thing3
lab5
directory to lab5-thing3
on thing3
.
Include the cluster name in your commit message, e.g.,
thing3$ cd ~/PDC thing3$ git pull origin master thing3$ git add -A thing3$ git commit -m "Lab 5 submit (thing3)" thing3$ git push origin masterAlso, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git
.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Tuesday, January 13, 2015.