Introduction to OpenMP programming

CS 300, Parallel and Distributed Computing (PDC)
Due Tuesday, January 13, 2015

Preliminary material

Laboratory exercises

  1. On a link computer, create a ~/PDC/lab5 subdirectory for work on the lab, and change directory to that directory.

  2. Copy ~cs300/omp/trap_omp.cpp to your lab5 directory. On a link computer, compile this OpenMP version of trap_omp.cpp as follows:

    g++ -fopenmp -o trap_omp trap_omp.cpp
    

    The -fopenmp flag requests compiling and linking support for OpenMP. Note that one could also compile and link in separate steps, in which case -fopenmp should be used in both commands.

  3. Try running the resulting program trap_omp without command-line arguments, then with a single positive command-line argument to request different thread counts. Observe how the output varies.

    Patterns in trap_omp.cpp.

    Examining the code of trap_omp.cpp, we see Data Decomposition at work as before, this time splitting work of adding the areas of trapezoids among multiple threads within a single process instead of among multiple processes spread out on a cluster. However, the higher-level OpenMP #pragma omp call mostly conceals the details of Data Decomposition, since OpenMP divides the interval of trapezoids among the threads automatically with #pragma omp for.

    The #pragma's reduction(+, integral) clause specifies that a reduce operation should take place to add the partial sums of trapezoids. We have known this as a Collective Communication pattern. However, unlike MPI, OpenMP does not need to use network communication for this reduce operation. Instead, OpenMP can share values among threads using memory locations in order to accomplish the reduction. All of the details are hidden in that reduction() clause, except the essentials of what reduction operation to perform and what values to reduce, namely the value of each thread's variable integral.

    The trap_omp.cpp code represents a new pattern that is quite common in practice.

    • The Loop Parallel focuses on computationally intensive loops as opportunities for parallelism. In the case of trap_omp.cpp, the number of iterations constitutes the main factor in computational intensity. Other loops may include more computation within each iteration of a loop.

    Threads in OpenMP

    Behind the scenes, OpenMP's (parallel for) feature divides the work of the loop among multiple term('thread')s, which carry out separate paths of execution within a process. Threads are sometimes called lightweight processes: they execute their own sequences of instructions much as if they were independent processes, but there is less overhead computation to switch between threads within a process than to switch between different processes. This is because a process's threads share computational resources such as memoy with that process. Those shared resources don't need to be switched, but only thread-specific resources such as those that control the execution path (e.g., the program counter as discussed in Hardware Design). As we have seen with OpenMP clauses such as shared() and private(), a programmer can control whether certain resources are shared among threads or dedicated to individual threads when solving a parallel programming problem.

    Of course, there is computational overhead for a process to create threads and destroying them when no longer needed. Here are two patterns relevant to managing threads.

    • The Fork-Join pattern involves (1) allocating one or more threads ("forking," as if creating a fork in the road for two separate execution paths), (2) carrying out computations in those threads in parallel, then (3) waiting for all of those threads to finish their work ("joining" the separate execution paths back into a single path) before proceeding with sequential execution.

    • A Thread Pool is a collection of threads that have been created and not yet destroyed that a programmer can reuse for segments of parallel computations as needed.

    OpenMP uses Fork-Join and Thread Pool implicitly in its work, so an OpenMP programmer never forks threads or interacts with OpenMP's thread pool directly. The omp parallel for directive splits up the range of values of a loop-control variable (Data Decomposition) of a loop, then carries out those parcels of work using Fork-Join. Also, OpenMP typically creates a thread pool once at the outset of a program, then reuses that thread pool for all of the omp parallel regions throughout a program's code, in order to save the computational overhead of repeatedly creating and destroying threads.

    A programmer doesn't need to use OpenMP to program with threads. We will see several other ways to use threads implicitly using other libraries (such as OpenMP) or other languages. A programmer can also create and use threads directly using various thread packages such as C++-11 threads and Posix threads (pthreads, also available for C language).

  4. The time feature of the shell provides running time information for a program. Use

      %  time trap_omp n
    
    to see how the running time varies depending on the number of threads used.

    Try several powers of 2. Also try numbers that are near but not exactly powers of 2 (both above and below), and look for interesting patterns, as well as some arbitrary values that are not near powers of two. Try multiple runs with the same number of threads for a few thread counts: are the results always the same?

    Record your observations and results in a file README in your lab5 directory.

  5. Copy trap_omp.cpp to thing3 or thing1 as follows, then retest a selection of thread counts on that 32- or 64-core computer. Report on your observations in README

    To accomplish this step,

  6. Create a program trap2.cpp that is the same as trap_omp.cpp, except removing the reduction clause in the OpenMP construct and adding integer as a shared variable. Try running this with various numbers of threads, including 1 (the default). How does this change the output? Can you explain this behavior? Write your observations and conclusions in README

  7. The program ~cs300/omp/sections.cpp was presented in class. Copy this program to your directory, compile it, and observe the behavior of the program over multiple runs.

    Report on your runs in your README file. Can you explain what went wrong for two or more of your sample runs? Include that analysis in README.

    Note: Please create only one README file, and use it to describe what you find on both computing systems, rather than making multiple READMEs.

  8. Use OpenMP clauses, constructs, and/or other strategies to correct the behavior of sections.cpp. Notes:

    Keep notes on your efforts to correct sections.cpp in README.

Deliverables

Use one of the git stratgies in lab1 Deliverables to submit your work in this lab. For example, rename your Link lab5 directory to lab5-link, and rename your thing3 lab5 directory to lab5-thing3 on thing3. Include the cluster name in your commit message, e.g.,

    thing3$  cd ~/PDC
    thing3$  git pull origin master
    thing3$  git add -A
    thing3$  git commit -m "Lab 5 submit (thing3)"
    thing3$  git push origin master
Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Tuesday, January 13, 2015.