Homework (CS 300, Parallel and Distributed Computing (PDC))

Homework

CS 300, Parallel and Distributed Computing (PDC)

Homework 3 Due 10/10/2016

MPI review
1. [C]
  Write an MPI program that exchanges messages between all pairs of nodes in MPI_COMM_WORLD using point-to-point communication, then broadcasts one message using collective communication.
  
  A process with rank r should receive a message from each process with rank less than r, then send a message to each process with rank greater than r. This can be accomplished by having every process execute two loops, according to the following pseudocode:
```
for (int i = 0;  i < r;  i++) {
    receive a message from any process
    print the message received and the rank r
}
for (int i = r+1;  i < nprocs;  i++) 
    send a message containing r to process with rank i
```
  Here, nprocs is the number of processes in MPI_COMM_WORLD.
  
  Notes:
  - Lab 1 has an example of point-to-point send and receive calls in MPI, as well as MPI setup and termination, finding one's rank, etc.
    
    The synchronization process code for the first Dining Philosopers lab gives an example of receiving from any process, using MPI_ANY_SOURCE.
  - If each message body is an MPI_INT value v equal to the sender's rank, then the printed value of v*10 + r should appear as a sequence of two uninterrupted digits indicating that message's source and destination. For example, if process 3 sends a message with integer body 3 to process 2, then the number 32 would be printed, then a space.
    
    By printing a single integer value rather than a succession of two values, we avoid the risk of one digit sequence occuring inside another.
  - After completing its second loop (above), each process should participate in an MPI broadcast collective call. The syntax for each process's call is exactly the same, but one process will provide a value that the other processes will receive via the broadcast.
    
    You may choose the process to serve as broadcast sender, perhaps randomly.
  - The MPI primes lab included a broadcast collective call.
  - To show how the broadcast proceeds, have the broadcast sender print a newline before the broadcast and send its rank during the broadcast, then have each non-sender in the broadcast print the message r together with its own rank n, perhaps as the single integer r*10 + n.
  - Here is expected sample output from an MPI program with five processes constructed as above:
```
01031413040223123424
31343230
```
    Note: This is one of many possible output orderings. The use of MPI_ANY_SOURCE for receives increases the potential variation in orderings. In this example, the process with rank 3 is the broadcast sender.
Cache performance in OpenMP primes
1. [C]
  After finishing the lab on OpenMP primes, carry out one more performance improvement strategy: Interchange the order of nesting of loops when marking elements of the array marked[] .
  
  Explanation: In the given algorithm, the outer loop iterates once per prime number p, and the inner loop iterates once per index i of marked[]. When an OpenMP for construct is applied to that inner loop, the OpenMP system divides the range of that for loop into subranges, and assigns each of those subranges to various threads for parallel computation. Unfortunately, the marking algorithm involves accessing array elements in "leaps": for example, if the prime is p=37, the inner loop marks every 37th element starting at 37²=1369 and proceeding along the entire length of marked[]. Some portion of this "leaping" traversal of marked[] will be performed by each thread, for that thread's subrange of indices. Because of this "leaping" and the fact that the subrange is long, values for memory locations in marked[] will be loaded into the cache but rarely (if ever) used more than once, before they must be replaced by another memory value that is needed.
  
  In other words, it is better for cache performance to access a few memory locations repeatedly over a short time than to access many memory locations occasionally over a lengthy time. This is a matter of working with the cache to help it give us better performance.
  
  By using the array index i to control the outer loop of the computation, and using the prime p to control the inner loop, each element in the array marked[] will be used many times over a short period of time, making for good cache performance.
  
  DO THIS: Create a directory hw/omprimes6 on a Link computer, and copy sieve.cpp and Makefile from your lab omprimes5 directory to the new subdirectory omprimes6. Modify the program as described above, test that it runs correctly, then run the benchmarking script to compare performance with other versions of the sieve program. Write down your observations/results in a file hw/README.
  
  Notes:
  - The outer loop will range from i=1 to the size of the array marked[], and the inner loop will range among all primes p (drawn from wherever you stored them, perhaps located in a vector named pset).
  - To determine whether to mark a location marked[i], use the remainder operator % : if the integer corresponding to the index i is divisible by p, assign 1 to marked[i].
  - Once a location marked[i] is assigned the value 1, no further primes need to be checked for that location.
2. [H]
  In the program omprimes6, is it better to parallelize the outer loop controlled by i, or the inner loop controlled by p? Write reasons for your answer.
3. [H]
  The program omprimes3.cpp uses OpenMP constructs to cause each thread to check all primes against different subranges of the array marked[]. In another approach to the problem, one might cause each thread to check a subrange of primes against the entire array marked[]. What are the pros and cons of the latter approach?