Homework
CS 300, Parallel and Distributed Computing (PDC)
MPI review
[C]
Write an MPI program that exchanges messages between all pairs of nodes
in MPI_COMM_WORLD using point-to-point communication, then
broadcasts one message using collective communication.
A process with rank r should receive
a message from each process with rank less than r, then
send a message to each process with rank greater than r.
This can be accomplished by having every process execute two loops,
according to the following pseudocode:
for (int i = 0; i < r; i++) {
receive a message from any process
print the message received and the rank r
}
for (int i = r+1; i < nprocs; i++)
send a message containing r to process with rank i
Here, nprocs is the number of processes in
MPI_COMM_WORLD.
Notes:
Lab 1 has an example of point-to-point send and receive calls in MPI, as well as MPI setup and termination, finding one's rank, etc.
The synchronization process code for the first Dining Philosopers
lab gives an example of receiving from any process, using
MPI_ANY_SOURCE.
If each message body is an MPI_INT value
v equal to
the sender's rank, then the printed value of v*10 + r
should appear as a sequence of two uninterrupted digits indicating
that message's
source and destination. For example, if process 3 sends a message
with integer body 3 to process 2, then the number 32 would be
printed, then a space.
By printing a single integer value rather than a succession of two values, we avoid the risk of one digit sequence occuring inside another.
After completing its second loop (above), each process should participate in an MPI broadcast collective call. The syntax for each process's call is exactly the same, but one process will provide a value that the other processes will receive via the broadcast.
You may choose the process to serve as broadcast sender, perhaps randomly.
The MPI primes lab included a broadcast collective call.
To show how the broadcast proceeds, have the broadcast sender
print a newline before the broadcast and send its rank during
the broadcast, then have each non-sender in the broadcast print the
message r together with its own rank n,
perhaps as the single integer r*10 + n.
Here is expected sample output from an MPI program with five processes constructed as above:
01031413040223123424 31343230
Note: This is one of many possible output orderings.
The use of MPI_ANY_SOURCE for receives increases the
potential variation in orderings.
In this example, the process with rank 3 is the broadcast sender.
Cache performance in OpenMP primes
[C]
After finishing the lab on OpenMP primes, carry out one more
performance improvement strategy: Interchange the order of nesting of
loops when marking elements of the array marked[] .
Explanation: In the given algorithm, the outer loop
iterates once per prime number p, and the inner loop
iterates once per
index i of marked[]. When an OpenMP
for
construct is applied to that inner loop, the OpenMP system divides the
range of that for loop into subranges, and assigns
each of those subranges to various threads for parallel computation.
Unfortunately, the marking algorithm involves accessing array elements
in "leaps": for example, if the prime is p=37, the inner
loop marks every 37th element starting at 372=1369 and proceeding
along the entire length of marked[].
Some portion of this "leaping" traversal of marked[] will
be performed by each thread, for that thread's subrange of indices.
Because of this "leaping" and the fact that the subrange is long,
values for memory locations in
marked[] will be loaded into the cache but rarely (if
ever) used more than once, before they must be replaced by another
memory value that is needed.
In other words, it is better for cache performance to access a few memory locations repeatedly over a short time than to access many memory locations occasionally over a lengthy time. This is a matter of working with the cache to help it give us better performance.
By using the array index i to control the
outer loop of the computation, and using the prime
p to control the inner loop, each element in
the array marked[] will be used many times over a short
period of time, making for good cache performance.
DO THIS: Create a directory
hw3/omprimes6 on a Link computer, and copy
sieve.cpp and Makefile from your lab
omprimes5 directory to the new subdirectory
omprimes6. Modify the program as described above, test
that it runs correctly, then run the benchmarking script to compare
performance with other versions of the sieve program.
Write down your observations/results in a file hw3/README.
Notes:
The
outer loop will range from i=1 to the size of the array
marked[], and the inner loop will range among all primes
p (drawn from wherever you stored them, perhaps located in a
vector named pset).
To determine whether to mark a location marked[i],
use the remainder operator % : if the integer
corresponding to the index i is divisible by p,
assign 1 to marked[i].
Once a location marked[i] is assigned the value 1,
no further primes need to be checked for that location.
[H]
In the program omprimes6, is it better to parallelize
the outer loop controlled by i,
or the inner loop controlled by p? Write reasons for
your answer.
[H]
The program
omprimes3.cpp
uses OpenMP constructs to cause each thread to check all primes
against different
subranges of the array marked[]. In another approach to the
problem, one might cause each thread to check a subrange of primes
against the entire array marked[]. What are the pros and
cons of the latter approach?