Homework
CS 300, Parallel and Distributed Computing (PDC)
- Homework 3 Due 10/10/2016
-
-
MPI review
-
[C]
Write an MPI program that exchanges messages between all pairs of nodes in
MPI_COMM_WORLD
using point-to-point communication, then broadcasts one message using collective communication.A process with rank
r
should receive a message from each process with rank less thanr
, then send a message to each process with rank greater thanr
. This can be accomplished by having every process execute two loops, according to the following pseudocode:for (int i = 0; i < r; i++) { receive a message from any process print the message received and the rank r } for (int i = r+1; i < nprocs; i++) send a message containing r to process with rank i
Here,nprocs
is the number of processes inMPI_COMM_WORLD
.Notes:
-
Lab 1 has an example of point-to-point send and receive calls in MPI, as well as MPI setup and termination, finding one's rank, etc.
The synchronization process code for the first Dining Philosopers lab gives an example of receiving from any process, using
MPI_ANY_SOURCE
. -
If each message body is an
MPI_INT
valuev
equal to the sender's rank, then the printed value ofv*10 + r
should appear as a sequence of two uninterrupted digits indicating that message's source and destination. For example, if process 3 sends a message with integer body 3 to process 2, then the number 32 would be printed, then a space.By printing a single integer value rather than a succession of two values, we avoid the risk of one digit sequence occuring inside another.
-
After completing its second loop (above), each process should participate in an MPI broadcast collective call. The syntax for each process's call is exactly the same, but one process will provide a value that the other processes will receive via the broadcast.
You may choose the process to serve as broadcast sender, perhaps randomly.
-
The MPI primes lab included a broadcast collective call.
-
To show how the broadcast proceeds, have the broadcast sender print a newline before the broadcast and send its rank during the broadcast, then have each non-sender in the broadcast print the message
r
together with its own rankn
, perhaps as the single integerr*10 + n
. -
Here is expected sample output from an MPI program with five processes constructed as above:
01031413040223123424 31343230
Note: This is one of many possible output orderings. The use of
MPI_ANY_SOURCE
for receives increases the potential variation in orderings. In this example, the process with rank 3 is the broadcast sender.
-
-
-
Cache performance in OpenMP primes
-
[C]
After finishing the lab on OpenMP primes, carry out one more performance improvement strategy: Interchange the order of nesting of loops when marking elements of the array
marked[]
.Explanation: In the given algorithm, the outer loop iterates once per prime number
p
, and the inner loop iterates once per indexi
ofmarked[]
. When an OpenMPfor
construct is applied to that inner loop, the OpenMP system divides the range of thatfor
loop into subranges, and assigns each of those subranges to various threads for parallel computation. Unfortunately, the marking algorithm involves accessing array elements in "leaps": for example, if the prime isp
=37, the inner loop marks every 37th element starting at 372=1369 and proceeding along the entire length ofmarked[]
. Some portion of this "leaping" traversal ofmarked[]
will be performed by each thread, for that thread's subrange of indices. Because of this "leaping" and the fact that the subrange is long, values for memory locations inmarked[]
will be loaded into the cache but rarely (if ever) used more than once, before they must be replaced by another memory value that is needed.In other words, it is better for cache performance to access a few memory locations repeatedly over a short time than to access many memory locations occasionally over a lengthy time. This is a matter of working with the cache to help it give us better performance.
By using the array index
i
to control the outer loop of the computation, and using the primep
to control theinner
loop, each element in the arraymarked[]
will be used many times over a short period of time, making for good cache performance.DO THIS: Create a directory
hw/omprimes6
on a Link computer, and copysieve.cpp
andMakefile
from your labomprimes5
directory to the new subdirectoryomprimes6
. Modify the program as described above, test that it runs correctly, then run the benchmarking script to compare performance with other versions of thesieve
program. Write down your observations/results in a filehw/README
.Notes:
-
The outer loop will range from
i
=1 to the size of the arraymarked[]
, and the inner loop will range among all primesp
(drawn from wherever you stored them, perhaps located in avector
namedpset
). -
To determine whether to mark a location
marked[i]
, use the remainder operator%
: if the integer corresponding to the indexi
is divisible byp
, assign 1 tomarked[i]
. -
Once a location
marked[i]
is assigned the value 1, no further primes need to be checked for that location.
-
-
[H]
In the program
omprimes6
, is it better to parallelize the outer loop controlled byi
, or the inner loop controlled byp
? Write reasons for your answer. -
[H]
The program
omprimes3.cpp
uses OpenMP constructs to cause each thread to check all primes against different subranges of the arraymarked[]
. In another approach to the problem, one might cause each thread to check a subrange of primes against the entire arraymarked[]
. What are the pros and cons of the latter approach?
-
-