Finding large primes with MPI
Split SPLIT_ID
CS 300, Parallel and Distributed Computing (PDC)
First submission due
Saturday September 22, 2018;
complete lab due Thursday September
27, 2018
Finding large primes with MPI
compute primes up to 10k (k = 9?), standalone
rand; generate 18-digit random digits and test for primality, reporting lowest divisor if compositewith n=10 processes, find an 18-digit prime by testing random digit strings
improve efficiency?
Preliminary material
HW1, computing primes less than N
Pseudo-random number generation with
rand()(srand())Public-key encryption (e.g., RSA), using a product of large primes.
Protocol
Laboratory exercises
Login to a cluster admin node. Create a subdirectory
~/PDC/lab2, then change your default directory to your newlab2directory.Added instructions for Fall 2016:
Perform the following steps in order to enable multi-node MPI jobs on
cumulusLog into
cumulusand create an SSH key (unless you have already done so).cumulus$ ssh-keygen -t rsa
Press ENTER key in response to all prompts.Add your public key to your
authorized_keysfile.cumulus$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
The>>above causes the shell to append to a file, creating that file if necessary; so the command above should add your public key to the end of yourauthorized_keysfile whether that file already exists or not.Why we need this step: Since your home directory is shared between
cumulusand all thecumulusNNcluster nodes (cumulus01, cumulus02,etc.), this will enable you to login between thecumulushead node and thecumulusNNcluster nodes without supplying a password.mpiexecwill require this passwordless-SSH capability in order to start processes on other nodes.Check the success of this step by attempting to SSH between nodes, e.g.,
cumulus$ ssh cumulus02
You should successfully log intocumulus02without having to enter a password. If you are prompted for a password, check that the contents of your file~/.ssh/id_rsa.pubwas copied correctly to the end of your file~/.ssh/authorized_keys.Note: Only some
cumulusNNnodes may be in service. For a list of workingcumulusNNnodes, enter the following command oncumuluscumulus$ cat /usr/mpihosts.newAlso do this:
SSH into each of the hosts in
/usr/mpihosts.newonce before proceeding.
Answer "yes" if SSH asks if you want to continue logging into an unknown computer. This insures that each of the cluster nodes gets added to your
~/.ssh/known_hostsfile, whichmpiexecwill need for passwordless SSH connections.Now try running an MPI job that works from the first lab, as follows:
cumulus$ cp ~/PDC/lab1/trap.cpp . cumulus$ mpiCC -o trap trap.cpp cumulus$ mpiexec -np 4 -hostfile /usr/mpihosts.new --map-by node trap
The--map-by nodecommand-line argument causes the MPI processes to run on multiple cluster nodes in a round robin schedule, first one process per node for each node in/usr/mpihosts.new, then a second node per node (running on second cores), etc., until all processes are assigned.Check the success of this step by looking for normal output from a run of your program (e.g.,
trap).Verify that MPI is actually using multiple nodes by adding the following to your
trap.cppprogram (or other test program):We will print the rank and host name (computer name) from each process in the MPI computation. Note that standard output from each process is forwarded to the standard out of the
mpiexeccommand, so this will enable us to check that multiple nodes are being used.The system call
gethostname()enables you to obtain a printable name of the computer running your program. Googleman gethostnameto find a manual page that describesgethostname(). You can add the following lines to your code to yourtrap.cppprogram, perhaps at the point in the code just after defining and initializingintegral.#define MAXHOSTNAME 25 const int MAXHOSTNAME=25; char hostname[MAXHOSTNAME]; gethostname(hostname, MAXHOSTNAME);
Here, we define and use the constant MAXHOSTNAME for readability and maintainability (it is easy to change that value in one place and have all its uses updated througout the code. Also add this include file near the top oftrap.cpp:#include <unistd.h>
System calls. An (operating) system call is an operation that is carried out the operating system kernel, i.e., the code that is loaded on your computer when you boot it, then runs continually until you power your system down (or suspend it).
gethostname()is a Linux system call; Windows would use a system call for the Windows kernel with a different name.We would like to use a C++ statement such as the following to print out process information:
cout << "Rank " << my_rank << " (" << hostname << ")" << endl;However, the output from multiple processes might get garbled, because each operator<<is actually a function call in C++, and one process series of<<calls might not finish before another process's<<calls start. (This is an example of a race condition, in which the correct behavior of a computation depends on timing. We will study race conditions later.)To avoid overlapping garbled output, we will pack our desired output into a string, then use
coutto print that string in a single call to<<. Add this code to your programtrap.cppjust after yourgethostname()call:std::stringstream ss; ss << "Rank " << my_rank << " (" << hostname << ")" << endl; cout << ss.str();Also add the following#includenear the beginning oftrap.cpp#include <sstream>
See thecplusplus.comreference (used for Software Design) for specs of the standardstringstreamclass.Recompile and test your program, using the
mpiexecarguments above. You should see the usualtrapoutput, plus four lines such asRank 1 (cumulus04)
Note: Those lines printed by MPI processes may occur in any order, and ordinarily will change order each time you run the MPI program, because there's no enforced scheduling order among these MPI processes.
(Followup on Lab1.) The C++ library for MPI is now deprecated, meaning that it is no longer supported in recent versions of the MPI software. Even though it is still part of the version we are using, we would rather invest our efforts in technologies that are likely to remain relevant. Also, you may already have noticed that the documentation for the C MPI library is better than the documentation for the C++ MPI library. Note that the C-language bindings can be used in C++ programs.
Copy
trap.cpp(or a later version of the C++trapcode) from yourlab1directory to yourlab2directory, and modify it to use the C language MPI library (as found intrap.c) instead of the C++ MPI library. You can still usempiCCto compile your C++ code. Test the new version using mpiexec with SLURM, and verify that it works.Note: We will use the C-language MPI library in this and future labs.
More features of SLURM batch scripts. Create a simple MPI program
try.cppthat prints out a one-line message including the MPI rank. Compile your program usingmpiCC, and create a scripttry.shcontaining the following lines to run your program.#!/bin/sh #SBATCH -o try.out -e try.err #SBATCH -n 4 mpiexec -x LD_LIBRARY_PATH=/usr/lib/openmpi/lib try
You can submit this script using the command
$ sbatch try.sh
This script demonstrates additional features of SLURM job scripts:
Shell-script comments that begin with
#SBATCHcan be used to provide command-line arguments for the SLURM job submission. In particular, the two#SBATCHcommands in this script specify three command-line arguments with their values.
You can override script parameters at the
sbatchcommand line. For example,$ sbatch -n 6 try.sh
causes the program to run with six processes instead of four.On a cluster admin node (e.g.,
cumulus, copy your homework programprimes.cppinto your subdirectory~/PDC/lab2. For this part, haveprimes.cppproduce a count of primes between 2 andN=1000.Then, compile
primes.cppand and run usingmpiexec, in order to produce answers from a 1-processor run ofprimes.cppfor "3-digit" primes (up to 1000).Note. This program has no MPI calls, so it can be compiled using
g++. You can usempiexeceven though that programprimes.cppdoes not actually need to use MPI. We could usempiexecwith other executables to make multiple simultaneous runs of a program conveniently. However, in this case, we are only using one node instead of many.After testing your code in this way, create a
commitcontaining it.cumulus$ git add primes.cpp cumulus$ git commit -m "lab2: primes.cpp, copied from HW1"
How many primes can we compute this way? Try computing 4-digit primes (up to
N=10000), 5-digit primes (up toN=100000), etc., to find the largest number k≤9 such that a cluster node can compute k-digit primes.Carry out this computation in parallel using MPI, as follows.
Modify
primes.cppto callMPI_Init()at the beginning ofmain()andMPI_Finalize()at the end. Use 10r+4 for the size of the sieve array, whereris the MPI rank. You can compute the number 10r+4 using a loop, perhaps located in a helper functionipow(base, exp). (This will require the header for the MPI calls.) Test this version by usingmpiexecwith-np 1to request one processor (note that this test also performs the check for 4-digit primes). Be sure to usempiCCto compile your program, and use the-oand-earguments forsbatchto capture standard output and standard error.Now change the array size to be 10r+5 and use
mpiexec -np 5to request five processors, thus checking for 5-, 6-, ..., and 9-digit primes in parallel.Observe that the output from all the processors is merged in the output file. (If you are only seeing the output for 5-digit primes, try printing less output, e.g., just a count of how many primes were found.)
Also, any error output from the processes will be merged into your error file. If any of your processes generates an error (probably for trying to allocate too much memory for the array), the entire job (all processes) will be terminated immediately. As a result, you may see correct output from some processes, error output from one process, and see nothing from any other processes.
If an error arises, reduce the number of processes and resubmit the job, until you can determine the largest value of k≤9 for which array allocation succeeds.
Optional: Our sieve array uses a whole byte to represent what is essentially one bit of information (either 0 or 1). Can you achieve a greater value of k if you use a different strategy for representing the sieve array?
Now that you have determined the largest integer k≤9 such that k-digit primes can be computed using our seive method, create a file
/home/user/PDC/lab2/prime.listconsisting of the primes between 2 and 10k, one per line.Note: Do not
commitprimes.listto git -- we want source code, not large data files, in your stogit repository.If you want to prevent
primes.listfrom appearing as an "untracked file" ingit statuscommands, add the linelab2/primes.list
to a file~/PDC/.gitignore.Use an
ofstreamobject to create your file. Here's an example program that prints a string to a file:#include#include using namespace std; int main () { ofstream outfile ("/home/user/PDC/lab2/hello.out"); outfile << "Hello, world!" << endl; outfile.close(); } Note: It's possible for a cluster node to create a file in your
lab2directory because your admin-node's home directory is shared with all the cluster nodes on that cluster (via NFS).End of first submission stage.
Submit your work up to this step (or later) by Saturday, September 22, 2018 as follows:
Use
git commit --amend
to update your most recent commit message to add the stringsubmit lab2: first submit
to the existing message. Modify that added string if you have any clarifications about this submission (e.g.,submit lab1: items 1-6 only).Pull/push that commit.
cumulus$ git pull origin master cumulus$ git push origin master
Write a new program
big-primes.cppthat randomly generates 2k-digit numbers and checks whether they are primes, using your fileprime.list.This first version of
big-primes.cppdoes not need MPI. (You can compile withg++.) However, we can still run the program usingmpiexecin order potentially to execute multiple copies at once. Invokempiexecwith-np 1for one processor, for now.Note: Our general strategy with 1-processor MPI jobs is to make sure our jobs run using MPI jobs in the simplified situation of a single processor, before adding in multiple processors. This incremental approach simplifies debugging, and distributed programs are notoriously difficult to debug.
Your program should begin by reading the primes in
prime.listand storing them in an array oflong. Open that file for input as anifstreamobject. You can use the operator>>to read integers from that file stream. You may hard-code the number of primes inprime.listfor the array size (first column in output from shell command$ wc prime.list).Use the function
rand()in the C++ library<cstdlib>to generate pseudo-random numbers. For example, the following short program prints five random numbers between 0 and 1000.#include <iostream> #include <cstdlib> using namespace std; static const int N = 1000; main() { for (int i = 0; i < 5; i++) cout << (rand() % N) << endl; }Note: The C++ operator
%returns the remainder from integer division of its operands. For example,103%5returns the value 3.Also note that the keyword
staticfor a global variable limits the scope of that global to the current filebigprimes.cpp.The function
rand()does not generate arbitrarily large pseudo-random numbers. The maximum is the value ofRAND_MAX(defined in<cstdlib>), which you can determine for your cluster nodes by writing a short program to print that value.Create and run a simple MPI test program (with one processor) to determine the value of
RAND_MAX, so you will getRAND_MAX's value for a cluster node instead of the cluster's admin machine.Since the number of digits in
RAND_MAXis probably less than 2k, you will have to fabricate long prime numbers from multiple return values fromrand(). For example, ifRAND_MAXexceeds 106 and 18-digit random numbers are desired, you can create one using three calls torand()as follows.const long long M = 1000000; // million long long bigrand = (rand() % M); bigrand += (rand()%M)*M; bigrand += (rand()%M)*M*M; // bigrand is between 0 and M^3
Here, we use the type
long longfor the variablebigrand, to insure that we have a 64-bit (8-byte) integer, in order to contain an 18-digit positive integer. (The size of a data type for a particular version of a C++ compiler can be checked using thesizeof()operator. You can check for yourself whether the typelonghas 64-bit integers.)Note: We limited k to be no greater than 9 so that an integer with 2k digits can be contained in a
long longvariable.long longvariables can hold values between -263 and +263-1 , where 263 ~ 8×103*6.Note 2: The
gcc/g++compiler supports a nonstandard type identifier__int128for 128-bit integers for computers that support them. If you find that k = 9 succeeds on the cluster you're using, using a test program that simply prints the value ofsizeof(__int128). Submit that test program usingmpiexec, in order to see the results for a worker node on that cluster, not the head node.To determine whether a large number
bigrandis a prime, we will repeatedly check known primes that are less thansqrt(bigrand)to see if they dividebigrand. This means computingbigrand % pfor all primespin your array of primes. Ifbigrand % preturns 0 for some primep, thenbigrandis not a prime. Ifbigrand % pis non-zero for all primespin your list (which contains all primes with up to 9 digits long), thenbigrandmust be a prime, because the square root of an 18-digit number contains 9 digits.Have your program print a message indicating the value of
bigrandand whether it was a prime; if it was not prime, also print the prime divisor you found.Run your program to check one random number with (up to) 2k digits for primality, using one cluster node.
When you know you have working code, create a
commitcontaining it.cumulus$ git add big-primes.cpp cumulus$ git commit -m "lab2: big-primes.cpp, randomly generate numbers and test them for primality"
Make a quick run of your current version of
big-primes.cpp, except substitutingbigrand19 (a prime) just before checking for primality, to insure that your program correctly determines primes. Run this on one processor.If you need to make corrections in
big-primes.cpp, fix the program, thencommitwith an appropriate message.Now, adapt
big-primes.cppto check 2k-digit random numbers until it finds at least one prime, using 8 processes in parallel.Add calls to MPI library functions
MPI_Init(),MPI_Comm_rank(), andMPI_Finalize()so each copy of thebig-primesexecutable can determine its rank.In order to insure that each node checks different primes, include a call to
srand(r);before your program makes any calls torand(), whereris a process's rank. The functionsrand()assigns a random number seed, leading to different sequences of pseudo-random numbers for different ranks.-
Now add a loop for repeatedly generating a value for
bigrand, checking it for primality, printing results to standard output, then communicating the results with other processes, until some process finds a prime.Use point-to-point communication to share with the processes whether a prime has been found. Here is a strategy for the processes to communicate results with each other and determine when to stop: Just after each process checks a value for
bigrandfor primality and prints that result to standard output, insert the following communication code.The head node (rank 0) sends messages to every non-head node (rank>0) containing one
MPI_INTvalue, either a 0 (for false) or a 1 (for true), indicating whether that head node found a prime. This will require a loop ofMPI_Recv()calls.)Each non-head node NH should receive that
MPI_INTvalue from the head node, and should terminate immediately if the received value is 1. Otherwise, that non-head node NH should send anMPI_INTvalue 0 or 1 to the head node, indicating whether NH found a prime, then NH should continue to the next iteration of its loop (find the nextbigrand, test thatbigrand, etc.).If the head node did not find a prime on its own, then the head node should now receive from every non-head node (in a loop).
If the head node receives a 1 from one of the non-head nodes, the head node should send a 1 to all non-head nodes, then the head node should terminate.
If instead the head node receives a 0 from all of the non-head nodes, then the head node should continue with the next iteration of its loop (make another
bigrand, test thatbigrand, etc.).You can debug your program by generating, say, 6-digit random numbers, then switch back to 2k-digit random numbers once you are convinced it works.
Note: One way to accomplish this is to encapsulate your code for generating a large random number
bigrandin a functiongetBigrand()with return typelong long int, and to write a second functiongetSmallrand()that also has return typelong long int, but which returns a shorter random number (e.g., returnsrand()%M). This makes it convenient to switch the initialization ofbigrandbetween long and short random numbers.
Note: This protocol for communicating results among the processors is correct because:
Every process terminates as soon as the head node informs all the processes that a prime has been found. Otherwise, every process continues to look for new primes.
If any non-head process finds a prime, it will report it to the head node, and the head node will inform all processes.
If the head node finds a prime, then it will inform all processes.
When you know you have working code, create a
commitcontaining it.cumulus$ git add big-primes2.cpp cumulus$ git commit -m "lab2: big-primes2.cpp, find a 2k digit prime, with broadcast and reduce"
-
Patterns in Finding large primes with MPI .
This lab exhibits some of the same patterns as we saw in the first lab.
Data Decomposition is present in the MPI version of
big-primes.cpp, where generated random numbers constitute the data being split up among processors. We also see Data Decomposition when determining k, since we tested array lengths for five nodes in parallel. In that case, we used parallelism more for convenience than for speedup.We used the Loop Parallel pattern when we used MPI to parallelize the loop in
big-primes.cpp. Unlike the loop in the trapezoid lab, we are not dividing the work of computing with a known range of values. Instead, we are using MPI parallelism to check multiple random values at once, in order to shorten the search time for a random large prime.Message Passing is present in the MPI version of
big-primes.cpp, implemented in calls toMPI_Send()andMPI_Recv().
Each of these MPI labs (including the first lab) also serves as an example of the following pattern.
In the SPMD, a single program is executed in parallel with multiple data values or data sets, making the computation easy to manage. In these MPI programs, literally the same program is executed on multiple computers, and even though the head process executes a variant set of instructions in order to assemble answers, SPMD is well represented in these two labs.
In the first lab, all nodes carry out the core computation of adding areas of trapezoids.
In this lab, when determining k, all nodes attempt prime computations at various scales (numbers of digits).
In the MPI version of
big-primes.cpp, all nodes check random digit strings for primality.
The program
big-primes.cppalso indicates some patterns we haven't encountered before.In the Master-Worker pattern, one thread of computation (the master) assigns work to other threads (the workers). In
big-primes, the head node represents the master, and sending a message containing 0 to a non-head node constitutes assigning work to that node (which represents a worker), namely, to try another random prime. Often, as in this case, the master also collects results from the workers.The Scatter-Gather pattern is comparable to Master-Worker, but specifically refers to a master sending values to workers (scatter) then receiving results from them all (gather). Scatter-Gather would more specifically describe the protocol in
big-primes.cppthan Master-Worker, except that the master in the Scatter-Gather pattern ordinarily sends different values to each worker, whereas this master sends the same value to each worker.
SPMD and Master-Worker are "mid-level" patterns, along with Loop Parallel. As a variant of Master-Worker, the Scatter-Gather pattern would also be at a middle level.
Create a second version
big-primes2.cppof your program that uses collective communication to share whether a large prime has been found. Here is a protocol:After every process checks one value, a reduce operation is performed to deliver the number of primes found to the head process. Each process transmits 1 if it found a prime and 0 otherwise.
The head uses a broadcast to inform all processes whether a prime was found, so they can know whether to try again or terminate.
Note:
Look up the arguments for a
MPI_Bcastusing online documentation.You could use a Google search for "MPI_Bcast" to find documentation for that library call. You may see references to particular implementations of MPI, such as OpenMPI (which we use on our main clusters) or MPICH (used on Raspberry Pi clusters and at some other sites' main clusters). We would expect the arguments for MPI calls to be identical for standard implementations of MPI.
When you know you have working code, create a
commitcontaining it.cumulus$ git add big-primes2.cpp cumulus$ git commit -m "lab2: modify big_primes to find a 2k digit prime"
Some communication options.
MPI includes calls
MPI_Scatter()andMPI_Gather()that support the Scatter-Gather pattern. However,MPI_Bcast()the andMPI_Reduce()(both considered to be forms of Collective Communication) are a better fit for this implementation, since a single value is sent to all workers, and there is no need to distinguish between the individual responses received from those workers.If the primes found by the workers (non-head nodes) were being sent back to the master (head node) instead of just a number 0 or 1, then
MPI_Gather()would be a better fit, although we might still useMPI_Bcast()to inform workers that they should continue the search.If in addition the head node were to distribute candidate digit strings to the workers instead of those workers generating digit strings on their own, then the computation would finally represent a complete Scatter-Gather pattern.
We count Scatter-Gather as a mid-level pattern because the two-stage implementation pattern of scattering followed by gathering is further removed from the hardware and more closely related to a (high-level) Data Decomposition strategy for a particular application (e.g., testing multiple candidate digit sequences for primality in parallel) than low-level patterns such as Collective Communication, which more closely reflecting execution of operations in hardware (e.g., sending all nodes a message via broadcast, or combining values from each node through reduction).
Create a version
big-primes3.cppof your program that uses the collective callMPI_Gather()to communicate the actual prime(s) found, if any. Each process participating in theMPI_Gather()operation either provides the prime it found or the integer 0, indicating no prime was found.Notes:
Look up the use of
MPI_Gather()in online resources.The first argument
sendbufin a call ofMPI_Gather()is the address of a value to transmit to the head node.A later argument
recvbufis the address of an array of locations, one per process, each capable of holding one value from a node. TheMPI_Gather()operation fills that array with values for the head node only. The index of a value in that array corresponds to the rank of the process providing that value.
For our application, we want
MPI_Gather()to transmit 64-bitlong longvalues, namely,bigrandvalues from nodes that turn out to be prime. Investigate how to transmit 64-bit values using MPI data types.MPI data types include
MPI_INT, MPI_LONG, MPI_FLOAT,etc. Some implementations of MPI includeMPI_LONG_LONG_INT. A web search can be used to find a more complete list of options.We want to communicate an 8-byte integer. You can use the library function
MPI_Type_size()to determine the number of bytes used for various MPI types.MPI_Type_size()'s second argument is the address of an integer variable (be sure to allocate it) that receives the number of bytes in its first argument. The typeMPI_LONG_LONG_INT(if it's supported in the current MPI implementation) might be a good one to try first.Create and run a simple MPI test program for this work, so you will get values on cluster nodes instead of the admin machine (such as
cumulus).If no predefined MPI data type has length 8 bytes, then you will have to transmit 8-byte large prime values using two 4-byte integers. In that case, you can use an array of two 4-byte integers for the first argument
sendbufofMPI_Gather()call, the value 2 for the second argumentsendcount, and a 4-byte MPI integer type (MPI_LONG?) for the third argumentsendtype. You will have to break your 8-byte values into the two 4-byte array elements, perhaps using%and>>operators.
Use a copy of
big-primes2.cppas the starting point forbig-primes3.cpp. In this version, we will retain theMPI_Bcast()strategy for the head process to send requests to the other processes, and replaceMPI_Reduce()by aMPI_Gather()approach for receiving results from processes.In the head process, you can use a loop to determine whether any non-zero values were delivered by the head process itself or other processes. (This loop that includes the head process constitutes a simpler approach than the protocol used before for determining whether the head or other processes had found a prime.)
In most MPI implementations, the array argument
recvbufthat gathers results only needs to be allocated in the head process, which is the only process that can access gathered values in that array. Use dynamic allocation for that array, since its length depends on the number of MPI processes, known only at runtime.If you had to use an array of two integer types for the
sendbufvalue, then include that factor 2 when computing the number of bytes to allocate dynamically forrecvbuf.
Unlike previous implementations of
big-primes, the head process will now know the value of the first large prime(s) found. Havebig-primes3.cppprint a report of all large primes found in the firstMPI_Gather()call that has nonzero values inrecvbuf.As before, you can use shorter random numbers to debug your program, before applying it with 2k-digit random numbers.
When you know you have working code, create a
commitcontaining it.cumulus$ git add big-primes3.cpp cumulus$ git commit -m "lab2: big-primes3.cpp, with broadcast and gather"
Create a version
big-primes4.cppof your program that uses the collective callMPI_Scatter()to distribute candidate 2k-digit random numbers to test for primality.Notes:
Look up the use of
MPI_Scatter()in online resources.For
MPI_Scatter(), the first argumentsendbufis the array of values to send out; only the root process's value for this argument will have an effect.Unlike
MPI_Gather(), for this callMPI_Scatter()the argumentrecvbufpoints to a single value, and all processes use that argumentrecvbufto receive values that the root process supplied throughsendbuf.
Use a copy of
big-primes3.cppas a starting point forbig-primes4.cpp.Instead of each process generating its own random numbers, in
big-primes4.cpponly the head process generates random numbers to test for primality. Have the head process generate enough new random numbers to fill the array argumentsendbuffor its call toMPI_Scatter().Also, remove the random-number generation from the code area for non-head processes, and have them obtain their random numbers from the
recvbufargument in their calls toMPI_Scatter().In the head process, use the
MPI_Scatter()call instead ofMPI_Bcast()to initiate new rounds of primality testing.Use shorter random numbers to debug your program, then run it with 2k-digit random numbers. Note: We expect to discover different prime(s) than before, since all random numbers are generated from the random-number seed for the head process only, instead of using multiple random-number seeds for all processes.
When you know you have working code, create a
commitcontaining it.cumulus$ git add big-primes4.cpp cumulus$ git commit -m "lab2: big-primes4.cpp, with scatter and gather"
[Optional.] Explore other variants of the
big-primescode, such as:Write a version that produces at least P big primes. You could accumulate discovered primes in the head process until P of them are collected.
Does it make a difference if more than one random number is sent to processes at a time? Observe that an array of ct random numbers could be received by each process using the
recvbufargument forMPI_Scatter(), and an array of ct results (primes or zeroes) could be sent from each process using thesendbufargument forMPI_Gather(). In that approach, make sure your head-only arrays (sendbufforMPI_Scatter(), andrecvbufforMPI_Gather()) are expanded by a factor of ct (i.e., can hold ct*N large integer values).Do you have an idea of your own?
If you try an optional idea:
create a file
READMEin yourlab2directory that lists the filenames of your optional source code, and describes what they seek to do; andcommityourREADMEand your optional source files, using thecommitmessages to indicate progress towards your goals for each program.
Choosing which loops to parallelize, parallelization overhead, and cache usage.
The final version of the
big-primesprogram applies the Loop Parallel pattern at the head node, when iterating over code that represents the Scatter-Gather pattern. We did not parallelize other loops in that code.The loop in which the master generates multiple candidate random digit sequences for a worker is not worth parallelizing, because the overhead of sending the small task of generating one of these random digit strings to another processor (even another core on the same computer) is far greater than simply carry out that task on the head itself.
The loop within a worker for checking whether any of that worker's candidate random digit sequences is divisible by a particular known prime does not merit parallization, also because the parallelization overhead far exceeds the division operation that might be carried out in parallel.
The loop within a worker that iterates through all primes might conceivably benefit from parallelization. However, small primes are far more likely to divide a random large number than large primes (e.g., 2 divides half of large random numbers, whereas the prime 99999989 divides only 1/99999989 of all large random numbers). It is unclear that the savings from this parallelization for those rare instances when small primes do not eliminate a random large number would exceed the overhead of parallelization.
On the other hand, the computation time required to find a dozen large primes is much less than twelve times computation required to find a single large prime, because of cache usage...______
Deliverables
All of your code for this assignment should already be
contained in commits. Modify the most recent
commit message to indicate that you are submitting
the completed assignment.
cumulus$ git commit --amendAdd the following to the latest
commit message.
lab2: complete
If your assignment is not complete, indicate your progress instead, e.g.,lab2: items 1-5 complete, item 6 partialYou can make later commits to submit updates.
Finally, pull/push your commits in
the usual way.
cumulus$ git pull origin master cumulus$ git push origin master
Submit your work using
git on the cluster head
node you used. Include the cluster name in your commit message, e.g.,
mist$ cd ~/PDC mist$ git pull origin master mist$ git add -A mist$ git commit -m "Lab 2 submit (mist)" mist$ git push origin masterAlso, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Sunday, September 25, 2016.
Files: primes.cpp big-primes.cpp big-primes2.cpp big-primes2.cpp big-primes3.cpp big-primes4.cpp