Finding large primes with MPI (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>First submission due Saturday September 22, 2018; complete lab due Thursday September 27, 2018</span>)

Finding large primes with MPI
Split SPLIT_ID

CS 300, Parallel and Distributed Computing (PDC)
First submission due Saturday September 22, 2018;
complete lab due Thursday September 27, 2018

Finding large primes with MPI

compute primes up to 10^k (k = 9?), standalone
rand; generate 18-digit random digits and test for primality, reporting lowest divisor if composite
with n=10 processes, find an 18-digit prime by testing random digit strings
improve efficiency?

Preliminary material

HW1, computing primes less than N
Pseudo-random number generation with rand() (srand())
Public-key encryption (e.g., RSA), using a product of large primes.

Protocol

Laboratory exercises

Login to a cluster admin node. Create a subdirectory ~/PDC/lab2, then change your default directory to your new lab2 directory.

Added instructions for Fall 2016:

Perform the following steps in order to enable multi-node MPI jobs on cumulus

Log into cumulus and create an SSH key (unless you have already done so).
```
cumulus$  ssh-keygen -t rsa
```
Press ENTER key in response to all prompts.
Add your public key to your authorized_keys file.
```
cumulus$  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
```
The >> above causes the shell to append to a file, creating that file if necessary; so the command above should add your public key to the end of your authorized_keys file whether that file already exists or not.

Why we need this step: Since your home directory is shared between cumulus and all the cumulusNN cluster nodes (cumulus01, cumulus02, etc.), this will enable you to login between the cumulus head node and the cumulusNN cluster nodes without supplying a password. mpiexec will require this passwordless-SSH capability in order to start processes on other nodes.

Check the success of this step by attempting to SSH between nodes, e.g.,
```
cumulus$  ssh cumulus02
```
You should successfully log into cumulus02 without having to enter a password. If you are prompted for a password, check that the contents of your file ~/.ssh/id_rsa.pub was copied correctly to the end of your file ~/.ssh/authorized_keys .

Note: Only some cumulusNN nodes may be in service. For a list of working cumulusNN nodes, enter the following command on cumulus
```
cumulus$  cat /usr/mpihosts.new
```
Also do this:
- SSH into each of the hosts in /usr/mpihosts.new once before proceeding.
Answer "yes" if SSH asks if you want to continue logging into an unknown computer. This insures that each of the cluster nodes gets added to your ~/.ssh/known_hosts file, which mpiexec will need for passwordless SSH connections.
Now try running an MPI job that works from the first lab, as follows:
```
cumulus$  cp  ~/PDC/lab1/trap.cpp  .
cumulus$  mpiCC -o trap trap.cpp  
cumulus$  mpiexec -np 4 -hostfile /usr/mpihosts.new --map-by node trap
```
The --map-by node command-line argument causes the MPI processes to run on multiple cluster nodes in a round robin schedule, first one process per node for each node in /usr/mpihosts.new, then a second node per node (running on second cores), etc., until all processes are assigned.

Check the success of this step by looking for normal output from a run of your program (e.g., trap).
Verify that MPI is actually using multiple nodes by adding the following to your trap.cpp program (or other test program):
- We will print the rank and host name (computer name) from each process in the MPI computation. Note that standard output from each process is forwarded to the standard out of the mpiexec command, so this will enable us to check that multiple nodes are being used.
- The system call gethostname() enables you to obtain a printable name of the computer running your program. Google man gethostname to find a manual page that describes gethostname(). You can add the following lines to your code to your trap.cpp program, perhaps at the point in the code just after defining and initializing integral.
```
  #define MAXHOSTNAME 25
  const int MAXHOSTNAME=25;
  char hostname[MAXHOSTNAME];
  gethostname(hostname, MAXHOSTNAME);
```
  Here, we define and use the constant MAXHOSTNAME for readability and maintainability (it is easy to change that value in one place and have all its uses updated througout the code. Also add this include file near the top of trap.cpp :
```
#include <unistd.h>
```
  System calls. An (operating) system call is an operation that is carried out the operating system kernel, i.e., the code that is loaded on your computer when you boot it, then runs continually until you power your system down (or suspend it). gethostname() is a Linux system call; Windows would use a system call for the Windows kernel with a different name.
- We would like to use a C++ statement such as the following to print out process information:
```
  cout << "Rank " << my_rank << " (" << hostname << ")" << endl;
```
  However, the output from multiple processes might get garbled, because each operator << is actually a function call in C++, and one process series of << calls might not finish before another process's << calls start. (This is an example of a race condition, in which the correct behavior of a computation depends on timing. We will study race conditions later.)
  
  To avoid overlapping garbled output, we will pack our desired output into a string, then use cout to print that string in a single call to <<. Add this code to your program trap.cpp just after your gethostname() call:
```
  std::stringstream ss;
  ss << "Rank " << my_rank << " (" << hostname << ")" << endl;
  cout << ss.str();
```
  Also add the following #include near the beginning of trap.cpp
```
#include <sstream>
```
  See the cplusplus.com reference (used for Software Design) for specs of the standard stringstream class.
- Recompile and test your program, using the mpiexec arguments above. You should see the usual trap output, plus four lines such as
```
Rank 1 (cumulus04)
```
  Note: Those lines printed by MPI processes may occur in any order, and ordinarily will change order each time you run the MPI program, because there's no enforced scheduling order among these MPI processes.

(Followup on Lab1.) The C++ library for MPI is now deprecated, meaning that it is no longer supported in recent versions of the MPI software. Even though it is still part of the version we are using, we would rather invest our efforts in technologies that are likely to remain relevant. Also, you may already have noticed that the documentation for the C MPI library is better than the documentation for the C++ MPI library. Note that the C-language bindings can be used in C++ programs.

Copy trap.cpp (or a later version of the C++ trap code) from your lab1 directory to your lab2 directory, and modify it to use the C language MPI library (as found in trap.c) instead of the C++ MPI library. You can still use mpiCC to compile your C++ code. Test the new version using mpiexec with SLURM, and verify that it works.

Note: We will use the C-language MPI library in this and future labs.

More features of SLURM batch scripts. Create a simple MPI program try.cpp that prints out a one-line message including the MPI rank. Compile your program using mpiCC, and create a script try.sh containing the following lines to run your program.

#!/bin/sh
#SBATCH -o try.out -e try.err
#SBATCH -n 4

mpiexec -x LD_LIBRARY_PATH=/usr/lib/openmpi/lib try

You can submit this script using the command

$ sbatch try.sh

This script demonstrates additional features of SLURM job scripts:

Shell-script comments that begin with #SBATCH can be used to provide command-line arguments for the SLURM job submission. In particular, the two #SBATCH commands in this script specify three command-line arguments with their values.

You can override script parameters at the sbatch command line. For example,

$  sbatch -n 6 try.sh

causes the program to run with six processes instead of four.

On a cluster admin node (e.g., cumulus, copy your homework program primes.cpp into your subdirectory ~/PDC/lab2. For this part, have primes.cpp produce a count of primes between 2 and N=1000.

Then, compile primes.cpp and and run using mpiexec, in order to produce answers from a 1-processor run of primes.cpp for "3-digit" primes (up to 1000).

Note. This program has no MPI calls, so it can be compiled using g++. You can use mpiexec even though that program primes.cpp does not actually need to use MPI. We could use mpiexec with other executables to make multiple simultaneous runs of a program conveniently. However, in this case, we are only using one node instead of many.

After testing your code in this way, create a commit containing it.

cumulus$  git add primes.cpp
cumulus$  git commit -m "lab2: primes.cpp, copied from HW1"

How many primes can we compute this way? Try computing 4-digit primes (up to N=10000), 5-digit primes (up to N=100000), etc., to find the largest number k≤9 such that a cluster node can compute k-digit primes.

Carry out this computation in parallel using MPI, as follows.

Modify primes.cpp to call MPI_Init() at the beginning of main() and MPI_Finalize() at the end. Use 10^r+4 for the size of the sieve array, where r is the MPI rank. You can compute the number 10^r+4 using a loop, perhaps located in a helper function ipow(base, exp). (This will require the header for the MPI calls.) Test this version by using mpiexec with -np 1 to request one processor (note that this test also performs the check for 4-digit primes). Be sure to use mpiCC to compile your program, and use the -o and -e arguments for sbatch to capture standard output and standard error.
Now change the array size to be 10^r+5 and use mpiexec -np 5 to request five processors, thus checking for 5-, 6-, ..., and 9-digit primes in parallel.
Observe that the output from all the processors is merged in the output file. (If you are only seeing the output for 5-digit primes, try printing less output, e.g., just a count of how many primes were found.)

Also, any error output from the processes will be merged into your error file. If any of your processes generates an error (probably for trying to allocate too much memory for the array), the entire job (all processes) will be terminated immediately. As a result, you may see correct output from some processes, error output from one process, and see nothing from any other processes.

If an error arises, reduce the number of processes and resubmit the job, until you can determine the largest value of k≤9 for which array allocation succeeds.
Optional: Our sieve array uses a whole byte to represent what is essentially one bit of information (either 0 or 1). Can you achieve a greater value of k if you use a different strategy for representing the sieve array?

Now that you have determined the largest integer k≤9 such that k-digit primes can be computed using our seive method, create a file /home/user/PDC/lab2/prime.list consisting of the primes between 2 and 10^k, one per line.

Note: Do not commit primes.list to git -- we want source code, not large data files, in your stogit repository.

If you want to prevent primes.list from appearing as an "untracked file" in git status commands, add the line

lab2/primes.list

to a file ~/PDC/.gitignore .

Use an ofstream object to create your file. Here's an example program that prints a string to a file:

    #include 
    #include 
    using namespace std;
    
    int main () {
      ofstream outfile ("/home/user/PDC/lab2/hello.out");
      outfile << "Hello, world!" << endl;
      outfile.close();
    }

Note: It's possible for a cluster node to create a file in your lab2 directory because your admin-node's home directory is shared with all the cluster nodes on that cluster (via NFS).

End of first submission stage.

Submit your work up to this step (or later) by Saturday, September 22, 2018 as follows:

Use
```
git commit --amend
```
to update your most recent commit message to add the string
```
submit lab2: first submit 
```
to the existing message. Modify that added string if you have any clarifications about this submission (e.g., submit lab1: items 1-6 only ).

Pull/push that commit.

cumulus$  git pull origin master
cumulus$  git push origin master

Write a new program big-primes.cpp that randomly generates 2k-digit numbers and checks whether they are primes, using your file prime.list .

This first version of big-primes.cpp does not need MPI. (You can compile with g++.) However, we can still run the program using mpiexec in order potentially to execute multiple copies at once. Invoke mpiexec with -np 1 for one processor, for now.

Note: Our general strategy with 1-processor MPI jobs is to make sure our jobs run using MPI jobs in the simplified situation of a single processor, before adding in multiple processors. This incremental approach simplifies debugging, and distributed programs are notoriously difficult to debug.
Your program should begin by reading the primes in prime.list and storing them in an array of long. Open that file for input as an ifstream object. You can use the operator >> to read integers from that file stream. You may hard-code the number of primes in prime.list for the array size (first column in output from shell command $ wc prime.list).
Use the function rand() in the C++ library <cstdlib> to generate pseudo-random numbers. For example, the following short program prints five random numbers between 0 and 1000.
```
#include <iostream>
#include <cstdlib>
using namespace std;

static const int N = 1000;

main() {
  for (int i = 0;  i < 5;  i++) 
    cout <<  (rand() % N)  << endl;
}
```
Note: The C++ operator % returns the remainder from integer division of its operands. For example, 103%5 returns the value 3.

Also note that the keyword static for a global variable limits the scope of that global to the current file bigprimes.cpp.
The function rand() does not generate arbitrarily large pseudo-random numbers. The maximum is the value of RAND_MAX (defined in <cstdlib>), which you can determine for your cluster nodes by writing a short program to print that value.
Create and run a simple MPI test program (with one processor) to determine the value of RAND_MAX, so you will get RAND_MAX's value for a cluster node instead of the cluster's admin machine.

Since the number of digits in RAND_MAX is probably less than 2k, you will have to fabricate long prime numbers from multiple return values from rand(). For example, if RAND_MAX exceeds 10⁶ and 18-digit random numbers are desired, you can create one using three calls to rand() as follows.
```
const long long M = 1000000;  // million
long long bigrand = (rand() % M);
bigrand += (rand()%M)*M;
bigrand += (rand()%M)*M*M;
// bigrand is between 0 and M^3
```
Here, we use the type long long for the variable bigrand, to insure that we have a 64-bit (8-byte) integer, in order to contain an 18-digit positive integer. (The size of a data type for a particular version of a C++ compiler can be checked using the sizeof() operator. You can check for yourself whether the type long has 64-bit integers.)

Note: We limited k to be no greater than 9 so that an integer with 2k digits can be contained in a long long variable. long long variables can hold values between -2⁶³ and +2⁶³-1 , where 2⁶³ ~ 8×10^3*6.

Note 2: The gcc/g++ compiler supports a nonstandard type identifier __int128 for 128-bit integers for computers that support them. If you find that k = 9 succeeds on the cluster you're using, using a test program that simply prints the value of sizeof(__int128). Submit that test program using mpiexec, in order to see the results for a worker node on that cluster, not the head node.
To determine whether a large number bigrand is a prime, we will repeatedly check known primes that are less than sqrt(bigrand) to see if they divide bigrand. This means computing bigrand % p for all primes p in your array of primes. If bigrand % p returns 0 for some prime p, then bigrand is not a prime. If bigrand % p is non-zero for all primes p in your list (which contains all primes with up to 9 digits long), then bigrand must be a prime, because the square root of an 18-digit number contains 9 digits.
Have your program print a message indicating the value of bigrand and whether it was a prime; if it was not prime, also print the prime divisor you found.
Run your program to check one random number with (up to) 2k digits for primality, using one cluster node.

When you know you have working code, create a commit containing it.

cumulus$  git add big-primes.cpp
cumulus$  git commit -m "lab2: big-primes.cpp, randomly generate numbers and test them for primality"

Make a quick run of your current version of big-primes.cpp, except substituting bigrand 19 (a prime) just before checking for primality, to insure that your program correctly determines primes. Run this on one processor.

If you need to make corrections in big-primes.cpp, fix the program, then commit with an appropriate message.

Now, adapt big-primes.cpp to check 2k-digit random numbers until it finds at least one prime, using 8 processes in parallel.

Add calls to MPI library functions MPI_Init(), MPI_Comm_rank(), and MPI_Finalize() so each copy of the big-primes executable can determine its rank.
In order to insure that each node checks different primes, include a call to srand(r); before your program makes any calls to rand(), where r is a process's rank. The function srand() assigns a random number seed, leading to different sequences of pseudo-random numbers for different ranks.
Now add a loop for repeatedly generating a value for bigrand, checking it for primality, printing results to standard output, then communicating the results with other processes, until some process finds a prime.

Use point-to-point communication to share with the processes whether a prime has been found. Here is a strategy for the processes to communicate results with each other and determine when to stop: Just after each process checks a value for bigrand for primality and prints that result to standard output, insert the following communication code.
- The head node (rank 0) sends messages to every non-head node (rank>0) containing one MPI_INT value, either a 0 (for false) or a 1 (for true), indicating whether that head node found a prime. This will require a loop of MPI_Recv() calls.)
- Each non-head node NH should receive that MPI_INT value from the head node, and should terminate immediately if the received value is 1. Otherwise, that non-head node NH should send an MPI_INT value 0 or 1 to the head node, indicating whether NH found a prime, then NH should continue to the next iteration of its loop (find the next bigrand, test that bigrand, etc.).
- If the head node did not find a prime on its own, then the head node should now receive from every non-head node (in a loop).
- If the head node receives a 1 from one of the non-head nodes, the head node should send a 1 to all non-head nodes, then the head node should terminate.
  
  If instead the head node receives a 0 from all of the non-head nodes, then the head node should continue with the next iteration of its loop (make another bigrand, test that bigrand, etc.).
- You can debug your program by generating, say, 6-digit random numbers, then switch back to 2k-digit random numbers once you are convinced it works.
  
  Note: One way to accomplish this is to encapsulate your code for generating a large random number bigrand in a function getBigrand() with return type long long int, and to write a second function getSmallrand() that also has return type long long int, but which returns a shorter random number (e.g., returns rand()%M ). This makes it convenient to switch the initialization of bigrand between long and short random numbers.

Note: This protocol for communicating results among the processors is correct because:

Every process terminates as soon as the head node informs all the processes that a prime has been found. Otherwise, every process continues to look for new primes.
If any non-head process finds a prime, it will report it to the head node, and the head node will inform all processes.
If the head node finds a prime, then it will inform all processes.

When you know you have working code, create a commit containing it.

cumulus$  git add big-primes2.cpp
cumulus$  git commit -m "lab2: big-primes2.cpp, find a 2k digit prime, with broadcast and reduce"

Patterns in Finding large primes with MPI .

This lab exhibits some of the same patterns as we saw in the first lab.

Data Decomposition is present in the MPI version of big-primes.cpp, where generated random numbers constitute the data being split up among processors. We also see Data Decomposition when determining k, since we tested array lengths for five nodes in parallel. In that case, we used parallelism more for convenience than for speedup.
We used the Loop Parallel pattern when we used MPI to parallelize the loop in big-primes.cpp. Unlike the loop in the trapezoid lab, we are not dividing the work of computing with a known range of values. Instead, we are using MPI parallelism to check multiple random values at once, in order to shorten the search time for a random large prime.
Message Passing is present in the MPI version of big-primes.cpp, implemented in calls to MPI_Send() and MPI_Recv().

Each of these MPI labs (including the first lab) also serves as an example of the following pattern.

In the SPMD, a single program is executed in parallel with multiple data values or data sets, making the computation easy to manage. In these MPI programs, literally the same program is executed on multiple computers, and even though the head process executes a variant set of instructions in order to assemble answers, SPMD is well represented in these two labs.
- In the first lab, all nodes carry out the core computation of adding areas of trapezoids.
- In this lab, when determining k, all nodes attempt prime computations at various scales (numbers of digits).
- In the MPI version of big-primes.cpp, all nodes check random digit strings for primality.

The program big-primes.cpp also indicates some patterns we haven't encountered before.

In the Master-Worker pattern, one thread of computation (the master) assigns work to other threads (the workers). In big-primes, the head node represents the master, and sending a message containing 0 to a non-head node constitutes assigning work to that node (which represents a worker), namely, to try another random prime. Often, as in this case, the master also collects results from the workers.
The Scatter-Gather pattern is comparable to Master-Worker, but specifically refers to a master sending values to workers (scatter) then receiving results from them all (gather). Scatter-Gather would more specifically describe the protocol in big-primes.cpp than Master-Worker, except that the master in the Scatter-Gather pattern ordinarily sends different values to each worker, whereas this master sends the same value to each worker.

SPMD and Master-Worker are "mid-level" patterns, along with Loop Parallel. As a variant of Master-Worker, the Scatter-Gather pattern would also be at a middle level.

Create a second version big-primes2.cpp of your program that uses collective communication to share whether a large prime has been found. Here is a protocol:

After every process checks one value, a reduce operation is performed to deliver the number of primes found to the head process. Each process transmits 1 if it found a prime and 0 otherwise.
The head uses a broadcast to inform all processes whether a prime was found, so they can know whether to try again or terminate.

Note:

Look up the arguments for a MPI_Bcast using online documentation.
You could use a Google search for "MPI_Bcast" to find documentation for that library call. You may see references to particular implementations of MPI, such as OpenMPI (which we use on our main clusters) or MPICH (used on Raspberry Pi clusters and at some other sites' main clusters). We would expect the arguments for MPI calls to be identical for standard implementations of MPI.

When you know you have working code, create a commit containing it.

cumulus$  git add big-primes2.cpp
cumulus$  git commit -m "lab2: modify big_primes to find a 2k digit prime"

Some communication options.

MPI includes calls MPI_Scatter() and MPI_Gather() that support the Scatter-Gather pattern. However, MPI_Bcast() the and MPI_Reduce() (both considered to be forms of Collective Communication) are a better fit for this implementation, since a single value is sent to all workers, and there is no need to distinguish between the individual responses received from those workers.

If the primes found by the workers (non-head nodes) were being sent back to the master (head node) instead of just a number 0 or 1, then MPI_Gather() would be a better fit, although we might still use MPI_Bcast() to inform workers that they should continue the search.
If in addition the head node were to distribute candidate digit strings to the workers instead of those workers generating digit strings on their own, then the computation would finally represent a complete Scatter-Gather pattern.

We count Scatter-Gather as a mid-level pattern because the two-stage implementation pattern of scattering followed by gathering is further removed from the hardware and more closely related to a (high-level) Data Decomposition strategy for a particular application (e.g., testing multiple candidate digit sequences for primality in parallel) than low-level patterns such as Collective Communication, which more closely reflecting execution of operations in hardware (e.g., sending all nodes a message via broadcast, or combining values from each node through reduction).

Create a version big-primes3.cpp of your program that uses the collective call MPI_Gather() to communicate the actual prime(s) found, if any. Each process participating in the MPI_Gather() operation either provides the prime it found or the integer 0, indicating no prime was found.

Notes:

Look up the use of MPI_Gather() in online resources.
- The first argument sendbuf in a call of MPI_Gather() is the address of a value to transmit to the head node.
- A later argument recvbuf is the address of an array of locations, one per process, each capable of holding one value from a node. The MPI_Gather() operation fills that array with values for the head node only. The index of a value in that array corresponds to the rank of the process providing that value.
For our application, we want MPI_Gather() to transmit 64-bit long long values, namely, bigrand values from nodes that turn out to be prime. Investigate how to transmit 64-bit values using MPI data types.
- MPI data types include MPI_INT, MPI_LONG, MPI_FLOAT, etc. Some implementations of MPI include MPI_LONG_LONG_INT. A web search can be used to find a more complete list of options.
- We want to communicate an 8-byte integer. You can use the library function MPI_Type_size() to determine the number of bytes used for various MPI types. MPI_Type_size()'s second argument is the address of an integer variable (be sure to allocate it) that receives the number of bytes in its first argument. The type MPI_LONG_LONG_INT (if it's supported in the current MPI implementation) might be a good one to try first.
  Create and run a simple MPI test program for this work, so you will get values on cluster nodes instead of the admin machine (such as cumulus).
- If no predefined MPI data type has length 8 bytes, then you will have to transmit 8-byte large prime values using two 4-byte integers. In that case, you can use an array of two 4-byte integers for the first argument sendbuf of MPI_Gather() call, the value 2 for the second argument sendcount, and a 4-byte MPI integer type (MPI_LONG?) for the third argument sendtype. You will have to break your 8-byte values into the two 4-byte array elements, perhaps using % and >> operators.
Use a copy of big-primes2.cpp as the starting point for big-primes3.cpp. In this version, we will retain the MPI_Bcast() strategy for the head process to send requests to the other processes, and replace MPI_Reduce() by a MPI_Gather() approach for receiving results from processes.
- In the head process, you can use a loop to determine whether any non-zero values were delivered by the head process itself or other processes. (This loop that includes the head process constitutes a simpler approach than the protocol used before for determining whether the head or other processes had found a prime.)
In most MPI implementations, the array argument recvbuf that gathers results only needs to be allocated in the head process, which is the only process that can access gathered values in that array. Use dynamic allocation for that array, since its length depends on the number of MPI processes, known only at runtime.
- If you had to use an array of two integer types for the sendbuf value, then include that factor 2 when computing the number of bytes to allocate dynamically for recvbuf.
Unlike previous implementations of big-primes, the head process will now know the value of the first large prime(s) found. Have big-primes3.cpp print a report of all large primes found in the first MPI_Gather() call that has nonzero values in recvbuf.
As before, you can use shorter random numbers to debug your program, before applying it with 2k-digit random numbers.

When you know you have working code, create a commit containing it.

cumulus$  git add big-primes3.cpp
cumulus$  git commit -m "lab2: big-primes3.cpp, with broadcast and gather"

Create a version big-primes4.cpp of your program that uses the collective call MPI_Scatter() to distribute candidate 2k-digit random numbers to test for primality.

Notes:

Look up the use of MPI_Scatter() in online resources.
- For MPI_Scatter(), the first argument sendbuf is the array of values to send out; only the root process's value for this argument will have an effect.
- Unlike MPI_Gather(), for this call MPI_Scatter() the argument recvbuf points to a single value, and all processes use that argument recvbuf to receive values that the root process supplied through sendbuf.
Use a copy of big-primes3.cpp as a starting point for big-primes4.cpp.
- Instead of each process generating its own random numbers, in big-primes4.cpp only the head process generates random numbers to test for primality. Have the head process generate enough new random numbers to fill the array argument sendbuf for its call to MPI_Scatter().
- Also, remove the random-number generation from the code area for non-head processes, and have them obtain their random numbers from the recvbuf argument in their calls to MPI_Scatter().
- In the head process, use the MPI_Scatter() call instead of MPI_Bcast() to initiate new rounds of primality testing.
- Use shorter random numbers to debug your program, then run it with 2k-digit random numbers. Note: We expect to discover different prime(s) than before, since all random numbers are generated from the random-number seed for the head process only, instead of using multiple random-number seeds for all processes.
- When you know you have working code, create a commit containing it.
```
cumulus$  git add big-primes4.cpp
cumulus$  git commit -m "lab2: big-primes4.cpp, with scatter and gather"
```

[Optional.] Explore other variants of the big-primes code, such as:

Write a version that produces at least P big primes. You could accumulate discovered primes in the head process until P of them are collected.
Does it make a difference if more than one random number is sent to processes at a time? Observe that an array of ct random numbers could be received by each process using the recvbuf argument for MPI_Scatter(), and an array of ct results (primes or zeroes) could be sent from each process using the sendbuf argument for MPI_Gather(). In that approach, make sure your head-only arrays (sendbuf for MPI_Scatter(), and recvbuf for MPI_Gather()) are expanded by a factor of ct (i.e., can hold ct*N large integer values).
Do you have an idea of your own?

If you try an optional idea:

create a file README in your lab2 directory that lists the filenames of your optional source code, and describes what they seek to do; and
commit your README and your optional source files, using the commit messages to indicate progress towards your goals for each program.

Choosing which loops to parallelize, parallelization overhead, and cache usage.

The final version of the big-primes program applies the Loop Parallel pattern at the head node, when iterating over code that represents the Scatter-Gather pattern. We did not parallelize other loops in that code.

The loop in which the master generates multiple candidate random digit sequences for a worker is not worth parallelizing, because the overhead of sending the small task of generating one of these random digit strings to another processor (even another core on the same computer) is far greater than simply carry out that task on the head itself.
The loop within a worker for checking whether any of that worker's candidate random digit sequences is divisible by a particular known prime does not merit parallization, also because the parallelization overhead far exceeds the division operation that might be carried out in parallel.
The loop within a worker that iterates through all primes might conceivably benefit from parallelization. However, small primes are far more likely to divide a random large number than large primes (e.g., 2 divides half of large random numbers, whereas the prime 99999989 divides only 1/99999989 of all large random numbers). It is unclear that the savings from this parallelization for those rare instances when small primes do not eliminate a random large number would exceed the overhead of parallelization.

On the other hand, the computation time required to find a dozen large primes is much less than twelve times computation required to find a single large prime, because of cache usage...______

Preliminary material

Laboratory exercises

Deliverables