Finding large primes with MPI (CS 300, Parallel and Distributed Computing (PDC)<br/><span style='font-size: 65%'>Due Thursday, January 8, 2015</span>)

Laboratory exercises

Login to a cluster admin node. Create two subdirectories ~/PDC/lab2 and ~/PDC/lab2/job, then change your default directory to your new lab2 directory.

(Followup on Lab1.) The C++ library for MPI is now deprecated, meaning that it is no longer supported in recent versions of the MPI software. Even though it is still part of the version we are using, we would rather invest our efforts in technologies that are likely to remain relevant. Also, you may already have noticed that the documentation for the C MPI library is better than the documentation for the C++ MPI library. Note that the C-language bindings can be used in C++ programs.

Copy trap.cpp (or a later version of the C++ trap code) from your lab1 directory to your lab2 directory, and modify it to use the C language MPI library (as found in trap.c) instead of the C++ MPI library. You can still use mpiCC to compile your C++ code. Test the new version using mpirun with SLURM, and verify that it works.

Note: We will use the C-language MPI library in this and future labs.

More features of SLURM batch scripts. Create a simple MPI program try.cpp that prints out a one-line message including the MPI rank. Compile your program using mpiCC, and create a script try.sh containing the following lines to run your program.

    #!/bin/sh
    #SBATCH -o try.out -e try.err
    #SBATCH -n 4
    
    mpirun -x LD_LIBRARY_PATH=/usr/lib/openmpi/lib try

You can submit this script using the command

  $ sbatch try.sh

This script demonstrates additional features of SLURM job scripts:

Shell-script comments that begin with #SBATCH can be used to provide command-line arguments for the SLURM job submission. In particular, the two #SBATCH commands in this script specify three command-line arguments with their values.

You can override script parameters at the sbatch command line. For example,

    $  sbatch -n 6 try.sh

causes the program to run with six processes instead of four.

Copy your homework program primes.C to a cluster admin node, into a subdirectory ~/PDC/lab2. For this part, have primes.C produce a count of primes between 2 and N=1000. Also, create a job script primes.sh that runs an executable lab2/primes with 1 processor as an mpirun job.

Then, compile primes.C and use slurm to submit your script primes.sh, in order to produce answers from a 1-processor run of primes.C for "3-digit" primes (up to 1000).

Note. This program has no MPI calls, so it can be compiled using g++. Even though the job script primes.sh causes the program to run with mpirun, that program primes.C does not actually need to use MPI. We could use mpirun with other executables to make multiple simultaneous runs of a program conveniently. However, in this case, we are only using one node instead of many.

How many primes can we compute this way? Try computing 4-digit primes (up to N=10000), 5-digit primes (up to N=100000), etc., to find the largest number k≤9 such that a cluster node can compute k-digit primes.

Carry out this computation in parallel using MPI, as follows.

Modify primes.C to call MPI_Init() at the beginning of main() and MPI_Finalize() at the end. Use round(pow(10,r+4)) 10⁴ for the size of the sieve array, where r is the MPI rank. You can compute the number 10^r+4 using a loop, perhaps located in a helper function ipow(base, exp). (This will require the <cmath> header file in order to use round() and pow(), and another header for the MPI calls.) Test this version by submitting primes.sh with one processor (note that this test also performs the check for 4-digit primes). Be sure to use mpiCC to compile your program, and use the -o and -e arguments for sbatch to capture standard output and standard error.
Now change the array size to be round(pow(10, r+5)) 10^r+5 and modify primes.sh to request five processors, thus checking for 5-, 6-, ..., and 9-digit primes in parallel.
Observe that the output from all the processors is merged in the output file. (If you are only seeing the output for 5-digit primes, try printing less output, e.g., just a count of how many primes were found.)

Also, any error output from the processes will be merged into your error file. If any of your processes generates an error (probably for trying to allocate too much memory for the array), the entire job (all processes) will be terminated immediately. As a result, you may see correct output from some processes, error output from one process, and see nothing from any other processes.

If an error arises, reduce the number of processes and resubmit the job, until you can determine the largest value of k≤9 for which array allocation succeeds.
Optional: Our sieve array uses a whole byte to represent what is essentially one bit of information (either 0 or 1). Can you achieve a greater value of k if you use a different strategy for representing the sieve array?

Now that you have determined the largest integer k≤9 such that k-digit primes can be computed using our seive method, create a file /home/user/PDC/lab2/prime.list consisting of the primes between 2 and 10^k, one per line.

Use an ofstream object to create your file. Here's an example program that prints a string to a file:

    #include <iostream>
    #include <fstream>
    using namespace std;
    
    int main () {
      ofstream outfile ("/home/user/PDC/lab2/hello.out");
      outfile << "Hello, world!" << endl;
      outfile.close();
    }

Note: It's possible for a cluster node to create a file in your lab2 directory because your admin-node's home directory is shared with all the cluster nodes on that cluster (via NFS).

Write a new program big-primes.cpp that randomly generates 2k-digit numbers and checks whether they are primes, using your file prime.list .

This first version of big-primes.cpp does not need MPI. (You can compile with g++.) However, use a job script big-primes.sh in order to run the program on a cluster node via Slurm --- with one processor, for now.

Note: Our general strategy with 1-processor MPI jobs is to make sure our jobs run using MPI jobs in the simplified situation of a single processor, before adding in multiple processors. This incremental approach simplifies debugging, and distributed programs are notoriously difficult to debug.
Your program should begin by reading the primes in prime.list and storing them in an array of long. Open that file for input as an ifstream object. You can use the operator >> to read integers from that file stream. You may hard-code the number of primes in prime.list for the array size (first column in output from shell command $ wc prime.list).
Use the function rand() in the C++ library <cstdlib> to generate pseudo-random numbers. For example, the following short program prints five random numbers between 0 and 1000.
```
#include <iostream>
#include <cstdlib>
using namespace std;

#define N 1000

main() {
  for (int i = 0;  i < 5;  i++) 
    cout <<  (rand() % N)  << endl;
}
```
Note: The C++ operator % returns the remainder from integer division of its operands. For example, 103%5 returns the value 3.
The function rand() does not generate arbitrarily large pseudo-random numbers. The maximum is the value of RAND_MAX (defined in <cstdlib>), which you can determine for your cluster nodes by writing a short program to print that value. Be sure to submit an Slurm job (with one processor), so you will get RAND_MAX's value for a cluster node instead of the cluster's admin machine.

Since the number of digits in RAND_MAX is probably less than 2k, you will have to fabricate long prime numbers from multiple return values from rand(). For example, if RAND_MAX exceeds 10⁶ and 18-digit random numbers are desired, you can create one using three calls to rand() as follows.
```
    const long long M = 1000000;  // million
    long long bigrand = (rand() % M);
    bigrand += (rand()%M)*M;
    bigrand += (rand()%M)*M*M;
    // bigrand is between 0 and M^3
```
Here, we use the type long long for the variable bigrand, to insure that we have a 64-bit (8-byte) integer, in order to contain an 18-digit positive integer. (The size of a data type for a particular version of a C++ compiler can be checked using the sizeof() operator. You can check for yourself whether the type long has 64-bit integers.)

Note: We limited k to be no greater than 9 so that an integer with 2k digits can be contained in a long long variable. long long variables can hold values between -2⁶³ and +2⁶³-1 , where 2⁶³ ~ 8×10^3*6.

Note 2: The gcc/g++ compiler supports a nonstandard type identifier __int128 for 128-bit integers for computers that support them. If you find that k = 9 succeeds on the cluster you're using, using a test program that simply prints the value of sizeof(__int128). Submit that test program using mpirun, in order to see the results for a worker node on that cluster, not the head node.
To determine whether a large number bigrand is a prime, we will repeatedly check known primes that are less than sqrt(bigrand) to see if they divide bigrand. This means computing bigrand % p for all primes p in your array of primes. If bigrand % p returns 0 for some prime p, then bigrand is not a prime. If bigrand % p is non-zero for all primes p in your list (which contains all primes with up to 9 digits long), then bigrand must be a prime, because the square root of an 18-digit number contains 9 digits.
Have your program print a message indicating the value of bigrand and whether it was a prime; if it was not prime, also print the prime divisor you found.
Run your program to check one random number with (up to) 2k digits for primality, using one cluster node.

Make a quick run of your current version of big-primes.cpp, except substituting bigrand 19 (a prime) just before checking for primality, to insure that your program correctly determines primes. Run this on one processor.

Now, adapt big-primes.cpp to check 2k-digit random numbers until it finds at least one prime, using 8 cluster nodes in parallel.

Add calls to MPI library functions MPI_Init(), MPI_Comm_rank(), and MPI_Finalize() so each copy of the big-primes executable can determine its rank.
In order to insure that each node checks different primes, include a call to srand(r); before your program makes any calls to rand(), where r is a process's rank. The function srand() assigns a random number seed, leading to different sequences of pseudo-random numbers for different ranks.
Now add a loop for repeatedly generating a value for bigrand, checking it for primality, printing results to standard output, then communicating the results with other processes, until some process finds a prime.

Use point-to-point communication to share with the processes whether a prime has been found. Here is a strategy for the processes to communicate results with each other and determine when to stop: Just after each process checks a value for bigrand for primality and prints that result to standard output, insert the following communication code.
- The head node (rank 0) sends messages to every non-head node (rank>0) containing one MPI_INT value, either a 0 (for false) or a 1 (for true), indicating whether that head node found a prime. This will require a loop of MPI_Recv() calls.)
- Each non-head node NH should receive that MPI_INT value from the head node, and should terminate immediately if the received value is 1. Otherwise, that non-head node NH should send an MPI_INT value 0 or 1 to the head node, indicating whether NH found a prime, then NH should continue to the next iteration of its loop (find the next bigrand, test that bigrand, etc.).
- If the head node did not find a prime on its own, then the head node should now receive from every non-head node (in a loop).
- If the head node receives a 1 from one of the non-head nodes, the head node should send a 1 to all non-head nodes, then the head node should terminate.
  
  If instead the head node receives a 0 from all of the non-head nodes, then the head node should continue with the next iteration of its loop (make another bigrand, test that bigrand, etc.).

Note: This protocol for communicating results among the processors is correct because:

Every process terminates as soon as the head node informs all the processes that a prime has been found. Otherwise, every process continues to look for new primes.
If any non-head process finds a prime, it will report it to the head node, and the head node will inform all processes.
If the head node finds a prime, then it will inform all processes.

Patterns in Finding large primes with MPI.

This lab exhibits some of the same patterns as we saw in the first lab.

Data Decomposition is present in the MPI version of big-primes.cpp, where generated random numbers constitute the data being split up among processors. We also see Data Decomposition when determining k, since we tested array lengths for five nodes in parallel. In that case, we used parallelism more for convenience than for speedup.
Message Passing is present in the MPI version of big-primes.cpp, implemented in calls to MPI_Send() and MPI_Recv().

The program big-primes.cpp also indicates some patterns we haven't encountered before.

In the Master-Worker pattern, one thread of computation (the master) assigns work to other threads (the workers). In big-primes, the head node represents the master, and sending a message containing 0 to a non-head node constitutes assigning work to that node (which represents a worker), namely, to try another random prime. Often, as in this case, the master also collects results from the workers.
The Scatter-Gather pattern is comparable to Master-Worker, but specifically refers to a master sending values to workers (scatter) then receiving results from them all (gather). Scatter-Gather would more specifically describe the protocol in big-primes.cpp than Master-Worker, except that the master in the Scatter-Gather pattern ordinarily sends different values to each worker, whereas this master sends the same value to each worker.

Master-Worker might be described as a "mid-level" pattern, not as high (i.e., not as close to the application) as Data Decomposition, but not as low (close to the hardware) as Message Passing. As a variant of Master-Worker, the Scatter-Gather pattern would also be at a middle level.

Create a second version big-primes2.cpp of your program that uses collective communication to share whether a large prime has been found. Here is a protocol:

After every process checks one value, a reduce operation is performed to deliver the number of primes found to the head process. Each process transmits 1 if it found a prime and 0 otherwise.
The head uses a broadcast to inform all processes whether a prime was found, so they can know whether to try again or terminate.

Notes:

Look up the arguments for a MPI_Bcast using the online documentation.
MPI includes Collective Communication calls MPI_Scatter() and MPI_Gather() to support the Scatter-Gather pattern. However, MPI_Bcast() and MPI_Reduce() are a better fit for this problem, since a single value is sent to all workers, and there is no need to distinguish between the individual responses received from those workers. If the primes found by the workers (non-head nodes) were being sent back to the master (head node) instead of just a number 0 or 1, then MPI_Gather() would be a better fit (even though we might still use MPI_Bcast() for "scattering").

Deliverables

Submit your work using git on the cluster head node you used. Include the cluster name in your commit message, e.g.,

    mist$  cd ~/PDC
    mist$  git pull origin master
    mist$  git add -A
    mist$  git commit -m "Lab 2 submit (mist)"
    mist$  git push origin master

Also, fill out this form to report on your work.

See lab1 Deliverables section if you need to set up the cluster you used for git.

If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.

This lab is due by Thursday, January 8, 2015.

Preliminary material

Laboratory exercises

Deliverables