Finding large primes with MPI
CS 300, Parallel and Distributed Computing (PDC)
Due Thursday, January 8, 2015
Finding large primes with MPI
compute primes up to 10k (k = 9?), standalone
rand
; generate 18-digit random digits and test
for primality, reporting lowest divisor if composite
with n=10 processes, find an 18-digit prime by testing random digit strings
improve efficiency?
HW1, computing primes less than N
Pseudo-random number generation with rand()
(srand()
)
Public-key encryption (e.g., RSA), using a product of large primes.
Protocol
Login to a cluster admin node. Create two subdirectories
~/PDC/lab2
and ~/PDC/lab2/job
, then change your
default directory to your new lab2
directory.
(Followup on Lab1.) The C++ library for MPI is now deprecated, meaning that it is no longer supported in recent versions of the MPI software. Even though it is still part of the version we are using, we would rather invest our efforts in technologies that are likely to remain relevant. Also, you may already have noticed that the documentation for the C MPI library is better than the documentation for the C++ MPI library. Note that the C-language bindings can be used in C++ programs.
Copy trap.cpp
(or a later version of the C++
trap
code) from your lab1
directory to your
lab2
directory, and modify it to use the C language MPI
library (as found in trap.c
) instead of the C++ MPI
library. You can still use mpiCC
to compile your C++
code. Test the new version using mpirun with SLURM, and verify that
it works.
Note: We will use the C-language MPI library in this and future labs.
More features of SLURM batch scripts. Create a simple
MPI program try.cpp
that prints out a one-line message
including the MPI rank. Compile your program using
mpiCC
, and create a script try.sh
containing the following lines to run your
program.
#!/bin/sh #SBATCH -o try.out -e try.err #SBATCH -n 4 mpirun -x LD_LIBRARY_PATH=/usr/lib/openmpi/lib try
You can submit this script using the command
$ sbatch try.sh
This script demonstrates additional features of SLURM job scripts:
Shell-script comments that begin with #SBATCH
can
be used to provide command-line arguments for the SLURM job
submission. In particular, the two #SBATCH
commands in
this script specify three command-line arguments with their values.
You can override script parameters at the sbatch
command
line. For example,
$ sbatch -n 6 try.shcauses the program to run with six processes instead of four.
Copy your homework program primes.C
to a cluster
admin node, into a subdirectory ~/PDC/lab2
. For this part,
have primes.C
produce a count of primes between 2
and N
=1000. Also, create a job
script primes.sh
that
runs an executable lab2/primes
with 1 processor as an mpirun
job.
Then, compile primes.C
and use slurm
to submit your script
primes.sh
, in order to produce answers from a
1-processor run of primes.C
for "3-digit" primes (up to
1000).
Note. This program has no MPI calls, so it can be
compiled using g++
. Even though the job script
primes.sh
causes the program to run with
mpirun
, that program primes.C
does not
actually need to use MPI. We could use mpirun
with other
executables to make multiple simultaneous runs of a program
conveniently. However, in this case, we are only using one node
instead of many.
How many primes can we compute this way? Try computing
4-digit primes (up to N
=10000), 5-digit primes (up to
N
=100000), etc., to find the largest number k≤9
such that a cluster node can compute k-digit primes.
Carry out this computation in parallel using MPI, as follows.
Modify primes.C
to call MPI_Init()
at the beginning of main()
and
MPI_Finalize()
at the end. Use round(pow(10,r+4))
104
for the size of the sieve array, where r
is the MPI rank.
You can compute the number 10r+4 using a
loop, perhaps located in a helper function ipow(base,
exp)
.
(This will require the <cmath>
header file in order
to use round()
and pow()
, and another header for
the MPI calls.) Test this version by submitting
primes.sh
with one processor
(note that this test also performs the check for 4-digit primes). Be
sure to use mpiCC
to compile your program, and use the
-o
and -e
arguments for sbatch
to capture standard output and standard error.
Now change the array size to be round(pow(10, r+5))
10r+5 and
modify primes.sh
to request five processors, thus
checking for 5-, 6-, ..., and 9-digit primes in parallel.
Observe that the output from all the processors is merged in the output file. (If you are only seeing the output for 5-digit primes, try printing less output, e.g., just a count of how many primes were found.)
Also, any error output from the processes will be merged into your error file. If any of your processes generates an error (probably for trying to allocate too much memory for the array), the entire job (all processes) will be terminated immediately. As a result, you may see correct output from some processes, error output from one process, and see nothing from any other processes.
If an error arises, reduce the number of processes and resubmit the job, until you can determine the largest value of k≤9 for which array allocation succeeds.
Optional: Our sieve array uses a whole byte to represent what is essentially one bit of information (either 0 or 1). Can you achieve a greater value of k if you use a different strategy for representing the sieve array?
Now that you have determined the largest integer k≤9 such that k-digit primes can be computed using our seive
method, create a file /home/user/PDC/lab2/prime.list
consisting of the primes between 2 and 10k, one per line.
Use an ofstream
object to create your file. Here's an example program that prints a
string to a file:
#include <iostream> #include <fstream> using namespace std; int main () { ofstream outfile ("/home/user/PDC/lab2/hello.out"); outfile << "Hello, world!" << endl; outfile.close(); }
Note: It's possible for a cluster node to create a file in your
lab2
directory because your admin-node's home directory
is shared with all the cluster nodes on that cluster (via NFS).
Write a new program big-primes.cpp
that randomly
generates 2k-digit numbers and checks whether they are
primes, using your file prime.list
.
This first version of big-primes.cpp
does not need
MPI. (You can compile with g++
.) However, use a job
script big-primes.sh
in order to run the program on a
cluster node via Slurm --- with one processor, for now.
Note: Our general strategy with 1-processor MPI jobs is to make sure our jobs run using MPI jobs in the simplified situation of a single processor, before adding in multiple processors. This incremental approach simplifies debugging, and distributed programs are notoriously difficult to debug.
Your program should begin by reading the primes in
prime.list
and storing them in an array of
long
. Open that file for input as an
ifstream
object. You can use the operator
>>
to read integers from that file stream. You may
hard-code the number of primes in prime.list
for the
array size (first column in output from shell command $ wc
prime.list
).
Use the function rand()
in the C++ library
<cstdlib>
to generate pseudo-random numbers.
For example, the following short program prints five random numbers
between 0 and 1000.
#include <iostream> #include <cstdlib> using namespace std; #define N 1000 main() { for (int i = 0; i < 5; i++) cout << (rand() % N) << endl; }
Note: The C++ operator %
returns the
remainder from integer division of its operands. For example,
103%5
returns the value 3.
The function rand()
does not generate arbitrarily
large pseudo-random numbers. The maximum is the value of
RAND_MAX
(defined in <cstdlib>
), which
you can determine for your cluster nodes by writing a short program to
print that value. Be sure to submit an Slurm job (with one processor),
so you will get RAND_MAX
's value for a cluster node
instead of the cluster's admin machine.
Since the number of digits in RAND_MAX
is probably
less than 2k, you will have to fabricate long prime numbers
from multiple return values from rand()
. For example, if
RAND_MAX
exceeds 106 and 18-digit random
numbers are desired, you can create one using three calls to
rand()
as follows.
const long long M = 1000000; // million long long bigrand = (rand() % M); bigrand += (rand()%M)*M; bigrand += (rand()%M)*M*M; // bigrand is between 0 and M^3
Here, we use the type long long
for the variable
bigrand
, to insure that we have a 64-bit (8-byte) integer, in
order to contain an 18-digit positive integer. (The size of a data
type for a particular version of a C++ compiler can be checked using
the sizeof()
operator. You can check for yourself
whether the type long
has 64-bit integers.)
Note: We limited k to be no greater than 9
so that an integer with 2k digits can be contained in a
long long
variable. long long
variables can hold
values between -263 and +263-1 , where
263 ~ 8×103*6.
Note 2: The gcc/g++
compiler supports a
nonstandard type identifier __int128
for 128-bit integers for
computers that support them.
If you find that k = 9 succeeds on the cluster you're using,
using a test program that simply prints the value of
sizeof(__int128)
. Submit that test program using
mpirun
, in order to see the results for a worker
node on that cluster, not the head node.
To determine whether a large number bigrand
is a
prime, we will repeatedly
check known primes that are less than sqrt(bigrand)
to
see if they divide bigrand
.
This means computing bigrand % p
for all primes
p
in your array of primes. If bigrand % p
returns 0 for some prime p
, then bigrand
is
not a prime. If bigrand % p
is non-zero for all
primes p
in your list (which contains all primes with up
to 9 digits long), then bigrand
must be a prime, because the
square root of an 18-digit number contains 9 digits.
Have your program print a message indicating the value of
bigrand
and whether it was a prime; if it was not prime,
also print the prime divisor you found.
Run your program to check one random number with (up to) 2k digits for primality, using one cluster node.
Make a quick run of your current version of
big-primes.cpp
, except substituting bigrand
19 (a prime) just before checking for primality, to insure that your
program correctly determines primes. Run this on one processor.
Now, adapt big-primes.cpp
to check 2k-digit
random numbers until it finds at least one prime, using 8 cluster
nodes in parallel.
Add calls to MPI library functions MPI_Init()
,
MPI_Comm_rank()
, and MPI_Finalize()
so each
copy of the
big-primes
executable can determine its rank.
In order to insure that each node checks different primes,
include a call to srand(r);
before your program makes any
calls to rand()
, where r
is a process's
rank. The function srand()
assigns a random number
seed, leading to different sequences of pseudo-random numbers for
different ranks.
Now add a loop for repeatedly generating a value for
bigrand
, checking it for primality, printing results to
standard output, then communicating
the results with other processes, until some process finds a prime.
Use point-to-point communication to share with the processes
whether a prime has been found. Here is a strategy for the processes
to communicate results with
each other and determine when to stop:
Just after each process checks a value for
bigrand
for primality and prints that result to standard
output, insert the following communication code.
The head
node (rank 0) sends messages to every non-head node (rank>0)
containing one MPI_INT
value, either a 0 (for false) or a
1 (for true), indicating whether that head node found a prime. This
will require a loop of MPI_Recv()
calls.)
Each non-head node NH should receive that
MPI_INT
value from the head node, and should terminate immediately if the
received value is 1. Otherwise, that
non-head node NH should send an MPI_INT
value 0
or 1 to the
head node, indicating whether NH found a prime, then
NH should continue to the next iteration of its loop (find
the next bigrand
, test that bigrand
, etc.).
If the head node did not find a prime on its own, then the head node should now receive from every non-head node (in a loop).
If the head node receives a 1 from one of the non-head nodes, the head node should send a 1 to all non-head nodes, then the head node should terminate.
If instead the head node receives a 0 from all of the
non-head nodes, then the head node should continue with the next
iteration of its loop (make another bigrand
, test that
bigrand
, etc.).
Note: This protocol for communicating results among the processors is correct because:
Every process terminates as soon as the head node informs all the processes that a prime has been found. Otherwise, every process continues to look for new primes.
If any non-head process finds a prime, it will report it to the head node, and the head node will inform all processes.
If the head node finds a prime, then it will inform all processes.
Patterns in Finding large primes with MPI.
This lab exhibits some of the same patterns as we saw in the first lab.
Data Decomposition is present in the MPI
version of big-primes.cpp
, where generated random
numbers constitute the data being split up among processors. We also
see Data Decomposition when determining k,
since we tested array lengths for five nodes in parallel. In that
case, we used parallelism more for convenience than for speedup.
Message Passing is present in the MPI version
of big-primes.cpp
, implemented in calls to
MPI_Send()
and MPI_Recv()
.
The program big-primes.cpp
also indicates some
patterns we haven't encountered before.
In the Master-Worker pattern, one thread of
computation (the master) assigns work to other threads (the
workers). In big-primes
, the head node
represents the master, and sending a message containing 0 to a non-head node
constitutes assigning work to that node (which represents a worker),
namely, to try another random prime. Often, as in this case, the
master also collects results from the workers.
The Scatter-Gather pattern is comparable to
Master-Worker, but specifically refers to a master
sending values to workers (scatter) then receiving results
from them all (gather). Scatter-Gather
would more specifically describe the protocol in
big-primes.cpp
than Master-Worker,
except that the master in the Scatter-Gather pattern
ordinarily sends different values to each worker, whereas
this master sends the same value to each worker.
Master-Worker might be described as a "mid-level" pattern, not as high (i.e., not as close to the application) as Data Decomposition, but not as low (close to the hardware) as Message Passing. As a variant of Master-Worker, the Scatter-Gather pattern would also be at a middle level.
Create a second version big-primes2.cpp
of your
program that uses collective communication to share whether a large
prime has been found. Here is a protocol:
After every process checks one value, a reduce operation is performed to deliver the number of primes found to the head process. Each process transmits 1 if it found a prime and 0 otherwise.
The head uses a broadcast to inform all processes whether a prime was found, so they can know whether to try again or terminate.
Notes:
Look up the arguments for a MPI_Bcast
using the online documentation.
MPI includes Collective Communication calls
MPI_Scatter()
and MPI_Gather()
to support
the Scatter-Gather pattern. However,
MPI_Bcast()
and MPI_Reduce()
are a better
fit for this problem, since a single value is sent to all workers, and
there is no need to distinguish between the individual responses
received from those workers. If the primes found by the workers
(non-head nodes) were
being sent back to the master (head node) instead of just a number 0
or 1, then MPI_Gather()
would be a better fit (even
though we might still use MPI_Bcast()
for "scattering").
Submit your work using
git
on the cluster head
node you used. Include the cluster name in your commit message, e.g.,
mist$ cd ~/PDC mist$ git pull origin master mist$ git add -A mist$ git commit -m "Lab 2 submit (mist)" mist$ git push origin masterAlso, fill out this form to report on your work.
See lab1 Deliverables
section if you need to set up the cluster you used for
git
.
If you did your work on two different clusters, submit work from both of them using one of the strategies in lab1 Deliverables.
This lab is due by Thursday, January 8, 2015.