Introduction to parallel algorithms
CS 300 (PDC)
Data parallelism: computation in which the same program or operation is applied to different subsets of the data in parallel.
Task parallelism: computation in which different algorithms are run in parallel.
A computational problem is called embarrassingly parallel if it requires little or no effort to divide that problem into subproblems that can be computed independently on separate computing resources.
For example, BLAST searches call for comparing a given genomic sequence against each member of collection of known sequences, one-to-one. If ten processors have equal access to all the sequences, then they can collectively carry out ten such genomic comparisons in the same time that one such comparison would require on a single processor.
Here are some other examples of embarrassing parallelism:
Graphics computations in which each pixel's value can be computed independently of other pixels, such as computation of the Mandelbrot set (fractal geometry).
Computation of parameterized scientific models for multiple independent parameter sets. For example, a computational model of how a plant processes nitrogen over time could be computed for thousands or millions of different parameter choices. Each parameter set would indicate a different (theoretical) plant, and the computed results could be analyzed, for instance, to discover properties of nitrogen absorption among plants with certain properties. [Schade, Waldschmidt '08]
Likewise, the motion of planets in our solar system could be simulated by computing solutions to the differential equations that represent the n-body problem in Physics, using high-accuracy computations and observed initial conditions. Here, one could approximate each planet's motion over short time increments, using the collected results as initial conditions for a next iteration. [Anderson '06]
Here, the computations over a new time increment is embarrassingly parallel, although the collection and distribution of each iteration's results requires a step of coordinated communication that is not entirely parallel.
SETI@home also represents embarrassing parallelism.
In a computation for St. Olaf's radar team, which gathers radar readings that indicate the ice structure of permafrost on Antarctica, those readings are analyzed using certain routines from the open-source Seismic UNIX software package. It turned out that this computation spent most of its time processing the rows of a certain matrix. Each row's computation is independent of the other rows, making that matrix processing embarrassingly parallel. [Frederick '09]
Running the computation on a faster computer sped up the computation by a factor of four (from four hours to one hour). Modifying Seismic UNIX to parallelizing that matrix-processing portion of the algorithm on 34 processors led to an additional 15-fold speedup overall, for a 60-fold overall speedup (four hours to four minutes). Combined with greater computational capacity of cluster computers, the team was ultimately able to avoid having to split up the original data set, then automate the entire radar computation, leading to a streamlined analysis phase that could be completed in one week of summer research instead of six weeks, even though the data set was four times as large.
The map
and reduce
stages of map-reduce computations are each
embarrassingly parallel.
These examples show that embarrassing parallelism is quite useful, not something to be avoided.
The speedup of a parallel algorithm over a corresponding sequential algorithm is the ratio of the compute time for the sequential algorithm to the time for the parallel algorithm. If the speedup factor is n, then we say we have n-fold speedup.
For example, if a sequential algorithm requires 10 min of compute time and a corresponding parallel algorithm requires 2 min, we say that there is 5-fold speedup.
The observed speedup depends on all implementation factors. For example, more processors often leads to more speedup; also, if other programs are running on the processors at the same time as a program implementing a parallel algorithm, those other programs may reduce the speedup.
Even if a problem is embarrassingly parallel, one seldom actually obtains $n$-fold speedup when using $n$ processors.
There is often overhead involved in a computation. For example, in the solar system computation, results need to be copied across the network upon every iteration. This communication is essential to the algorithm, yet the time spend on this communcation does not directly compute more solutions to the n-body problem.
In general, communication costs are frequent contributors to overhead. The processing time to schedule and dispatch processes also leads to overhead.
Synchronization delays occur when a process must wait for another process to deliver computing resources. For example, after each computer in the solar system computation delivers the results of its iteration, it must wait to receive the updated values for other planets before beginning its next iteration.
Some parts of a computation may be inherently sequential. In the polar ice example, only the matrix computation was parallelized, and other parts gained in performance only because they were performed on faster hardware and software (on a single computer)
Rarely, using n processors may lead to more than an n-fold speedup. For example, if a computation involves a large data set that does not fit into the main memory of a single computer, but does fit into the collective main memories of n computers, and if an embarrassingly parallel implementation requires only proportional portions of the data, then the parallel computation involving n computers may run more than n times as fast because disk accesses can be replaced by main-memory accesses.
Replacing main-memory accesses by cache accesses could have a similar effect. Also, parallel pruning in a backtracking algorithm could make it possible for one process to avoid an unnecessary computation because of the prior work of another process.
Amdahl's Law is a formula for estimating the maximum speedup from an algorithm that is part sequential and part parallel.
The search for 2k-digit primes illustrates this kind of problem: First, we create a list of all k-digit primes, using a sequential sieve strategy; then we check 2k-digit random numbers in parallel until we find a prime.
The Amdahl's Law formula is
Pσ is the proportion of the sequential program's execution time that can be parallelized. (Here, the Greek letter σ indicates the sequential program.)
n is the number of processors used to parallelize the algorithm.
For example, suppose that we use our strategy to search for primes using 4
processors, and that 90% of the running time is spent checking
2k-digit random numbers for primality (after an initial 10%
of the running time computing a list of k-digit primes). Then
Pσ = .90 and
n = 4. According to Amdahl's Law,
This estimates that we will obtain about 3-fold speedup by using
4-fold parallelism.
Note: Amdahl's Law computes the overall speedup, taking into account that the sequential portion of the algorithm has no speedup, but the parallel portion of the algorithm has speedup n.
It may seem surprising that we obtain only 3-fold overall speedup when 90% of the algorithm achieves 4-fold speedup. This is a lesson of Amdahl's Law: the non-parallelizable portion of the algorithm has a disproportionate effect on the overall speedup.
A non-computational example may help explain this effect. Suppose that a team of four students is producing a report, together with an executive summary, where the main body of the report requires 8 hours to write, and the executive summary requires one hour to write and must have a single author (representing a sequential task). If only one person wrote the entire report, it would require 9 hours. But if the four students each write 1/4 of the body of the report (2 hours, in 4-fold parallelism), then one student writes the summary, then the elapsed time would be 3 hours---for a 3-fold overall speedup. The sequential portion of the task has a disproportionate effect because the other three students have nothing to do during that portion of the task.
A short computation shows why Amdahl's Law is true.
The total execution time Tσ required for a sequential program σ arises from two parts:
let Sσ be the proportion of Tσ due to non-parallelizable (sequential) computation; and
let Pσ be the proportion of Tσ due to parallelizable computation.
These two proportions can be thought of as percentages (of Tσ), together they add up to 100%, i.e.,
Sσ + Tσ = 1The speedup due to parallelism is
We use an inequality ≤ instead of equality to account for the fact
that the parallelized portion of the computation may require more than
(Pσ/n), due to overhead of starting up
parallelism, etc.
Scalability is the ability of a program or system to retain its desirable properties as the size of a computation increases.
What does Amdahl's law predict as the number of processes increases?
As the number of processors n grows, the term Pσ/n (in the denominator of Amdahl's formula) becomes closer and closer to 0. This implies that the maximum possible overall speedup is 1/(1-Pσ). For example, if Pσ=75% of a sequential computation is parallelizable, Amdahl's Law indicates that the maximum possible speedup is 1/(1-.75) = 4, no matter how many processes there are.
However, Amdahl's Law assumes that a computation's scale remains the same when more processors are available. Gustafson observed that in practice, people tend to perform larger scale computations when larger scale parallel computing systems become available.
Returning to the example with Pσ=75% at a certain scale, and assume that the sequential implementation of the computation requires 80 min to compute. Then the sequential portion of the computation requires 20 min, and the parallelizable portion requires 60 min. With n=6 processors, we could hope to compute the parallelizable portion in 10 min, for 30 min in all (speedup factor of 80/30 = 2.67). But if 60 processors became available, we could now process up to 10 times as much data in the parallelizable section of code in that same 10 minutes -- this is changing the scale of the problem to match the scale of the computing system. Unless the running time for the sequential section changes, this means we can process 10 times as much data in that same 30 minutes if 10 times as many processors are available.
Changing the scale of the problem to match the scale of the parallel computing resource means changing the value of Pσ in Amdahl's Law, because we are expanding the parallelizable portion of the computation. Continuing the example, the tenfold increase in parallelizable computation means that 600 minutes work of parallelizable computation would be carried out in the expanded problem, for a total sequential computation time of 620 minutes; then Pσ = 600/620 = 96.77% instead of 75%. Accomplishing all this computation in 30 minutes means a speedup of 620/30 = 20.67, which is the same estimate that Amdahl's Law produces using P=96.77 and n=60. Note that this is far more than the maximum speedup of 4 if the size of the problem can't expand.
Gustafson and Barsis
encapsulated this behavior in a formula for scaled speedup (speedup
under the assumption that problem size grows with the number of
processors). Let Sπ be the proportion
(percentage) of a
parallelized program taken up by
sequential computing,
Pπ the proportion of that computation taken up by
parallel computing; note that
Sπ + Pπ = 1. Then
This results in Gustafson-Barsis's Law:
A major goal of this course is to explore non-embarrassing parallelism. How can we construct efficient and correct parallel algorithms that solve problems that can't be parallelized in obvious ways?
Communication between processes is often a major source of complication and performance challenge for an algorithm. This is indicated in some of the examples above, e.g., the solar system project and map-reduce's second stage
Amdahl's law...
Some patterns of computation and communication that are not embarrassingly parallel [View from Berkeley '06]:
Linear algebra computations. For example, inverting or finding the determinant of a matrix in parallel involves operations on the entire collection of entries, likely including communication of partial results.
Dynamic programming involves computing a solution to a large problem by solving simpler overlapping subproblems. A parallelized dynamic programming approach to a problem will require communication between subproblems.
______
______
A few terms before discussing things that can go wrong:
Computing resources are hardware or software entities that can be allocated to a particular process. Examples: files, communication messages, shared data structure, etc.
A blocked process is one that is unable to continue
execution until a certain event occurs. For example, when a process
calls MPI::Recv()
, it must block until a message
arrives.
Starvation: A process is perpetually denied resources it needs.
Deadlock: A situation in which a set of process exists, each of which is blocked awaiting a resource that is already allocated to another process in that set.
The following four conditions must all hold for a deadlock to exist:
Mutual exclusion
Hold and wait
No preemption
Circular wait
Race condition: Correct behavior of algorithm depends on timing.