Introduction
First, the paper itself: The Landscape of Parallel Computing Research: The View From Berkeley
Conclusions of interest to software engineers:
- The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems.
- To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications.
- To be successful, programming models should be independent of the number of processors.
- Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines.
The thesis of the paper is that real world applications and hardware are naturally parallel. Therefore we need a programming model, system software and supporting architectures that are naturally parallel. This will require a fundamental re-design.
The paper lists some items of conventional wisdom that have been, or will be, superceded soon. Here are a few of interest to software engineers:
-
Old: Multiply is slow, but load and store are fast.
New: Load and store are slow, but multiply is faster (by as much as two orders of magnitude). -
Old: Don't bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.
New: It will be a very long wait for a faster sequential computer (perhaps five years).
The rest of the paper is organized around four themes: applications, programming models, hardware and evaluation.
Applications
The paper lists 13 "dwarfs", which are patterns of computation (within a processor) and communication (between processors) that characterize certain classes of applications. The first seven are numerical methods from scientific computing:
- Dense Linear Algebra
- Sparse Linear Algebra
- Spectral Methods
- N-body Methods
- Structured Grids
- Unstructured Grids
- Monte Carlo (generalized to "MapReduce" in paper and considered "embarrassingly parallel")
In addition to those seven, the papers authors add these six:
- Combinational Logic (CRC codes, Hashing, Encryption)
- Graph Traversal (Quicksort, Decision Trees)
- Dynamic Programming
- Backtrack and Branch-and-Bound
- Construct Graphical Models (Bayesian Networks, Hidden Markov Models)
- Finite State Machine (may be "embarrassingly sequential")
There is an interesting note at the end of the section on dwarves:
In the era of multicore and manycore. Popular algorithms from the sequential computing era may fade in popularity. For example, if Huffman decoding proves to be embarrassingly sequential, perhaps we should use a different compression algorithm that is amenable to parallelism.
The paper goes on to describe two ways to compose the dwarfs. The first is via "temporal distribution" which involves using a common set of processors for the composed dwarfs with some sort of time sharing. The other is via "spatial distribution" in which the different dwarfs run on separate (sets of) processors. The paper makes it clear that optimizing the composition of dwarfs can be incredibly difficult in pratice due to issues with loose coupling (to promote code reuse), data structure translation (the need for one dwarf to use the same data with a different structure) and the ability to predict the format of the data at the start of the computation.
The authors cite an Intel Study that lists three categories of computing that are expected to increase in demand:
- Recognition: machine learning approaches in which data is examined and models are constructed.
- Mining: searches the web to find instances of the models.
- Synthesis: creation of new models as in graphics.
The paper contains a very interesting graphic (on page 16) that maps some common tasks from various disciplines to a set of basic mathematic primitives (which are, presumably, what the hardware manufacturers are targetting).
Hardware
The bulk of the hardware discussion is about MPU based systems.
Section 4.1.3 has an interesting discussion about heterogeneous collections of processors in the same system. The idea is that the sequential parts of a program can be the bottleneck on a parallel architecture. An interesting solution for that problem is to have a few more powerful processors in the system that are dedicated to sequential jobs. They refer to Amdahl's Law to illustrate the point. The question of how to identify and schedule these sequential elements is left open, but they admit that the overhead of managing a heterogenous collection of processors could be prohibitive.
Section 4.3 discusses the interconnections between processors in a manycore system. The make a distinction between point-to-point communication for two processors and global communication that involves all the processors. They claim that the latter is composed of smaller messages that are latency-bound. The former are somewhat larger and are bandwidth-bound as a result. They posit using a different set of interconnects for each of these use cases.
The bandwidth-bound communication can take place using internal packet switch-like networks on the chip. Then the topology of the interconnect can be modified on demand to meet the needs of a particular application. For latency-bound communication they recommend something like what BlueGene/L does: a tree network for collective communcation (BlueGene/L also uses a torus interconnect for point-to-point messages).
Section 4 discusses communication primitives for MPU systems, which they are now calling "chip-scale multiprocessors" or CMPs (no wikipedia link on that name, yet). They claim that CMPs have higher inter-core bandwidth and lower latency. They also claim that they can offer new lightweight coherency primitives that operate between cores. They use these claims to motivate the notion that programming a CMP is fundamentally different than programming an SMP.
The first stop is to note that cache coherency schemes will need an overhaul to handle the number of processors in a CMP. They don't give any clear examples of what they mean by this, except to note that on-chip caches with "private or shared configurations" could work. They go on to discuss software transactional memory as an approach to synchronization between processors. It isn't clear, at this point, what the memory model for CMP systems looks like. I.e., it isn't obvious how much of the memory is shared (as it is in a traditional SMP system) and how much of it is private to each processor. The note on message passing in section 4.4.5 seems to indicate that the memory is shared, and that it is more efficient in CMP than in SMP. This seems to be because the CMP systems are implemented on a single chip, whereas SMP systems use multiple chips plugged into the same memory subsystem.
Programming
The section starts with a well-written piece on why the tradeoff between implementation efficiency and programmer productivity is so difficult. If the level of abstraction level is too high, it becomes impossible to optimize the performance. If the abstraction level is too low all the time is spent dealing with the underlying details. In both cases, productivity suffers, and IT productivity is already low enough thanks to reddit and digg.
The paper separates programming languages into three groups: hardware-centric (IXP-C), application-centric (MatLab) and formalism-centric (Actor-based). They describe these as related to efficiency, productivity and correctness (in that order). They decry the lack of effor in devising languages that have proven effects on the psychology of the people writing programs in them. They give, as an example, the notion that software transactional memory is easier to reason about than other traditional concurrency primitives. They promote the use of actual user studies to validate the efficacy (or lack thereof) of languges. This section is far too light on details, and misses major issues with leaky abstractions (i.e., even with "simple" models there are cases where a real knowlege of the underlying complexity is crucial). It is difficult to shield programmers from the reality of the hardware they are working with, although god knows Java is doing its level best here.
There is a brief section on making the programming model independent of the number of processors. They describe MPI as the dominant paradigm in parallel programming for scientific computing, and how it requires the programmer to explicity map tasks to processors. The paper doesn't have a lot to offer here, other than the fact that this is still a very open research area, and it would be nice if it wasn't.
As an aside, it is interesting that they don't mention languages like Erlang, as these are the hot new thing in distributed/parallel computation (at least, for "alpha geeks", as Tim O'Reilly loves to call them). They do briefly mention the Actor model, which has roots back into the seventies. Erlang is a member of this family, which they refer to as a "formalism-centric" approach, even though it is also very application-centric (at least, it was for Ericsson).
The next few sections deal with compilers and operating systems. They discuss "auto-tuners" which can automatically tune a kernel for use on a system. It does this by mutating the optimization parameters and testing until it finds an ideal (or near ideal) setting for the target hardware. This seems very promising, although it isn't clear how widely applicable it is. They go on to discuss the need for a more flexible OS that allows applications to only use the capabilities they really need. They point to virtual machines as an indication that the time has come for this sort of optimization.
This is the section I was most looking forward to, and I found it disappointing. It felt like there was more hand-waving than real science here. I agree, in principal, that programming language development should be the subject of user testing, but it isn't clear (at all) how one would go about such a monumental project, or that the result would be any better than our current languages. In the end, the humans that program computers are significantly more flexible than the computers themselves.
The real wins, in that area, have been tools like compilers and auto-tuners, as they point out. These are the things that allow us to work at a much higher level of abstraction. Perhaps it is the case that the growing power of computers will continue that trend, so that close optimization won't be as important as productivity. The success of languages like Ruby seems to indicate that this is true, as it's performance is rather awful. In that case, however, we're making an explicit trade off between programmer productivity and implementation efficiency, with little regard for the latter. The paper seems to want to take a more nuanced position, but doesn't back it up with enough examples. The example of software transactional memory (which is one of the few they use) is good, but limited. It isn't clear that you can wholly replace existing models of concurrency with an STM model (or that you would want to if you could).