Dataflow programming

In computer programming, dataflow languages are a type of programming language that model the program, conceptually if not physically, as a directed graph of the data flowing around a program, with nodes representing the operations on the data. In most programming languages the operations themselves are the "first class citizens", and the program as a whole is conceptually seen as a series of operations on data that is essentially invisible. This distinction may seem minor, but the paradigm shift is fairly dramatic, and allows dataflow languages to be spread out across multiprocessor systems for free. Dataflow languages have some features of functonal languages, and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing.

One of the key concepts in computer programming is the idea of "state", essentially a snapshot of the measure of various conditions in the system. For instance, a light switch might be on or off, a simple example of state, and the state all of the switches in your house at the current moment is also a state as well. Most programming languages require a considerable amount of state information in order to operate properly, information which is generally hidden from the programmer.

To illustrate the concept, consider this simple command, a = a + b;, which adds two numbers. In this case there is no additional state, all of the data needed to run this operation is explicitly defined in the code. But now consider an equally valid operation doSomething, which has no inputs defined. This does not imply that doSomething uses no data, just that the data it needs is implicit in the current state of the program. The data, in this case, is essentially invisible.

In fact, the state is often hidden from the computer as well, which has no idea that this piece of information encodes state, while that is temporary and will soon be discarded. This is a serious problem, as the state information needs to be shared across multiple processors in parallel processing machines, and without knowing which state is important and which isn't, most languages force the programmer to add a considerable amount of extra code to indicate which data and parts of the code are important in this respect.

This code tends to be both expensive in terms of performance, as well as difficult to debug and often downright ugly; most programmers simple ignore the problem. Those that cannot have to pay a heavy performance cost, a price that is paid even in the most common case when the program is run on a single processor. Explicit parallelism is one of the primary reasons for the poor performance of Enterprise JavaBeans system, for instance.

Dataflow languages promote the data to become the main concept behind any program. It may be considered odd that this is not always the case, as programs generally take in data, process it, and then feed it back out -- this was especially true of older programs and well represented in the Unix operating system. Programs in a dataflow language start with an input, perhaps the command line parameters, and illustrate how that data is used and modified. The data is now explicit, often illustrated physically on the screen as a line or pipe showing where the information flows.

Operations consist of "black boxes" with inputs and outputs, all of which are always explicitly defined. They run as soon as all of their inputs become valid, as opposed to when the program encounters them. Whereas a traditional program essentially consists of a series of statements saying "do this, now do this", a dataflow program is more like a series of workers on an assembly line, who will do their assigned task as soon as the materials arrive. And this is why dataflow languages are inherently parallel; the operations have no hidden state to keep track of, and the operations are all "ready" at the same time.

Dataflow programs are generally represented very differently inside the computer as well. A traditional program is just what it seems, a series of instructions that run one after the other. A dataflow program is constucted as a big hash table instead, with uniquely identified inputs as the keys, and pointers to the code as data. When any operation completes, the program scans down the list until it finds the first operation where all of the inputs are currently valid, and runs it. When that operation finishes it will typically put data into one or more outputs, thereby making some other operation become valid.

For parallel operation only the list needs to be shared; the list itself is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime instead. All dataflow programs will run perfectly on multiprocessor machines without any changes. Better yet, the overhead of parallel execution, the need to explicitly share the state between processors, can be removed completely on machines with a single processor simply by using a different runtime.

As it might be expected, dataflow languages were originally developed in order to make parallel programming easier. Thus the languages were typically developed at the large supercomputer labs.

One of the most popular was SISAL, developed at Lawrence Livermore National Laboratory. SISAL looks like most statement-driven languages, but demands that every variable be defined only once. This allows the compiler to easily identify the inputs and outputs. A number of offshoots of SISAL hav been developed, including SAC, Single Assignment C, which tries to remain as close to the popular C programmign language as possible.