Jump to content

One-pass compiler

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Murray Langton (talk | contribs) at 11:38, 28 January 2025 (Difficulties: emphasise declaration before use). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer programming, a one-pass compiler is a compiler that processes each compilation unit only once, sequentially translating each source statement or declaration into something close to its final machine code. This is in contrast to a multi-pass compiler which converts the program into one or more intermediate representations in steps between source code and machine code, and which reprocesses the entire compilation unit in each sequential pass.

A one-pass compiler initially has to leave addresses for forward jumps unresolved. These can be handled in various ways (e.g. tables of forward jumps and targets) without needing to have another complete pass. Some nominally one-pass compilers effectively 'pass the buck' by generating assembly language and letting the assembler sort out the forward references, but this requires one or more passes in the assembler.

One-pass compilers are smaller and faster than multi-pass compilers.[1]

One-pass compilers are unable to generate as efficient programs as multi-pass compilers due to the limited scope of available information. Many effective compiler optimizations require multiple passes over a basic block, loop (especially nested loops), subroutine, or entire module. Some require passes over an entire program.

Difficulties

A core requirement for any programming language intended for one-pass compilation is that all user-defined identifiers, except possibly labels, must be declared before use:

  • Pascal contains a forward specification to allow routines to be mutually recursive.
  • PL/I allows data declarations to be placed anywhere within a program, specifically, after some references to the not-yet-declared items, so one pass is needed to resolve data declarations, followed by one or more passes to generate code.

Any language which allows preprocessing such as PL/1, C, or C++, must have at least two passes, one pass for preprocessing and one or more passes for translation.

Some computer hardware (e.g. x86) may have short and long branch instructions: short if the destination is within about 127 bytes, and long otherwise. A one-pass compiler must assume that all jumps are long, whereas a multi-pass compiler can check on jump distance and generate possibly shorter code.

Languages which rely on context for statement identification rather than using reserved or stropped keywords may also cause problems. The following examples are from Fortran:

 IF (B) l1,l2      ! two-way branch, where B is a boolean/logical expression  
 IF (N) l1,l2,l3   ! three-way branch, where N is a numeric expression
 IF (B) THEN       ! start conditional block
 IF (B) THEN = 3.1 ! conditional assignment to variable THEN
 IF (B) X = 10     ! single conditional statement
 IF (B) GOTO l4    ! conditional jump
 IF (N) = 2        ! assignment to a subscripted variable named IF
 DO 12 I = 1,15    ! start of a count-controlled loop
 DO 12 I = 1.15    ! assignment of the value 1.15 to a variable called DO12I

The entire statement must be scanned to determine what sort of statement it is; only then can translation take place. Hence any potentially ambiguous statements must be processed at least twice.

Declaration before usage

When generating code for the various expressions, the compiler needs to know the nature of the operands. For example, a statement such as A:=B; could produce rather different code depending on whether A and B are integers or floating-point variables (and what size: single, double or quadruple precision) or complex numbers, arrays, strings, programmer-defined types, etc. In this case, a simple approach would be to transfer a suitable number of words of storage, but, for strings this could be unsuitable as the recipient may be smaller than the supplier and in any case, only a part of the string may be used - perhaps it has space for a thousand characters, but currently contains ten. Then there are more complex constructions, as offered by COBOL and pl/i, such as A:=B by name; In this case, A and B are aggregates (or structures) with A having for example parts A.x, A.y and A.other while B has parts B.y, B.c and B.x, and in that order. The "by name" feature means the equivalent of A.y:=B.y; A.x:=B.x; But because B.c has no counterpart in A, and A.other has no counterpart in B, they are not involved.

All of this can be handled by the requirement that items are declared before they are used. Some languages do not require explicit declarations, generating an implicit declaration on first encountering a new name. Should a fortran compiler encounter a previously-unknown name whose first letter is one of I,J,...,N, then the variable will be an integer, otherwise a floating-point variable. Thus a name DO12I would be a floating-point variable. This is a convenience, but after a few experiences with mistyped names, most programmers agree that the compiler option "implicit none" should be used.

Other systems use the nature of the first encounter to decide the type, such as a string, or an array, and so forth. Interpreted languages can be particularly flexible, with the decision being made at run time, somewhat as follows

if condition then pi:="3.14" else pi:=3.14 fi;
print pi;

Should there be a compiler for such a language, it would have to create a complex entity to represent the variable pi, containing an indication as to just what its current type is and associated storage to represent such a type. This is certainly flexible, but may not be helpful for intensive computation as in solving A.x = b where A is a matrix of order a hundred, and suddenly, any one of its elements may be of a different type.

Procedures and functions

Declaration before use is likewise an easy requirement to meet for procedures and functions, and this applies also to the nesting of procedures within procedures. As with ALGOL, Pascal, PL/I and many others, MATLAB and (since 1995) Fortran allow a function (or procedure) to contain the definition of another function (or procedure), visible only within the containing function, but these systems require that they be defined after the end of the containing procedure.

But when recursion is allowed, a problem arises. Two procedures, each invoking the other, cannot both be declared before usage. One must be first in the source file. This need not matter if, as on the encounter with an unknown variable, sufficient can be deduced from the encounter that the compiler could generate suitable code for the invocation of the unknown procedure, with of course the "fixup" apparatus in place to come back and fill in the correct address for the destination when the procedure's definition is encountered. This would be the case for a procedure with no parameters, for example. The returned result from a function invocation may be of a type discernable from the invocation, but this may not always be correct: a function could return a floating-point result but have its value assigned to an integer.

Pascal solves this problem by requiring "predeclaration." One of the procedure or function declarations must be given first, but, instead of the body of the procedure or function, the keyword forward is given. Then the other procedure or function can be declared and its body defined. At some point the "forward" procedure or function is redeclared, along with the body of the function.

For the invocation of a procedure (or function) with parameters, their type will be known (they being declared before use) but their usage in the procedure invocation may not be. Fortran for example passes all parameters by reference (i.e. by address) so there is no immediate difficulty with generating the code (as always, with actual addresses to be fixed up later), but Pascal and other languages allow parameters to be passed by different methods at the programmer's choice (by reference, or by value, or even perhaps by "name") and this is signified only in the definition of the procedure, which is unknown before the definition has been encountered. Specifically for Pascal, in the specification of parameters a prefix "Var" signifies that it must be received by reference, its absence signifies by value. In the first case the compiler must generate code that passes the address of the parameter, while in the second it must generate different code that passes a copy of the value, usually via a stack. As always, a "fixup" mechanism could be invoked to deal with this, but it would be very messy. Multi-pass compilers can of course collate all the required information as they shuttle back and forth, but single-pass compilers cannot. Code generation could be paused while the scan advances (and its results be held in internal storage) until such time as the needed entity is encountered, and this might not be regarded as resulting in a second pass through the source because the code generation stage will soon catch up, it was merely halting for a while. But this would be complex. Instead a special construction is introduced, whereby the procedure's definition of parameter usage is declared "forward" of its later full definition so that the compiler may know it before use, as it requires.

From First Fortran (1957) onwards, separate compilation of portions of a program has been possible, supporting the creation of libraries of procedures and functions. A procedure in the source file being compiled that invokes a function from such an outside collection must know the type of result returned by the unknown function, if only to generate code that looks in the right place to find the result. Originally, when there were only integers and floating-point variables, the choice could be left to the rules for implicit declaration, but with the proliferation of sizes and also types the invoking procedure will need a type declaration for the function. This is not special, having the same form as for a variable declared inside the procedure.

The requirement to be met is that at the current point in a single-pass compilation, information on an entity is needed so that the correct code for it can be produced now, if with address fixups later. Whether the required information will be encountered later on in the source file or is to be found in some separately-compiled code file, the information is provided by some protocol here.

Whether or not all invocations of a procedure (or function) are checked for compatibility with each other and their definitions is a separate matter. In languages descended from Algol-like inspiration, this checking is usually rigorous, but other systems can be indifferent. Leaving aside systems that allow a procedure to have optional parameters, mistakes in the number and type of parameters will normally cause a program to crash. Systems that allow separate compilation of parts of a complete programme that later are "linked" together should also check for the correct type and number of parameters and results as mistakes are even easier to make, but often do not. Some languages (such as Algol) have a formal notion of "upgrading" or "widening" or "promotion", whereby a procedure that expects say a double-precision parameter may be invoked with it as a single precision variable, and in this case the compiler generates code that stores the single precision variable into a temporary double-precision variable which becomes the actual parameter. This however changes the parameter passing mechanism to copy-in, copy-out which may lead to subtle differences in behaviour. Far less subtle are the consequences when a procedure receives the address of a single precision variable when it expects a double precision parameter, or other size variations. When within the procedure the parameter's value is read, more storage will be read than that of its given parameter and the resulting value is unlikely to be an improvement. Far worse is when the procedure changes the value of its parameter: something is sure to be damaged. Much patience can be expended in finding and correcting these oversights.

Pascal example

An example of such a construct is the forward declaration in Pascal. Pascal requires that procedures be declared or fully defined before use. This helps a one-pass compiler with its type checking: calling a procedure that hasn't been declared anywhere is a clear error. Forward declarations help mutually recursive procedures call each other directly, despite the declare-before-use rule:

function odd(n : integer) : boolean;
begin
    if n = 0 then
        odd := false
    else if n < 0 then
        odd := even(n + 1) { Compiler error: 'even' is not defined }
    else
        odd := even(n - 1)
end;

function even(n : integer) : boolean;
begin
    if n = 0 then
        even := true
    else if n < 0 then
        even := odd(n + 1)
    else
        even := odd(n - 1)
end;

By adding a forward declaration for the function even before the function odd, the one-pass compiler is told that there will be a definition of even later on in the program.

function even(n : integer) : boolean; forward;

function odd(n : integer) : boolean;
  { Et cetera }

When the actual declaration of the body of the function is made, either the parameters are omitted or must be absolutely identical to the original forward declaration, or an error will be flagged.

Pre-processor recursion

When declaring complex data aggregates, a possible usage of functions Odd and Even could arise. Perhaps if a data aggregate X has a storage size that is an odd number of bytes, a single byte item might be added to it under the control of a test upon Odd(ByteSize(X)) so as to make an even number. Given the equivalent declarations of Odd and Even as above, a "forward" declaration probably wouldn't be needed because the usage of the parameters is known to the pre-processor which is unlikely to present opportunities to choose between by reference and by value. However, there could be no invocations of these functions in the source code (outside their definitions) until after their actual definition, because the result of the invocation is required to be known. Unless of course the pre-processor engaged in multiple passes of its source file.

Forward declarations considered harmful

Anyone who has attempted to maintain coherence amongst the declarations and usages of procedures in a large program and its usage of libraries of routines, especially one undergoing changes, will have struggled over the usage of forward or similar added declarations for procedures invoked but not defined in the current compilation. Maintaining synchrony between widely-separated locations especially across different source files requires diligence. Those declarations using the reserved word are easy to find, but if the helpful declarations are not distinguished from ordinary declarations, the task becomes troublesome. The gain of supposedly swifter compilation may seem insufficient when simply abandoning the goal of one-pass compilation would remove this imposition.

See also

References

  1. ^ "Single pass, Two pass, and Multi pass Compilers". GeeksforGeeks. 2019-03-13. Retrieved 2023-05-15.