Jump to content

User:Code-Analysis/sandbox

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Code-Analysis (talk | contribs) at 10:19, 23 October 2012 (Core Grammar). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The grammar of any programming language can be considered either in wide terms that include exact specification of everything what is allowed and what is not allowed in the language or in narrow terms that describe only the formal grammar that is sutable for automatic creation of LR parsers. This article focuses ont the formal grammar. Details of the C++ grammar are described in the main article on C++.

Formal grammar describes 'context free gramamar' of the language. It lacks various restrictions like requirement for all variables to be defined; formal grammar cannot distinguish between the name of the variable and the name of the types. All identifiers for LR parser are simply identifiers. Information about identifiers is stored in the name tables. Name tables are not part of the formal grammar. Neverthless sometimes LR parser has to make decision on the nature of the identifier. This decision shows up as resolution of the grammar conflict.



C++ 2003 Grammar

The formal grammar of the language is presented in the Annex A of the standard. It consists of 3 major parts.

Lexical conventions

This part of the grammar describes what is an identifier, number, string, etc. Some of the rules are vague and contain human language like `each non-white-space character that cannot be ...` or `any member of the source character set except ...`. Other rules contain lengthy enumerations that mention all letters of the English alphabet or names of all possible operations. This is why the table below contains 2 separate lines for the number of rules. The first line counts lengthy enumeration as one rule. The second line counts all rules of the section.

Non terminals 42
Grammar rules (significantly different) 94
Grammar rules all 276

Preprocessing Directives

This part of the grammar describes the C++ preprocessor. It is important to note that structure of the conditional compilation expressions are not defined in this section. This section refers to the 'constant-expression' from the core grammar. The requirement for the 'constant-expression' to be numeric is not expressed in the formal grammar in any form. One grammar rule contains human language `the left-parenthesis character without preceding white space`.

Non terminals 14
Grammar rules 28

Core Grammar

This part of the grammar describes the C++ language itself. It defines features like classes, functions, statements, expressions, etc. If one carefuly types this section of the grammar into a file and tries to compile it, this results in several syntax errors because the text of the standard contains typos. Although after fixing those typos the grammar compilers and works as expected. In particular this section of the grammar contans:

Non terminals 142
Grammar rules 576
Parsing states (LR) 14235
Grammar conflicts 9337

It makes sense to mention that the grammar as it is presented in the standard is organized in the form that is good for presenting it in the document. This form of the grammar is not perfect for using it in the parser. In particular it contains several similar non terminals that are scattered across the grammar. By condensing these non terminals and applying other conversions that do not change the input language it is possible to reduce the number of grammar rules to about 500 and the number of grammar conflicts to about 900.

Links to C++ grammars that can be used in compiler generation tools.

  • [1] – The C++ 2003 standard grammar listing
  • [2] – The C++ 2003 grammar with optimized structure of the grammar conflicts.