User:Code-Analysis/sandbox

The grammar of any programming language can be considered either in wide terms that include exact specification of everything what is allowed and what is not allowed in the language or in narrow terms that describe only the formal grammar that is suitable for automatic creation of LR parsers. This article focuses on the formal grammar. General description of the C++ language can be found in the main article.

Formal grammar describes the `context free grammar` of the language. It lacks various restrictions like requirement for all variables to be defined; formal grammar cannot distinguish between the name of the variable and the name of the type. All identifiers for LR parser are simply identifiers. Information about identifiers is stored in the name tables. Name tables are not part of the formal grammar. Nevertheless sometimes LR parser has to make decision on the nature of the identifier. This decision shows up as resolution of the grammar conflict.

C++ 2003 Grammar

The formal grammar of the language is presented in the Annex A of the standard. It consists of 3 major parts.

Lexical conventions

This part of the grammar describes what is an identifier, number, string, etc. Some of the rules are vague and contain human language like `each non-white-space character that cannot be ...` or `any member of the source character set except ...`. Other rules contain lengthy enumerations that mention all letters of the English alphabet or names of all possible operations. This is why the table below contains 2 separate lines for the number of rules. The first line counts lengthy enumeration as one rule. The second line counts all rules of the section.

Non terminals	42
Grammar rules (significantly different)	94
Grammar rules all	276

Using C++ grammar for parsing source files

Result of parsing a C++ source file using the grammar based parser is the [parse tree|Parse_tree]. Leaf nodes of this tree are terminal symbols of the grammar. Intermediate nodes are non terminals. The root of this tree is the axiom of the grammar. Children of any non terminal in this tree are right hand symbols of the rule that was used to create this non terminal. For example the rule:

will appear in the parse tress as

Below is a sample C++ file. This source file is intentionally short. It is not doing anything valuable. Bigger files generate bigger trees.

struct Base1
{
       enum E2 { a, b, c };
};


class Derived1 : public Base1
{
protected:
             int m_data1, m_data2;        

public:
             int Func1(E2 x)
             {
                    m_data1 += x;

                    if (x == b)
                        return m_data1;

                   return m_data2 >> 3;
             }
};

Result of parsing this source file is 'parsing tree'. It shows what rules were used during the parsing process and in what order. Yellow circles represent terminal symbols. Green circles represent non terminal symbols. Small red circles mark places on the tree where the grammar based parser was resolving the grammar conflicts.

External links

Links to C++ grammars that can be used in compiler generation tools.

[1] – The C++ 2003 standard grammar listing
[2] – The C++ 2003 grammar with optimized structure of the grammar conflicts.

User:Code-Analysis/sandbox/conflicts resolution