A parser accepts or rejects a string of terminal symbols in a language by following the rules of the BNF/attribute grammar defining the language. Parsing can occur top-down, where the root of the parse tree is constructed first, or bottom-up, where the leaf nodes of the parse tree are constructed first. We are interested in the Programming languages are context-free languages thus require parsing techniques modeled on context-free grammars. The context sensitive elements in programming languages are taken care of through the use of attribute values that are trickled up or down the parse tree and/or a symbol table..
Top-down parsing starts with the starting symbol of the grammar and applies productions in a left-to-right derivation to construct the parse tree. The leaf nodes of the parse tree are the terminals of the language. Each internal node of the parse tree, with the exception of the root, is a LHS (left hand side) of some production rule. Sibling nodes (children of the same parent node) share the same RHS (right hand side) of some production. The parser stops when all leaf nodes on the tree are terminal symbols.
While parsing you often have a choice of which nonterminal to proceed with next. Proceeding with the leftmost nonterminal is called a leftmost derivation. Proceeding with the rightmost nonterminal is a rightmost derivation. A leftmost derivation is the default method and is called the leftmost canonical form.
Recursive Descent Parsing is a top-down parsing technique in which every production involves a function call. An alternative to recursive descent is to use a parsing function in the form of a parsing table.
Predictive parsing (covered in later sections of ch 4) is a form of recursive descent in which the lookahead symbol unambiguously determines the flow from LHS to RHS of the BNF rules. Without predictive parsing, top-down algorithm may involve backtracking if the incorrect rule is selected.
LL(1) is a class of context-free grammars to which predictive parsing can be applied. The first 'el' denotes a left-to-right scan. The second 'el' denotes a leftmost derivation. The '1' denotes one lookahead symbol. One lookahead symbol means only one input symbol of lookahead is used at each step to make a parsing decision. LL(1) grammars are parsed with a left-to-right scan, a leftmost derivation and one lookahead symbol. LL(1) grammars cannot be left recursive since the leftmost nonterminal is the same as the LHS. This would result in infinite recursion. There are techniques to remove left recursion.
Bottom-up parsers that use a left-to-right scan and a rightmost derivation can parse LR(1) grammars. 'L' denotes left-scan and 'R' denotes rightmost derivation. Since scanning is rightmost an LR(1) parser allows production rules that are left recursive.
{ int a = 4.5; // valid code 9stu += sum; // invalid ident causes parser to discard entire statement a = 99.2; // parser restarts parsing here }2. Phrase - Level Recovery is an error recovery strategy that attempts some correction on input after the error; e.g.
{ int a = 4.5 // parser inserts the missing ; and continues sum += sum; }This strategy is not implemented in compilers for high-level languages. It is implemented by some scripting language interpreters and HTML parsers.
3. Error Productions is an error recovery strategy that defines common errors in the language in advance; i.e., define the error as a BNF rule and trap for it. You will implement this technique in your parser for certain error conditions.
4. Global Correction is a costly error recovery strategy that is of theoretical interest only. We will not cover this strategy.
You can implement error-recovery in your parser by using any combinations of the above strategies.
A production rule in a regular grammar must have a single nonterminal on the LHS and a single terminal on the RHS or a single terminal followed or prefaced by one nonterminal on the RHS. All regular grammars can be expressed as a regular expression. Example:
REGULAR GRAMMAR S -> aS | a | e L(R) = {a, aa, aaa, aaaa, ...} regex: a*Production rules in a CFG are more complex and thus can describe languages with richer syntax. A rule must have a single nonterminal on the LHS but can have one or more nonterminals or terminals on the RHS. The RHS does not have to begin with a terminal.
CONTEXT FREE GRAMMAR S -> aS | Sb | b L(G) = {abb, aabbb, ababb, ... }Every valid string in a language that can be described by a regular language can also be described by a CFG.
Example. Let Σ = {a,b}. Design a regex that matches all possible strings over Σ including the empty string followed by a single 'a'. L = { a, aa, ba, aaa, aba, baa, bba, ... } The corresponding regex is (a|b)*a Design a CFG for the same language: CFG: S -> Ta T -> aT | bT | a | b | e Test Parse: babaa S / \ T a / \ b T / \ a T / \ b T | aOn the other hand, CFGs cannot be defined by regexes. An easy example is a^nb^n, a >= 1. This language can easily be described with a CFG
S -> aSb | ab = {ab, aabb, aaabbb, ... }but it is not possible with a regex since by the time scanning starts with the list of 'b's, the scanner does not *remember* how many 'a's have already been scanned.
A yet more restrictive classification of grammars is context-sensitive grammars. The classic example is:
Language L is defined as a^nb^nc^n, for n >= 1. L = {abc, aabbcc, aaabbbccc, aaaabbbbccc, ...}It is possible to define this language with a context-sensitive grammar but NOT possible with a CFG. Since production rules in a context-sensitive grammar are complicated a more intuitive approach is to construct a parser based on a CFG for a language that is complete (accepts all valid strings) but not accurate (accepts some invalid strings). The invalid strings are identified through the use of constraints and attribute values that trickle around the parse tree. Attribute values provide a way of validating context-sensitive predicates during parsing.
In the language a^nb^nc^n the parser needs to keep a running total of the number of 'a's, the number of 'b's and the number of 'c's in the input. During parsing these counts are percolated around the parse tree and used to enforce the semantic (context-sensitive) information in the grammar.
<list> -> <list>, <list> | <item> <item> -> iLeftmost derivation for
i, i, i <list> -> <list>, <list> -> <list>, <list>, <list> -> <item>, <list>, <list> -> i, <list>, <list> -> i, <item>, <list> -> i, i, <list> -> i, i, <item> -> i, i, i Leftmost Parse Tree: list / | \ list , list / | \ | list , list i | | i iRightmost derivation for
i, i, i <list> -> <list>, <list> -> <list>, <list>, <list> -> <list>, <list>, <item> -> <list>, <list>, i -> <list>, <item>, i -> <list>, i, i -> <item>, i, i -> i, i, i Rightmost Parse Tree: list / | \ list , list | / | \ i list , list | | i i Obviously an ambiguous grammar. ********** EXAMPLE 2. ********** Below are two grammars, G and G', each of which generates strings of correctly balanced parentheses. Both are recursive but only one is ambiguous. The letter 'e' denotes epsilon, the empty string. ---------- Grammar G ---------- S -> SS | (S) | e Derive the string () () () Leftmost derivation and tree ----------------------------- S -> SS -> SSS -> (S)SS -> (e)SS -> (e)(S)S -> (e)(e)S -> (e)(e)(S) -> (e)(e)(e) S / \ S S / \ | S S (S) | | | (S) (S) e | | e e Rightmost derivation and tree ----------------------------- S -> SS -> S(S) -> S(e) -> SS(e) -> S(S)(e) -> S(e)(e) -> (S)(e)(e) -> (e)(e)(e) S / \ S S | / \ (S) S (S) | | | e (S) (S) | | e eYou can remove ambiguity from Grammar G by creating a grammar that enforces the order of evaluation; E.g., in Grammar G' below, parentheses associate to the right.
------------ Grammar G' ------------ S -> (S)S | e Regardless of a leftmost or rightmost derivation, you only produce one tree: Input string: ()()() Leftmost parse: Rightmost parse: S S / \ / \ (S) S (S) S | / \ | / \ e (S) S e (S) S | / \ | / \ e (S) S e (S) S | | | | e e e eGrammar G' is no longer ambiguous because (S) and S are different tokens. S goes to either (S) or epsilon but not both at the same time.
Problem 1. Consider the following specification of expressions of this form:
9 7 - a 9 + 8 - c a - b + 9 <expr> -> <element> | <expr> <weak-op> <expr> <element> -> 0..9 | a..z <weak-op> -> + | -Two derivation trees for the expression a - b - c demonstrate that the grammar is ambiguous.
Tree 1. Rightmost derivation. <expr> / | \ <expr> <weak-op> <expr> / | \ | | <expr> <weak-op> <expr> - <element> | | | <element> - <element>Tree 2. Leftmost derivation (swap left and right subtrees under the root)
<expr> / | \ <expr> <weak-op> <expr> / | / | \ <element> - <expr> <weak-op> <expr> | | | <element> - <element>To remove ambiguity replace the ambiguous rule
<expr> -> <expr> <weak-op> <expr>with a rule that does not duplicate the nonterminal in the RHS:
<expr> -> <expr> <weak-op> <term>
<expr> -> <thing> | <thing> * <expr> <object> -> <element> | <element> - <object> <thing> -> <object> | <thing> + <object> <element> -> a | b | c | d | (<object>)The order of precedence among the three operations, lowest to highest is - (minus) + (plus) * (multiplication)
The associativity of operations is:
* (multiplication) is executed right-to-left (right associative) - (minus) is executed right-to-left (right associative) + (plus) is executed left-to-right (left associative)Since the parentheses operation is defined in the lowest rule, the operation inside the parentheses will be performed first. The parentheses will override default precedence or associativity.
Left recursion (direct or indirect) in a grammar is a problem for top-down parsing. Direct means recursion occurs within 1 production rule. Indirect means recursion occurs within 2 or more production rules. + S -> Sa (plus means in 1 or more production rules) Ex. S -> Aa | b A -> Ac | Sd | e, where e is epsilonA right recursive grammar involves rules of this form:
+ S -> aS Old. New. Example. Parse: a + b + c E E Rule 1. E -> E + T / | \ / \ Rule 2. E -> T E + T T E' Rule 3. T -> a | b | c / | \ | / / | \ Replace Rule 1 & 2 with: E + T c a + T E' E -> TE' | | / /| \ E'-> +TE' | e T b b + T E' T -> a | b | c | | \ a c eObservations.
The parse tree in the old grammar grows from the bottom up; e.g., parsing
a + b + c + a
adds an E -> E + T production to the top of the existing tree.
The new grammar grows from the top down; e.g., parsing
a + b + c + a
replaces the E' -> ε
in the rightmost lowest production with an E' -> +TE'
production.
This pattern can be applied to remove left left recursion from any
grammar. You simply remove the recursive symbol and any terminals that
follow it. Replace that chunk (in this case E+) with a new non-terminal
symbol call it E' that derives to +TE' (essentially flipping the
recursion from left to right).
--------------------------- | General Algorithm 4.19 | --------------------------- + Input: Grammar G with no cycles (A -> A) or e-productions. Output: Equivalent Grammar with no left recursion. Notation: A is a single nonterminal, a and B are strings of terminals and nonterminals and B does not begin with A. Step 1. Order all nonterminals A_1, A_2, A_3, ... A_n. Step 2. Execute the following: for (each i from 1 to n) { for (each j from 1 to i - 1) { replace each production of the form A_i -> A_jB by productions A_i -> a_1B | a_2B | ... | a_kB, where A_i -> a_1 | a_2 | ... | a_k, are all current A_j-productions } eliminate immediate left recursion among A_i-productions } Example 1. S -> Aa | b A -> Ac | Sd | e Order nonterminals S,A. S has no left recursion in outer loop i = 1. For i = 2 substitute for S in A -> Sd : S ->Aa | b A -> Ac | Aad | bd | e We are done with S. Now take care of left-recursion in A in the new rules: A -> Ac | Aad | bd | e We need a new nonterminal A'. A -> bdA' | A' A' -> cA' | adA' | e Example 2. A -> Aa | Ba | c B -> Bb | Ab | d Order nonterminals A,B. After the outer loop executes for A: A -> cA' | Ba | c A'-> aA' | e
#1. Add a rule outside the grammar to specify which parse tree is correct. (C does this to solve the dangling else problem) #2. Change the grammar to force the construction of the correct parse tree. Left factoring is an example of approach #2. When given a choice between A-productions, rewrite the productions to defer the decision until more input comes in. Ex. <stmt> -> if <expr> then <stmt> else <stmt> | if <expr> then <stmt> Change the grammar so you know whether to use the first or the second rule. Method. For each nonterminal A, find the longest prefix alpha that is common to two or more alternatives. If there is a non-trivial common prefix (alpha != epsilon), replace all A-productions A -> alphaBeta_1 | alphaBeta_2 | ... alphaBeta_n | gamma, where gamma represents all alternatives that do not begin with alpha, with A -> alphaA' | gamma A' -> Beta_1, | Beta_2 | .. | Beta_n. Example. The IF ELSE THEN where i is IF t THEN e is ELSE. S -> iEtS | iEtSeS | a E -> b Step 1. factor out iEtS: iEts(ε | eS) Step 2. create a new production S' to handle rules inside parentheses: S' -> ε | eS Step 3. Insert S'-production back into the original grammar: S -> iEtSS' | a S' -> eS | ε E -> b