Language: a finite or infinite set of sentences. A language has a lexicon, syntax rules, and semantics. A grammar is a formal definition of a language.
Lexicon: contains all the lexemes of the language; i.e., predefined names, symbols and user defined identifiers (see C's lexicon)
Syntax: the form or structure of units in the language whether sentences, expressions, statements, or program units.
Semantics: the meaning of the expressions, statements, and program units in a language. For a programming language, semantics most often describe the runtime behavior of a program. A syntactically correct statement may be semantically meaningless.
Lexeme: the lowest level syntactic unit of a language; i.e., lexemes are the terminal symbols in the language. The lexemes in the English lexeme include words and puncutation symbols. An example of a lexeme is any C keyword.
A Sentence in language L is a valid string of lexemes over the terminal set of L. In English, the lexemes are words and a sentence is a string words (plus punctuation). A word can also have a grammar, defined as the set of arrangements over the terminal set we call the alphabet. If the language is the set of identifiers, then the terminal set is called the character set of L. The C/C++ character set is ASCII (excluding non-printable characters). The Java character set for Unicode.
Token: a category of lexemes (e.g., identifier, keyword, literal, separator, operator in a programming language or noun, verb, adverb, ... in human language). A token is a classification of the terminal symbols of a language. The lexemes are the actual terminals in the language. For example, a Keyword is a token classification in most lanuages. The actual keyword 'const' is a lexeme belonging to that classification.
Two types of grammars:
Generational models "generate" all possible sentences and no others in the language. Context-free and BNF grammars are generational models. A device that generates sentences of a language (a derivation or a parse tree) can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator.
Recognitional models verify whether a sentence is in the language or not. A recognition device reads input strings of the language and decides whether the input strings belong to the language. Example: The syntax analysis part of a compiler is a recognitional model.
+ A finite set Σ of terminal symbols, e.g. the terminal alphabet of the language that combine to form sentences.
+ A finite set N of nonterminal symbols or syntactic categories, each of which represents some collection of subphrases of the sentences. (Nonterminals are denoted in some way.)
+ A finite set P of productions or rules that describe how each nonterminal is defined in terms of terminal symbols and nonterminals. A production has a left-hand side (LHS) and right-hand side (RHS). Productions use ::= or -> , read as "is defined as". The "|" symbol is OR.
aBb ::= bC S ::= 0B | 1A+ A nonterminal S, the start symbol, that specifies the uppermost category of the language (e.g. a sentence).
UNRESTRICTED, requires only that one nonterminal appear on LHS
aBb ::= bCCONTEXT SENSITIVE, requires at least one nonterminal on the LHS and that the RHS contain no fewer symbols than LHS. Rules are of the form:
vAw ::= vzwExample: A grammar for the language of the set of strings with equal numbers of as, bs, and cs in that order. E.g. { abc, aabbcc, aaabbbccc, ... } is context-sensitive.
Σ = {a,b,c} N = {S,A,C,Q,X} S = S P = { S -> abC | aSQ bQC -> bbCC CQ -> CX CX -> QX QX -> QC C -> c }
CONTEXT FREE, requires that all rules have a single nonterminal on LHS, and one or more nonterminal and terminals on RHS. Examples:
CFG 1. Σ = {(,)} N = {S} S = S P = { S ::= () | (S) | SS } CFG 2. Σ = {a,b} N = {S} S = S P = { S ::= aSa | bSb | a | b | λ } CFG 3. Σ = {a,b} N = {S} S = S P = { S ::= aSb | ε } L = { ab, aabb, aaabbb, aaaabbbb, ... } CFG 4. Σ = {a,b} N = {S,A,B} S = S P = {S ::= aABb A ::= Ab | b B ::= aB | a } CFG 5. Σ = {0,1} N = {S,A,B} S = S P = {S ::= 1A | OB A ::= 0 | 0S | 1AA B ::= 1 | 1S | 0BB }Are these strings in the language defined by CFG 5?
0101 011 11110011 00011000110001110
Before deriving the strings let's see if we can uncover some patterns in the language. This language is complicated but there is one pattern that is easy to discern. The strings below are in L(5) following the S->1A; A->0 | 0S rules:
10,1010,101010,10101010,..Following the S->0B; B->1 | 1S rules we have
01,0101, 010101, 01010101,...
It looks like all strings of '01' any number of times and all strings of '10' any number of times are in L(5).
The next rules to be considered are complicated! But if the string follows S -> 1A; A -> 1AA, we know that strings will be prefaced by 11 and followed by an additional 0.1100, 1011000, ...
Based on these patterns it appears that '11110011' is not in the language. But let's try to derive it:
S -> 1A -> 11AA -> 111AAA -> 1111AAAA -> 1111001AA STUCKREGULAR is the most restrictive grammar. Rules must have a single nonterminal on the LHS and one terminal on the RHS or one terminal followed or prefaced by one nonterminal. The location of the nonterminal must be consistent. A terminal prefaced by a nonterminal is left regular and the reverse is right regular. Every regular grammar is also context-free but not all context-free grammars are regular. This grammar a right regular.
<binary number> ::= 0 <binary number> ::= 1 <binary number> ::= 1 <binary number> <binary number> ::= 0 <binary number>
Examples of BNF rules:
<ident_list> -> identifier | identifier, <ident_list> <if_stmt> -> if <logic_expr> then <stmt>
<ident_list> -> ident | ident, <ident_list>An Example Grammar:
<program> -> <stmts> <stmts> -> <stmt> | <stmt> ; <stmts> <stmt> -> <var> = <expr> <expr> -> <term> + <term> | <term> - <term> <term> -> <var> | const <var> -> a | b | c | dAn Example Derivation and Parse Tree:
Sentence: a = b + const <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const Parse Tree: <program> | <stmts> | <stmt> / | \ <var> = <expr> | / | \ a <term> + <term> | | <var> const | bAmbiguity in Grammars:
A grammar is ambiguous if when generating a valid sentence you can end up with more than one distinct parse tree. To prove that a grammar is ambiguous, take a sentence in the grammar and and do a leftmost and a rightmost derivation. If that generates two different parse trees the language is ambiguous. Example:
AMBIGUOUS GRAMMAR
<expr> -> <expr> <op> <expr> | const
<op> -> / | -
Sample expression in the grammar: const - const / const
Rightmost Derivation and Rightmost Parse Tree.
<expr> => <expr> <op> <expr>
=> <expr> <op> const
=> <expr> / const
=> <expr> <op> <expr> / const
=> <expr> <op> const / const
=> <expr> - const / const
=> const - const / const
<expr>
/ | \
<expr> <op> <expr>
/ | \ | |
<expr> <op> <expr> / const
| | |
const - const
Leftmost Derivation and Leftmost Parse Tree.
<expr> => <expr> <op> <expr>
=> const <op> <expr>
=> const - <expr>
=> const - <expr> <op> <expr>
=> const - const <op> <expr>
=> const - const / <expr>
=> const - const / const
<expr>
/ | \
<expr> <op> <expr>
| | / | \
const - <expr> <op> <expr>
| | |
const / const
Sometimes ambiguity is not a problem semantically (i.e., the meaning of the sentence remains unchanged). It is a problem when the semantic *meaning* of the sentence differs between parse trees. The meaning of "const - const / const" differs since subtraction does not commute with division. So ambiguity in this expression grammar is a problem. To fix it, you need to create a grammar that enforces correct precedence for operators (in this case division over subtraction). A grammar that has double recursion (the non-terminal on the LHS is repeated twice in the RHS) is always ambiguous. So you will need to remove the double recursion by replacing the non-terminal with an "intermediate" non-terminal.
An easy method to create the correct unambiguous grammar is to follow the behavior of the parse tree that gives you the correct result, creating production rules and replacing NTs with intermediate NTs as you go in order to force the outcome. In this case, the Leftmost Parse Tree is the one that gives division precedence over subtraction.
<expr> -> <expr> - <term> | <term> <term> -> <term> / const | const
Operator Precedence in Expression Grammars
The parse tree indicates precedence levels of the operators - nodes farthest from the root are evaluated first (since trees are displayed upside down these are the lowest levels). Evaluation is a LVR traversal of the parse tree.
Example Grammar
<expr> -> <expr> + <term> | <term> <term> -> <term> * <factor> | <factor> <factor> -> ( <expr> ) | <id> <id> -> A | B | CWhich operator (+ or *) has precedence?
What happens if you swap the + and * operators in the grammar?
<term> is defined as <term> * <factor> before it is defined as <id>. What does this mean in terms of a parse tree?
Associativity of Operators in Expression Grammars
The property of associativity for binary operators means that ((a op b) op c) is the same as (a op (b op c)); Addition is associative, subtraction is non-associative.
Associativity in a programming language defines the order of operations for "like" operators that are non-associative
Right associative means that operators of equal precedence are evaluated from right to left. Right recursion enforces right association. In right recursion the nonterminal in the LHS appears at the right end of the RHS.
Left associative means that operators of equal precedence are evaluated from left to right. Left recursion enforces left association. In left recursion the nonterminal in the LHS appears at the left end of the RHS.
Operator associativity can be indicated by a grammar. For example:
7 - 4 - 2
Left associativity means the 4 *associates* left: (7 - 4) - 2 = 1
Right associativity means the 4 *associates* right: 7 - (4 - 2)= 5
Example of recursion in the rule to control associativity with corresponding parse trees for 7 - 4 - 2.
<expr> -> <expr> - <term> | <term> (left recursive) <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <expr> / | \ <expr> - <term> Semantic meaning: (7-4)-2=1 / | \ | <expr> - <term> 2 | | <term> 4 | 7 <expr> -> <term> - <expr> | <term> (right recursive) <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <expr> / | \ <term> - <expr> Semantic meaning: 7-(4-2)=5 | / | \ 7 <term> - <expr> | | 4 <term> | 2Extended BNF (EBNF)
<proc_call> -> ident [(<expr_list>)]
<term> -> <term> (+ | -) const
<ident> -> letter {letter digit}
BNF <expr> => <expr> + <term> | <expr> - <term> | <term> <term> => <term> * <factor> | <term> / <factor> | <factor> EBNF <expr> => <term> {(+ | -) <term>} <term> => <factor> {(* | /) <factor>} note: ENBF does not specify associativity--the syntax analyzer must do this
Example 3.4 : <assign> -> <id> = <expr> <id> -> A | B | C <expr> -> <expr> + <term> | <term> <term> -> <term> * <factor> | <factor> <factor> -> ( <expr> ) | <id> Example 3.2: <assign> -> <id> = <expr> <id> -> A | B | C <expr>-><id> + <expr> | <id> * <expr> | ( <expr> ) | <id> #1. Consider this Grammar <assign> -> <id> = <expr> <expr> -> <expr> + <term> | <term> <term> -> <term> * <factor> | <factor> <factor> -> ( <expr> ) | <id> <id> -> A | B | C Is + right or left associative? How would you change the associativity of +? #2. Consider the folowing two grammars, each of which generates strings of correctly balanced parentheses and brackets. Determine if either or both is ambiguous. The letter "e" represents the Greek letter epsilon; i.e., the empty string. (a) <string> ::= <string> <string> | (<string>) | [<string>] | e (b) <string> ::= (<string>) <string> | [<string>] <string> | e #3. Consider the following specification of expressions: <expr> ::= <element> | <expr> <weak op> <expr> <element> ::= <numeral> | <variable> <weak op> ::= + | - Demonstrate its ambiguity by displaying two derivation trees for the expression a - b - c. #4. Modify the following grammar to add a unary minus operator that has higher precedence than either + or *. <assign> -> <id> = <expr> <id> -> A | B | C <expr> -> <expr> + <term> | <term> <term> -> <term> * <factor> | <factor> <factor> -> ( <expr> ) | <id> #5. Consider the following grammar: <expr> ::- <term> | <expr> + <term> <term ::= <element> | <term> * <element> <element> ::= <num> | <var> Draw the parse tree for this valid expression: 5 * Z + 30 + 7 #6. Consider the grammar S -> AB | AD A -> Aa | b B -> Bb | a D -> cDc | d Are these valid strings in the language generated by the grammar? ba ? bbbbab ? aaaacccc ? bcdcc ? baaab ? baccdcc ? #8. Consider this grammar: <assign> -> <id> = <expr> <id> -> A | B | C <expr> -> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id> Show a parse tree and a leftmost derivation for the following statement: A = A * ( B + ( C * A)) 1. <assign> => <id> = <expr> 2. => A = <expr> 3. => A = <id> * <expr> 4. => A = A * <expr> 5. => A = A * (<expr>) 6. => A = A * (<id> + <expr>) 7. => A = A * (B + <expr>) 8. => A = A * (B + (<expr>)) 9. => A = A * (B + (<id> * <expr>)) 10. => A = A * (B + ( C * <expr>)) 11. => A = A * (B + ( C * <id>)) 12. => A = A * (B + ( C * A)) Parse tree. <assign> / | \ <id> = <expr> | / | \ A <id> * <expr> | | A ( <expr> ) / | \ <id> + <expr> | | B ( <expr> ) / | \ <id> * <expr> | | C <id> | A To prove that a grammar is ambiguous, provide a sentence in the grammar that has two different parse trees. (You can't show this with derivations.) <S> -> <A> <A> -> <A> + <A> | <id> <id> -> a | b | c Sentence: a + b + c Tree 1. <S> | <A> / | \ <A> + <A> | / | \ <id> <A> + <A> | | | a <id> <id> | | b c Tree 2. <S> | <A> / | \ <A> + <A> / | \ | <A> + <A> <id> | | | <id> <id> c | | a b Describe the language defined by the following grammar: <S> -> <A> a <B> b <A> -> <A> b | b <B> -> a <B> | a Start by finding the behavior of <A> and <B> : <A> -> b bb bbb ... <A> is 1 or more "b"s. <B> -> a aa aaa ... <B> is 1 or more "a"s Thus, <A>a<B>b defines the language of all strings of 1 or more "b"s followed by 2 or more "a"s ending with a "b".