Resources:
ascii chart
flex/bison pdf
ANSI C Ref
phaseI files
The goal of Phase I is to produce a lexical analyzer (also called scanner or lexer) for your compiler front-end for a language called cflat. The lexical structure of cflat is provided below. You do not need to know the syntax of cflat yet. The job of the lexical analyzer in this phase is to simply identify and classify tokens, add identifiers to a symbol table, and report error messages. In Phase II you will pass these tokens to the parser. One approach to Phase I is to start with the scanner you wrote for the second laboratory.
Keywords
The entire set of lexemes (the lexicon) for cflat
can be classified into tokens by regular expressions.
Keywords are grabbed by their literal name; e.g.,
keyword else
is grabbed as else
.
Each keyword should be assigned a different token id.
Why? Because in the next phase the parser will need to
identify the exact keyword in order to perform syntax analysis.
(If you were coding your scanner by hand you
would write a transition diagram for each keyword and implement it
or do a keyword lookup into a hash table.)
Symbol Table
For phase I you must create a data structure for a simple symbol
table (you will be changing the structure a bit for subsequent phases
as you learn more about symbol table usage).
For Phase I scanning the primary use of a symbol table is
for internal error-checking (you can dump it at the end of scanning).
The scanner will simply
add an entry into the symbol table whenever it encounters
an identifier. The parser will eventually
use the symbol table during parsing for syntactic and
semantic analysis and thus what you store there may change.
For now, simply add an entry for identifiers.
A good structure for a symbol table is a chained hash table. Since C does not have a hash table in its standard library, you can either code one yourself, borrow one from the Internet or use the one I provide for you. For modularity, put the code for your symbol table in separate files (symtab.c and symtab.h). The code I provide for a simple chained hash table is in phaseI_files. You can also use the linear probing hash table that is part of the concordance from lab04.
What goes in the symbol table? It helps to know what the end product of compilation will be. View the assembly code for constants.c generated by stopping compilation at the assembly code stage
$ gcc -S constants.c $ gcc -S num_lits.cIf you look at constants.s you should notice that all literals are either embedded into the .data segment as is the case with globals or emcoded into the assembly instruction that pushes the data onto the stack frame. Unless you have a language with odd semantics the parser does not need to store information about constant literals. Note that the .type assembler directive distinguishes between labels that are data and labels that are functions. Data typing (float, integer,..) is handled by the parser during semantic analysis. If assembly is produced you know the code passed semantic analysis.
**************************** * Cflat Lexical Structure * **************************** TokenType Lexemes --------- ----------------------------------------------------------------- Comments C-style comments (see definition below) Keyword int, char, float, if, else, return, while Operator + - * / % == != > >= < <= && || = ++ -- += -= FloatConst C-style floating point number IntConst (+|-)integer CharConst C-style character literals Your lexer should accept any ascii character enclosed in single quotes including escape characters. Examples: 'a' '\n' '\t' '%' (the null character '' is invalid) StringConst C-style string literals (see defintion below) Identifier C-style identifiers Punctuator [ ] ( ) ; , --------------------- Definitions & Notes --------------------- Cflat is case sensitive. Keywords are all lowercase. C floating point: "an integer part, a decimal point, a fraction part, an [eE], and an optionally signed integer exponent. The integer and fraction parts both are one or more digits. Either the integer part or the fraction part (but not both) can be missing. Either the decimal point or the [eE] and the exponent (not both) can be missing. " (C manual Ch. 3) C-style string literals: A string literal is a sequence of characters surrounded by double quotation marks, as in "...". A string literal has type array of char and is initialized with the given characters. The compiler places a null byte (\0) at the end of each string literal so that programs that scan the string literal can find its end. A double-quotation character (") in a string literal must be preceded by a backslash (\). In addition, the same escapes as those described for character constants can be used. Your lexer should accept these strings as valid: "*" "hi there" "This is a string that has a \" inside it" "\"" "" C-style identifiers: An identifier is a sequence of letters, digits, and underscores (_). The first character cannot be a digit. Uppercase and lowercase letters are distinct. Name length is unlimited.Error Handling
Jumping to the next token is not always easy to do. In general if an illegal symbol (character) is encountered the scanner should eat input until the it hits the next whitespace. Skipping to the next token in this fashion will avoid the extra work in trying to classify the remaining characters past the illegal token. For example, without this feature your scanner will generate this output when scanning bad.cf. Obviously, not a good idea.
Finding the next token is not always easy to do.
For example, if you encounter " hi* there
with no
closing double-quote your scanner will start scanning at h. This
will generate some unnessary scanning but that is OK.
You do not want to over complicate your regexes just to
catch every esoteric error. Catching 4num
as an illegal
identifier is also tricky. The cleanest thing to do is
scan an integer 4 and an identifer num. Let the parser sort it out.
For each error you must display an error message along with line # and column #. The yyleng global variable will be useful for displaying the column number. Then toss out the token (all characters until the next whitespace) and continue scanning. These are some of the errors your scanner should catch:
o Invalid C string o Encountering a symbol that is not in the alphabet of the language. If a symbol is not in one of the regexes defined in your scanner then that symbol is not in the language. For an occurrence of an invalid character (e.g. $), display the invalid symbol and the line # and col #. Then spit the symbol out and continue scanning. o Character constants with the following characteristics: - a missing quote at the end of the character constant ('x) - an empty character constant ('')Bookkeeping
/* comments */ int dummy() { int x = 5; return 0; }Your scanner should produce output similar to this:
============== SCANNER OUTPUT ============== int Keyword dummy Identifier ( Punctuator ) Punctuator { Punctuator int Keyword x Identifier = Operator 5 intConst ; Punctuator return Keyword 0 intConst ; Punctuator } Punctuator ================== Symbol Table Dump ================== NAME LINE ------------- ---- dummy 2 x 4 End Symbol Table Dump @@ ==================== Token Count Summary ===================== Keywords: 3 Operators: 1 Punctuators: 6 intConsts: 2 floatConsts: 0 charConsts: 0 Identifiers: 2 strConsts: 0 TOTAL TOKENS:14 BAD TOKENS: 0WHAT TO SUBMIT FOR PHASE I You do not need to submit anything for Phase I unless you do not complete Phase II. In either case, your code is not due until Monday Jun 9. Should you not complete Phase II attach scan.l in an email and send it to me. To test your scanner I will run your code against good.cf and bad.cf. These two input files test all constructs in the lexicon and test all errors. Sample run:
$ cat good.cf | ./scan > out $ cat bad.cf | ./scan > badoutSend scan.l to donna@cs.csub.edu or demonstrate to me in class.
NOTES: For debugging lex compile with LFLAGS = -d and add to the lex file: extern int yy_flex_debug; yy_flex_debug = 1;