CMPS 450 Compilers

Front End Phase I "Lexical Analysis"

Resources:
ascii chart
flex/bison pdf
ANSI C Ref
phaseI files

The goal of Phase I is to produce a lexical analyzer (also called scanner or lexer) for your compiler front-end for a language called cflat. The lexical structure of cflat is provided below. You do not need to know the syntax of cflat yet. The job of the lexical analyzer in this phase is to simply identify and classify tokens, add identifiers to a symbol table, and report error messages. In Phase II you will pass these tokens to the parser. One approach to Phase I is to start with the scanner you wrote for the second laboratory.

Keywords
The entire set of lexemes (the lexicon) for cflat can be classified into tokens by regular expressions. Keywords are grabbed by their literal name; e.g., keyword else is grabbed as else. Each keyword should be assigned a different token id. Why? Because in the next phase the parser will need to identify the exact keyword in order to perform syntax analysis. (If you were coding your scanner by hand you would write a transition diagram for each keyword and implement it or do a keyword lookup into a hash table.)

Symbol Table
For phase I you must create a data structure for a simple symbol table (you will be changing the structure a bit for subsequent phases as you learn more about symbol table usage). For Phase I scanning the primary use of a symbol table is for internal error-checking (you can dump it at the end of scanning). The scanner will simply add an entry into the symbol table whenever it encounters an identifier. The parser will eventually use the symbol table during parsing for syntactic and semantic analysis and thus what you store there may change. For now, simply add an entry for identifiers.

A good structure for a symbol table is a chained hash table. Since C does not have a hash table in its standard library, you can either code one yourself, borrow one from the Internet or use the one I provide for you. For modularity, put the code for your symbol table in separate files (symtab.c and symtab.h). The code I provide for a simple chained hash table is in phaseI_files. You can also use the linear probing hash table that is part of the concordance from lab04.

What goes in the symbol table? It helps to know what the end product of compilation will be. View the assembly code for constants.c generated by stopping compilation at the assembly code stage

           $ gcc -S constants.c
           $ gcc -S num_lits.c

If you look at constants.s you should notice that all literals are either embedded into the .data segment as is the case with globals or emcoded into the assembly instruction that pushes the data onto the stack frame. Unless you have a language with odd semantics the parser does not need to store information about constant literals. Note that the .type assembler directive distinguishes between labels that are data and labels that are functions. Data typing (float, integer,..) is handled by the parser during semantic analysis. If assembly is produced you know the code passed semantic analysis.

****************************
* Cflat Lexical Structure  *
****************************

TokenType   Lexemes
---------   ----------------------------------------------------------------- 
Comments    C-style comments  (see definition below)

Keyword     int, char, float, if, else, return, while

Operator    +  -  *  /  %  ==  !=   >  >=  <  <=  &&  ||  =   ++  --  += -=

FloatConst  C-style floating point number 
                  
IntConst    (+|-)integer
 
CharConst   C-style character literals
            Your lexer should accept any ascii character enclosed in single
            quotes including escape characters. Examples:
            'a'   '\n'   '\t'   '%'  (the null character '' is invalid)

StringConst C-style string literals (see defintion below)

Identifier  C-style identifiers 

Punctuator  [  ]  (  )  ;  ,


---------------------
 Definitions & Notes
---------------------
Cflat is case sensitive. Keywords are all lowercase.

C floating point:  "an integer part, a decimal point, a fraction part, an [eE],
                   and an optionally signed  integer exponent. The integer and
                   fraction parts both are one or more digits. Either the 
                   integer part or the fraction part (but not both) can be 
                   missing. Either the decimal point or the [eE] and the 
                   exponent (not both) can be missing. " (C manual Ch. 3)

C-style string literals:
    A string literal is a sequence of characters surrounded by double quotation
    marks, as in "...". A string literal has type array of char and is 
    initialized with the given characters. The compiler places a null byte 
    (\0) at the end of each string literal so that programs that scan the 
    string literal can find its end. A double-quotation character (") in a 
    string literal must be preceded by a backslash (\). In addition, the same 
    escapes as those described for character constants can be used.  

   Your lexer should accept these strings as valid: 
            "*"  
            "hi there"  
       "This is a string that has a \" inside it" 
             "\""  
             ""

C-style identifiers:
    An identifier is a sequence of letters, digits, and underscores (_). The 
    first character cannot be a digit. Uppercase and lowercase letters are 
    distinct. Name length is unlimited.

Error Handling
Your scanner must include error handling. Since scanning is so time intensive (as much as 70% of compilation time is spent in the scanner) the general approach for error handling is to display an error message, jump to the next token, and continue scanning. The scanner should not try to catch anything related to syntax or implementation rules. For example, integer or float constants that will produce overflow or underflow can be caught be the parser.

Jumping to the next token is not always easy to do. In general if an illegal symbol (character) is encountered the scanner should eat input until the it hits the next whitespace. Skipping to the next token in this fashion will avoid the extra work in trying to classify the remaining characters past the illegal token. For example, without this feature your scanner will generate this output when scanning bad.cf. Obviously, not a good idea.

Finding the next token is not always easy to do. For example, if you encounter " hi* there with no closing double-quote your scanner will start scanning at h. This will generate some unnessary scanning but that is OK. You do not want to over complicate your regexes just to catch every esoteric error. Catching 4num as an illegal identifier is also tricky. The cleanest thing to do is scan an integer 4 and an identifer num. Let the parser sort it out.

For each error you must display an error message along with line # and column #. The yyleng global variable will be useful for displaying the column number. Then toss out the token (all characters until the next whitespace) and continue scanning. These are some of the errors your scanner should catch:

o Invalid C string  
 
o Encountering a symbol that is not in the alphabet of the language. If a 
  symbol is not in one of the regexes defined in your scanner then that symbol 
  is not in the language.  For an occurrence of an invalid character (e.g. $),
  display the invalid symbol and the line # and col #. Then spit the symbol out
  and continue scanning. 

o Character constants with the following characteristics:
      - a missing quote at the end of the character constant ('x)
      - an empty character constant ('')

Bookkeeping
As a sanity check have your scanner keep a running total of each type of token it encounters and an overall token count. While scanning display information about each valid token. At the end of scanning, display the counters and dump the symbol table. For example, on this valid input file

/*  comments */
int dummy()
{
  int x = 5;
  return 0;
}

Your scanner should produce output similar to this:

 
==============
SCANNER OUTPUT  
==============
int     Keyword 
dummy   Identifier
(       Punctuator
)       Punctuator
{       Punctuator
int     Keyword 
x       Identifier
=       Operator
5       intConst
;       Punctuator
return  Keyword 
0       intConst
;       Punctuator
}       Punctuator

==================
 Symbol Table Dump
==================
NAME           LINE
-------------  ----
dummy           2 
x               4 
End Symbol Table Dump
@@

====================
 Token Count Summary
=====================
Keywords:    3
Operators:   1
Punctuators: 6
intConsts:   2
floatConsts: 0
charConsts:  0
Identifiers: 2
strConsts:   0

TOTAL TOKENS:14
BAD TOKENS:  0

WHAT TO SUBMIT FOR PHASE I

You do not need to submit anything for Phase I unless you do not complete Phase II. In either case, your code is not due until Monday Jun 9. Should you not complete Phase II attach scan.l in an email and send it to me. To test your scanner I will run your code against good.cf and bad.cf. These two input files test all constructs in the lexicon and test all errors. Sample run:

   $ cat good.cf | ./scan > out
   $ cat bad.cf | ./scan > badout

Send scan.l to donna@cs.csub.edu or demonstrate to me in class.

NOTES: For debugging lex compile with
      LFLAGS = -d

and add to the lex file:
   extern int yy_flex_debug;
   yy_flex_debug = 1;