CS 410 Programming Assignment 1

Fall 2006 [Bono]
Due: Thursday, September 14, 11:59pm

Introduction

In this assignment you will write a lexical analyzer for the Espresso language, using the tool flex. The flex documentation is available here. There is also a pdf version linked from the Documentation section of the course web page. The definition for the Espresso language can be found at Espresso Language Reference (pdf file). There will be no face-to-face grading meeting (i.e., demo) required for this assignment. Students are required to work individually on this assignment (no groups). Save your solution to this assignment, because you will need it for later assignments.

Files and development environment

All of the software used for this course is available on aludra. For this assignment we will be using using gmake, g++, and flex to compile and gdb (or ddd) for debugging. We will test your program on aludra, so if you do your development somewhere other than on an scf Sun machine (this includes aludra and the Sun workstations in Sal 125), you'll need to make sure it also works on the scf environment before you submit your work.

Getting access to flex. To be able to use flex you will need to put the following lines at the end of your .login file:

if (-f /usr/usc/gnu/flex/default/setup.csh) then
        source /usr/usc/gnu/flex/default/setup.csh
endif
You will need to run
source .login
to have this change take effect immediately. You only need to do this one time.

Getting the assignment files. To be able to work on the assignment, create a directory for your assignment, cd into it, and then give the following Unix command.

 gmake -f ~csci410/espresso/lexer/Makefile getfiles 
This Makefile copies the files that you will need to modify for the assignment to your directory, but also symbolically links the files you will need access to, but that you should not modify.

A note on symbolic links. A symbolic link means that the file appears to be in your directory, but you are actually looking at the original from the course account. This has several advantages: it makes it easy for you to not modify these files by mistake, it doesn't use use up space on your account for files, and it makes it easy to give you a new version of the file if we need to update it (the new version will be there automatically without your doing anything). If you do the command "ls -F" it shows symbolically linked files and directories differently, with @ and / suffix, respectively, to distinguish them from other files. We recommend you alias ls to "ls -F" on your account.

The assignment files. Most of these files are described in more detail in later parts of this document. The files in bold below are ones you create and/or modify. The files are:

Lexical analyzer

When you feed a flex input file to flex, it creates a function called yylex. The yylex created by your "flex program" will return the next token in a source code file that is input to a compiler. If this token has an associated lexical value of interest to the rest of the compiler, this is communicated to the caller via the global variable yylval. When getting the next token, it will skip over any intervening comments and whitespace. The token returned by the function is an integer code. These codes, as well as the definition of yylval's type, called YYSTYPE, have all been defined for you in espresso.tab.h.

You will need to read or skim the flex documentation as necessary to learn how to write a flex input file, and for further information about how the lexical analyzer produced by flex matches its input.

Special case for single-char lexemes. As you are probably aware, the internal representation of character literals in C/C++ is just an integer called its ASCII value. We will just use that integer code as the token for single-character lexemes. To make this work, all the token values defined in espresso.tab.h are at least 256 (bigger than ASCII values), and you may notice that NO codes are defined for our single-character lexemes (e.g., '{' or '+' ). So, for a valid single-character lexeme, you should have the flex program return the character itself as the token (i.e., as the return value for yylex; there will be no additional lexical value to go with it).

The only tokens that have lexical values are INT_LITERAL, BOOL_LITERAL, ID, and ERROR, and they use the yylval fields intLiteral, boolLiteral, id, and errorMsg, respectively. The other fields defined for yylval are only used by the parser.

The lexical structure of Espresso is outlined in the section entitled Espresso Lexemes in the Espresso Language Reference (pdf file).

Your flex specification must be complete: all input should be matched by some regular expression. Put another way, no input should trigger the flex default action of echoing the input to the stdout.

Your lexical analyzer must also keep track of the current line number of the source program by maintaining the global variable linenum. This will be used by other parts of the compiler to identify where errors occurred.

When you are skipping whitespace, don't forget the DOS carriage return character, which can be denoted by '\r' in flex. DOS files have this in addition to the normal '\n' at the end of each line.

Compiling and testing the Lexical analyzer

As mentioned earlier, we have written the test driver for you, as well as the Makefile for compiling your code. The test driver repeatedly gets a token using yylex, each time printing out the token, line number, and value. After reaching the end of the input file it dumps the contents of the string table.

The command

gmake testlexer
compiles the lexical analyzer. That is, it feeds espresso.flex to flex, then compiles the resulting partial C++ program (which is in espresso-lex.cc) with the test driver (in testlexer.cc) to make the executable, testlexer. As mentioned earlier, please look at the Makefile to see exactly what happens.

The command

testlexer inputfile
runs the resulting lexical analyzer on the file inputfile, which does not have to be a valid espresso program in other ways, nor does it have to have the .esp suffix.

The version of the code you initially download compiles and runs (try it on test.esp), but, of course, it doesn't do a whole lot.

The StringTab and Symbol classes

Programs usually have the same identifiers repeated in many places. To save space and time while compiling we're going to use a data structure called a string table to store the identifiers, such that it only stores one copy of the string for each identifier. As mentioned previously, we have written this tool for you.

Once you enter an identifier into the string table it gives you back a pointer to a Symbol object (the rest of the compiler will use this pointer to access the identifier). StringTab only creates one Symbol object for each string. So the second time you add the same identifier, it gives back a pointer to the Symbol it created the first time. Because we know there is only one Symbol for any unique identifier, we can just use pointer-compare on Symbol pointers to find out if two identifiers are the same (as opposed to comparing character by character).

You need to add the identifier to the string table right when you scan it, and you'll return the Symbol* in yylval (see definition of YYSTYPE in espresso.tab.h). The global variable stringtab is declared in lexer.h. See StringTab.h for detailed documentation for the interface for StringTab and Symbol.

The test driver, testlexer, prints out the Symbol for the identifier when it gets one, but also, after processing the whole program, it prints out the whole string table. In this string table listing each identifier will appear only once.

Handling lexical errors

Instead of printing out error messages itself, the lexical analyzer is going to return a special token, ERROR, when it encounters an error in the input. When you return ERROR, you will also return a char* lexical value in yylval.errorMsg. There are two kinds of values that will be passed back, depending on the type of error encountered: Here is a list of the other errors you can encounter:

Development strategy

We strongly encourage you to get a subset of the program working before you continue with development. This is also sometimes called incremental development. We've given you a very small working subset, but you should feel free to create many more as you go. For this assignment it is fairly easy to do this since each of the tokens or set of tokens you write code for are somewhat independent from each other. Another good reason to create subsets for this assignment is that when you are using a new language (flex input specifications here), it can save you a lot of typing if you test small programs, figuring out how things work in flex, before investing a lot of time on creating a larger program that would have to be completely rewritten if it used faulty syntax or was based on a flawed model of the semantics.

Developing in subsets means that at the end if you don't complete the assignment, you will still have a working piece of software to submit, rather than a large mess that doesn't compile. The former will earn a higher score than the latter. Since the grading is primarily on correctness, this becomes even more crucial: no cushion of style and documentation points.

What to turn in

Before submitting, make sure your name and loginid appear in a comment at the beginning of every source file you submit and in the README file. It's ok if it doesn't appear in test data files (e.g., test.esp).

The Makefile includes a rule for submitting the assignment. You run it by typing gmake submit. It copies your espresso.flex, README, and test.esp to the course account using the local submit utility. The last version submitted is the one that will be graded: each submit overwrites the previous one.

The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees