CS 410 Programming Assignment 1
Fall 2006 [Bono]
Due: Thursday, September 14, 11:59pm
Introduction
In this assignment you will write a lexical analyzer for the Espresso
language, using the tool flex. The flex
documentation is available here.
There is also a pdf version linked from the Documentation section of
the course web page.
The definition for the Espresso language can be found at
Espresso Language
Reference (pdf file).
There will be no face-to-face grading meeting
(i.e., demo) required for this assignment. Students are required to
work individually on this assignment (no groups). Save your solution
to this assignment, because you will need it for later assignments.
Files and development environment
All of the software used for this course is available on aludra. For
this assignment we will be using using gmake, g++,
and flex to compile and gdb (or ddd) for
debugging. We will test your program on aludra, so if you do your
development somewhere other than on an scf Sun machine (this includes
aludra and the Sun workstations in Sal 125), you'll need to make sure
it also works on the scf environment before you submit your work.
Getting access to flex. To be able to use
flex you will need to put the following lines at the end of
your .login file:
if (-f /usr/usc/gnu/flex/default/setup.csh) then
source /usr/usc/gnu/flex/default/setup.csh
endif
You will need to run
source .login
to have this change take effect immediately. You only need to do this
one time.
Getting the assignment files. To be able to work on
the assignment, create a directory for your assignment, cd
into it, and then give the following Unix command.
gmake -f ~csci410/espresso/lexer/Makefile getfiles
This Makefile copies the files that you will need to modify for the
assignment to your directory, but also symbolically links the files
you will need access to, but that you should not modify.
A note on symbolic links. A symbolic link means that
the file appears to be in your directory, but you are actually
looking at the original from the course account. This has several
advantages: it makes it easy for you to not modify these files by
mistake, it doesn't use use up space on your account for files, and
it makes it easy to give you a new version of the file if we need to
update it (the new version will be there automatically without your
doing anything). If you do the command "ls -F" it shows
symbolically linked files and directories differently, with @ and /
suffix, respectively, to distinguish them from other files. We
recommend you alias ls to "ls -F" on your account.
The assignment files. Most of these files are described in
more detail in later parts of this document. The files in bold below
are ones you create and/or modify. The files are:
-
espresso.flex. The flex input file.
This is where you will write all of your code for this assignment.
-
README.
See directions in the file itself.
-
test.esp.
Include a thorough test of your
lexical analyzer in this file: it can show lexical errors as well as
correct lexemes, since the lexical analyzer should be able to continue
after encountering errors. (Of course, an unclosed comment is an
error that would take you to the end of the file.) This file does not
have to look much like a valid Espresso program since
testlexer just recognizes individual lexemes, but cares
nothing about how they are put together.
-
testlexer.cc.
The test driver for the lexical analyzer. We
have written this for you. You will find it useful to look at to see
what the program should do.
-
espresso.tab.h.
The declarations that need to be known to
both the lexical analyzer and the parser. This file was automatically
created by bison. Do not modify this file.
-
parser.h. Some declarations of class names used in
espresso.tab.h. You don't need to use these otherwise until the parser.
-
lexer.h.
Some declarations that need to be known to both
testlexer.cc and the code produced by flex.
-
Makefile. This has a rule for getting the files
(getfiles, described previously), for compiling the program,
and for submitting the program. See comments at the top of the file
for details. You should not need to modify this file unless you add
additional source files (for which you would need to add make rules
and change the submit rule). We recommend you familiarize yourself
with the workings of this Makefile.
-
StringTab.h
and
StringTab.cc.
A string table class (StringTab)and associated Symbol
class. You will be using these classes in your lexical analyzer.
They are documented in the .h file.
-
printToken.h.
and
printToken.cc.
A function for printing out the name of a
token when given its number. It knows about the #defines
in espresso.tab.h. This is used by testlexer; you will not need to
call it yourself.
-
~csci410/espresso/lexer/Small.out.
(not linked to your local directory) Sample lexical analyzer output
for Small.esp
example. Note: Small.esp, which among other examples is linked from
the project section of the course web page, would not be a complete
test of your lexical analyzer, since it doesn't include all the tokens
of the language.
Lexical analyzer
When you feed a flex input file to flex, it creates a function called
yylex. The yylex created by your "flex program"
will return the next token in a source code file that is input to a
compiler. If this token has an associated lexical value of interest
to the rest of the compiler, this is communicated to the caller via
the global variable yylval. When getting the next token,
it will skip over any intervening comments and whitespace. The token
returned by the function is an integer code. These codes, as well as
the definition of yylval's type, called YYSTYPE,
have all been defined for you in
espresso.tab.h.
You will need to read or skim the flex
documentation as necessary to learn how to write a flex input
file, and for further information about how the lexical analyzer
produced by flex matches its input.
Special case for single-char lexemes. As you are
probably aware, the internal representation of character literals in
C/C++ is just an integer called its ASCII value. We will just use
that integer code as the token for single-character lexemes. To make
this work, all the token values defined in espresso.tab.h are at least
256 (bigger than ASCII values), and you may notice that NO codes are
defined for our single-character lexemes (e.g., '{' or '+' ). So, for
a valid single-character lexeme, you should have the flex program
return the character itself as the token (i.e., as the return value
for yylex; there will be no additional lexical value to go with it).
The only tokens that have lexical values are INT_LITERAL,
BOOL_LITERAL, ID, and ERROR, and they
use the yylval fields intLiteral,
boolLiteral, id, and errorMsg,
respectively. The other fields defined for yylval are only
used by the parser.
The lexical structure of Espresso is outlined in the section entitled
Espresso Lexemes in the Espresso Language
Reference (pdf file).
Your flex specification must be complete: all input
should be matched by some regular expression. Put another way, no
input should trigger the flex default action of echoing the
input to the stdout.
Your lexical analyzer must also keep track of the current line number
of the source program by maintaining the global variable
linenum. This
will be used by other parts of the compiler to identify where errors
occurred.
When you are skipping whitespace, don't forget the DOS carriage return
character, which can be denoted by '\r' in flex. DOS files have this
in addition to the normal '\n' at the end of each line.
Compiling and testing the Lexical analyzer
As mentioned earlier, we have written the test driver for you, as well
as the Makefile for compiling your code. The test driver repeatedly
gets a token using yylex, each time printing out the token,
line number, and value. After reaching the end of the input file it
dumps the contents of the string table.
The command
gmake testlexer
compiles the lexical analyzer.
That is, it feeds
espresso.flex
to flex, then
compiles the resulting partial C++ program (which is in
espresso-lex.cc) with the test driver (in
testlexer.cc) to make the executable, testlexer.
As mentioned earlier, please look at the
Makefile
to see exactly what happens.
The command
testlexer inputfile
runs the resulting lexical
analyzer on the file inputfile, which does not have to be a
valid espresso program in other ways, nor does it have to have the
.esp suffix.
The version of the code you initially download compiles and runs (try
it on test.esp), but, of course, it doesn't do a whole lot.
The StringTab and Symbol classes
Programs usually have the same identifiers repeated in many places.
To save space and time while compiling we're going to use a data
structure called a string table to store the identifiers, such that it
only stores one copy of the string for each identifier. As mentioned
previously, we have written this tool for you.
Once you enter an identifier into the string table it gives you back a
pointer to a Symbol object (the rest of the compiler will
use this pointer to access the identifier). StringTab only
creates one Symbol object for each string. So the second
time you add the same identifier, it gives back a pointer to the
Symbol it created the first time. Because we know there is
only one Symbol for any unique identifier, we can just use
pointer-compare on Symbol pointers to find out if two identifiers
are the same (as opposed to comparing character by character).
You need to add the identifier to the string table right when you scan it,
and you'll return the Symbol* in yylval (see
definition of YYSTYPE in
espresso.tab.h). The
global variable stringtab is declared in
lexer.h. See
StringTab.h
for detailed documentation for the interface
for StringTab and Symbol.
The test driver, testlexer, prints out the
Symbol for the identifier when it gets one, but also, after
processing the whole program, it prints out the whole string table.
In this string table listing each identifier will appear only
once.
Handling lexical errors
Instead of printing out error messages itself, the lexical analyzer is
going to return a special token, ERROR, when it encounters
an error in the input. When you return ERROR, you will
also return a char* lexical value in
yylval.errorMsg.
There are two kinds of values that will
be passed back, depending on the type of error encountered:
- invalid character. Put it and any other invalid
characters that immediately follow it into one string that you will
return in errorMsg.
- all other lexical errors. put an informative error
message in errorMsg.
Here is a list of the other errors you can encounter:
- unterminated comment
- integer out of range. HINT: The C function
strtol (declared in stdlib.h) will be helpful
for for figuring out if an integer is out of range. Java (and
Espresso) ints and C/C++ long ints both have the
same range (32 bit integers). You will only have to worry about the
positive part of the range since we have no unary minus.
- '&' not followed by another
'&'. '&&' is a valid lexeme, but
'&' by itself is not a lexeme in Espresso.
- '/' not followed by '*' or another
'/'. When used in the correct combination, '/'
starts the two forms of comments, but otherwise is illegal.
- integers with leading zeroes. Espresso only accepts Java
decimal integer literals. In Java, integers with leading
zeroes are interpreted as octal integer literals, but are errors in
Espresso. Examples of this are "032"," 00549". Of course, zero by
itself is still a valid decimal integer. So, for example, "0xy" would
generate two tokens, INT_LITERAL followed by ID,
with no error produced.
Development strategy
We strongly encourage you to get a subset of the program working
before you continue with development. This is also sometimes called
incremental development. We've given you a very small working subset,
but you should feel free to create many more as you go. For this
assignment it is fairly easy to do this since each of the tokens or
set of tokens you write code for are somewhat independent from each
other. Another good reason to create subsets for this assignment is
that when you are using a new language (flex input specifications
here), it can save you a lot of typing if you test small programs,
figuring out how things work in flex, before investing a lot of time on
creating a larger program that would have to be completely rewritten
if it used faulty syntax or was based on a flawed model of the semantics.
Developing in subsets means that at the end if you don't complete the
assignment, you will still have a working piece of software to submit,
rather than a large mess that doesn't compile. The former will earn a
higher score than the latter. Since the grading is primarily on
correctness, this becomes even more crucial: no cushion of style and
documentation points.
What to turn in
Before submitting, make sure your name and loginid appear in a comment
at the beginning of every source file you submit and in the README
file. It's ok if it doesn't appear in test data files (e.g., test.esp).
The
Makefile
includes a rule for submitting the assignment. You run
it by typing gmake submit. It copies your
espresso.flex,
README,
and
test.esp
to the course account using the local submit utility. The
last version submitted is the one that will be graded: each submit
overwrites the previous one.
The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees