CS 410 Programming Assignment 3

Fall 2006 [Bono]
Due: Tuesday, October 31, 11:59pm

Table of Contents


Introduction

In this assignment you will write a semantic analyzer for the Espresso language. This phase of the compiler will uncover any remaining errors in the Espresso program that the parser and lexical analyzer didn't find, printing informative error messages as it goes. Besides detecting errors, it will create the symbol tables for symbols used in the program, and it will use the symbol table and the AST to do type-checking. In the type-checking pass you will sometimes uncover errors, but you'll also "decorate" the AST with type information. Besides possible error messages, the output of the semantic analyzer is the decorated AST and the symbol tables.

To complete this program you'll need your completed lexical analyzer and parser.

The Espresso Language Reference describes some of the semantics of Espresso, and we will discuss some more about semantic rules here. For anything not described in these places, Espresso follows the rules for Java, which are described in the Java Language Reference. Both of these documents are also linked from the course web page.

There will be a face-to-face grading meeting (i.e., demo) required for this assignment; you'll be getting more information about that as deadline approaches. Students are required to work individually on this assignment (no groups). Save your solution to this assignment, because you will need it for later assignments.

This assignment is much larger than the first two, so you'll need to start early.

Getting the assignment files

To be able to work on the assignment, create a new directory for your assignment, cd into it, and then give the following Unix command:

 gmake -f ~csci410/espresso/semant/Makefile getfiles 
This Makefile copies the files that you will need to modify for the assignment to your directory, but also symbolically links the files you will need access to, but that you should not modify.

Also, copy your espresso.flex and espresso.y files from your PA2 directory. You should not copy other files from your PA1 or PA2 directories. The "getfiles" command, shown above, links all the other files necessary to compile testsemant, as well as testparser and testlexer.

The assignment files

The files in bold below are ones you create and/or modify and submit. The files are:

Major passes of the semantic analyzer

Because not all entities have to be declared before they are used in an Espresso program, semantic analysis will require multiple passes over all or part of the AST. In particular, class names may be used before they are defined in the file, and fields and methods of a class may be used before they are defined in the file. Your semantic analyzer will have the following three passes:
  1. Get all class names, making sure that each one has a unique name.

  2. Get all methods and fields for each class, making sure that each one has a unique name in that class. Checks during this pass include making sure the types for fields are defined, and the return types for methods are defined. You do not need to process the rest of the method header (i.e., formal parameter list) during this pass.

  3. Type-check the program. This will involve a pass over the whole AST, and will involve adding declarations for formals and locals to the appropriate symbol tables, and processing all the statements, expressions, and return "statements" in the tree. See section on Type-checking for more details.
Note: the later section on Scoping Rules will provide more information about the use of symbol tables to store the information for the first two passes and how the symbol tables will be used in the third pass.

Semantic Rules

The beginning of this document gives links to the two documents you need to refer to for Espresso semantic rules. For instance, we recommend you reread the sections of the Espresso reference on Espresso Semantics and on the "main" method. As mentioned in the Espresso Language Reference, if you are unsure of some Java rule, try out a small test program on the Java compiler. In some cases this will be easier than trying to find it in the Java Language Specification. Although in other parts of this document we discuss in detail the Java scoping rules that apply to Espresso, there are many other semantic rules that are not enumerated here, primarily related to type-checking the various types of statements and expressions in the Espresso subset of Java.

As mentioned in the Espresso Language Reference, methods in Espresso may not be overloaded. This means that each class may have at most one method with a given name. In contrast, overloading is allowed in Java (and C++). Only one version of a method per class means we can identify a method uniquely by knowing its class and name -- we don't have to examine its parameter list to identify it.

Two "special" rules you will need to check are the restriction on the filename of a Java program, and the other is related to this in a static routine. Espresso doesn't have static routines in general, but main is static. Java does not allow the use of this in static functions, since there is no associated object instance.

The SymTab template class

The Symtab template class is a lookup table that maps keys -- represented as Symbol* objects -- to values whose type is specified by the template parameter; it can also maintain an arbitrary number of nested scopes. When you instantiate the template parameter with a type, it actually is saying that the values will be pointers to objects of that type. Please see the header file, SymTab.h, for complete documentation of this class, and the testSymTab.cc file for a small example of its use. We also went over this class in lecture.

In the next section, you will see that you will need multiple symbol tables, some of them needing a different value type (i.e., template instantiation) than others. This class is general enough to use in all the cases mentioned.

Scoping Rules and using the symbol table

The section on semantic analysis passes described the part of the scoping rules related to whether something needs to be declared before it is used. We'll summarize some of the other relevant Java scoping rules here: The first rule mentioned above implies that we'll need separate symbol tables for types, fields (one table per class), and methods (one table per class). The second and third rules imply that we'll have two scope levels for variables, since they can be defined at the class level or the method level.

Because of the scoping rules, you'll need to save the classSymTab, fieldSymTab (per class), and methodSymTab (per class) in the appropriate tree node for later use in type-checking (i.e., these get created in passes (1) and (2) and used in pass (3) -- see previous section on Major passes for information on what these passes are). You should also save the varSymTab variable environment for each method (i.e., don't throw it out after you type-check a method, but save it in the method node). The reason we need to save them is we will be using these symbol tables again when we do code generation (PA4) -- if we don't save them, the code generation pass will have to rebuild them.

In the later section on Type-Checking, the type environment, which we pass down to the type checking routines, refers to all the names in scope in a portion of the program. For a statement, for example, the names visible include the classes, all the methods of the classes, and all the variables that are in scope. This can be represented as two symbol tables: (1) the classSymTab and (2) the varSymTab.

This type environment gets passed around a lot, but where does it get used? The class table gets used any time we encounter a type name, to make sure that type is defined, including every time we process a declaration, and in new expressions. The var table is used for type-checking variable references and assignment expressions. We also need the type environment to type-check method calls. Recall that a method is identified by its class and name. To look up the method for a class, first you look up the class in the classSymTab, and part of its value will be its methodSymTab, from which you can look up the method you are interested in.

Please see the section on Output of the Semantic Analyzer for more on printing the contents of the symbol tables you create.

The VarInfo classes

We created a set of classes for you, defined in semant.h, to use as the types for values associated with symbols in your variable symbol tables. For other symbol tables, you will probably be using some AST node type for the value type. However, for variables, there is no one AST node that is used for all the declarations. Remember, when you refer to a variable, it might have been declared as a local, a formal, a field, or it might be this, which has no declaration in the tree. That means there are four kinds of variables. The inheritance hierarchy has a base class called VarInfo and four subclasses, one for each of the kinds of variables mentioned: LocalInfo, FormalInfo, FieldInfo, ThisInfo,

Since its Espresso type is the only piece of information we need to know about a variable during type-checking, we could have used Symbol as the value type to associate with any of the kinds of variables in the symbol table. However, this will require us to rebuild these symbol tables during code generation, since during code generation we need to store additional information about the variable, namely its location; and we can't modify the Symbol class. (Note: the example program testSymTab.cc uses Symbol as the value type.)

The reason why there is not just one type for VarInfo, but an inheritance hierarchy, is that the way you generate code for assignments to and references to the four different kinds of variables is slightly different, because they will be found in different places in memory. Thus, this hierarchy will allow us to use polymorphic function(s) to do this task. You don't have to understand how to do this now, but by using these classes now, you will save yourself some work in PA4.

The idea of using these classes is you want to be able to have entries of any of the different kinds of variable in the same variable symbol table. But, when you create an single entry it needs to be of the particular kind you are declaring. For example, if you are adding a local variable declaration to a symbol table, you will create the value by doing

 new LocalInfo(type) 

You may be wondering why we aren't just using the AST nodes related to the variable declarations (e.g., Field, Local, etc.) for this. It's because they aren't related by inheritance in the AST class hierarchy, so we wouldn't be able to stick instances of them in the same symbol table in a type-safe way.

Dealing with this

The only place this is used in the AST is as a VarRef. To type-check such nodes with it we need to know what type this is. One way to handle the issue is to add a special symbol table entry for this in our variable symbol table. By "special", we mean that it won't correspond to any declaration in the AST. Since this is in scope and has the same type for any method of a particular class, you could add it to your symbol table for fields for the class (even though it's not really a field). Note: the VarInfo inheritance hierarchy described in the previous section has a special subtype ThisInfo for for entries for this.

Type-checking

Type-checking is done as a traversal of the AST. Usually each statement and expression has its own rule for doing type-checking. This will be implemented as polymorphic typeCheck functions for the various node types. That is, most AST node types will have their own typeCheck method to typeCheck that particular construct of Espresso. They won't all necessarily have the same interface, however. But all Stmts, for example, will need to have the same polymorphic typeCheck interface, and similar with Exprs.

The general form of a typeCheck routine on an expression is:

Symbol *myexpr::typeCheck(theTypeEnvironment)   // this is just pseudo-code
{
  // type-check subexpressions
  Symbol *type1 = subtree1->typeCheck(theTypeEnvironment);
  Symbol *type2 = subtree2->typeCheck(theTypeEnvironment);

  do type check rule for this kind of expression, which probably
  involves looking at types of subtrees (type1, and type2) to figure out type
  of this instance (call it mytype).  this part will print out error messages
  if the type rules are not followed.

  setType(mytype);   // set type field in node
  return mytype;
}
So we pass the type environment (see Scoping rules) down the AST, but compute the inferred types in a bottom-up manner (i.e., typecheck subtrees of node before typechecking node). We can see the advantage of polymorphism here, because we never have to check what kind of expression is actually in the subtrees, because it will automatically use the correct version of typeCheck based on the type of the object that the pointer is pointing to.

You need to use the type field we have already defined for you to decorate the AST with types as you infer them. This is accessed via the getType and setType methods in the Expr class (i.e., all expression nodes inherit this field). The AST dump routines already print this field for us, so we will be able to see the results of type-checking via the existing call that dumps the whole tree (in testsemant.cc). Note: If we have an invariant that all typeCheck routines for Expr subclasses set the type field, then we don't have to actually return the type too, because the caller can find it in the subtrees.

Type-checking for non-Expr constructs will still involve making the calls on the subtrees, passing the correct environment down. For statements, we still need to check that the types of the subparts are correct although statements themselves have no type. For methods (i.e., definitions, not calls) the method-scope variables will be added to the type environment before calling typeCheck on the statements and return expression in that method.

The dump routines declared and implemented ast.h and astDump.cc, respectively, will be useful for you to study as a model for how to do a polymorphic tree traversal. One aspect you can look at is how how common code between subclasses is put into superclasses to avoid repetition, and how that code gets called. The diagrams of the AST class hierarchy given in the AST document may also be helpful when you want a big picture of the classes involved in type-checking and the relationships between them The diagram does not show the part-subpart relationships between objects in an AST; but the section describing each of the the constructors does.

Error reporting and recovery

Error output should be sent to the stream cerr. Unlike in the parser, here it will be easy to give informative error messages, because you will report the error at the point that you detect it. If you are unsure of how to describe an error, or whether something should even be an error, for things that would also be errors in a Java compiler you can try out a sample Espresso program in the Java compiler. You are not required to give the same messages as a Java compiler, and in some instances your message might be very different than Java's, because it might arise from restrictions in the Espresso grammar. For example, the expression
this=x
would produce a parse error from the Espresso compiler (since we treat this as a keyword), whereas the Java compiler would generate a message about not being able to use this on the left-hand side of an assignment.

We have provided you with a convenient class that has a routine to help format error messages (it will print the line number where the error occurred). See SError.h for details of how to use it. Your error messages should include any relevant identifying information, such as the identifier involved. Here is an example error message:

line 9: error: type 'Blob' for field 'y' is undefined.

Unlike in the parser, here you should be able to recover from most errors and continue processing. Of course, this can sometimes lead to many cascading errors, since you are making some kind of assumption about what was intended. You will have to decide how you want to do this for the various major categories of errors. For example, if something is doubly-defined, you can just ignore the second instance of it, and use the first one. If it is a class that is defined twice, you could still process the innards of the second class, but any other reference to that class will still mean the first one defined. If it's something like an invalid operand to a + expression, you will generate an error message, but you should just set the type of the plus expression to int, so higher parts of the tree will consider it legal.

Recovering from reference to an undefined type, you can again use int as a substitute, although as you may guess, this is not a panacea, since int, unlike all user-defined types, isn't even a reference type (i.e., you aren't allowed to call methods on int).

The general idea of error recovery is to put things in such a way that the program won't crash if you continue (e.g., always assign a valid type value in the AST while type-checking), and such that you avoid generating spurious errors. The second part of that is much more difficult to do; we will mainly be concerned here that you do the first. You are going to be documenting your your error recovery methods in the README.

Output of the Semantic Analyzer

Besides any error output, described above, your program must print the AST for the program and, when requested by the user, the symbol tables as well. We have already provided the call to dump the AST, but you need to write the calls to dump the symbol tables. This output, which will only be generated when the debug flag is set, should be sent to the stream cout. (As mentioned previously, the error output should be sent to the stream cerr.) This way, if error output interleaves with other output, it will be easy to separate them by sending the streams two different places.

The output of the symbol tables is controlled by an optional debug option to testsemant. Running it as

testsemant -d Test.esp
sets the global flag semantDebug. You will only generate your symbol table output when this flag is set. The symbol table output will appear before the AST output.

Note: debug output is to help you debug! Don't wait until you have completed your program to add the code to print out the symbol tables. This output will be helpful to verify that subsets work as you are incrementally developing the program. More on this in the next section.

For each symbol table you print, it should begin with a header line describing what it is a symbol table for, for example:

Methods of class Foo:
Then it should use the provided dump or dumpKeys functions of Symtab to do the actual printing. Please see SymTab.h for more about using these functions. These functions print the key (i.e., name) for each entry; in addition, for each of the kinds of symbol tables you must print the following information about the value of each entry: Note: Using the dump function for the SymTab class may require you to define your own << operator: see documentation in SymTab.h for more details. There is an example of defining an overloaded << operator for a user-defined class in StringTab.h and StringTab.cc.

You can either print the symbol tables as you create them, or else, after you are done with the rest of a pass, make another pass over the tree to print the symbol tables created in that pass. The advantage to the latter is it helps you check that you actually saved the symbol tables (as opposed to putting them in a local var that disappears when you go out of scope for this pass). Either way all the symbol table printing needs to happen before the return from Semant::analyzeProg.

The main program we gave you calls dump on the AST after it returns from the semantic analyzer module. This will print out the type you added to each expression node during the type-checking pass (the part that came out as no_type during PA2).

Development strategy

Once again, we strongly encourage you to use incremental development. It's especially important here, because there is a lot more code to write than in the previous assignments in this class. In the grading, as usual, we will favor a working subset over a large mess that doesn't compile or doesn't work. Fortunately this program (unlike some) lends itself to building testable subsets.

Some natural subsets here are each of the different major passes mentioned near the beginning of the document. When apportioning your time, do not think of these as three equal pieces: the first pass is a small amount of code, the second pass is a little larger, and the third pass is a lot more code than the first two. After you develop each of the first two passes you can check your results by dumping the appropriate symbol tables: this output is required for the assignment anyway.

The type-checking pass (third pass) itself can be broken into separately testable subsets. Adding the formal/local variables for each method to symbol tables can be the first subset. You can test this subset by printing the resulting symbol tables. And, of course, you can incrementally build the type-checking routines for different constructs without doing all of them at once, testing the current subset by running it with an Espresso program that only uses the constructs you have implemented.

A few other hints for successful development:

What to turn in

The Makefile includes a rule for submitting the assignment. You run it by typing gmake submit. It copies your ast.h, semant.h, semant.cc, README, Test.esp, and TestBad.esp to the course account using the local submit utility. The last version submitted is the one that will be graded: each submit overwrites the previous one.

The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees