CS 410 Programming Assignment 4

Fall 2006 [Bono]
Early Bonus Deadline: Tuesday, November 21, 11:59pm
Due: Tuesday, Nov 28, 11:59pm

Table of Contents


Introduction

In this assignment you will write a code generator for the Espresso language. When you are done the executable created by the Makefile will be a complete Espresso compiler! And you will be able to run the code generated by your compiler (more on that later).

To complete this program you'll need your completed lexical analyzer, parser, and semantic analyzer. The semantic analyzer and code generator are much more closely connected in code than the previous parts were, in that they both operate on the same data structures (the AST and the symbol tables). In fact, doing this assignment may involve making small modifications to your semantic analyzer code, so when you submit, you will be resubmitting your code for the semantic analyzer too. We will not be testing the error-checking parts of the previous phases, so if you didn't have time to get those implemented, you do not need to do so now. Put another way, we will only test your Espresso compiler with correct Espresso programs, since they are the only ones for which code gets generated.

In the last assignment we were concerned with static semantics, i.e., the parts of the semantic rules that can be checked at compile-time. In this assignment you will be concerned with operational semantics, or how an Espresso program behaves in execution. While the Espresso Language Reference describes the subset we are using, it doesn't expressly describe the run-time semantics. For anything not described in the Espresso document, Espresso follows the rules for Java, which are described in the Java Language Reference. Both of these documents are also linked from the course web page. Recall also that you can write little test Java programs if you are unsure of the behavior of some feature.

There will be a face-to-face grading meeting (i.e., demo) required for this assignment; you'll be getting more information about that as deadline approaches. Students are required to work individually on this assignment (no groups). This assignment is at least as big as PA3, so start right away.

Oh, yeah, the early bonus deadline. This 10% bonus is encouragement for you to complete the assignment early.

Setting up your development environment

First, you will need to modify your path variable so that you can use the spim MIPS simulator. Put the following line at the end of your .login file:
set path=(~csci410/espresso/bin $path)
To have the change take effect immediately, do the Unix command source .login.

Getting the assignment files. To be able to work on the assignment, create a new directory for your assignment, cd into it, and then give the following Unix command:

 gmake -f ~csci410/espresso/codegen/Makefile getfiles 
This Makefile copies the files that you will need to modify for the assignment to your directory, but also symbolically links the files you will need access to, but that you should not modify.

You will need to copy the following files from your PA3 directory to your PA4 directory:

The "getfiles" command, shown above, links all the other files necessary to compile espc (your complete compiler), testsemant, testparser and testlexer.

The assignment files

The files in bold below are ones you create and/or modify and submit. The files are:

Major passes of the code generator

Because fields can be used before they are defined in an Espresso program, you will need at least two passes to do code generation. The first pass assigns offsets for all fields inside each class and determine the size of each object type, and then the second pass does the rest of code generation.

The "rest" includes assigning offsets for formals and locals, but this can be combined with the rest of code generation in a single pass, since the declarations come before any uses.

Entry point for the code generator

The main program in espc.cc calls the other parts of the compiler, then, if there were not errors in the other parts, it opens the ".s" file that the MIPS instructions will get written to, it creates a CodeGen object, and then calls the member function genCode on this object. You may add code to the CodeGen constructor as well, but the main entry point to your code-generation code is CodeGen::genCode.

The CodeGen constructor gets three things from the driver program:

  1. the ast, which in turn should contain all the symbol tables except the one for classes (because that one is not stored in the ast).
  2. the classSymTab that was created in semantic analysis
  3. the output stream to write your MIPS instructions to.

In PA3 we mentioned that you were allowed to change what the valuetype was for entries in classSymTab. If you did that, i.e., if your classSymTab is not type SymTab<Class>, you will have to make your own version of esp.cc that passes in the correct variable for the class table, and you will need to change the Makefile so that it submits your version of esp.cc.

Use of the symbol tables in code generation

We will need to add additional information to our symbol table entries to do code generation. What that code looks like depends on how you instantiated the symbol table class for the various symbol tables.

In PA4 we are using symbol tables to get information about generating code for symbols. That means we need to store additional information to associate with a name. For example, when we generate code for the expression new Foo() we need to get information about the size of a Foo object. When we generate code for a reference to a variable (AST node VarRef), we need to know whether this variable is on the heap (fields) or on the stack (formals and locals), and what its offset is (depending on what kind of variable it is, the offset will be relative to the this object or in the stack frame). These are not the only uses of the symbol tables, just a few examples.

So what does this imply for the symbol tables we already created? In code generation we will need to traverse over all the declarations of symbols (in the AST) to add the additional information needed to their symbol table entries for generating code. When we generate code for expressions, such as the ones mentioned above, is where we have to use the information added when we processed the associated declaration.

The implementation issue that comes up is: is the value-type used in a given symbol table amenable to adding fields to store this additional information? For a class symbol table with value type Class, i.e., symbol table is type SymTab<Class>, the answer is yes; and similarly if the value type for a method table is type Method. We can just add any fields or methods needed for code generation to the Class or Method AST classes.

If you used VarInfo and associated subclass objects for entries in your variable symbol tables, then this will not be a problem (more on this in the next section). However, if you used SymTab<Symbol>, you will not be able to add additional C++ fields for code generation, because the Symbol represents the name of a type: it's not an object specific to one symbol table entry (e.g., all variable table entries with type Foo use the same Symbol object for Foo).

If your variable table values are Symbols, we recommend you just leave the semantic analysis module as is, and just build new variable symbol tables for use in code generation. The VarInfo class hierarchy was discussed in more detail in a section on it in the PA3 assignment description, and in comments in semant.h, where the classes are defined.

Using the VarInfo hierarchy in code generation

As we explained in PA3, the reason we have different classes for the symbol table entries for different kinds of variables is that the way you generate code for assignments to and references to the four different kinds of variables may be slightly different, because they will be found in different places in memory. Thus, this hierarchy will allow us to use polymorphic function(s) to do this task; e.g., you could have polymorphic genVarRef and genVarUpdate function which behaves differently for each of the four subclasses. Just like with any such hierarchy, if there is code which is common to all or some of the subclasses, you can push it up to the superclass. One style of doing this is to make a protected superclass function to do the common part; this function may be called by the virtual function whose implementation is class-specific: defaultDump in the VarRef hierarchy is an example of this.

MIPS and the spim MIPS Simulator

Your code generator is going to generate instructions for the MIPS machine. We are going to run that generated code under the MIPS simulator called spim.

For information on the MIPS architecture and instruction set, please refer to the documentation on spim that is accessible in the Documentation section of the web page. This documentation also explains how to use the spim and xspim tools.

In an earlier section we described how to set up your environment so you can run spim. There are actually two versions of this tool, a command-line version called spim, and a GUI version called xspim. The first version is probably easier for just running a MIPS program generated by your compiler, but the second one might be easier for debugging a MIPS program generated by your compiler (although both have the same basic debugging commands available).

The program testemit.cc generates a small MIPS program, which you can run on the simulator to try it out. The next section spells out how to compile and run testemit.

Emit functions and NameGenerator

The files emit.h and emit.cc define a few useful functions for formatting MIPS instructions. A few of these functions don't actually emit code to a file, but instead generate a string that can be passed to an emit function. See testemit.cc for an example of how to use given the functions together. You can also compile and run it to see the code it generates. You can then run the MIPS program it generates under spim. Here is a sample command sequence to do this. (Note that, unlike your Espresso compiler, testemit generates its code to cout.)
gmake testemit
testemit > testemit.s
spim testemit.s

Feel free to define similar functions and constants (there are some of the latter defined in testemit.cc) to help simplify your code generation code. E.g., you could write a function for generating the code to push something on the stack. You should put these definitions in your espc.cc file.

The NameGenerator class is the other thing defined in emit.h. It is a class for generating unique names for labels when you generate code. When you generate code for control structures, you will need labels for locations that your code jumps to. If, for example, there are 10 loops in the generated code they can't all use the same label names, so this class gives you a new unique label name every time you call gen(), e.g., the first three calls to gen() on a NameGenerator return the strings "label0", "label1", "label2". We created a NameGenerator object for you: the global variable, lgen is defined for you near the top of codegen.cc.

For more details on the tools provided in the emit package, see the comments in emit.h

The Run-Time System

The run-time system consists of some hand-coded MIPS code that is automatically loaded when you run spim (or xspim). We will document the interface to that code here, but if you want to look at this code it is in ~csci410/espresso/lib/exceptions.s

The code you generate will have to conform to the interface of the run-time system, which we will describe here. The run-time system has these main parts: startup code, which invokes your main method; routines for doing I/O and for allocating dynamic data; one routine for a runtime error you need to catch; and default exception handling code for MIPS. The rest of this section describes what the run-time system assumes about the generated code, and what the run-time system provides for the generated code.

Execution start-up

When you start spim, the following things happen: a function called main is called with no arguments. Register $ra contains the return address. When control returns from main, execution halts with the message "Espresso program successfully executed."

Your code generator must define the global label main.

Run-time system routines

These routines only write scratch register $v0, except _walloc, which writes $a0 (more about that below).
_println
Prints a newline. Takes no arguments. Register $a0 is unmodified by this function.

_int_println
Prints an integer followed by a newline. Integer is passed in register $a0. Register $a0 is unmodified by this function.

_walloc
_w(ord)alloc allocates n contiguous words on the heap. The argument, n, is passed in register $a0. Address of the start of the memory allocated is returned in $a0. All words in the memory allocated are initialized to the value 0. On MIPS the heap grows towards larger addresses, so the last address of the memory allocated will be greater than (or possibly equal to, if we only consider word addresses) the address returned by _walloc.

_null_abort
Should be called when a method call is attempted on a null object. Prints the line number of the source code, prints an informative message, and aborts the program. Line number is passed in register $a0.

MIPS Stack and Register conventions

Section A.6 (starting on p. 22) of the spim documentation describes the register and calling conventions used in MIPS programs. You should stick to the general conventions for register use. However, your call/return sequence will probably be different from the example shown. In particular, we recommend you don't pass parameters in registers, because it's more difficult, especially since it means the code for some parameters would need to be different than for others. However, we discuss below how you might want to pass this in a register.

As discussed in lecture, you will have to design an activation record layout and a call and return sequence for methods, since all methods have to be in agreement as to how this proceeds. For function calls and other expressions you will need to be concerned with Java requirements for the order of evaluation of subexpressions: the rules are generally that things must be evaluated from left to right.

Since every method call has the one parameter that appears before the dot (i.e., the object we're calling the method on), you may want to pass it in a register, such as $a0, rather than putting it in the activation record. That value corresponds to this in the called function. You may also want to keep the current value of this in a register during method execution, because any reference to a field is going to involve its value, since it holds the starting address of the object containing the field. This is an example of a long-lasting value that would be preserved across function calls, so you would want to use a callee-saved register, such as $s0. What that means is that the callee saves the old value of the register (here, the "old $s0", that is, the caller's this pointer), before writing the new value of the register (the current this pointer), and then restores the old value before returning. Note that storing this in a register implies the way you generate code for references to this will be different from how you generate code for uses of other variables.

Naming conventions

We recommend you use a logical naming scheme for labels for methods so that it is easy to reference them. For method <method> in class <class> , you should use the name
<class>.<method>
One exception to this is the previously mentioned main method, which has to have that exact name, since the run-time system is going to call it.

You won't be able to use fixed names for labels related to control structures. Handling this is discussed in more detail in the section on the name generator.

Your generated code will not need any static data, so you won't need any names for addresses in the data segment.

Hints about a few language features

One big simplification with how we can do things over what Java has to do concerns method calls. In particular, since we have no inheritance or overriding, we don't need to implement dynamic method binding. Instead, our method calls will involve jumping (using instruction jal) to fixed locations (whose naming scheme was described in the previous section).

In Java (and Espresso) when two reference type values (i.e., anything that's not int or boolean) are compared for equality (== or !=), a pointer compare is done. Put another way, we are comparing if the two operands refer to the same object (same identity), not whether the two objects have the same value. This has implications for the way we create objects. In the case of creating instances of a class that has no fields, it means we still need to create an object on the heap for it, so it has a different identity other instances of the same class. You can achieve this by allocating a (dummy) word for those objects. Alternatively you can have an extra header field at the top off all of your objects; you could use it to store the size of the object, although the generated code for Espresso won't ever need to use the size information (but it could be useful if we were recovering unused memory via garbage collection).

Also, don't forget that && is a conditional and (i.e., a control structure). See Java manual for details.

Running the Compiler

Once you compile the compiler, you can run it with a command such as,
espc Test.esp
which will produce the file Test.s. You can then run the resulting MIPS program with the command
spim Test.s
In addition you can run the compiler in two kinds of debug modes, controlled with the following compiler options:
-s
Prints your debugging output for semantic analysis phase. (This was controlled by a -d flag in PA3.)
-c
Print debugging output for code generation phase.
The main program we have given you handles these options, and sets the global flags semantDebug and/or codegenDebug based on them. The semantDebug flag will just do whatever it did in your PA3, except we added a line so that it also prints your AST after semantic analysis only if this flag is on (in contrast, the PA3 driver always printed the AST). The second flag, codegenDebug, you can use as you wish to have the program print additional output to cout (not to the .s file) for debugging purposes. This is only for your convenience: you are not required to produce any output based on the -c option.

Development strategy

You can develop the code for the first pass (see section on passes) as a testable subset, where you check your results by dumping the symbol tables, including additional information you add to them for code generation (with printing controlled by codegenDebug flag described above). Note: the VarInfo class hierarchy has dump functions defined for each class that you can modify to print additional information.

But to get a subset that generates code that will run in spim you will need to do somewhat more. In particular, you'll need to design your activation record and call and return sequence before you'll be able to generate much testable MIPS code. This is because you need the activation record to access local variables or parameters.

Once you design your activation record, you can build subsets that dump more symbol tables as you modify them, or that generate code for part of the call and return sequence (i.e., code that we probably can't run in spim yet).

Once you get calls and returns working, you can build code generation for the various language features incrementally.

If you do not complete the project, a desirable working subset is a working compiler for a subset of Espresso (i.e., generates code that runs). In contrast, generating correct instructions for individual features is not so great if the instructions aren't put together correctly to make a MIPS program that is equivalent to the given Espresso program (e.g., if you generate the correct code to access a parameter, but your code to set up the parameters doesn't work).

What to turn in

The Makefile includes a rule for submitting the assignment. You run it by typing gmake submit. It copies your ast.h, semant.h, semant.cc, codegen.h, codegen.cc, README, and Test.esp to the course account using the local submit utility. The last version submitted is the one that will be graded: each submit overwrites the previous one.

The University of Southern California does not screen or control the content on this website and thus does not guarantee the accuracy, integrity, or quality of such content. All content on this website is provided by and is the sole responsibility of the person from which such content originated, and such content does not necessarily reflect the opinions of the University administration or the Board of Trustees