The basic architecture is described well by this diagram:
V
---------- ---> Pretty-printer i
| Source | --------- ---------- / T s
| Code | ---> | Lexer |->| Parser | -----> Type-checker r i
| File | --------- ---------- ^ \ e t
---------- ^ ---> Interpreter e o
Parse Tree r
s
The lexer and the parser are on the front end of the system. As you
know, the lexer's job is to break up the source code into language tokens.
The parser then uses these tokens as terminal symbols to parse the
language grammar. The output of the parser is a parse tree, which
serves as our internal representation for the program.
Once we have the parse tree, we can do a variety of meaningful things with the program. In the diagram above are three examples of things to do with the parse tree: pretty print the source code, do static type checking on the program, or run (interpret) the program. For this assignment, we will consider all three.
In this section we describe some details of the design of each component of the system.
The Lexer class takes the entire source code as input to the constructor
via a single string. It then provides a narrow interface to the parser
for consuming tokens of the program string one by one:
Token next() consumes the next token from the input and returns itToken look() returns the next token in the input without consuming it (lookahead)Token match( TokenType tt ) consumes and returns the next token only if it is the expected type, otherwise throws a ParseError
The Lexer implementation keeps track of its location in the original
source code (line number and column number). This information is passed
on to each token, which knows its "location in the source" as well as
its token type and lexeme.
The Parser is implemented as a recursive descent parser.
Each production in the language grammar translates into a method in the
parser; each non-terminal in the grammar corresponds to a type of node
in the parse tree. The structure of each node in the parse tree is
determined by the nature of that program construct. For example, class
If has three main components: an Expression
pointer, which points to a node in the tree representing the
conditional expressions, and two Block pointers, which
point to the "then" and "else" blocks of code associated with the
If. (There is also a Token field named
"anchor", which just hangs around to aid in error reporting--recall
that tokens know what line number of the file they came from.)
The parser recognizes three "top level structures" in the program
source: Programs, Procedures, and
Functions. A parse tree is associated with each of
these. Since one source code file can hold any number of
procedure/function definitions, the parser returns a structure
called a File to respresent all the code in the source
code file. The File contains two
Dictionaries, which allow us to lookup procedures and
functions by name, and a pointer to the main program itself.
Once we have the parse tree, there are a variety of things we may want to do with it, for example: interpret the code, pretty-print the code, or typecheck it. In order to decouple the parse tree data structure from the various actors which want to traverse it, we use the Visitor design pattern.
Each node in the parse tree inherits from an abstract class
ParseTreeNode. Then, there is another abstract class
called ParseTreeVisitor from which the
PrettyPrinter, Typechecker, and
Interpreter inherit. Each ParseTreeNode can
"accept" a visitor, and each visitor knows how to visit each type of
node in the parse tree. That is to say, the
ParseTreeVisitor interface looks something like
class ParseTreeVisitor {
public:
virtual void visit( Procedure* x )=0;
virtual void visit( Function* x )=0;
virtual void visit( Declaration* x )=0;
virtual void visit( Block* x )=0;
virtual void visit( ProcedureCall* x )=0;
virtual void visit( Assign* x )=0;
virtual void visit( If* x )=0;
virtual void visit( While* x )=0;
...
only with a lot more methods (one for each separate type of node in the
parse tree). Visitors then walk the tree by visiting each node. For
example, the Interpreter, when visiting an If
node, would first visit the conditional Expression in the
node. If the expression evaluated to true, it would then visit the
"then" Block of the If node, else it would
visit the "else" Block.
TypeChecker is a subclass of
ParseTreeVisitor. Its job is to verify that the code
meets the language's static semantics. For the most part, this means
"type checking". The TypeChecker maintains a
Dictionary which maps variable names to their types, and
traverses the parse tree for each module, ensuring that variables are
used in accordance with their declaration. For example, when the
TypeChecker visits an If node, it check that
the Expression is a boolean. When it visits an
Assign node (assignment statement), it ensures that the
type of the variables on the left-hand-side matches the type of the
expression on the right-hand-side.
Procedure and function calls take a little more work to typecheck--the
TypeChecker must ensure that a procedure of function with
the right name exists, and looks up the type and number of its formal
parameters, and check them against the actual parameters. It must also
ensure that only expressions with lvalues are passed as
inout or out parameters.
The Interpreter is another ParseTreeVisitor
subclass. It takes a File parameter to its constructor
and interprets the Program in that File.
Interpreting the program is as simple as traversing the parse tree for
the program. The Interpreter maintains a
Stack of referencing environments--one for each module on
the current activation stack. The referencing enviornments are just
Dictionaries which map variable names to
Values. Value is an abstract class with
three concrete subclasses (Bool, Int, and
String) which model the values of the three types of
variables in our language.
The semantics for visiting each type of node in the parse tree are
intuitive. For example, visiting a Declaration causes us
to add new entries in the current referencing environment. To visit an
Assign statement, we visit the right-hand-side
Expression to evaluate it, and then look up the
left-hand-side Var in the referencing environment and
set() a new Value. The result of visiting
the Expression in an If determines whether we
visit the "then" Block or the "else" Block.
To call a procedure or function, all the actual parameters are evaluated and stored, and then the called module is visited. When that module begins, it pushes a new referencing environment onto the stack, adds the parameters to its environment, and then runs its course.
A pretty-printer is a tool which reformats source code by manipulating
whitespace so the source code is indented "pretty". It can be
implemented as yet another subclass of ParseTreeVisitor.
As each node is visited, the code corresponding to the node is printed
with the correct indentation and newlines.
A precise specification of the language would be a pain in the neck, and no fun to read. So instead, this section merely points out some highlights of the language.
To give a feel for the language, here's a small program that computes the greatest common factor of two numbers:
function gcf( in int x, in int y ) returns int
begin
if( y > x ) then
return gcf( y, x );
else
if( x mod y = 0 ) then
return y;
else
return gcf( x mod y, y );
endif
endif
endfunction
program foo
int x, y;
begin
print( "enter two numbers (0's to quit): " );
read( x );
read( y );
while( x>0 and y>0 ) do
print( "gcf is ", gcf(x,y) );
print( "enter two numbers (0's to quit): " );
read( x );
read( y );
endwhile
endprogram
There is one parameter-less main program, accompanied by any number of
procedures and functions. There are three kinds of data:
int, string and bool. There are
also three kinds of parameters: in, out, and
inout. Functions may not have side-effects. Declarations
for variables appear at the top of a block for each module. There are
no global variables. Statements are terminated with semicolons. The
two main control constructs are if and
while. Recursion is supported.
There are a number of builtin operators. They are listed here, from highest to lowest precedence:
* div mod + - . < <= > >= = <> not and orThe "dot" is the string concatenation operator, as in perl. All operators associate left-to-right.
In addition to those operators, there are a few named operations:
substr( str, start, length ) -> string length( str ) -> int print( <exprlist> ) read( <varlist> )
String literals appear in double-quotes. true and
false are boolean values. Integer literals are
supported. The language is case-sensitive. Comments are not
supported.
The sample programs are suggestive of the grammar that describes the language, so no BNF is given.
We could just have you write the whole thing, but that'd be no fun, because it's big (about 2000 lines of code) and you only have a short period of time to do this project. So instead, we're giving you most of the code, and you just have to add a few features and make a few modifications. Hopefully in the process you'll learn a bit about C++ and also about programming languages in general.
Before you do anything, you should grab a copy of the code (which is located in the directory
~lorgon/Inton lennon.cc), compile it, and run it on some of the sample test files. Then you should try to read over some of the C++ code and compare it to the architecture described above, to get a feel for what source code files map into which parts of the architecture. Once you've a basic feel for the code, you can actually get started working on the project.
First, a number of method bodies have been removed and replaced with a
comment saying "WRITE THIS". You should implement those
methods--until you do, many of the sample programs we've given you won't
parse, typecheck, or run properly.
Second, the length operator for strings has been completely
omitted from the code. To add this feature to the interpreter "from
scratch", you will need to:
LENGTH to the TokenType enumerated type in lexer.hLENGTH in tttos() and test_keyword() in lexer.ccLength class in parse_tree.h (Hint: it will look a lot like Not) and add a visit(Length*) method to the ParseTreeVisitor interfacequx() in parser.ccLength in typechecker.cc and interpreter.cc (as well as declarations in their .h files)At this point, the interpreter should conform to the language spec.
Next, you must implement an interesting feature of your own. What you do is up to you. Here are a few suggestions you might consider, in approximately increasing order of difficulty:
loop...exitif...endloop in 1501 pseudocode.
if( x = 1 ) then
do_something();
elseif( x = 2 ) then
do_something_else();
else
meow();
endif
Modify the interpreter so this works.There are, as always, a number of resources available to you. For general questions/discussion, the newsgroup is good; since all three sections of 3411 are doing this assignment, you may wish to crosspost to all three groups. TA/instructor office hours or email are available for more specific questions.
There will also be a C++ help session centered around the assignment (time and date to be announced).
Get started early! Even if you're not writing any code yet, start looking over the given code and digesting it.
You should turn in all of your source code and the Makefile. In addition, you should include a README file which describes (1) any problems you are aware of with your program and (2) a description of the feature you added. You should also turn in at least two sample code files (call them "mytest01" and "mytest02") which demonstrate your language feature.
Put everything into a directory, go into workon, and type
turnin p1 directory_nameto turn in your assignment.
The due dates vary across sections; you can always use
turnin --scheduleto find out when workon thinks it is due for your section. Maybe.