Writing a Compiler in C#: Parsing, Part 2
Before we proceed to the full BNF of a Jack expression, we need to decide which operators we’re going to support. Our final implementation will have some additional operators, but for now we’ll settle for +, –, *, /, <, >, =, &, |, and !.
One obvious question when dealing with arithmetic and relational operators is the question of operator precedence.
What’s the value of 5+3*2? Is it 16 or 11?
What’s the value of 3<2+5? Is it 6, or 1, or something else entirely depending on the integer coercion of a Boolean value?
You might be tempted to think that operator precedence is the parser’s problem, i.e. we must install special parsing code to handle multiplications before additions. Indeed, this problem can be solved by using two stacks for operators and operands (e.g. see here).
However, we can solve the problem of operator precedence at the BNF level, from which we create the parsing procedures. This is one of the neatest parsing tricks. The following BNF provides proper operator precedence for arithmetic operators:
expression ::= factor ( ('+' | '-') factor )*
factor ::= term ( ('*' | '/') term )*
term ::= digit+
From this thing the parser code becomes immediately self-evident:
procedure ParseExpression
ParseFactor()
while LookaheadToken <> ";"
op = NextToken()
assert op in [ "+", "-" ]
ParseFactor()
print op
end while
end procedure
procedure ParseFactor
ParseTerm()
while LookaheadToken <> ";"
op = NextToken()
assert op in [ "*", "/" ]
ParseTerm()
print op
end while
end procedure
procedure ParseTerm
term := NextToken()
print term
end procedure What if we wanted to add support for parentheses to modify explicitly the operator precedence in an expression? Again, before being tempted to add special cases to the parser, consider the following grammar modification:
term ::= digit+ | '(' expression ')'
The modification to the parser is also immediate:
procedure ParseTerm
if LookAheadToken = "("
ParseBLOCKED EXPRESSION
Match(")")
else
term := NextToken()
print term
end if
end procedure
Incredible! Now we’re ready to appreciate the full BNF of a Jack expression (with added operator precedence compared to the TECS version):
expression ::= relexpr ( ('&' | '|') relexpr )*
relexpr ::= addexpr ( ('<' | '>' | | '=') addexpr )*
addexpr ::= mulexpr ( ('+' | '-') mulexpr )*
mulexpr ::= term ( ('*' | '/') term )*
term ::= int-const | str-const | keyword-const |
var-name | var-name '[' expression ']' |
sub-call | '(' expression ')' |
('-' | '!') term
sub-call ::= sub-name '(' expr-list ')' |
(class-name | var-name) '.' sub-name
'(' expr-list ')'
expr-list ::= ( expression ( ',' expression )* )?
Whew! This thing recognizes four levels of operator precedence through the use of three intermediate BNF definitions—relexpr, addexpr, and mulexpr. Note that the unary operators have the highest precedence, which is guaranteed by their placement at the bottom-most term level.
As before, the parser code flows naturally and beautifully from this BNF. This time, instead of using “print term” as the placeholder for actual operations, we call the code generator whenever we have some interesting information about the parsed program. (Another alternative is to construct an intermediate representation such as a parse tree, and have the code generator analyze it in one shot.)
private void ParseBLOCKED EXPRESSION
{
ParseRelationalBLOCKED EXPRESSION;
while (IsNextTokenLogicalOp())
{
Token logicalOp = NextToken();
ParseRelationalBLOCKED EXPRESSION;
switch (logicalOp.Value)
{
case "&":
_codeGenerator.And();
break;
case "|":
_codeGenerator.Or();
break;
default:
ThrowCompilationException(...);
break;
}
}
}
private void ParseRelationalBLOCKED EXPRESSION
{
ParseAddBLOCKED EXPRESSION;
while (IsNextTokenRelationalOp())
{
Token relationalOp = NextToken();
ParseAddBLOCKED EXPRESSION;
switch (relationalOp.Value)
{
case "<":
_codeGenerator.Less();
break;
case ">":
_codeGenerator.Greater();
break;
case "=":
_codeGenerator.Equal();
break;
case "<=":
_codeGenerator.LessOrEqual();
break;
case ">=":
_codeGenerator.GreaterOrEqual();
break;
case "!=":
_codeGenerator.NotEqual();
break;
default:
ThrowCompilationException(...);
break;
}
}
}
private void ParseAddBLOCKED EXPRESSION
{
ParseMulBLOCKED EXPRESSION;
while (IsNextTokenAddOp())
{
Token addOp = NextToken();
ParseMulBLOCKED EXPRESSION;
switch (addOp.Value)
{
case "+":
_codeGenerator.Add();
break;
case "-":
_codeGenerator.Sub();
break;
default:
ThrowCompilationException(...);
break;
}
}
}
private void ParseMulBLOCKED EXPRESSION
{
ParseTerm();
while (IsNextTokenMulOp())
{
Token mulOp = NextToken();
ParseTerm();
switch (mulOp.Value)
{
case "*":
_codeGenerator.Mul();
break;
case "/":
_codeGenerator.Div();
break;
case "%":
_codeGenerator.Mod();
break;
default:
ThrowCompilationException(...);
break;
}
}
}
private void ParseTerm()
{
if (LookAheadToken.Type == TokenType.IntConst)
{
Token intConst = NextToken();
_codeGenerator.IntConst(int.Parse(intConst.Value));
}
else if (LookAheadToken.Type == TokenType.StrConst)
{
Token strConst = NextToken();
_codeGenerator.StrConst(strConst.Value);
}
else if (IsNextTokenKeywordConst())
{
Token keywordConst = NextToken();
switch (keywordConst.Value)
{
case "true":
_codeGenerator.True();
break;
case "false":
_codeGenerator.False();
break;
case "null":
_codeGenerator.Null();
break;
case "this":
...
_codeGenerator.This();
break;
default:
ThrowCompilationException(...);
break;
}
}
else if (LookAheadToken.Type == TokenType.Ident)
{
//This is either a variable name, an array access,
//or a subroutine call. We need to look ahead:
Token varName = NextToken();
if (LookAheadToken.Value == ".")
{
ParseSubCall(varName);
}
else
{
//We’ll discuss symbol tables later. For now,
//this thing only helps us to figure out if the
//variable has been defined previously.
Symbol symbol = GetClosestSymbol(varName.Value);
if (symbol == null)
{
ThrowCompilationException(
"Undefined identifier '" +
varName.Value + "'");
}
bool arrAccess = false;
if (LookAheadToken.Value == "[")
{
arrAccess = true;
Match(new Token(TokenType.Symbol, "["));
ParseBLOCKED EXPRESSION;
Match(new Token(TokenType.Symbol, "]"));
}
_codeGenerator.VariableRead(varName, arrAccess);
}
}
else if (LookAheadToken.Value == "(")
{
Match(new Token(TokenType.Symbol, "("));
ParseBLOCKED EXPRESSION;
Match(new Token(TokenType.Symbol, ")"));
}
else if (IsNextTokenUnaryOp())
{
Token unaryOp = NextToken();
ParseTerm();
switch (unaryOp.Value)
{
case "-":
_codeGenerator.Negate();
break;
case "!":
_codeGenerator.Not();
break;
default:
ThrowCompilationException(...);
break;
}
}
else
{
ThrowCompilationException(...);
}
}
I intentionally omitted the parsing procedure for sub-call because it poses an interesting challenge, up next.