Building a simple JavaScript like interpreted language from scratch
This blog post will walk through how to build a simple interpreted language with JavaScript like syntax that will be able to run the following program.
var x = 5;
var y = 6;
function greaterThanTen(value) {
if (value > 10) {
return true;
}
else {
return false;
}
}
greaterThanTen(x + y);
This language will have some large restrictions. In particular...
- only two data types (booleans, integers)
- support for only binary expressions
for example 1 + 2 + 3 will need to be var x = 1 + 2; var y = x + 3; - no loops or data structures
None the less this this guide will still demonstrate the basics of how an interpreted language can work and the project result will be a great sandbox that can be extended and modified to learn even more.
Initializing the project
mkdir interpreter_from_scratch && cd interpreter_from_scratch
dotnet new console
Tokenizing Input
The first step in an interpreter is to tokenize or lex input.
The "code" input is nothing more than a long string that must be analyzed character by character. Lexical analysis or tokenizing is the process of translating this string input into a series of easier to process tokens containing only the relevant information.
Something like "var a = 5;" will be translated into 5 tokens
(VAR, IDENTIFIER, ASSIGN, INTEGER, SEMICOLON)
These tokens will contain the type of token and the literal string value that comprises the token.
Starting with defining the TokenType and Token models.
The TokenType model will define all the vocabulary that makes up the programming language.
TokenType.cs
namespace interpreter_from_scratch
{
public enum TokenType
{
IDENTIFIER,
INTEGER,
TRUE,
FALSE,
FUNCTION,
VAR,
IF,
ELSE,
RETURN,
ASSIGN,
EQUALS,
DOESNOTEQUAL,
GREATERTHAN,
LESSTHAN,
PLUS,
MINUS,
SLASH,
ASTERISK,
LEFTPAREN,
RIGHTPAREN,
LEFTBRACE,
RIGHTBRACE,
COMMA,
SEMICOLON,
ENDOFFILE,
}
}
Token.cs
namespace interpreter_from_scratch
{
public class Token
{
public TokenType Type { get; set; }
public string Literal { get; set; }
public Token(TokenType type, string literal)
{
Type = type;
Literal = literal;
}
}
}
Next is to create a Lexer class. This will take the code string input as an initialization parameter and walk through this input producing Tokens along the way based on that input.
Lexer.cs
namespace interpreter_from_scratch
{
public class Lexer
{
public int Position { get; set; }
public string Input { get; set; }
public Token CurrentToken { get; set; }
public Token PeekToken { get; set; }
public Lexer(string input)
{
Input = input;
Position = 0;
}
}
}
Besides the input we will also use the properties Position, CurrentToken and PeekToken. Position is the index of the input string the Lexer analyzes. The PeekToken will be used to look ahead at the next token input . The primary method in this Lexer class will be to fetch the next token which will populate the CurrentToken and PeekToken. The first thing that needs to be done to get the next token is to get the next relevant character in the string. This means skipping over any whitespace. Once a relevant character is found, based on the character type a Token can be created.
CurrentToken will take the existing value of PeekToken and PeekToken will take the value of the new token produced. If the input runs out an ENDOFILE token is created.
NextToken will be called twice in the initializer. This is to populate both the CurrentToken and PeekToken on initialization.
public class Lexer
{
// other code
public Lexer(string input)
{
Input = input;
Position = 0;
// Load first two tokens
NextToken();
NextToken();
}
public void NextToken()
{
SkipWhitespace();
Token? token = null;
CurrentToken = PeekToken;
if (Position >= Input.Length)
{
PeekToken = new Token(TokenType.ENDOFFILE, "");
return;
}
// Get token type
if (token != null)
{
PeekToken = token;
}
}
private void SkipWhitespace()
{
while (Position < Input.Length && char.IsWhiteSpace(Input[Position]))
{
Position++;
}
}
}
Once the Lexer finds a relevant character that needs to be processed, it will create a new token and assign it a type based on the character type. The easiest characters to tokenize are the symbol characters.
// other code
public class Lexer
{
// other code
public void NextToken()
{
SkipWhitespace();
Token? token = null;
CurrentToken = PeekToken;
public void NextToken()
if (Position >= Input.Length)
{
PeekToken = new Token(TokenType.ENDOFFILE, "");
return;
}
var character = Input[Position];
switch (character)
{
case '=':
if (Position < Input.Length - 1 && Input[Position + 1] == '=')
{
token = new Token(TokenType.EQUALS, "==");
Position += 2;
}
else
{
token = new Token(TokenType.ASSIGN, "=");
Position++;
}
break;
case '+':
token = new Token(TokenType.PLUS, "+");
Position++;
break;
case '-':
token = new Token(TokenType.MINUS, "-");
Position++;
break;
case '*':
token = new Token(TokenType.ASTERISK, "*");
Position++;
break;
case '/':
token = new Token(TokenType.SLASH, "/");
Position++;
break;
case '!':
if (Input[Position + 1] == '=')
{
token = new Token(TokenType.DOESNOTEQUAL, "!=");
Position += 2;
}
break;
case '>':
token = new Token(TokenType.GREATERTHAN, ">");
Position++;
break;
case '<':
token = new Token(TokenType.LESSTHAN, "<");
Position++;
break;
case '(':
token = new Token(TokenType.LEFTPAREN, "(");
Position++;
break;
case ')':
token = new Token(TokenType.RIGHTPAREN, ")");
Position++;
break;
case '{':
token = new Token(TokenType.LEFTBRACE, "{");
Position++;
break;
case '}':
token = new Token(TokenType.RIGHTBRACE, "}");
Position++;
break;
case ',':
token = new Token(TokenType.COMMA, ",");
Position++;
break;
case ';':
token = new Token(TokenType.SEMICOLON, ";");
Position++;
break;
default:
break;
}
if (token != null)
{
PeekToken = token;
}
}
}
Some extra consideration needs to be made EQUALS and DOESNOTEQUAL as these symbols require two different specific characters in order.
If the input character is not a symbol this means it needs to be either a keyword (FOR, IF etc.), identifier, integer or invalid value.
In this programming language identifiers must start with a letter character and not a number. Therefore, we can assume if we run into a number it is an integer.
public class Lexer
{
// other code
public void NextToken()
{
// other code
switch (character)
{
// other code
default:
if (char.IsNumber(character))
{
var number = "";
while (Position < Input.Length && char.IsNumber(Input[Position]))
{
number += Input[Position];
Position++;
}
token = new Token(TokenType.INTEGER, number);
break;
}
// other code
}
}
}
If the character is not a number, it must be a keyword or identifier. To determine if the sequence is a keyword the Lexer will contain a map of keywords and their token type. If the result is found in the map, the new Token will be set to the adjacent token type. If not, it will be considered an identifier.
If the character is neither a letter nor an integer, it is some sort of unknown character that the interpreter does not support. For this an error is thrown.
public class Lexer
{
private readonly Dictionary<string, TokenType> _keywords = new Dictionary<string, TokenType>
{
{ "var", TokenType.VAR },
{ "true", TokenType.TRUE },
{ "false", TokenType.FALSE },
{ "if", TokenType.IF },
{ "else", TokenType.ELSE },
{ "function", TokenType.FUNCTION },
{ "return", TokenType.RETURN }
};
// other code
public void NextToken()
{
// other code
switch (character)
{
// other code
default:
else if (char.IsLetter(character))
{
var identifier = "";
while (Position < Input.Length && (char.IsLetter(Input[Position]) || char.IsDigit(Input[Position])))
{
identifier += Input[Position];
Position++;
}
if (_keywords.ContainsKey(identifier))
{
token = new Token(_keywords[identifier], identifier);
break;
}
token = new Token(TokenType.IDENTIFIER, identifier);
break;
}
throw new Exception($"Unknown token: {Input[Position]} at position {Position}");
}
// other code
}
}
And that's it for the Lexer.
Feel free to sanity check your code with these unit tests LexerTests.cs (opens in a new tab)
Parsing an Abstract Syntax Tree
The next step after the tokens can be created is to construct an abstract syntax tree. This will be a tree representation of the program which can be walked down to evaluate.
The nodes of the tree will also be known as statements. A statement is just an instruction for the computer to do something. The only required property this class will hold is the Token it is derived from. This will also be defined as abstract since it will only be an inherited class. Since there will be several new classes for this parser a separate directory Ast is created.
Ast/Statement.cs
namespace interpreter_from_scratch.Ast
{
public abstract class Statement
{
public Token Token { get; set; }
}
}
A Parser class will be created which takes a Lexer as a required property on initialization. This Parser class will be used to generate an InterprterProgram or the abstract syntax tree which is a list of statements.
Ast/InterpreterProgram.cs
namespace interpreter_from_scratch.Ast
{
public class InterpreterProgram
{
public IEnumerable<Statement> Statements { get; set; }
public InterpreterProgram(IEnumerable<Statement> statements)
{
Statements = statements;
}
}
}
Parser.cs
using interpreter_from_scratch.Ast;
namespace interpreter_from_scratch
{
public class Parser
{
public Lexer Lexer { get; set; }
public Parser(Lexer lexer)
{
Lexer = lexer;
}
public InterpreterProgram ParseProgram()
{
var statements = new List<Statement>();
return new InterpreterProgram(statements);
}
}
}
Now to go through each of the features the language supports and implement a way to parse them into a statement node. The features this language supports include.
- Variable assignment (var x = 5;)
- Return statements (return 5;)
- If/else statements (if (true) { return 5; } else { return 10; })
- Functions (function add(x, y) { return x + y; })
- Expressions
- Booleans
- Integers
- Binary Expressions (5 + 5;)
An expression is anything that produces a value. The REPL will be able to support something like "2 + 2;", and have it produce the value "4".
Since many of the other statement types will contain an Expression, it's best to implement parsing of Expressions first.
Like statements there will be a base expression class that will look very similar, only containing a Token value.
Ast/Expression.cs
namespace interpreter_from_scratch.Ast
{
public abstract class Expression
{
public Token Token { get; set; }
}
}
Next two models are created for the two data types, Bools, and Integers. These inherit the Expression base class but also contain a value.
Ast/Bool.cs
namespace interpreter_from_scratch.Ast
{
public class Bool : Expression
{
public bool Value { get; set; }
public Bool(Token token, bool value)
{
Token = token;
Value = value;
}
}
}
Ast/Integer.cs
namespace interpreter_from_scratch.Ast
{
public class Integer : Expression
{
public int Value { get; set; }
public Integer(Token token, int value)
{
Token = token;
Value = value;
}
}
}
With these two models the ParseExpression method can be started. Depending on the TokenType this will create a Bool or Integer Expression object.
using interpreter_from_scratch.Ast;
namespace interpreter_from_scratch
{
public class Parser
{
// other code
private Expression ParseExpression()
{
var token = Lexer.CurrentToken;
switch (token.Type)
{
case TokenType.INTEGER:
return new Integer(token, int.Parse(token.Literal));
case TokenType.TRUE:
case TokenType.FALSE:
return new Bool(token, token.Type == TokenType.TRUE);
default:
return null;
}
}
}
}
Next the parser needs the ability to parse identifiers into expressions. Something like "var x = 5; x;" in a REPL will require the interpreter to return the value of x. The identifier expression object will look like the two above except it will hold the identifier name. In this case x. These values will be translated into their bool or integer values later when evaluating the tree.
Ast/Identifier
namespace interpreter_from_scratch.Ast
{
public class Identifier : Expression
{
public string Value { get; set; }
public Identifier(Token token)
{
Token = token;
Value = token.Literal;
}
}
}
This case can now also be parsed.
Parser.cs
// other code
private Expression ParseExpression()
{
var token = Lexer.CurrentToken;
switch (token.Type)
{
case TokenType.IDENTIFIER:
return new Identifier(token);
case TokenType.INTEGER:
return new Integer(token, int.Parse(token.Literal));
case TokenType.TRUE:
case TokenType.FALSE:
return new Bool(token, token.Type == TokenType.TRUE);
default:
return null;
}
}
Finally, the parser needs to parse binary expressions. If an INTEGER Token is encountered there is a chance it could be succeeded by a PLUS and another INTEGER.
Rather than creating an integer expression a binary expression will be created which can then be evaluated and produce a value.
The binary expression model will hold a left expression, right expression, and operation token type.
Ast/BinaryExpression.cs
namespace interpreter_from_scratch.Ast
{
public class BinaryExpression : Expression
{
public Expression Left { get; set; }
public TokenType Operation { get; set; }
public Expression Right { get; set; }
public BinaryExpression(Expression left, TokenType operation, Expression right)
{
Left = left;
Operation = operation;
Right = right;
}
}
}
In the ParseExpression method instead of assuming an integer or boolean will be a standalone value, the parser will place the value in a left expression variable. If the next token is a binary operation, the parser then knows to expect a binary expression, if not it returns the left expression.
Parser.cs
public class Parser
{
private readonly HashSet<TokenType> _binaryOperations = new HashSet<TokenType>
{
TokenType.PLUS,
TokenType.MINUS,
TokenType.SLASH,
TokenType.ASTERISK,
TokenType.EQUALS,
TokenType.DOESNOTEQUAL,
TokenType.GREATERTHAN,
TokenType.LESSTHAN,
};
// other code
private Expression ParseExpression()
{
var token = Lexer.CurrentToken;
Expression leftExpression;
switch (token.Type)
{
case TokenType.IDENTIFIER:
leftExpression = new Identifier(token);
break;
case TokenType.INTEGER:
leftExpression = new Integer(token, int.Parse(token.Literal));
break;
case TokenType.TRUE:
case TokenType.FALSE:
leftExpression = new Bool(token, token.Type == TokenType.TRUE);
break;
default:
return null;
}
if (_binaryOperations.Contains(Lexer.PeekToken.Type))
{
return ParseBinaryExpression(leftExpression);
}
return leftExpression;
}
private BinaryExpression ParseBinaryExpression(Expression left)
{
Lexer.NextToken();
var operation = Lexer.CurrentToken;
Lexer.NextToken();
var right = ParseExpression();
return new BinaryExpression(left, operation.Type, right);
}
}
That covers parsing expressions. Next is statements starting with parsing variable assignments.
First is to create a Var model. This will inherit the statement class and contain an identifier or variable name, and a value it is being assigned to.
Ast/Var.cs
namespace interpreter_from_scratch.Ast
{
public class Var : Statement
{
public Identifier Identifier { get; set; }
public Expression Value { get; set; }
public Var(Token token, Identifier identifier, Expression value)
{
Token = token;
Identifier = identifier;
Value = value;
}
}
}
Next, the ParseStatement method can be started. This method will determine the type of statement based on the token value. If the token type is anything else, it will assume it is an expression. For example, in a REPL the code "5 + 5;". To handle this, a ExpressionStatement model is created that will just contain an expression.
ExpressionStatement.cs
namespace interpreter_from_scratch.Ast
{
public class ExpressionStatement : Statement
{
public Expression Expression { get; set; }
public ExpressionStatement(Token token, Expression expression)
{
Token = token;
Expression = expression;
}
}
}
Parser.cs
// other code
private Statement ParseStatement()
{
switch (Lexer.CurrentToken.Type)
{
case TokenType.VAR:
return ParseVarStatement();
default:
return ParseExpressionStatement();
};
}
private Var ParseVarStatement()
{
var token = Lexer.CurrentToken;
ExpectNextToken(TokenType.IDENTIFIER);
var identifier = new Identifier(Lexer.CurrentToken);
ExpectNextToken(TokenType.ASSIGN);
Lexer.NextToken();
var value = ParseExpression();
ExpectNextToken(TokenType.SEMICOLON);
return new Var(token, identifier, value);
}
private ExpressionStatement ParseExpressionStatement()
{
var token = Lexer.CurrentToken;
var value = ParseExpression();
ExpectNextToken(TokenType.SEMICOLON);
return new ExpressionStatement(token, value);
}
private void ExpectNextToken(TokenType type)
{
if (Lexer.PeekToken.Type != type)
{
throw new Exception($"Expected next token to be {type}, got {Lexer.PeekToken.Type} instead.");
}
Lexer.NextToken();
}
With this change a helper function ExpectNextToken is introduced. This checks that the next token is exactly what is expected based on the syntax rules. If it is it proceeds to the next token. If not, it will throw an error.
ParseExpressionStatement gets the expression from the current Token as its value and expects a semicolon. This is purely a syntax style decision. The expectation is expression statements will end with a semicolon.
ParseVarStatement uses ExpectNextToken to enforce syntax rules on variable assignments. The sequence should contain an identifier, assignment, and expression. All of these are verified, and a new Var statement is created.
Parsing return statements will look similar. The return statement will only hold the return value.
Ast/Return.cs
namespace interpreter_from_scratch.Ast
{
public class Return : Statement
{
public Expression Value { get; set; }
public Return(Token token, Expression value)
{
Token = token;
Value = value;
}
}
}
Parser.cs
// other code
private Statement ParseStatement()
{
switch (Lexer.CurrentToken.Type)
{
case TokenType.VAR:
return ParseVarStatement();
case TokenType.RETURN:
return ParseReturnStatement();
default:
return ParseExpressionStatement();
};
}
private Return ParseReturnStatement()
{
var token = Lexer.CurrentToken;
Lexer.NextToken();
var value = ParseExpression();
ExpectNextToken(TokenType.SEMICOLON);
return new Return(token, value);
}
// other code
IfElse statements will contain a consequence in the form of an expression, as well as a consequence and alternative in the form of a block. A block is just sequence of statements similar to our top level interpreter program.
Ast/Block.cs
namespace interpreter_from_scratch.Ast
{
public class Block : Statement
{
public IEnumerable<Statement> Statements { get; set; }
public Block(IEnumerable<Statement> statements)
{
Statements = statements;
}
}
}
Ast/IfElse.cs
namespace interpreter_from_scratch.Ast
{
public class IfElse : Statement
{
public Expression Condition { get; set; }
public Block Consequence { get; set; }
public Block Alternative { get; set; }
public IfElse(Token token, Expression condition, Block consequence, Block alternative)
{
Token = token;
Condition = condition;
Consequence = consequence;
Alternative = alternative;
}
}
}
Parsing IfElse is like the above. ExpectNextTokens are used to enforce syntax rules. The condition is parsed using ParseExpression, and consequence is parsed using ParseBlockStatement which parses statements until it encounters a right brace. If an else is found the alternative is parsed as well and added to the IfElse statement.
Parser.cs
private Statement ParseStatement()
{
switch (Lexer.CurrentToken.Type)
{
case TokenType.VAR:
return ParseVarStatement();
case TokenType.RETURN:
return ParseReturnStatement();
case TokenType.IF:
return ParseIfElseStatement();
default:
return ParseExpressionStatement();
};
}
// other code
private IfElse ParseIfElseStatement()
{
var token = Lexer.CurrentToken;
ExpectNextToken(TokenType.LEFTPAREN);
Lexer.NextToken();
var condition = ParseExpression();
ExpectNextToken(TokenType.RIGHTPAREN);
ExpectNextToken(TokenType.LEFTBRACE);
var block = ParseBlockStatement();
if (Lexer.PeekToken.Type == TokenType.ELSE)
{
Lexer.NextToken();
ExpectNextToken(TokenType.LEFTBRACE);
var elseBlock = ParseBlockStatement();
return new IfElse(token, condition, block, elseBlock);
}
return new IfElse(token, condition, block, null);
}
private Block ParseBlockStatement()
{
var statements = new List<Statement>();
Lexer.NextToken();
while (Lexer.CurrentToken.Type != TokenType.RIGHTBRACE)
{
var statement = ParseStatement();
if (statement != null)
{
statements.Add(statement);
}
Lexer.NextToken();
}
return new Block(statements);
}
Finally parsing functions and function calls. A function statement contains an identifier or the function name, a list of identifiers or the parameters, and the function body which will be a block.
Ast/Function.cs
namespace interpreter_from_scratch.Ast
{
public class Function : Statement
{
public Identifier Identifier { get; set; }
public IEnumerable<Identifier> Parameters { get; set; }
public Block Body { get; set; }
public Function(Token token, Identifier identifier, IEnumerable<Identifier> parameters, Block body)
{
Token = token;
Identifier = identifier;
Parameters = parameters;
Body = body;
}
}
}
Parsing the function statement.
Parser.cs
// other code
private Statement ParseStatement()
{
switch (Lexer.CurrentToken.Type)
{
case TokenType.VAR:
return ParseVarStatement();
case TokenType.RETURN:
return ParseReturnStatement();
case TokenType.IF:
return ParseIfElseStatement();
case TokenType.FUNCTION:
return ParseFunctionStatement();
default:
return ParseExpressionStatement();
};
}
private Function ParseFunctionStatement()
{
var token = Lexer.CurrentToken;
Lexer.NextToken();
var identifierToken = Lexer.CurrentToken;
var functionIdentifier = new Identifier(identifierToken);
ExpectNextToken(TokenType.LEFTPAREN);
var parameters = new List<Identifier>();
while (Lexer.CurrentToken.Type != TokenType.RIGHTPAREN)
{
Lexer.NextToken();
var parameter = new Identifier(Lexer.CurrentToken);
parameters.Add(parameter);
if (Lexer.PeekToken.Type != TokenType.RIGHTPAREN)
{
ExpectNextToken(TokenType.COMMA);
}
else
{
Lexer.NextToken();
}
}
ExpectNextToken(TokenType.LEFTBRACE);
var block = ParseBlockStatement();
// Let the user optionally add a semicolon at the end of the function declaration
if (Lexer.PeekToken.Type == TokenType.SEMICOLON)
{
Lexer.NextToken();
}
return new Function(token, functionIdentifier, parameters, block);
}
The function will have the form of "function functionName(x, y) { return x; }.
After the function keyword Token an identifier or the name of the function is expected. This is followed by a left parenthesis. Finally, the parameters are parsed, which is a list of identifiers until encountering a right parenthesis signifying the end of the parameter list. Next, a left brace is expected followed by parsing the block statement.
I decided to make the semicolon optional so a user can write a function as
"function functionName(x, y) { return x; }" or "function functionName(x, y) { return x; };"
Finally, parsing function calls. For example calling the example function above would look like "functionName(1, 2);". A function call is an expression since it returns a value. This model will also contain an identifier or the function name, as well as a list of expressions or the parameters in the function call.
Ast/FunctionCall.cs
namespace interpreter_from_scratch.Ast;
public class FunctionCall : Expression
{
public Identifier Function { get; set; }
public IEnumerable<Expression> Parameters { get; set; }
public FunctionCall(Token token, Identifier function, IEnumerable<Expression> parameters)
{
Function = function;
Parameters = parameters;
Token = token;
}
}
Parsing function calls will happen in ParseExpression when encountering an identifier, instead of assuming it is an identifier expression the parser checks for a left parenthesis which instead tells it that is a function call. This function call is then parsed returning the correct expression type.
Parser.cs
// other code
private Expression ParseExpression()
{
var token = Lexer.CurrentToken;
Expression leftExpression;
switch (token.Type)
{
case TokenType.IDENTIFIER:
var identifier = new Identifier(token);
if (Lexer.PeekToken.Type == TokenType.LEFTPAREN)
{
Lexer.NextToken();
leftExpression = ParseFunctionCall(identifier);
}
else
{
leftExpression = identifier;
}
break;
case TokenType.INTEGER:
leftExpression = new Integer(token, int.Parse(token.Literal));
break;
case TokenType.TRUE:
case TokenType.FALSE:
leftExpression = new Bool(token, token.Type == TokenType.TRUE);
break;
default:
return null;
}
if (_binaryOperations.Contains(Lexer.PeekToken.Type))
{
return ParseBinaryExpression(leftExpression);
}
return leftExpression;
}
private FunctionCall ParseFunctionCall(Identifier identifier)
{
var token = Lexer.CurrentToken;
return new FunctionCall(token, identifier, ParseFunctionCallParameters());
}
private IEnumerable<Expression> ParseFunctionCallParameters()
{
var parameters = new List<Expression>();
Lexer.NextToken();
while (Lexer.CurrentToken.Type != TokenType.RIGHTPAREN)
{
var parameter = ParseExpression();
parameters.Add(parameter);
if (Lexer.PeekToken.Type != TokenType.RIGHTPAREN)
{
ExpectNextToken(TokenType.COMMA);
}
Lexer.NextToken();
}
return parameters;
}
One last thing to update the ParseProgram function so that it parses as many statements as it can until hitting the end of file token.
Parser.cs
// other code
public InterpreterProgram ParseProgram()
{
var statements = new List<Statement>();
while (Lexer.CurrentToken.Type != TokenType.ENDOFFILE)
{
var statement = ParseStatement();
if (statement != null)
{
statements.Add(statement);
}
Lexer.NextToken();
}
return new InterpreterProgram(statements);
}
That’s all for the Parser. The program is now creating an abstract syntax tree in the form of a list of statements which are linked to other statements or expressions.
Pause and sanity check your code with these unit tests ParserTests.cs (opens in a new tab)
Evaluating the abstract syntax tree
The final step is to evaluate the statements nodes produced by the parser. The evaluation traverses each statement node doing what each node signifies before eventually arriving at a final result which will be printed to the terminal. Each value encountered will be represented as a type of object inheriting from a new class called InterpreterObject. This class will hold a single value called InterpreterObjectType which contains all the possible object types the interpreter can produce
Evaluation/InterpreterObjectType.cs
namespace interpreter_from_scratch.Evaluation;
public enum InterpreterObjectType
{
INTEGER,
BOOLEAN,
NULL,
RETURNVALUE,
FUNCTION
}
Evaluation/InterpreterObject.cs
namespace interpreter_from_scratch.Evaluation;
public abstract class InterpreterObject
{
public InterpreterObjectType Type { get; set; }
}
The Evaluator class will contain the public method Evaluate which will take in a InterpreterProgram or list of Statements and return an InterpreterObject. This will loop through each statement, evaluate each and return the last result. Evaluate methods also need to be created for single statements, expressions and a list of expressions. For right now these will just return null.
Evaluator.cs
using interpreter_from_scratch.Ast;
using interpreter_from_scratch.Evaluation;
namespace interpreter_from_scratch;
public class Evaluator
{
public InterpreterObject Evaluate(InterpreterProgram program)
{
InterpreterObject result = null;
for (var i = 0; i < program.Statements.Count(); i++)
{
result = Evaluate(program.Statements.ElementAt(i));
}
return result;
}
private InterpreterObject Evaluate(Statement statement)
{
return null;
}
private InterpreterObject Evaluate(Expression expression)
{
return null;
}
private IEnumerable<InterpreterObject> Evaluate(IEnumerable<Expression> expressions)
{
var result = new List<InterpreterObject>();
foreach (var expression in expressions)
{
result.Add(Evaluate(expression));
}
return result;
}
}
Inside of the evaluate function for statements will be a switch statement which will change the type of evaluation depending on the statement type.
If an expression statement is encountered. Evaluate will be called on the expression and returned.
// other code
private InterpreterObject Evaluate(Statement statement)
{
switch (statement)
{
case ExpressionStatement expressionStatement:
return Evaluate(expressionStatement.Expression);
default:
return null;
}
}
A new property will need to be introduced before evaluating Var statements. This property will be a key value store which will map the string identifier name with the corresponding value it represents. When a Var statement is encountered its value will be evaluated and both the identifier name and value will be stored in the map to be used for future statements. This map will be called EnvironmentVariables.
One other property this class will need is access to an outer environment if it exists. When we encounter a function call a new EnvironmentVariables object will be created since the variable assignment that happens within the function statement is separate from the rest of the program. However, there may be cases in which variables declared outside the function statement need to be accessed.
Evaluation/EnvironmentVariables.cs
namespace interpreter_from_scratch.Evaluation;
public class EnvironmentVariables
{
public Dictionary<string, InterpreterObject> Variables = new Dictionary<string, InterpreterObject>();
public EnvironmentVariables OuterEnvironment = null;
public EnvironmentVariables()
{
Variables = new Dictionary<string, InterpreterObject>();
}
public EnvironmentVariables(EnvironmentVariables outer)
{
Variables = new Dictionary<string, InterpreterObject>();
OuterEnvironment = outer;
}
public InterpreterObject Get(string key)
{
if (Variables.ContainsKey(key))
{
return Variables[key];
}
if (OuterEnvironment != null)
{
return OuterEnvironment.Get(key);
}
return null;
}
}
A new EnvironmentVariables object will be created and passed into and through the various Evaluate methods.
Evaluator.cs
// other code
public InterpreterObject Evaluate(InterpreterProgram program, EnvironmentVariables environmentVariables)
{
InterpreterObject result = null;
for (var i = 0; i < program.Statements.Count(); i++)
{
result = Evaluate(program.Statements.ElementAt(i), environmentVariables);
}
return result;
}
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
case ExpressionStatement expressionStatement:
return Evaluate(expressionStatement.Expression, environmentVariables);
default:
return null;
}
}
private InterpreterObject Evaluate(Expression expression, EnvironmentVariables environmentVariables)
{
return null;
}
private IEnumerable<InterpreterObject> Evaluate(IEnumerable<Expression> expressions, EnvironmentVariables environmentVariables)
{
var result = new List<InterpreterObject>();
foreach (var expression in expressions)
{
result.Add(Evaluate(expression, environmentVariables));
}
return result;
}
With this in place Var statements can be evaluated. This will evaluate the expression and map the identifier value to the returned evaluated expression object.
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
case ExpressionStatement expressionStatement:
return Evaluate(expressionStatement.Expression, environmentVariables);
case Var varStatement:
var value = Evaluate(varStatement.Value, environmentVariables);
environmentVariables.Variables.Add(varStatement.Identifier.Value, value);
break;
default:
return null;
}
return null;
}
Return statement evaluation is also very simple. The expression will be evaluated, and the response will instead be a return object type. This is just a wrapper class around the evaluated expression value to signify that this is a return value.
Evaluation/ReturnObject.cs
namespace interpreter_from_scratch.Evaluation
{
public class ReturnObject : InterpreterObject
{
public InterpreterObject Value { get; set; }
public ReturnObject(InterpreterObject value)
{
Value = value;
Type = InterpreterObjectType.RETURNVALUE;
}
}
}
Evaluator.cs
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
case ExpressionStatement expressionStatement:
return Evaluate(expressionStatement.Expression, environmentVariables);
case Var varStatement:
var value = Evaluate(varStatement.Value, environmentVariables);
environmentVariables.Variables.Add(varStatement.Identifier.Value, value);
break;
case Return returnStatement:
var returnValue = Evaluate(returnStatement.Value, environmentVariables);
return new ReturnObject(returnValue);
default:
return null;
}
return null;
}
For if else evaluation the condition expression is evaluated. This should return a boolean object. If this boolean value is true, the consequence is evaluated. If false and the alternative exists, the alternative is evaluated.
Evaluator.cs
// other code
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
// other code
case IfElse ifElseStatement:
return EvaluateIfElse(ifElseStatement, environmentVariables);
// other code
}
}
private InterpreterObject EvaluateIfElse(IfElse ifElse, EnvironmentVariables environmentVariables)
{
var condition = Evaluate(ifElse.Condition, environmentVariables);
if (condition is not BoolObject)
{
throw new Exception($"Condition must be a boolean. Got {condition.GetType()}");
}
var boolCondition = (BoolObject)condition;
if (boolCondition.Value)
{
return Evaluate(ifElse.Consequence, environmentVariables);
}
else if (ifElse.Alternative != null)
{
return Evaluate(ifElse.Alternative, environmentVariables);
}
return null;
}
For function evaluation a function object is created which contains the block body, parameters, and environment variables. The identifier is then mapped to the function object in the environment variables.
Evaluation/FunctionObject.cs
using interpreter_from_scratch.Ast;
namespace interpreter_from_scratch.Evaluation
{
public class FunctionObject : InterpreterObject
{
public Block Body { get; set; }
public IEnumerable<Identifier> Parameters { get; set; }
public EnvironmentVariables EnvironmentVariables { get; set; }
public FunctionObject(Block body, IEnumerable<Identifier> parameters, EnvironmentVariables environmentVariables)
{
Body = body;
Parameters = parameters;
EnvironmentVariables = environmentVariables;
Type = InterpreterObjectType.FUNCTION;
}
}
}
Evaluator.cs
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
// other code
case Function functionStatement:
environmentVariables.Variables.Add(functionStatement.Identifier.Value, new FunctionObject(functionStatement.Body, functionStatement.Parameters, environmentVariables));
break;
// other code
}
// other code
}
Finally evaluating block statements which is like evaluating the whole program. Each statement in the block is evaluated and if a return value is encountered, execution stops, and the value is returned. If return is not encountered the last evaluated value is returned.
Evaluator.cs
private InterpreterObject Evaluate(Statement statement, EnvironmentVariables environmentVariables)
{
switch (statement)
{
// other code
case Block blockStatement:
return EvaluateBlock(blockStatement, environmentVariables);
// other code
}
// other code
}
private InterpreterObject EvaluateBlock(Block block, EnvironmentVariables environmentVariables)
{
InterpreterObject result = null;
for (var i = 0; i < block.Statements.Count(); i++)
{
result = Evaluate(block.Statements.ElementAt(i), environmentVariables);
if (result != null && result.Type == InterpreterObjectType.RETURNVALUE)
{
return result;
}
}
return result;
}
Next is evaluating the expressions. The easiest expression types to evaluate would be integer and boolean expressions. For both types a new object representation needs to be created.
Evaluation/BoolObject.cs
namespace interpreter_from_scratch.Evaluation;
public class BoolObject : InterpreterObject
{
public bool Value { get; set; }
public BoolObject(bool value)
{
Value = value;
Type = InterpreterObjectType.BOOLEAN;
}
}
Evaluation/IntegerObject.cs
namespace interpreter_from_scratch.Evaluation;
public class IntegerObject : InterpreterObject
{
public int Value { get; set; }
public IntegerObject(int value)
{
Type = InterpreterObjectType.INTEGER;
Value = value;
}
}
If an integer or boolean expression is encountered during evaluation, the corresponding object will be created and returned.
Evaluator.cs
// other code
private InterpreterObject Evaluate(Expression expression)
{
return null;
switch (expression)
{
case Integer integer:
return new IntegerObject(integer.Value);
case Bool boolean:
return new BoolObject(boolean.Value);
default:
return null;
}
}
To evaluate an identifier the evaluator will check if the identifier is contained in the environment variables and return the InterpreterObject value. If the identifier is not found in the environment variables an error will be thrown.
Evaluator.cs
private InterpreterObject Evaluate(Expression expression, EnvironmentVariables environmentVariables)
{
switch(expression)
{
// other code
case Identifier identifier:
return EvaluateIdentifier(identifier, environmentVariables);
// other code
}
}
private InterpreterObject EvaluateIdentifier(Identifier identifier, EnvironmentVariables environmentVariables)
{
var value = environmentVariables.Get(identifier.Value);
if (value == null)
{
throw new Exception($"Identifier {identifier.Value} not found");
}
return value;
}
For binary expressions the left and right expressions are evaluated independently. The operation is then applied depending on the operation token type. For this language only integers will be supported. So something like true + 5 will not be supported and throw an error.
Evaluator.cs
private InterpreterObject Evaluate(Expression expression)
{
return null;
switch (expression)
{
// other code
case BinaryExpression binaryExpression:
var left = Evaluate(binaryExpression.Left, environmentVariables);
var right = Evaluate(binaryExpression.Right, environmentVariables);
return EvaluateBinaryExpression(left, right, binaryExpression.Operation);
// other code
}
}
private InterpreterObject EvaluateBinaryExpression(InterpreterObject left, InterpreterObject right, TokenType operation)
{
if (left is ReturnObject leftReturnObject)
{
left = leftReturnObject.Value;
}
if (right is ReturnObject rightReturnObject)
{
right = rightReturnObject.Value;
}
if (left is not IntegerObject || right is not IntegerObject)
{
throw new Exception($"Only integers are capable of binary expressions. Got {left.GetType()}, {right.GetType()}");
}
var leftIneger = (IntegerObject)left;
var rightInteger = (IntegerObject)right;
switch(operation)
{
case TokenType.PLUS:
return new IntegerObject(leftIneger.Value + rightInteger.Value);
case TokenType.MINUS:
return new IntegerObject(leftIneger.Value - rightInteger.Value);
case TokenType.ASTERISK:
return new IntegerObject(leftIneger.Value * rightInteger.Value);
case TokenType.SLASH:
return new IntegerObject(leftIneger.Value / rightInteger.Value);
case TokenType.GREATERTHAN:
return new BoolObject(leftIneger.Value > rightInteger.Value);
case TokenType.LESSTHAN:
return new BoolObject(leftIneger.Value < rightInteger.Value);
case TokenType.EQUALS:
return new BoolObject(leftIneger.Value == rightInteger.Value);
case TokenType.DOESNOTEQUAL:
return new BoolObject(leftIneger.Value != rightInteger.Value);
default:
return null;
}
}
Finally, to evaluate function calls first the function identifier is evaluated and returns the function object. Second each of the parameters will be evaluated returning a list of identifiers.
An apply function method will be called which first checks that the input parameters match the function definition number of parameters. Then it will create a new environment variables object with the mapped input identifiers and values. Finally, it will create a new evaluator object which will evaluate the function block with the new environment variables object.
Evaluator.cs
private InterpreterObject Evaluate(Expression expression, EnvironmentVariables environmentVariables)
{
switch(expression)
{
// other code
case FunctionCall functionCall:
var function = Evaluate(functionCall.Function, environmentVariables);
var parameters = Evaluate(functionCall.Parameters, environmentVariables);
if (function is FunctionObject functionObject)
{
return ApplyFunction(functionObject, parameters);
}
throw new Exception($"Expected a function. Got {function.Type}");
// other code
}
}
private InterpreterObject ApplyFunction(FunctionObject function, IEnumerable<InterpreterObject> parameters)
{
if (function.Parameters.Count() != parameters.Count())
{
throw new Exception($"Expected {function.Parameters.Count()} parameters. Got {parameters.Count()}");
}
var extendedEnvironment = new EnvironmentVariables(function.EnvironmentVariables);
for (var i = 0; i < function.Parameters.Count(); i++)
{
extendedEnvironment.Variables.Add(function.Parameters.ElementAt(i).Value, parameters.ElementAt(i));
}
var evaluator = new Evaluator();
return evaluator.Evaluate(function.Body, extendedEnvironment);
}
And that's it for the Evaluator.
Check your code with these unit tests EvaluatorTests.cs (opens in a new tab)
REPL and Code Execution
One last thing needed is to create a way to use the interpreter. There will be two ways to use it a REPL and passing a text file for evaluation.
Program.cs
using interpreter_from_scratch.Evaluation;
namespace interpreter_from_scratch;
public class Program
{
public static void Main(string[] args)
{
if (args.Count() > 0)
{
var text = File.ReadAllText(args[0]);
var lexer = new Lexer(text);
var parser = new Parser(lexer);
var evaluator = new Evaluator();
var program = parser.ParseProgram();
var response = evaluator.Evaluate(program, new EnvironmentVariables());
PrintInterpreterObject(response);
}
else
{
Console.WriteLine("Staring REPL...");
var environment = new EnvironmentVariables();
while (true)
{
Console.Write(">> ");
var input = Console.ReadLine();
var lexer = new Lexer(input);
var parser = new Parser(lexer);
var evaluator = new Evaluator();
var program = parser.ParseProgram();
var response = evaluator.Evaluate(program, environment);
PrintInterpreterObject(response);
}
}
}
private static void PrintInterpreterObject(InterpreterObject interpreterObject)
{
if (interpreterObject == null)
{
return;
}
switch (interpreterObject.Type)
{
case InterpreterObjectType.INTEGER:
Console.WriteLine(((IntegerObject)interpreterObject).Value);
break;
case InterpreterObjectType.BOOLEAN:
Console.WriteLine(((BoolObject)interpreterObject).Value);
break;
case InterpreterObjectType.RETURNVALUE:
var returnObject = (ReturnObject)interpreterObject;
PrintInterpreterObject(returnObject.Value);
break;
default:
break;
}
}
}
And that's it. You now have a fully functioning interpreted programing language that can run the code below.
example_code.txt
var x = 5;
var y = 6;
function greaterThanTen(value) {
if (value > 10) {
return true;
}
else {
return false;
}
}
greaterThanTen(x + y);
dotnet run example_code.txt
True
While this is a good start there are obviously a lot of features that need to be added to make this usable. What immediately comes to mind includes arrays, hashmaps, removing the binary expression restriction (i.e. (1 + 1 + 1)), and loops. I will leave attempting to implement these as an exercise for the reader.