Dev Tips & Tricks

Building a language interpreter in JavaScript - Part 2

Back to posts

Welcome to part 2 of building an interpreter in pure JavaScript.

In the previous article, we introduced the language printly, how to build a lexer for it, and explained its syntax. Lexer allowed us to convert the code to tokens which are much more manageable for us to read. That is mainly because we do not need to think about many things like case sensitivity, how many spaces there are between each thing we need, etc.

At this point, we are ready to go to the next step. So what are we doing now?

Now, we parse.

Building a parser

A parser is an algorithm that receives the tokens, runs them against the language's grammar, and makes sense of them. In the end, we get a nice array of structured statements which we can easily parse and make use of in our interpreter.

As you see from the sentence, we have two things we need to create:

  1. Grammar for our language to verify that the code was written correctly.
  2. Parser to transform the tokens into structured statements.

So to start, let's take our tokens from the previous part as an example:

[
  {
    type: 'name',
    value: 'variable',
    start: 0,
    end: 8,
    line: 0,
    character: 0
  },
  {
    type: 'operator',
    value: '=',
    start: 9,
    end: 10,
    line: 0,
    character: 9
  },
  {
    type: 'number',
    value: '5',
    start: 11,
    end: 12,
    line: 0,
    character: 11
  }
];

For parsing, we need to take these tokens, verify them against grammar and transform them into a statement.

So how do we detect that?

One implementation would be to read the token, then try to find a corresponding rule, apply it and then transform. It would be really hard to match all of that because we have an infinite number of combinations between code parts. You can have an assignment that calls a function, performs addition, and does a boolean check, in infinite variations.

So, that approach will not work. How about we approach the problem differently?

What if we create the rules first, declare one of our rules a start rule then try to match as much as we can? As long as the rules match we can go as deep as needed until we match everything. It sounds complicated at first, but it will get clearer as we continue.

And a set of these rules will also make grammar for our language.

Defining the grammar of our language

Our grammar needs to be concise and understandable. We could use words to explain it, but we are programmers so we will complicate it with a notation. :)

We will invent a notation for our language:

// Rule is matched only if SubRule1 and SubRule2 
// match in this exact order.
// This is our start rule.
Rule -> SubRule1 SubRule2

// SubRule1 is matched if there is SubRule4 zero or infinite times
SubRule1 -> SubRule4*

// SubRule4 for is matched if there is either rule TokenA or rule (TokenB TokenC) exactly matched.
SubRule4 -> TokenA | (TokenB TokenC)

// SubRule2 is matched if rule TokenC optionally matches (it can match or not), and rule TokenD matches 1 or infinite times.
SubRule2 -> TokenC? TokenD+

// TokenA rule matches if the current token is of type 'keyword' and value is 'if'
TokenA -> <Token('keyword', 'if')>

// TokenB rule matches if the current token is of type 'number'
TokenB -> <Token('number')>

// TokenC rule matches if the current token is of type 'operator' and value '!'
TokenC -> <Token('operator', '!')>

// TokenD rule matches if the current token is of type 'name'.
TokenD -> <Token('name')>

In this notation, we start by trying to satisfy the rule Rule and then go deeper into each of the sub-rules until we reach the Tokens. Once we find a token and the token matches what we specified we go back up and say that the rule matched. The rules are each resolved recursively until we reach the token.

In our explanation, we will always read top to bottom to make things simple.

In this notation, you can see the quantifiers. These decide how many times each rule check we will apply:

  • * - Zero or infinite times, as many times as it can be matched
  • + - One or infinite times, as many times as it can be matched
  • ? - Zero or one time

We will also use parentheses ( and ) for grouping the rules together.

With this information, we are ready to write out the grammar for printly:

// Start rule of our grammar,
// we will match this until we reach the end of our token list.
// This will essentially return us each statement per line.
LineStatement -> IfExpressionStatement | AssignmentStatement | FunctionStatement

// If expression has expression check
// and a code block containing multiple LineStatements
IfExpressionStatement -> IfKeyword PStart Expression PEnd CodeBlock

// Code block can have zero or more line statements
// in this case our grammar goes from the top for each
// line statement but we specified it as zero because
// we can have empty if statements too.
CodeBlock -> BStart LineStatement* BEnd

// Our function statement is a rule for FunctionExpression plus
// end of the line.
FunctionStatement -> FunctionExpression Eol

// We separated this rule into its own because we can use
// function in other parts of the expressions (or even othe function calls) 
// so we need to have a separate version without end of line token.
// Our function parameters are optional because we can have
// functions with zero parameters.
FunctionExpression -> Name PStart FunctionParameters? PEnd

// Function parameters cam have one expression or multiple
// expression separated by a comma. Comma is grouped
// inside parentheses because we only need it if we 
// have more than one parameter.
FunctionParameters -> Expression (Comma Expression)*

// And finally the assignment can also be an expression
// and we need a Name token for the variable name
// and an Equal sign to complete the statement
AssignmentStatement -> Name Equals Expression Eol

// And here are the rules we have below
// which resolve directly to tokens.
// After our parser reaches these tokens it does not 
// need to look deeper.
Name -> <Token('name')>
Equals -> <Token('operator', '=')>
Comma -> <Token('comma')>
BStart -> <Token('codeBlockStart')>
BEnd -> <Token('codeBlockEnd')>
PStart -> <Token('parenStart')>
PEnd -> <Token('parenEnd')>
IfKeyword -> <Token('keyword', 'if')>
Eol -> <Token('endOfLine')>

If you read this carefully, you will see that the rule Expression is missing. We omitted it on purpose, but we will define it to keep it in a separate part to keep things understandable.

Expression in our case means every kind of expression which can be inside an if check, an assignment, and a function parameter. We can also have an expression inside expressions. And aside from this, we have one more problem to deal with - operator precedence. Yep, you heard right, we need to teach our language some math and logic. :)

Operator precedence allows us to evaluate some operators before others. This is the case with our arithmetic operator where multiplication * and division / operators need to come before addition + and subtraction -. We also have things where logic operators need to be at the bottom so that everything works as we expect.

Here is the full table of our operator precedence from the least important to most important.

Operator Description
&&, || And and Or logical operators
==, != Equality logical operators
+, - Addition and subtraction operators
*, / Multiplication and Division operators
! Not logical operator
() Group operator

For the purposes of our language and to keep things simple, all our operators are read from the left to right (also known as left-associative).

So with this in mind, here are the rules for Expression:

// We start with detection of && and || logical operators
// Grouping ((And | Or) EqualityTerm)* like this allows us to
// three or more operands like a || b && c || d
// Note that in our language these two operators are having
// the same priority, but usually in a lot languages operator
// && has higher priority above ||. We are keeping things simple though.
Expression -> EqualityTerm ((And | Or) EqualityTerm)*

// This allows us to detect == and != equality expressions
// the reasoning for this rule is the same as the previous one.
EqualityTerm -> RelationTerm ((DoubleEquals | NotEquals) RelationTerm)*

// This allows us to detect >, <, >=, <= boolean expressions
EqualityTerm -> AddSubTerm ((Less | Greater | LessEquals | GreaterEquals) AddSubTerm)*

// This allows us to detect + and - math expressions
AddSubTerm -> MulDivTerm ((Add | Subtract) MulDivTerm)*

// This allows us to detect * and / math expressions
MulDivTerm -> UnaryTerm ((Multiply | Divide) UnaryTerm)*

// This allows us to define ! negation boolean of a factor 
UnaryTerm -> Not? Factor

// Factor allows for the left or the right side to be 
// a number, string or another expression in parentheses
// If this is another expression in parentheses we go back
// up to Expression and check the rules for that expression
// as many times as we need to for our code
Factor -> GroupExpression | FunctionExpression | NumberExpression | VariableExpression | StringExpression

// Group expression allows us nest additional
// expressions and it has the highest precedence
// in expression so we can use it to clarify
// specific parts of an expression.
GroupExpression -> PStart Expression PEnd

// These last three are alias rules for
// Name, Number and String. Yes we could use
// them directly but we will need this for something else
// later in the code implementation.
VariableExpression -> Name
NumberExpression -> Number
StringExpression -> String

// And finally we need to define rules for tokens:
And -> <Token('operator', '&&')>
Or -> <Token('operator', '||')>
DoubleEquals -> <Token('operator', '=')>
NotEquals -> <Token('operator', '!')>
Less -> <Token('operator', '< ')>
Greater -> <Token('operator', '> ')>
LessEquals -> <Token('operator', '<= ')>
GreaterEquals -> <Token('operator', '>=')>
Add -> <Token('operator', '+')>
Subtract -> <Token('operator', '-')>
Multiply -> <Token('operator', '*')>
Divide -> <Token('operator', '/')>
Not -> <Token('operator', '!')>
String -> <Token('string')>
Number -> <Token('number')>

Implementing the parser

Now, when we wrote the grammar, we can start implementing our parser.

How to implement this? We can manually create functions that call functions
starting from the LineStatement rule and going down all the way until we reach the token.

That would result in a lot of code and duplication and a lot of bugs to deal with.

How about we try something simpler? Let's try to implement a way for us to define the rules exactly as per grammar. Of course, we will need some helper functions in
JavaScript to make that happen but we can do this.

But before that, we need a way to read through tokens and keep the state.

For that, we will need a similar thing we used in a lexer, a TokenReader class :)

class TokenReader {
    constructor(tokens) {
        this.tokens = tokens; // store tokens for further use
        this.position = 0; // current position in token list
        this.stateStack = []; // state stack so that we can rollback if we do not match something.
    }

    // Push current state to the stack
    // this allows us to go back to this state
    // if we do not match anything
    pushState() {
        this.stateStack.push(this.position);
    }

    // Restore last pushed state
    // we will call this when we read as far
    // as we could but we didn't match what we need.
    restoreState() {
        this.position = this.popState();
    }

    // Pops the last state from the list and returns it.
    // We will call this when we need to lose the 
    // last saved state when we matched something and we
    // do not need to go back.
    popState() {
        return this.stateStack.pop();
    }

    // Checks whether the current token is of a
    // specific type or not.
    isType(type) {
        return this.hasNext() && this.getType() === type;
    }

    // Returns the type of the current token.
    getType() {
        return this.get().type;
    }

    // Returns the value of the current token.
    getValue() {
        return this.get().value;
    }

    // Checks whether the value in the current token
    // matches.
    isValue(value) {
        return this.getValue() === value;
    }

    // Returns the token at the current position.
    get() {
        return this.tokens[this.position];
    }

    // Returns the very last token in the list.
    getLastToken() {
        return this.tokens[this.tokens.length - 1];
    }

    // Move to the next token in the position.
    next() {
        this.position++;
    }

    // Check and return whether there are more tokens to
    // consume.
    hasNext() {
        return this.position < this.tokens.length;
    }
}

Now that we have a token reader, let's implement a few functions which will allow us to define grammar rules.

First, we need to implement the function which defines a rule:

The next one is a function to match a set of tokens exactly in that order. We will call it... exactly :)

// Exactly is a function which returns a function
// which exactly runs each of the checks against
// a reader. If just one of the checks fail
// the whole function will return a null meaning
// that we couldn't match anything.
const exactly = (...checks) => reader => {
    // First we store the current position in the  
    // token list so that we can go back if we don't
    // match what we need.
    reader.pushState();

    const results = [];

    for (const check of checks) {

        const match = check(reader); // Run the check against the reader

        if (!match) {
            // Didn't get a match so we
            // restore the position in the token
            // and exit the function returning null, meaning no match.
            reader.restoreState();
            return null;
        }

        // We found a match so we add it to our
        // results list
        results.push(match);
    }

    // We drop the state we saved because we found all matches
    // by this point so we do not need the
    // saved state anymore.
    reader.popState();

    // We return the matched items
    // so that they can be transformed as needed.
    return results;
};

After exactly, we need to implement an ability to check either Rule A or Rule B (the | sign in the grammar) and match if either one
of them are matched.

Yep, you guessed it, we will call the function either:

// Either returns a function which 
// runs against a reader and checks each
// of the passed checks and returns the first
// one which matches.
const either = (...checks) => reader => {
    for (const check of checks) {
        reader.pushState(); // Store the state so that we can go back if we do not match.

        const match = check(reader);
        if (match) {
            // We found a match here
            // so we remove the stored state
            // as we do not need it.
            reader.popState();

            return match; // We return the found match.
        }

        // We didn't find a match here so
        // we restore the reader to previous state
        // so that next check in line
        // can be checked on the same position.
        reader.restoreState();
    }

    // We didn't find anything at this point
    // so we return null.
    return null;
};

Now, we will add a way to define a part of the value optional (denoted by ? sign in the grammar). We will call the function
optional:

// Optional function returns a function which works on a
// token reader, runs a check and returns a value
// denoted by defaultValue if the check does not match.
// Returning a defaultValue other than a null allows optional
// to always return something even if the check fails
// thus making the check optional. :)
const optional = (check, defaultValue = {type: 'optional'}) => reader => {
    reader.pushState(); // we store the state before the check
    const result = check(reader);

    if (!result) {
        // Our check failed
        // so we restore the previous state
        reader.restoreState();
    } else {
        // we had a match so we
        // dont need to keep the stored state anymore
        reader.popState();
    }

    // Here we return the match or the default value
    // as long as default value is not null this would
    // make the result optional.
    return result ? result : defaultValue;
};

After the optional check, we need to allow a check for zero-or-more * and one-or-more + rules. We will create one function to handle both of these. We will call it minOf. We will pass the amount we want to check (0, 1, or even more) to it, to denote what is the minimum amount of checks needed for this rule to pass.

// minOf returns a function which works on a token
// reader which performs a check for a minimum amount
// up to infinity if a check fails for a minimum
// amount, null is returned, anything after a minimum
// is optional.
const minOf = (minAmount, check) => reader => {
    reader.pushState(); // first we store the current state.

    const results = [];
    
    let result = null;

    while(true) {
        // we run checks as many times
        // as we can in this loop.
        result = check(reader);

        if (!result) {
            if (results.length < minAmount) {
                // result was not found and we
                // didn't reach the minimum
                // amount so the matching is a failure
                // and we restore state before
                // return a null.
                reader.restoreState();
                return null;
            }

            // We didn't find a match 
            // so we do not need to be
            // in this loop anymore so we exit it.
            break;
        }

        results.push(result);
    }
    
    // We reached the end here so
    // we do not need the state anymore
    // so we remove it and return the results.
    reader.popState();
    return results;
};

And finally, we need a way to define a token rule (also known as a terminal symbol in our case) in our grammar. We will define a function token:

// Token function returns a function
// which works on a token reader, 
// checks the type of the token and (if specified)
// a value of the token.
const token = (type, value = null) => reader => {
    // first we check if we have
    // a value set and value matches
    // in our token reader
    // if we didn't set a value parameter this 
    // variable is always true.
    let valueMatches = value ? reader.isValue(value) : true;

    if (reader.isType(type) && valueMatches) {
        // Our type is correct and value matches
        // so we return the matched token at this point.
        const result = reader.get();

        // this is also the only time we move to the
        // next token in the list, and it is because of this
        // that we need to push and pop the reader state
        // because if we do not go back on failures, we will not be
        // able to match everything correctly
        reader.next();

        // And finally we return the token result.
        return result;
    };

    // Here we didn't find a token we are looking for,
    // so we return a null.
    return null;
};

If you want to get up to this part in the repository. Check out part2-chapter1 branch.

Writing out the grammar

Now we are ready to write out the grammar. We will separate the implementation into parts.

We will use destructuring assignments and arrow functions to keep this short and concise so if you are not familiar with these parts of JavaScript, please click on the links to familiarize yourself with them.

We import the functions for our rules and create the first set of rules:

const { rule, either, exactly, optional, minOf, token } = require('./rule-helpers');

// LineStatement -> IfExpressionStatement | AssignmentStatement | FunctionStatement
const LineStatement = rule(
    () => either(IfExpressionStatement, AssignmentStatement, FunctionStatement),
    expression => expression // We do not need to anything with the result here
);

// IfExpressionStatement -> IfKeyword PStart Expression PEnd CodeBlock
const IfExpressionStatement = rule(
    () => exactly(IfKeyword, PStart, Expression, PEnd, CodeBlock),
    ([,, check,, statements]) => ({ type: 'if', check, statements }) // We transform the result into an if statement
)

// CodeBlock -> BStart LineStatement* BEnd
const CodeBlock = rule(
    () => exactly(BStart, minOf(0, LineStatement), BEnd),
    ([, statements]) => statements // We only take the statements from here at index 1.
);

// FunctionStatement -> FunctionExpression Eol
const FunctionStatement = rule(
    () => exactly(FunctionExpression, Eol),
    ([expression]) => expression // We only take the result of FunctionExpression at index 0.
);

// FunctionExpression -> Name PStart FunctionParameters? PEnd
const FunctionExpression = rule(
    () => exactly(Name, PStart, optional(FunctionParameters, []), PEnd),
    ([name, _, parameters]) => ({ // name is at index 0, parameters are at index 2
        type: 'function',
        name: name.value,
        parameters
    })
);

// FunctionParameters -> Expression (Comma Expression)*
const FunctionParameters = rule(
    () => exactly(Expression, minOf(0, exactly(Comma, Expression))),
    ([first, rest]) => [first, ...rest.map(([_, parameter]) => parameter)] // We combine first parameters with all of the rest into one array.
);

// AssignmentStatement -> Name Equals Expression Eol
const AssignmentStatement = rule(
    () => exactly(Name, Equals, Expression, Eol),
    ([name,, expression]) => ({ 
      type: 'assignment', 
      name: name.value, // name at index 0
      expression // expression at index 2
    })
);

Now for the Expression rule:

// We use this functions for all binary operations in the
// Expression rule because all of them parse the same way
// this will allow us to create nested operations.
const processBinaryResult = ([left, right]) => {
    let expression = left;

    // We need to go through all operators on the right side
    // because there can be 3 or more operators in an expression.
    for (const [operator, rightSide] of right) {

        // Each time we encounter an expression we put the
        // previous one in the left side.
        expression = {
            type: 'operation',
            operation: operator.value,
            left: expression,
            right: rightSide
        };
    }

    // Finally we return the expression structure.
    return expression;
};

// Expression -> EqualityTerm ((And | Or) EqualityTerm)*
const Expression = rule(
    () => exactly(EqualityTerm, minOf(0, exactly(either(And, Or), EqualityTerm))),
    processBinaryResult
);

// EqualityTerm -> RelationTerm ((DoubleEquals | NotEquals) RelationTerm)*
const EqualityTerm = rule(
    () => exactly(RelationTerm, minOf(0, exactly(either(DoubleEquals, NotEquals), RelationTerm))),
    processBinaryResult
);

// EqualityTerm -> AddSubTerm ((Less | Greater | LessEquals | GreaterEquals) AddSubTerm)*
const RelationTerm = rule(
    () => exactly(AddSubTerm, minOf(0, exactly(either(Less, Greater, LessEquals, GreaterEquals), AddSubTerm))),
    processBinaryResult
);

// AddSubTerm -> MulDivTerm ((Add | Subtract) MulDivTerm)*
const AddSubTerm = rule(
    () => exactly(MulDivTerm, minOf(0, exactly(either(Add, Subtract), MulDivTerm))),
    processBinaryResult
);

// MulDivTerm -> UnaryTerm ((Multiply | Divide) UnaryTerm)*
const MulDivTerm = rule(
    () => exactly(UnaryTerm, minOf(0, exactly(either(Multiply, Divide), UnaryTerm))),
    processBinaryResult
);

// UnaryTerm -> Not? Factor
const UnaryTerm = rule(
    () => exactly(optional(Not), Factor),
    ([addedNot, value]) => ({
        type: 'unary',
        withNot: addedNot.type !== 'optional', // We add this field to know whether we need to invert it or not.
        value
    })
);

// Factor -> GroupExpression | FunctionExpression | NumberExpression | VariableExpression | StringExpression
const Factor = rule(
    () => either(GroupExpression, FunctionExpression, NumberExpression, VariableExpression, StringExpression),
    factor => factor
);

// GroupExpression -> PStart Expression PEnd
const GroupExpression = rule(
    () => exactly(PStart, Expression, PEnd),
    ([, expression]) => expression
);

// VariableExpression -> Name
// Remember this part? We said we will need it. This is why.
// We need a way to structure variable names, numbers and strings
// So we created an alias rule where we structure the result tokens.
const VariableExpression = rule(
    () => Name,
    name => ({
        type: 'variable',
        name: name.value
    })
);

// NumberExpression -> Number
const NumberExpression = rule(
    () => Number,
    number => ({
        type: 'number',
        value: number.value
    })
);

// StringExpression -> String
const StringExpression = rule(
    () => String,
    string => ({
        type: 'string',
        value: string.value
    })
);

And finally, our token rules:

// Tokens
const Number = token('number');
const String = token('string');
const Name = token('name');
const Equals = token('operator', '=');
const PStart = token('parenStart');
const PEnd = token('parenEnd');
const BStart = token('codeBlockStart');
const BEnd = token('codeBlockEnd');
const Comma = token('comma');
const Eol = token('endOfLine');
const IfKeyword = token('keyword', 'if');
const And = token('operator', '&&');
const Or = token('operator', '||');
const DoubleEquals = token('operator', '==');
const NotEquals = token('operator', '!=');
const Less = token('operator', '<');
const Greater = token('operator', '>');
const LessEquals = token('operator', '<=');
const GreaterEquals = token('operator', '>=');
const Add = token('operator', '+');
const Subtract = token('operator', '-');
const Multiply = token('operator', '*');
const Divide = token('operator', '/');
const Not = token('operator', '!');

Now, we are ready to parse! Let's modify our index.js a bit to add the parsing functionality:

// Code which we want to parse
const code = `i = 5;`;

// Import the lexer
const analyseCode = require('./lexer-analyser');

// Run the lexer
const tokens = analyseCode(code);

// We include the the grammar here.
// Grammar exports the very first rule: LineStatement
// That means that parseGrammar is actually same as LineStatement constant.
const parseGrammar = require('./grammar');
const TokenReader = require('./token-reader');

// Create a reader for our tokens.
const reader = new TokenReader(tokens);

const statements = [];

while (reader.hasNext()) {
    // We parse grammar until we have a next token.
    const statement = parseGrammar(reader);

    if (statement) {
        // Our statement was parsed successfully,
        // so we add it to the list of statements.
        statements.push(statement);
        continue;
    }

    // Something went wrong here, we couldn't parse our statement here
    // so our language needs to throw a syntax error.
    let token = reader.hasNext() ? reader.get() : reader.getLastToken();
    throw new Error(`Syntax error on ${token.line}:${token.character} for "${token.value}". Expected an assignment, function call or an if statement.`);
}

// Finally we output the statements we parsed.
console.dir(statements, { depth: null }); // We set depth: null so that we can get a nice nested output.

You can also check out this part in the branch part2-chapter2 in the repository.

So for our code i = 5; we will get this:

[
  {
    type: 'assignment',
    name: 'i',
    expression: {
      type: 'unary',
      withNot: false,
      value: { type: 'number', value: '5' }
    }
  }
]

Let's test an if statement code:

if (a > 5) {
    a = 5;
}

We get:

[
  {
    type: 'if',
    check: {
      type: 'operation',
      operation: '>',
      left: {
        type: 'unary',
        withNot: false,
        value: { type: 'variable', name: 'a' }
      },
      right: {
        type: 'unary',
        withNot: false,
        value: { type: 'number', value: '5' }
      }
    },
    statements: [
      {
        type: 'assignment',
        name: 'a',
        expression: {
          type: 'unary',
          withNot: false,
          value: { type: 'number', value: '5' }
        }
      }
    ]
  }
]

And for a function call print('hello' + ' ' + 'world'); we get:

[
  {
    type: 'function',
    name: 'print',
    parameters: [
      {
        type: 'operation',
        operation: '+',
        left: {
          type: 'operation',
          operation: '+',
          left: {
            type: 'unary',
            withNot: false,
            value: { type: 'string', value: 'hello' }
          },
          right: {
            type: 'unary',
            withNot: false,
            value: { type: 'string', value: ' ' }
          }
        },
        right: {
          type: 'unary',
          withNot: false,
          value: { type: 'string', value: 'world' }
        }
      }
    ]
  }
]

As you can see, our grammar also transformed and nicely structured our statements making it easy for us to interpret and do something with them. You can play around with if statements and function calls to get a different output.

In the next part, we will cover writing an interpreter which will give our language a function where we can actually instruct it to do some actual work.

See ya until then! :)

Share with
Terms of Use | Privacy Policy