Dev Tips & Tricks

Building a language interpreter in JavaScript - Part 1

Back to posts


Building a language is hard, we all know that. So let's build one!

In this article series, we will build a usable language interpreter using pure JavaScript. Keep in mind that we will cover the most important (and most fun) parts of building a programming language. We will not concentrate on things like optimization or emitting bytecode.

Implementation will probably not be super-optimal too, but the purpose is to learn. And the end product can be used for many applications which do not require high-speed computing.

We will be using NodeJS for this project, but the end product can easily be tested in the browser too.

Another thing to note, this series will not use any existing libraries to build this. We will learn much much more this way than just using a library. But, to help those just wanting tldr, in the end, I will show you an easier way to do this by using a library that solves a lot of things we will implement manually.

Let's meet the language

Let's meet our language we will be making -- printly. Below is a simple snippet of what we will be interpreting:

firstName = 'John';
lastName = 'Smith';
age = 50;

print(firstName + ',' + lastName);

if (age > 40) {
    print(firstName + ' is over 40 years old');
}

So, let's see what we have here:

Variable assignments like firstName = 'John';

Print function call print(firstName + ',' + lastName);

Operators + , >

Parentheses ( )

If statement if (age > 40) {

Code blocks inside an if statement { }

String literals 'John'

Numbers 50

And of course, so that we do not forget... the dreaded semicolon ; :)

Overview

Here is the overview of stages we will perform in interpreting our language:

img
  1. We will read the code with the lexer and produce tokens
  2. We will read the tokens and create a structure that can be interpreted
  3. We will interpret the structure and execute it.
  4. We will return the result of the ran code.
Note: You can follow the parts of this article by checking out specific branches mentioned in the article.
Complete code is available at: https://github.com/ArekX/javascript-interpreter-example

Building a lexer

We will start our journey by building a lexer. Lexer is a piece of software that reads characters and transforms them into tokens. We will then use those tokens in the next stage.

You might think "ok this is easy, we will just use some regex to find things in the code string we want and produce tokens". This is not a good approach as you need to be able to also detect the correct positions of the token without a really complicated regex that might not be possible and can lead to bugs.

That is not to say that we won't use regex at all, we will use it but on a character level to detect a specific range of characters like uppercase / lowercase letters and numbers.

Our first order of business is to read the characters one by one. We will need to read characters and store the state where we are so that lexer can check for some characters ahead or before to make decisions on whether it can detect a specific token or not.

So to start we need a character reader:

module.exports = class CharacterReader {
    constructor(code) {
        this.code = code; // store code which we will read through
        this.characterPosition = 0; // Current character in a line of code text
        this.linePosition = 0; // Current line in the code text
        this.position = 0; // Current character in the code string.
    }

    // Return the number of characters specified by `amount`
    // without advancing the character reader.
    peek(amount = 1) {
        return this.code.substring(this.position, this.position + amount);
    }

    // Advance the character reader by specified amount
    next(amount = 1) {
        // we need a loop to go through all of the characters
        // by the specified amount so that we can properly
        // determine when a new line happened so that we
        // can keep proper line and character position.
        for(let i = this.position; i < this.position + amount; i++) {
            if (this.code[i] == '\n') { // If a new line character is detected
                this.linePosition++; // Increase line position
                this.characterPosition = 0; // Reset character position as it is a new line.
                continue;
            }

            this.characterPosition++; // Increase character position for the line.
        }

        this.position += amount; // Change current reader position in code string.
    }

    // Getter to just return current character position in the line in the code.
    getCharacterPosition() {
        return this.characterPosition;
    }

    // Getter to return current line position in the code
    getLinePosition() {
        return this.linePosition;
    }

    // Check and return whether there is more code to parse.
    hasNext() {
        return this.position < this.code.length;
    }
}

And to use it let's write:

const CharacterReader = require('./character-reader');

// Code which we want to parse
const code = `variable = 5`;

// Create an instance of character reader.
const reader = new CharacterReader(code);

// Should output a 'v'
console.log(reader.peek());

If you wish to follow this part through the code please checkout the branch part1-chapter1.

Reading characters into Tokens

A token is a representation of a concept we support in our language, like a variable name, equal sign a string, etc.

The reason why we need to use tokens multiple:

We can perform syntax validation to ensure that the code is written correctly. We can ensure that token B comes after token A.

We can process tokens instead of detecting things going one character at a time without having to worry if there is one space between a variable and equal sign or if there are 50.

Tokens are normalized, meaning it's easy to know what they are, and where they are in the text without having to write differently 'if' statements to handle specific edge cases you can usually have like whether a number is a floating point if it has a . or not.

To read tokens we first need to define them. In our language we have the following tokens:

  1. Numbers
  2. Strings between ' character
  3. Operators !, +, -, *, /, ==, !=, &&, ||, <, >, <=, >=, =, !=
  4. Keywords like if
  5. Names like variable names, function names, etc.
  6. Parentheses ()
  7. Code blocks between {}
  8. End of line ;
  9. Commas ,
  10. Whitespace like tabs, new lines, space characters

We need to create a detection algorithm for each of these. We will separate them into functions.

Keep in mind that the order of detection also matters here. If we for instance detect variable names before keywords we will not be able to detect keywords. The order of detection we will use is noted above.

Implementing a detection loop

Our detection loop will detect characters using the character reader implemented. We will create it like this:

// Character reader we implemented
const CharacterReader = require('./character-reader');

// List of token detector functions we will implement.
const tokenDetectors = [];

const detectTokens = code => {
    // Create character reader for our code.
    const reader = new CharacterReader(code);

    // List of tokens we found in the code.
    const foundTokens = [];

    // We loop until we go through all of the characters.
    while (reader.hasNext()) {
        let token = null;

        // Store the positions in case we detect the token
        let startPosition = reader.position;
        let linePosition = reader.getLinePosition();
        let characterPosition = reader.getCharacterPosition();

        // We go through each of the token detectors
        // and call the function for detecting each token.
        for (const detectToken of tokenDetectors) {
            token = detectToken(reader);

            if (token) {
                // Token is detected so we do not
                // continue detection.
                break;
            }
        }

        // If no token could detect the character at this
        // position means that we have a syntax error in our
        // language so we should not continue.
        if (!token) {
            throw new Error(`Invalid character '${reader.peek()}' at ${linePosition}:${characterPosition}`);
        }

        // If a token is found we store the token data
        // together with the position information.
        foundTokens.push({
            ...token,
            start: startPosition,
            end: reader.position,
            line: linePosition,
            character: characterPosition
        });
    }

    // After we found all of the tokens we remove the whitespace
    // tokens because we will not use them.
    return foundTokens.filter(i => i.type !== 'whitespace');
};

Now we will implement each of the token detectors.

Numbers

For numbers detection, we will need to detect number characters. We will only detect integer numbers, in this case, to keep things simple but this function can easily be modified to detect a full number specification.

const readNumberToken = reader => {
    let numberText = '';
    const numberMatch = /\d/; // Regex for detecing a digit.

    // We read until we characters to read.
    while (reader.hasNext()) {
        if (reader.peek().match(numberMatch)) { 
            // If a number matches the regex we add the
            // character to our string
            numberText += reader.peek();
            reader.next();
        } else {
            // if the number is not matched we do not need to search anymore.
            break;
        }
    }

    if (numberText.length == 0) {
        // if no number was detected, return null meaning no token detected.
        return null;
    }

    // We found the token and we return type and value of the token.
    return { type: 'number', value: numberText };
}

Strings

For string literal, we need more code because we need to handle specific states. The string starts with a quote ' and ends with 'a quote character.

Between those characters, we need to capture any characters (also known as consuming characters). We also have a special case we need to solve. This is when we want to have a quote inside a string.

In our implementation, we decided that if you prepend \ character, we will not treat the quote character as an end of the string. This is called character escaping.

This is the implementation function below:

const readString = reader => {
    let value = '';
    let startedReading = false; // Flag if we started reading a string
    let isEscaping = false; // Flag if we need to ignore the next character.

    // We read until we characters to read.
    while (reader.hasNext()) {
        const matchFound = reader.peek() == "'";

        // if we didnt start reading the string and the string character didnt match
        // this means that we didn't encounter a string.
        if (!startedReading && !matchFound) {
            break;
        }

        // This allow us to have a ' character inside
        // our strings as long as we escape it.
        if (reader.peek() == '\\' && !isEscaping) {
            isEscaping = true;
            reader.next();
            continue; // we only consume this character and not add it to value.
        }

        // if we started reading and found a string character,
        // this means that we reached the end of string literal
        if (startedReading && matchFound && !isEscaping) {
            reader.next(); // move to a character after ' in the code.
            break;
        }

        // if we didn't start reading but we found a valid string start
        // we set the state for reading the string.
        if (!startedReading && matchFound) {
            startedReading = true;
            reader.next();
            continue;
        }

        // Add the character to our detected string.
        value += reader.peek();
        reader.next(); // Move the reader to a next character.
        isEscaping = false; // Reset escape flag so that we do not escape the next character.
    }

    if (value.length == 0) {
        return null; // if no string token was found
    }

    // return token of type string
    return { type: 'string', value };
}

Operators

Now, let's detect our operators. We have different types of operators from
mathematical to relational. In this case, we will detect them all.

Since some operators have two characters like ==, &&, || we need to peek
for two characters and for one character in the character reader. There are other ways to do this but this way is easier to understand.

const readOperator = reader => {
    // Regex for operator characters we want to detect.
    const operatorMatch = /^(!|\+|-|\*|\/|==|!=|&&|\|\||<|>|<=|>=|=|!=)$/;

    // Peek one character to detect one character operator
    const oneCharacterOperator = reader.peek();

    // Peek one character to detect two characters operator
    const twoCharacterOperator = reader.peek(2);

    let value = null;

    if (twoCharacterOperator.match(operatorMatch)) {
        reader.next(2);
        value = twoCharacterOperator; // two character operator was found
    } else if (oneCharacterOperator.match(operatorMatch)) {
        reader.next();
        value = oneCharacterOperator; // one character operator was found
    }

    if (value) {
        // Operator is found, we return the token.
        return { type: 'operator', value };
    }

    // Nothing was found so we return null that the token was not found.
    return null;
}

Keywords

After that, we have to detect keywords. Keywords are names that we decided have a special meaning in our language. For our language, we will only have one. That is if keyword.

const readKeyword = reader => {
    if (reader.peek(2).match(/^if$/i)) {
        // We detected if keywords and return the token.
        reader.next(2);
        return { type: 'keyword', value: 'if' };
    }

    // No keyword detected
    return null;
}

Names

Now for variable and function names, we need to define what those are. Both variable 'name' and function 'name' are just that, names. We can detect them using the same token. Then depending on the context, we can deduce that we are calling a function or returning a variable name.

For our language, a name must start with a lowercase letter and then needs to have a continuous string of letters and numbers.

We can represent a name with two regular expressions:

  • /[a-z]/ for the start of the name, if this is not detected we do not detect anything further.
  • /[a-zA-Z0-9]+/ for the rest of the name, we detect until we match this.
const readName = reader => {
    let value = '';
    const startOfVariableMatch = /[a-z]/;
    const restOfVariableMatch = /[a-zA-Z0-9]/;

    // If we did not match the variable, do not return a token.
    if (!reader.peek().match(startOfVariableMatch)) {
        return null;
    }

    value = reader.peek();
    reader.next();

    while (reader.hasNext() && reader.peek().match(restOfVariableMatch)) {
        // add a character to the value as long as we match the variable name.
        value += reader.peek();
        reader.next();
    }

    // we return a variable token
    return { type: 'name', value };
}

Parentheses

This is an easy one. We detect ( or ) and return a different token based on what we detected.

const readParentheses = reader => {
    if (reader.peek() == '(') {
        // We detected '(', start of parentheses
        reader.next();
        return { type: 'parenStart', value: '(' };
    }

    if (reader.peek() == ')') {
        // We detected ')', end of parentheses
        reader.next();
        return { type: 'parenEnd', value: ')' };
    }

    // No token was detected.
    return null;
}

Code blocks

Another easy one. We detect { or } and return a different token based on what we detected.

const readCodeBlocks = reader => {
    if (reader.peek() == '{') {
        // We detected '{', start of code block
        reader.next();
        return { type: 'codeBlockStart' };
    }

    if (reader.peek() == '}') {
        // We detected '}', end of code block
        reader.next();
        return { type: 'codeBlockEnd' };
    }

    // No token was detected.
    return null;
}

End of line

Ah yes, the semicolon :). Fortunately, it is easy to detect this. We just detect a; character and move on.

const readEndOfLine = reader => {
    if (reader.peek() == ';') {
        // Semicolon is detected
        reader.next();
        return { type: 'endOfLine', value: ';' };
    }

    // Semicolon is not detected
    return null;
}

Commas

Where would we use a comma? Well, we have a few places, function arguments for one. Similar to semicolons, we just detect the , character.

const readComma = reader => {
    if (reader.peek() == ',') {
        // Comma was detected
        reader.next();
        return { type: 'comma', value: ',' };
    }

    // Token was not detected.
    return null;
}

Whitespace

And finally, the last token we detect is the whitespace. This token has the lowest priority because we do not want it being detected in front of something like strings which can also have whitespace characters. For us we will treat tabs, spaces, a new line character and carrier return character as whitespace.

Remember that in the reader loop above, we remove this token from the list after it is detected because we do not have a need for it.

const readWhitespace = reader => {
    const whitespaceRegex = /[\t\r\n ]/; // Regex for detecting whitespace.
    let value = '';
    while(reader.hasNext() && reader.peek().match(whitespaceRegex)) {
        // add detected whitespace to the value
        value += reader.peek();
        reader.next();
    }

    if (value.length > 0) {
        // Return detected whitespace.
        return {type: 'whitespace', value};
    }

    // No whitespace token was detected.
    return null;
}

And our lexer is complete. Passing code to this lexer will result in a list of tokens we can use for parsing.

As a final thing, we will modify our main function a little bit into the following code:

// Code which we want to parse
const code = `variable = 5`;

// Import the lexer
const analyseCode = require('./lexer-analyser');

// Run the lexer
const tokens = analyseCode(code);

// Should output the tokens
console.log(tokens);

And as an output we will get the following:

[
  {
    type: 'name',
    value: 'variable',
    start: 0,
    end: 8,
    line: 0,
    character: 0
  },
  {
    type: 'operator',
    value: '=',
    start: 9,
    end: 10,
    line: 0,
    character: 9
  },
  {
    type: 'number',
    value: '5',
    start: 11,
    end: 12,
    line: 0,
    character: 11
  }
]

And this is a token set for assigning the number 5 to a variable named variable.

In the next part, we will continue by implementing a parser that parses these tokens.

If you followed the code from GitHub you can get to this point by switching to the branch part1-chapter2.

Share with
Terms of Use | Privacy Policy