A High Level Architecture of the TypeScript compiler

2023-02-27

• 8 min read

#typescript #programming_language_design

grayscale photo of person running in panel paintings

Photo by David Werbrouck

Notes on how the TypeScript compiler works. Resources I took the notes from:

Program

The project coordinator:

Read the TSConfig: setup the program, get starting files
Pre-process files: follow imports to discover all possible files
Tokenize and Parse: convert text to a syntax tree
Binder: convert identifiers in syntax tree to symbols
Type Check: use binder and syntax tree to look for issues in code
Transform: changes the syntax tree to match tsconfig options
Emit: prints the syntax tree into .js, .d.ts, and other files

Let's see how that API looks like:

import * as ts from 'typescript';

// code to syntax tree
const program = ts.createProgram(files, opts);

// binding and type checking
const checker = program.getTypeChecker();

// syntax tree to code
program.emit();

The steps can broken down into three pieces

Source Code to Data: Tokenize and parsing phases
Type Checking: binding and type checking phases
Creating Files: emitting files into the disk

Source code to data

Converting code to a syntax tree

function welcome(str: string) {
  console.log(str);
}

const msg: string = 'Hello World!';

welcome(msg); // Hello World!

It creates the syntax tree using a scanner and a parser.

Scanner

The scanner receives the text (source code) and outputs a sequence of tokens: text -> tokens

const msg: string = 'Hello World';

It transforms this source code into tokens like this:

ConstKeyword WhitespaceTrivia Identifier ColonToken WhitespaceTrivia StringKeyword WhitespaceTrivia EqualsToken WhitespaceTrivia ...

The scanner also has diagnostics, an expressive way to show users their JavaScript/TypeScript code is invalid, has errors, or is missing anything.

No closing quote to the open quote:

const noEnd = '; // => Unterminated string literal.

An invalid character:

const 🇯🇵 = 'Japan Flag'; // Invalid character

Numeric separators can't be consecutive:

const num = 1__0; // Multiple consecutive numeric separators are not permitted.

And so on.

Parser

The parser gets the tokens generated by the scanner and creates the syntax tree: tokens -> AST

The AST is a series of Nodes. Each Node holds its position, the syntax kind, the value, etc.

A simple example of a syntax tree is the representation of a variable declaration. Given this source code:

const num = 10;

The parser will generate this syntax tree:

SourceFile
  - VariableStatement
    - VariableDeclarationList
      - VariableDeclaration
        - Identifier
        - NumericLiteral
  - EndOfFileToken

The VariableStatement has a list of VariableDeclaration, which has the Identifier (num) and the NumericLiteral (10) as children.

If we add the a number type to the num, the AST will have new child Node.

const num: number = 10;

The syntax will look like this:

SourceFile
  - VariableStatement
    - VariableDeclarationList
      - VariableDeclaration
        - Identifier
        - NumberKeyword // --> this node represents the `number` type
        - NumericLiteral
  - EndOfFileToken

There's a cool AST playground we can use to have a better understanding on how the source code and the generated syntax tree are related one to another.

The parser diagnostics shows the "right JavaScript code in the wrong place".

Private identifiers outside of a class:

#constant = 10; // The left-hand side of an assignment expression must be a variable or a property access.

BigInt should be integers not decimals:

const decimalBigInt = 1.2n; // A bigint literal must be an integer.

Reserved words in JavaScript:

const extends = 'extends'; // 'extends' is not allowed as a variable declaration name.

And so on.

Type Checking

Binder

The binder transforms the syntax into symbols: syntax -> symbols

It produces symbols in the form of symbol tables. It keeps track of the declaration(s) of identifiers, so if the checker wants to find the type of an expression, it can look up the identifier in the symbol table, get the declaration, and find its type annotation or initial value.

Processing code, there're different scopes. Let's see an example

const msg: string = 'Hello World!';

function welcome(str: string) {
  console.log(str);
}

welcome(msg);

For each scope, we have different symbols:

In the global scope, we have the msg and the welcome variables.
In the function scope, we have the str variable.

Symbols are tables for each scope in the program to store identifier with its metadata (where it was declared — line — and its flag).

Getting the syntax tree from the previous source code, the binder generates these symbol tables:

Global Scope:

msg:
- declared line 0
- flags: BlockScopedVariable
welcome:
- declared line 6
- flags: Function

welcome Function Scope:

parent: Global Scope
str:
- declared line 2
- flags: BlockScopedVariable

And the binder tries to keep track of the identifiers across files.

Another responsibility of the binder is to structure the flow nodes. It creates the flow graph to represent each scope and its types. It has different "containers" and each container could have different types for the same variable. Let's see an example.

function logValue(x: string | number) {
  if (typeof x === 'string') {
    console.log('string', x);
  } else {
    console.log('number', x);
  }

  x;
}

logValue function container
- x with type string | number
- "if condition is true" container
  - x with type string
- "if condition is false" container
  - x with type number
- x again with type string | number
  - it should check all the flow nodes because it's possible x has been changed by different conditionals
    - return statements
    - mutation
    - casting

The binder diagnostics show errors like:

The impossibility to declare a variable with the same naming it was used before for another variable:

const num = 1;
const num = 2; // Cannot redeclare block-scoped variable 'num'.

When identifiers are duplicated:

class User {}
type User = {}; // Duplicate identifier 'User'.

Checker

There're three main systems: how it checks, comparing types, and the inference system.

The checker checks each part of the syntax tree.

If we have a simple variable statement:

const msg: string = 'test';

It results into this syntax tree:

SourceFile
  - VariableStatement
    - VariableDeclarationList
      - VariableDeclaration
        - Identifier
        - StringKeyword
        - StringLiteral
  - EndOfFileToken

And the checker checks each node of the syntax tree.

checker
  .checkSourceElementWorker()
  .checkVariableStatement()
  .checkGrammarVariableDeclarationList()
  .checkVariableDeclaration()
  .checkVariableLikeDeclaration()
  .checkTypeAssignableToAndOptionallyElaborate()
  .isTypeRelatedTo(identifier, stringLiteral);

And the checker considers the source type and target string as the same type and returns true.

A more complex type checking analysis is when you need to check object type. e.g.

{ hello: number } = { hello: 'world' }

both are an object type
compare each object "field" (object's keys)
the return type (the value): are they assignable?
- if fails the check

Another example:

Promise<string> = Promise<{ hello: string }>;

both are a Promise object
compare the type arguments inside the promise
the string is not assignable to the object and it fails the check

The initializer inference

For type inference, the syntax tree, rather than providing the StringKeyword, there's no type given:

const msg = 'test';

SourceFile
  - VariableStatement
    - VariableDeclarationList
      - VariableDeclaration
        - Identifier
        // - No type given
        - StringLiteral
  - EndOfFileToken

So the type checker tries to understand the syntax shape to fill in the gaps: From StringLiteral, it understands that the type is a string (StringKeyword) and move that into the type.

Type parameter inference is a bit different. Take this code as an example:

declare function setup<T>(config: { initial(): T }): T;

For type parameter inference, it needs to be defined further down. For the type checker to understand or to infer the type of T, you should write the function passing an object with the a initial function which returns a value and the type checker will "infer" that the value's type is T.

Generic Function

Generic Arg T
Return value is T
Param is an object which has a function initial that returns a T

And you write the actual setup function call:

const test = setup({ initial: () { return 'test' }});

Now the return type of your implementation (string type) will be moved to the T type.

{ initial: (): T } = { initial: () { return 'test' }}

So now the type looks like this for this specific function:

declare function setup<string>(config: { initial(): string }): string;

Emitting files

Emitter

It transforms the syntax tree into files: syntax tree -> files

What to emit? e.g. *.js, *.map, *.d.ts
Printer: syntax tree to text
Tracking temp vars
Transformers: syntax tree to syntax tree

Tranforming the syntax tree into a new syntax tree to remove static types. Basically transforming TypeScript into legitimate JavaScript.

SourceFile                                        SourceFile
  - VariableStatement                               - VariableStatement
    - VariableDeclarationList                         - VariableDeclarationList
      - VariableDeclaration           --->              - VariableDeclaration
        - Identifier                                      - Identifier
        - StringKeyword                                   // removed type
        - StringLiteral                                   - StringLiteral
  - EndOfFileToken                                  - EndOfFileToken

The emitter remove the types, emits JavaScript output and .dts files (type check them).

Resources

Hey! You may like this newsletter if you're enjoying this blog. ❤

✖

Twitter · Github