We've been busy cooking up Dragon, our hobby programmable GPU. This GPU has a number of strange quirks and programming directly in assembly became quite complex. For these reasons, I decided to make a small programming language for writing high-level code that can run on the VPU processors found in the GPU. The language is called "Dragon Compute Language" or "DCL" for short.

Compiler Structure

The DCL compiler is written entirely in Python. Programs for DCL are necessarily small since the entire program memory within a VPU is only on the order of a thousand hardware instructions, meaning the compiler can still quickly parse and optimize programs at fast (even interactive) rates even though it's written in Python.

The compiler itself is written as a number of phases.

Lexer

DCL uses a very simple and traditional lexing task. Walks sequentially over the DCL source code from start to end and produces a stream of Tokens. Each Token has a type ("Left Parenthesis", "Keyword", etc.), the string itself, and also some note of where it was inside the source file. The location information is important for error reporting, for instance when there is a syntax error in the code.

Parser

DCL implements a LL(1) grammar and is parsed using Recursive Descent. The input to the parser is the sequence of Tokens from the lexer and the output is a tree of Nodes. The nodes capture all the expressions, statements, functions, etc. Initially the parser is parsing a "Program"-type Node. This looks at the next token to determine what is going to be parsed next as part of this program, e.g. a function definition, a constant definition, etc. That decision will descend into parsing a function or whatever, and in turn it will look at the next token to determine what the next thing to parse is.

The grammar of DCL is designed to make this style of parsing easy to do. For instance, consider that we are at the top-level scope of the source file and see the next Token is int. In C-like languages, you cannot tell if this is going to be the return type of a function declaration with return type int or you're about to declare a variable of type int, and the parsing of those two things is quite different. With one token "look ahead" you can't tell which it is, it's ambiguous. For reasons like this, functions are denote fn my_function() { ... } in DCL (and many other languages). At the top-level program scope, seeing a fn keyword Token unambiguously tells you you're about to see a function definition and you are guaranteed the only valid next token would be a function name, then a left parenthesis, etc.

Type Inference

DCL is strongly typed and has no implicit type casting anywhere. There are a number of built-in types and users may define packs and structs when they want to define data in new ways (See the Language Reference for details). Type inference works by visiting the node hierarchy output by the Parser from the bottom-up. At the leaves of the Node tree, we have constants, variable declarations, and built-in types. Because we know the actual type of all these things, we can assign type information to the node. Then we can visit those leaves' parents and determine the parent's type information by looking what type of Node it is and the type information of its children.

The way this is implemented in DCL is that there is a hierarchy of Tables. The responsibility of a Table is to define what variables and types are defined at a given logical scope of the program. For instance, the root of the program has a Table which automatically is assigned all the built-in types and variables. A function or a { ... } block inside a function will each get new Tables to track what variables are defined at that scope. When a Table is made, it is assigned to the corresponding Node (e.g. a Function Node or a BlockScope Node in the node tree). Additionally, Tables have a reference to their parent. When a line of code refers to something called foo, that Node where it is referenced will look for a Table assigned to the current node, and if one is not found, it will look upwards in its Node parents until it finds a Table. It will then consult the Table to find out "What is foo?". If that Table does not have the definition, then the Table will consult the parent Table. If after going through Tables all the way to the root Table foo is never found, then this is an undefined variable of some sort and a programmer error.

The end result of Type Inference is that every Node which can have a type (every number, expression, etc.) will have a type assigned.

Type Checking

With type inference completed, we can check that the type information is semantically correct. This means making sure that if you define a float variable and try to assign vector, that will fail because the types do not match. Similarly adding things together requires similar types, etc. In reality, type checking in the DCL compiler is partly done during type inference and partly as additional passes after all type inference has completed.

In situations where we need to change types, the cast(T, ...) and as(T, ...) built-in functions are used. These allow the programmer to perform a "semantic" cast (like converting a float to an integer, which might requires some actual computation) using cast, or simply reinterperet the bits of a variable as a new type using as.

Some built-in functions which read from memory are special. An arbitrary bit of data in memory has no type, so all memory read, tile read, and similar operations require the programmer to express "what is the type of the thing you're reading?" and the function will return that type.

Additional Validity Checks

In order to reduce complexity for the compiler, DCL has a number of restrictions. As an example, DCL has functions which the VPU can run. These top-level functions are called "Kernels". You can also define other functions in your source code and "call" them from a kernel. However, because the VPU itself has no call stack, you cannot actually call functions at runtime. What happens is the DCL compiler will inline all function calls, possibly many levels deep, my sort of copying the AST of the body of a function into the place where it is called from. For this reason, function calls can never form a loop.

There are other validity checks related to the VPU and language restrictions, but the point is made.

AST Optimization

At this stage, we know the program is syntactically valid, the types make sense, and we've done the necessary steps to ensure that logically this program is something that could run on the VPU. A straightforward translation of the current AST/program into instructions to run on the VPU may be really inefficient though. So at this point, over a sequence of many passes, the compiler will walk over the AST to find and simplify things where possible. As a very basic example, if you have some code like let x = 1 + 2;, the optimizer will see something like an Addition Node with two compile-time constant children. This can just be turned into a new compile-time constant of 3 in this case. There are less trivial examples.

Lowering AST to SSA

At this point, we have inlined all function calls for a kernel and the entire kernel is one long series of operations on data types, some complex, some mapping exactly to a built-in instruction. The next step is to transform the AST into an intermediate reprsentation ("IR") and express the kernel in SSA form. Instead of a tree structure, this means the kernel will now be a large number of register loads, writes, intermediate values, etc. The DCL compiler has its own IR for representing the different possibile pseudo-instructions. These are not yet VPU machine instructions, but we're getting closer. The reason for lowering from AST to SSA is that we can now reason about how/when register assignments occur in the program, and where branches will take place.

Note that actually assigning local register numbers to instructions in the program has not happened yet. We need to be able to reason about how branches will occur in the program (i.e. control flow) before we can do that..

Control Flow Graph and BB Generation

With a means for generating SSA, we simultaneously need to think about the Control Flow Graph (CFG) which enables us to understand which groups of instructions "Flow" into which other groups of instructions. One of the primary drivers for this is eventual need to allocate registers. Because we have limited registers and there are multiple branching paths through the program, we have to be able to reason about which branches are mutually exclusive, which results are "live" and overlap others, etc. We will discuss this more in the Part 2 of this article.

The CFG itself is composed of Basic Blocks which are sequences of operations that have no branching in the middle. There may be multiple ways to jump into the first instruction of a basic block, and the final instruction of a basic block is typically a branch to another basic block. Within a Basic Block we can more easily reason about variable/register assignment.

DCL explicitly models SSA IR operations, basic blocks, and the control flow graph. SSA operation inputs and result have a 'type' which is simply a VecN or BoolN, specifying the number of 16-bit slots occupied by a result, but not "where they are located" inside a register. That optimization will come later. This late optimization allows us to express, for instance, (A+B) and (C+D) where all four variables are vec2 (32-bits), and later realize that both adds can be packed into a single 64-bit addition operation.

Until Next Time...

This requires a Part 2. At this point we have the CFG and BB defining instructions. We still need a few things:

•

Optimization passes, especially SIMD operation packing/stuffing.

•

Code Generation

•

Output Binary Format and Runtime Considerations