Note: This document is copied from work-in-progress Dragon documentation. It is subject to change.

Summary

Dragon Compute Language (DCL) is a language for writing compute/shader programs for execution on the Dragon VPU cores. This guide goes over syntax and semantics of the language. The intended audience for this document is Dragon VPU programmers who will write graphics or compute programs to run on the VPU.

GPU Architecture Summary

What follows is a brief discussion of the most important components for people programming the VPU. For a complete discussion of the content in this section, see the document "of the "Dragon GPU Architecture". While much of the below is at the instruction or "assembly" level of the VPU, some of these details may be informative even as you write in the higher-level DCL.

The Dragon GPU is composed of a Control Unit (DCU) and one or more Vector Processors (VPUs). The DCU and VPUs of a GPU share a single 64-bit endpoint on the external bus. The DCU is a small RISCV processor which is responsible for communicating with remote processors and scheduling and managing work on the VPUs. The DCU is responsible for DMA'ing VPU program binaries from external memory into the VPU before launching thread groups. The DCU may launch threads in 1D or 2D arrangements.

A VPU is a barrel processor which may have a maximum resident set of 64 "thread groups", of which up to 20 may be "active" at any moment. Each thread group is composed of 4 threads. The VPU issues an instruction for one thread group, then on the following cycle issues an instruction for the next thread group, etc. After twenty instructions, a thread group will issue its next instruction. This means that every instruction has a latency of 20 cycles. If a thread group is blocked at the time it should ideally execute an instruction, a thread group which is not currently active but is ready is swapped in at zero cost. In order to achieve good performance, it is important then that there are enough threads to hide delays due to things like memory access delays.

Within the VPU ISA, each thread has access to 16 64-bit 'local' registers and 32 'global' registers. In general every instruction may read from up to two input registers and writeback to a single local register. On each input register, 16-bit components may be freely swizzled/copied/zero'd. See the below 'Swizzling' section. Register writeback may be masked on a 16-bit basis as well. The DCU is capable of writing to VPU global registers but not reading them. In DCL, this means that uniform variables are set by the DCU and the VPU merely reads these values.

The GPU has a single cache shared by up to 4 VPUs (the maximum configuration in Dragon V1). The cache is composed of 64B cache lines in a two-way set-associative configuration. The VPU may optionally configure any subset of the first half (i.e. first way) of the cache to be a coherent R/W storage between the DCU and VPUs. This enables things like VPU/DCU communication, fast atomic operations, and results and calculations that can span multiple threads' lifetimes.

All memory writes are posted and all read operations require initiating a load to a register and a second instruction to await the read completion. This does allow the compiler to potentially hide some load latency with compute work. DCL abstracts both of these with the built-in ext_load and ext_store functions.

The VPU has no notion of a stack, and so there are no function calls. DCL supports 'functions' in a limited sense by inlining all (possibly nested) function calls. Importantly this means function calls within a DCL/VPU program must form a DAG and have no cycles, i.e. no recursion.

Language basics

Comments may be defined using //, after which all remaining content on that line is ignored

// This is a comment

Whitespace is generally ignored throughout the language.

Built-in Types

DCL has the following built-in types

The constant for true in DCL is 1 and false is 0.

A bool4 may be assigned an integer value and the bottom four bits will determine which components of the bool4 are set true/false. A constant 0b1010 assigned to a bool4 will set x and z components to true and y and w components to false.

For all built-in types, component-wise arithmetic +, -, *, and / are available. Note that all division/inversion operations are serialized. On a single VPU, a division instruction acting on 4 components of a vec4 will serialize and require 4 clock cycles. matrix-matrix and matrix-vector multiplication operations are also supported. all of which involve the binary infix @ operator.

Declaring variables

Variables may be declared using the var keyword:

// Declare a mutable (i.e. changeable) variable of type i16
var x : i16;
// Declare a mutable variable whose type is inferred from the right-hand side
var y = x + 1;

If your intention is for an object to be immutable and non-modifiable, then instead use the let keyword:

let x : f16 = 123.0;
let y = x + 1.0;

// The bellow would result in an error. `let` declares immutable storage
// y = 2;

Integers may be defined in decimal, hexadecimal, or binary. Any number of _ characters may be placed in the middle of a constant for readability as desired.

// All three of these declare the same numeric value
let x = 32;
let y = 0x0000_0020;
let z = 0b10_0000;

Vector types may be declared and initialized using a special initializer list syntax:

let position = vec2{ 1.0, 2.0 + x };

Swizzling

All built-in vector types may be 'swizzled'. After the variable name of a vector, a period and then up to 4 x, y, z, w components may be specified in order to reorder, or duplicate elements of one vector into another. This does not modify the original vector. In addition, any slot may be _ to cause a zero to be placed into that component.

Swizzling leverages the built-in hardware capability of re-ordering vector registers when input to various instructions, so these swizzling operations are zero-cost.

Examples:

let v2: vec2 = vec2 {1.0, 2.0};
let v4: vec4 = vec4 {3.0, 4.0, 5.0, 6.0};

// Reverse components
let z1: vec2 = v2.yx; // (2, 1)

// Produce a vector which is the same as the original but x-component is zero
let z2: vec2 = v2._y; // (0, 2)

// Create arbitrary selections of a larger vector
let z3: vec2 = v4.xw; // (3, 6)

// Components may be duplicated to copy into multiple output positions.
let z4: vec4 = v4.xxzz; // (3, 3, 5, 5)

// You can duplicate elements of a small vector to create a larger vector
// up to a maximum size of vec4
let z5: vec4 = v2.x_yy; // (3, 0, 4, 4)

// Interleaving two vectors
let z4: vec4 = v2.xxxx + v4.zxwy; // (1+5, 1+3, 1+6, 1+4)

Functions

Functions are defined like in the following examples

fn my_func(arg1: i16, arg2: i16) -> i16 {
    return arg1 + arg2;
}

fn empty_func() {
    var x = tm_read_local(0, vec4);
    x *= 2.0;
    tm_write_local(0, x);
}

The argument types must be provided. If the return type -> T is not provided, then the function does not return a value.

Return statements are required to be at the outermost scope of a function definition. This means there can be no conditional logic which returns. Consequently, the set of threads which are active at the beginnning of a function will be the same set of active threads on return.

Named Arguments

When defining a function, it is also possible to define arguments with default values.

fn atan2(x:f16, y:f16, .radians:bool=0) -> f16 { ... }

When calling a function with named arguments, because named arguments have default values, it is not necessary to to provide values. If you do want to provide values (considering the above example function declaration):

let radians = atan2(1.2, 2.4, .radians=1);

When a named argument is a bool and you want to specify 1 as a value, the following is equivalent, called "flag notation":

let radians = atan2(1.2, 2.4, .radians);

Memory Loads and Store

VPUs may load from external memory via load_ext and passing a valid address. Stores work similarly via store_ext and must pass the value to store as the second argument.

Load examples:

let address : u32 = 0x0100_0000;

// A 16-bit load (still requires address be 64-bit aligned)
let X = load_ext(f16, address);

// An example of loading a 16-bit value given 64-bit aligned address
let Y = load_ext(f16, address).y;

// Load 32-bit or 64-bit values
let Z = load_ext(vec2, address);
let W = load_ext(vec4, address);

struct BigStruct {
    vec4 A;
    vec4 B;
    vec4 C;
};

// Loads 3x 64-bit cachelines and populates entire
// contents of the struct. This will become multiple
// load operations when lowered to the ISA.
let S = load_ext(BigStruct, address);

Store examples:

let X : vec4 = ...;

// Store X to memory. Size is inferred from X
store_ext(address, X);

// Same as above, but only store x, y, and w components.
// The position in memory containing the z component
// will be left as it was before the store executed.
store_ext(address, X, .mask = 0b1101);

Tile Memory Access

All VPU thread groups are launched in a logical position within a 32x32 'tile'. Each position in the tile stores a 64-bit value (such as a vec4). The VPU program is able to read and write to four tile memories called TM0-TM3. A 'local' read/write to tile memory utilizes the threads logical position within the tile (determined/provided when the thread group was launched) as the location in the tile memory to access. A 'non-local' read/write to tile memory allows the thread to provide an arbitrary location in tile memory for read/write, at an additional access latency.

Because the tile memory index is compiled into the VPU program instructions, the tile memory index must be specified at compile-time and may not be provided by a variable.

// Read a vec4 value from TM0 at the same logical location as the thread.
// The first argument is the tile memory number, and the second argument is used to 
// specify a type for how the returned data should be interpreted.
let x = tm_read(0, vec4);

// Perform some calculation on the vector
let doubled = x * 2;

// Store the result into the same position in TM1
tm_write(1, doubled);

More specifically

// - Write to TM[tm_index] the value 'value' and only write components indicated in 'mask'.
// - If 'location' named argument is provided, then this is a non-local write to the indicated
//   position within the tile.
fn tm_write(tm_index: constant, value:T, .location:i16, .mask:bool4=0b1111) -> void;

// - Read 64 bits of data from TM[tm_index]. 
// - The second argument determines the interpretation of the data returned
// - If a location is provided, then this is a non-local read from the indicated position
//   within the tile.
fn tm_read(tm_index: constant, T:typename, .location:i16) -> T;

Structure Definitions

struct Particle {
    position : vec2;
    velocity : vec2;
};

Packs

Aside from structs, it is also possible to define a 'pack'. The function of this language feature is to provide meaningful names for components of a vector.

// Define a named pack, the same size as a vec4, but with distinct component names.
// Note that a type is also assigned to the pack so the entire pack can be used as
// that 64-bit type
pack Rectangle : vec4 { f16 xmin; f16 xmax; f16 ymin; f16 ymax; };

// May be initialized from any 64-bit-wide value.
var rect : Rectangle = vec4{ ... };
let v0 : vec4 = vec4{ ... };

// Because Rectangle pack is defined using 'Rectangle : vec4', it may participate
// in operations requiring a vec4.
let v1 : vec4 = rect + v0;

Some constraints on packs:

Packs may also be declared within a struct. In this case, the components of the pack are accessible from the struct, and so is the name of the pack in order to refer to the entire 64-bit vector.

struct MyStruct {
    // Stores the bounds of a rectangle
    pack rect : vec4 { f16 xmin; f16 xmax; f16 ymin; f16 ymax; };
};

var ss : MyStruct;

// The entire rectangle can be accessed ...
ss.rect = vec4{ 1.0, 2.0, 3.0, 4.0 };

// ... or single components by the pack element names.
ss.xmin = 7.0;
ss.xmax = -2.0;

Uniform Variables

Uniform variables are readable by a VPU program and written by the DCU. This allows configurability at runtime. A uniform value cannot be written to by a VPU program.

uniform settings : vec4i;

Built-in Utility Functions

thread_position()

Returns a vec4i of details regarding the thread's location.

This maps to a single VPU instruction.

cast(T, value) (Semantic Cast)

Semantic cast value to the indicated type. If v is a multi-component vector type, then each component of v will be cast. The number of components in the output must fit into 64 bits; For example, it is NOT possible to cast 3 or 4 elements to i32. A cast operation from T -> T is a no-op.

This maps to a single VPU instruction.

as(T, value) (Reinterperet Cast)

Similar to cast, except data is merely reinterpreted as the indicated type rather than any conversion taking place. The number of bits that make up value must be the same width as T.

This is a zero-cost operation, simply allowing for proper type tracking in DCL.

Built-in Math Functions

cross(vec3, vec3) -> vec3

Computes a right-handed cross-product. Specifically the result is

cross(A,B) == vec3 { 
    (A.y * B.z) - (A.z * B.y),
    (A.z * B.x) - (A.x * B.z),
    (A.x * B.y) - (A.y * B.x)
}

dot(vecN, vecN) -> f16

Component-wise multiplication of like vector components, then summed to produce a float.

inverse(float) -> float

Return approximate 1.0 / x for input x. Also provided with the normal float / float operator.

Example Programs

Example:

Appendix: Undefined Behaviors (UB)