Dragon

Dragon is a "fantasy console", where we imagine that shortly after the Playstation and the Nintendo 64 both came out, Dragon Corp, a new developer, put out a new console. Dragon is purposefully intended to be technologically feasible (mostly) so that it could have existed in the late 1990's, but borrows ideas and concepts which wouldn't become popular for another 5-10 years, primarily general purpose programmable GPUs and shaders. For context of what graphics and gaming consoles in the 90s was like, I suggest you look at this 3DFX Oral History and Rodrigo Copetti's Console architecture reviews.

Dragon is designed by myself (sh4) and uchinokitsune. The project itself is an exploration of what hardware design involves, the complexity and realities of developing a full stack of hardware design, firmware, software, debug tooling, etc. and how all of these fit together.

System and Project Goals

•

Provide 'semi-realistic' hardware which could have existed during the late 1990's

•

Fully-programmable GPU shaders both for compute and graphics workloads.

•

Flexible tile-based rendering.

•

Able to rasterize O(thousands) of triangles at 60 FPS at 320p.

•

Targets execution on real FPGA developer kits, both for developers and some hypothetical user.

•

Zero hardware IP dependencies. We design and implement every UART, Memory controller, GPU Core. Why? Because the main point of this project is to learn and fully understand how hardware really works and what kinds of constraints FPGA environments

•

We implement the full firmware, SDK, and Software environment.

•

Where possible, we can somewhat gracefully scale to faster execution on FPGAs which have more resources, better clock rates, etc.

Introduction to FPGAs

We are never going to afford to produce an ASIC (It's sadly still ridiculously expensive). However, Dragon does run on real physical hardware via Field Programmable Gate Arrays (FPGAs). If you are not familiar with FPGAs, here's a super short introduction to what is inside them.

FPGAs boil down to a giant 2D grid of Cells (many names, another is Programmable Logic Block). What exactly is inside a cell depends on the FPGA manufacturer, but typically there is a Look-up Table (LUT), some single-bit register(s), and some "hard" adder logic. Anything "hard" ("Hard IP") means that literally the silicon in the FPGA has that thing implemented, so it's generally pretty fast. In addition to cells, scattered throughout your FPGA you will typically have some Digital Signal Processor (DSP) elements and "Block RAM" (sometimes called "BRAM(s)"). DSPs can serve a really important function of giving you some hard multiplier capabilities which we obviously need for many things, but especially for doing 3D graphics work. Block RAM is like little islands of hard memory that you can read and write to. Amongst this sea of cells, DSPs, Block RAM, etc. is a huge number of "wires".

Now, something to realize is that whether a LUT has 4 inputs versus 6, the size and modes of access for the block RAMs, the number of Cells, the amount of Block RAM, etc. etc. -- All of these particulars depend both on the manufacturer and the specific FPGA part. Even the naming of these things varies by manufacturer. You can get smaller ($) and larger ($$$) FPGAs which will usually have more DSPs, more cells, .

Now what makes FPGA programmable is that when an FPGA is powered on, it reads its 'configuration' from some other chip, and takes on that design that was loaded. What that means is that every look-up table, every DSP configuration, etc. will be loaded from a file. Because every cell can be loaded with arbitrary data and the wiring allows you to effectively connect cells together in nearly arbitrarily complex ways, FPGAs are capable of turning a hardware design into a physical thing.

The flow for an FPGA developer is something like:

•

Design: A developer implements a logic design in some Hardware Description Language (HDL) such as Verilog, SystemVerilog, VHDL, Chisel, etc. This is a language that can describe the logic without referring (too much) to how exactly this is implemented in a particular "technology".

•

Synthesis: You use a "synthesis" tool which will take your design written in whatever language and transform it into a set of cells, DSPs, memories, etc. that can actually work on your target FPGA platform. Note that this is like a list of resources that a design uses and how they need to be 'connected' but nothing about how they're actually physically arranged in the FPGAs 'grid' of resources.

•

PNR: You now use a "Place and Route" ("PNR") tool which takes the description of things and how they're supposed to be connected, takes a description of an actual FPGA and where the resources are physically located on the chip, and tries to find out how to hook up those real resources in such a way it matches what the synthesis tool created.

•

Clock Rates: The PNR tool will hopefully find some solution, but depending on the solution it could find, the wires and arrangement of cells will have some physical delay which limits how fast signals can travel in the design, which effectively determines the maximum clock rate your design can run. If this is less than what you need/expected, at this point you need to spend a lot of time to figure out where the bottleneck/limitation is in your design to help speed up the design.

•

Debugging: Much like software debugging, you can easily create a lot of bugs in FPGA hardware, and it can be extremely hard to debug sometimes. So there is some debugging-and-fix "loop" just like in software development. I will note that real development of hardware leans heavily into more serious formal verification methods which can literally prove that certain error cases are impossible in your design, etc.

There is so much more to FPGAs, but hopefully this tells you enough to understand the later parts of this document where we reference how/why we made certain design decisions.

Target Hardware Platforms

Today the design runs and 'targets' two different FPGA parts:

•

ECP5-85F - This FPGA is our "smaller" variant. This part from Lattice Semiconductor has 85 thousand cells. The ECP5 family uses 4-input LUT cells. In reality this part will likely become more like a "reduced" version, with software potentially running at a reduced resolution or reduced framerate.

•

Artix 7 A200T - This is our "larger" variant. This part from Xilinx has roughly 200 thousands cells and these are 6-input LUTs.

Developer Notes
From a system developer perspective, the synth+PNR workflow for both of these is different. There are also some slight differences in how memories and DSPs work between these two parts which means we must be careful in how we expose a single hardware design which makes good use of the available hardware.

System Components

Dragon is made up of several key components. Each of these is mentioned here and described in more detail later in the document.

Primary CPU
The Primary CPU is a RISCV-32IM core. We selected this core because it is reasonably simple to implement, reason about, fairly simple to get a good clock rate, because of the RISCV standard it is intentionally very customizable (which we have leveraged), and perhaps most importantly, it meant that we could make use of existing developer toolchains for compiling high level languages into working firmware and software. Originally we experimented on this design utilizing the picorv32 HDL core but have since migrated to our own design. On Dragon, the Primary CPU is nominally executing at 100 MHz.

System Memory
System memory represents the single shared pool of memory which is shared between CPU, GPU, Display Controller, and Audio system. The actual amount of memory depends on the target FPGA development board on which the system is being built for, but as a minimum configuration we expect at least 32MB with SDRAM-like access semantics. Because all of these devices may contend for access at once, there is a priority mechanism in CXB (see below) to control access. Note that in the case of our ECP5 boards, we have implemented our own SDRAM controller with the proper refresh timing etc. but for the A200T target which utilizes much more complex DDR3, handling this memory controller is taken as a later exercise once the rest of the project is already working.

Display Controller
We target GDPI (which is amazingly/suspiciously similar to HDMI without the branding/licensing requirements) for display output. The Display controller is responsible for feeding that display output with proper signal so that a framebuffer sitting in system memory will actually display properly on a display.

GPU: Dragon Control Unit (DCU)
The Dragon GPU is composed of two "sides", a control side and a work side. The Dragon Control Unit (DCU) is a second core identical to the primary CPU and after some setup from the Primary CPU, it is commanded to asynchronously execute a "Control Program" which operates the rest of the GPU which only the DCU can talk to. The Control Program is responsible for looking at the compute and rendering work which has been enqueued from the Primary CPU and ensuring the work side is kept busy operating on those tasks.

GPU: Vector Processing Units
The primary work of the Dragon GPU is performed by the Vector Processing Units (VPUs).

TODO

CXB Fabric
Between the various components of Dragon we need to route requests from one device to another. There is one I/O request/routing system which is somewhat akin to Network-on-chip (NOC). It is called the Cheshire Bus (CXB).

Performance Targets

TODO

CPU: RISCV Core

TODO

CPU: Instruction and Data Caches

TODO

Dragon GPU Frontend

TODO

Dragon Control Unit (DCU)

TODO

Vector Processing Units (VPU)

The VPU is a custom processor which performs all graphics and compute work within the GPU. It may effectively manage state for 64 "waves", each wave containing 4 threads, and each thread operating on 64-bit registers commonly representing 4-component 16-bit vectors of floats or integers. In order to work around extremely limited memory within the FPGA, a limited number of BRAM are used by only loading/storing register state associated with a single thread at a time.

A wave refers to 4 threads, a single instruction which all threads currently may run, and a mask of whether or not a given thread within the wave is active or inactive. Some kinds of operations may 'deactivate' a wave and bring in one of the other pending waves which is not currently in the barrel but is ready to execute work.

Because the GPU only needs simultaneous read/write access to a few sets registers at any moment, the VPU is always "cycling through" 20 waves in round-robin fashion. This means when an instruction is issued on a wave, the next instruction will not take place for another 20 cycles. For this reason, it is important to note that the VPU is focused primarily on bandwidth than latency; Tasks which can saturate the GPU with a lot of compute and keep a large number of threads working will enjoy the best performance. Any work which serializes against other threads or experiences divergence within a warp will have a significant impact on performance.

Dragon VPU: Core Architecture

The VPU is a custom architecture. A single VPU has 64 total Waves within an internal scheduler. At any moment, 20 waves are "active", and on each clock cycle, the VPU issues a single instruction for the current wave, and the next wave is handled in the next cycle. This round-robin over 20 waves is called the Barrel Processor paradigm. Each wave/cycle refers to 4 Threads within a wave, and each thread refers to 4-way SIMD for operations, with each instruction being 16-bits. Waves are further divided into even/odd. Even/odd waves have separate register banks and scheduling queues. Because the latency is 20 cycles, there can be 20 waves active at any time (10 even / 10 odd). There can be a total of up to 64 waves (32 even / 32 odd) either active or queued at any point in time. Queued waves can be resumed / paused with 0 latency. Waves are dynamically descheduled if they run a blocking operation (e.g. memory read) to avoid loss of throughput. Each thread has 16 local register with each register being a 4x16b vector.

VPU thread state has no stack, and so there is no concept of dynamic function calls. A VPU thread which has any type of divergence (conditional branching, looping, etc.) must sacrifice a single register to serve as a means of tracking which threads within the warp are active when entering some conditional logic and which threads are active after leaving that scope.

With respect to memory I/O, a VPU may access SysMem via a single 64b port/interface. Memory access includes support for atomics; If all 4 threads want to do a memory operation it takes 4 instruction "slots" to issue all of them, i.e. the atomic accesses for all threads within a wave which wish to use atomics will serialize. VPU writes are asynchronous to an internal store queue. Reads are also asynchronous: a read is always issued in one instruction but the programmer must block on completion of the read. This enables a programmer/compiler to potentially hide read latency with some compute task.

VPU threads each have 16 total local registers and XXX global registers. Local registers may be read from or written to via normal instructions while global registers may only be read from. The DCU may write to global registers. The global registers are commonly used to store constants and "uniforms" commanded from the CPU, though there is no distinction in hardware.

Dragon VPU: Look-Up Table

In order to support palletized color and other functions which would otherwise consume a lot of memory bandwidth or wasteful to store to the limited cache, the VPU has a Look-up Table (LUT), not to be confused with the FPGA LUTs. The VPU LUT functionality allows a single byte to index into a 256-entry table of 64-bit vectors. This table can be programmed via DMA from the DCU. Some potential uses of the LUT are:

•

Large tables of constants, for instance, in math function look-ups

•

Storing palette color data

Dragon VPU: Tile Memory

In order to facilitate rendering and compute beyond local registers, four separate "Tile Memories" (TM's) called TM0-TM3 are available. Each TM is a 32x32 region of 64-bit values, the same as a local register. The four threads of a wave may access any 2x2 region of a TM simultaneously so long as all four accesses remain within the same region. If threads refer to reads/writes spanning multiple TM regions then those accesses will be serialized.

It is intended that a TM might store depth information, rendered color data, and the like in the case of graphics rendering. Similarly, if implementing a deferred rendering binner, the TMs may be used to store triangle index data or similar other information.

Dragon VPU: Memory Access

TODO: Cache Behavior

Dragon VPU: FP16 Format

The VPU utilizes a custom floating point format called internally called FP16. Similar in design to IEEE754 variants, FP16 in the VPU has no concept of infinite, NaN, under- or over-flow, and no exception signaling. These edge cases do not serve the general usage of the VPU for graphics work and the additional logic complexity was deemed not worth it. VPU register are commonly interpreted as four FP16 values called x, y, z, and w.

Dragon VPU: Instruction Set

TODO

Design Lessons: High-Level Emulation/Simulation and Throughput Planning

"Throughput and Latency are Everything"
The most important lesson of all: Almost every design decision in this project has in the end come down to data latency and throughput. As an example "How many triangles can we render at 60 FPS?" becomes a sequence of questions which are ultimately bottlenecked either by the FPGA resources or external IO. We've performed exercises like this many many times because almost all questions of "will this be fast enough" or "how much geometry can we render" etc. will all come down to bottlenecks in the physical hardware.

Questions about throughput usually fall into two categories: ones you can quickly reason about and "ball park", and ones that are too complex to reason about. For the latter, usually any system requires a lot of dynamic behavior, systems involving cache hierarchies where the cache setup and access patterns have a substantial impact on performance, it's crucial that you get to a simulation of some kind of fidelity (see below).

Chip selection and interconnect decisions for systems of this era were clearly key decision points for architects. The above calculation also elludes to a very obvious reason why GPUs commonly had dedicated VRAM (avoiding contention with CPU, RAM port sizes optimized to the common I/O size, along with several other reasons).

"Quickly figuring out what is possible"
When we began designing Dragon, before diving into SystemVerilog HDL, we wanted some sense of whether the rasterization, memory access, VPU instructions, etc. could be done fast enough to do interesting rendering on FPGAs like the ones we have. We created some quick C++-based high-level emulator which would simulate performaing various aspects of the system and output a Chrome Profiler trace JSON which would be easily loaded into Chrome or Perfetto for visualization 1. This exercise was critical for convincing ourselves that we had enough ALU and other operations to hide memory accesses, to find out how important binning/tile-based rendering was, etc.

Simulators, Emulators, and tooling/progress trade-offs
Given the above, there is a spectrum of choices on how to determine what your final design is capable of.

On one extreme, you could say "Well, let's just implement the design and see how it does". Writing HDL, debugging its errors, and also the develop-flash-debug loop itself can be quite slow. At the same time, if you get to a working solution, you're maybe done, which is great. So there can be little wasted time if the implementation is fairly straightforward and you're already positive that a component is needed.

On the other extreme, you can do napkin math and approximate many things. This has the advantage of being very quick, but is often not applicable in a variety of settings. If you have multiple cache layers, multiple systems contending for a moment, dynamic and branching code, etc. it becomes increasingly difficult to come to any conclusion with any kind of certainty.

During the development of dragon we have done several "middle" solutions. We do have simulations of our actual Verilog code via Verilator. This is great because even when I'm on a flight I can continue answering questions about the real behavior of the logic, delays for memory accesses, etc. At the same time HDL is time consuming to reason about and write. We also have written "High-Level Emulators" which abstract every component into something that is vaguely performing some function every cycle, and can block on others etc. This can be very useful, but clearly we had to also write something that doesn't directly contribute to the "end product", so we have to be careful how we invest our time into these kinds of things. Lastly, I want to point out that we've written a more detailed "C++ simulation" of the GPU itself only. This has helped us understand the importance of various memory access latencies, importance of certain GPU programs being very fast, influenced creation of new instructions and realizing we didn't need other instructions, etc. Also the C++ GPU simulator is something that we can ostensibly hand to others in the future that want to build something for our GPU (write shaders, etc.) but don't have access to hardware, or for automated correctness testing, automated performance regression testing, etc.

All of these things certainly have value, but you need to be honest with yourself about the time investment and how you think it will pay off in the grand scheme of the project.

Design Lessons: FPGA Technology

•

Leverage simulation, fuzzing, and formal verification wherever possible. TODO (verilator testing)

Design Lessons: The Dragon GPU

Rasterizer Hardware or Software?
TODO

Why a Barrel Processor?
TODO

Dragon

Table of Contents

What is Dragon?

System and Project Goals

Introduction to FPGAs

Target Hardware Platforms

System Components

Performance Targets

CPU: RISCV Core

CPU: Instruction and Data Caches

Dragon GPU Frontend

Dragon Control Unit (DCU)

Vector Processing Units (VPU)

Dragon VPU: Core Architecture

Dragon VPU: Look-Up Table

Dragon VPU: Tile Memory

Dragon VPU: Memory Access

Dragon VPU: FP16 Format

Dragon VPU: Instruction Set

Design Lessons: High-Level Emulation/Simulation and Throughput Planning

Design Lessons: FPGA Technology

Design Lessons: The Dragon GPU