Dragon

Table of Contents

High Level Specification

Primary CPU Specification

GPU Specification

Design Lessons

What is Dragon?

Dragon is a "fantasy console", where we imagine that shortly after the Playstation and the Nintendo 64 both came out, Dragon Corp, a new developer, put out a new console. Dragon is purposefully intended to be technologically feasible (mostly) so that it could have existed in the late 1990's, but borrows ideas and concepts which wouldn't become popular for another 5-10 years, primarily general purpose programmable GPUs and shaders.

Dragon is designed by myself (sh4) and uchinokitsune. The project itself is an exploration of what hardware design involves, the complexity and realities of developing a full stack of hardware design, firmware, software, debug tooling, etc. and how all of these fit together.

System and Project Goals

Introduction to FPGAs

We are never going to afford to produce an ASIC (It's sadly still ridiculously expensive). However, Dragon does run on real physical hardware via Field Programmable Gate Arrays (FPGAs). If you are not familiar with FPGAs, here's a super short introduction to what is inside them.

FPGAs boil down to a giant 2D grid of Cells (many names, another is Programmable Logic Block). What exactly is inside a cell depends on the FPGA manufacturer, but typically there is a Look-up Table (LUT), some single-bit register(s), and some "hard" adder logic. Anything "hard" ("Hard IP") means that literally the silicon in the FPGA has that thing implemented, so it's generally pretty fast. In addition to cells, scattered throughout your FPGA you will typically have some Digital Signal Processor (DSP) elements and "Block RAM" (sometimes called "BRAM(s)"). DSPs can serve a really important function of giving you some hard multiplier capabilities which we obviously need for many things, but especially for doing 3D graphics work. Block RAM is like little islands of hard memory that you can read and write to. Amongst this sea of cells, DSPs, Block RAM, etc. is a huge number of "wires".

Now, something to realize is that whether a LUT has 4 inputs versus 6, the size and modes of access for the block RAMs, the number of Cells, the amount of Block RAM, etc. etc. -- All of these particulars depend both on the manufacturer and the specific FPGA part. Even the naming of these things varies by manufacturer. You can get smaller ($) and larger ($$$) FPGAs which will usually have more DSPs, more cells, .

Now what makes FPGA programmable is that when an FPGA is powered on, it reads its 'configuration' from some other chip, and takes on that design that was loaded. What that means is that every look-up table, every DSP configuration, etc. will be loaded from a file. Because every cell can be loaded with arbitrary data and the wiring allows you to effectively connect cells together in nearly arbitrarily complex ways, FPGAs are capable of turning a hardware design into a physical thing.

The flow for an FPGA developer is something like:

    Design: A developer implements a logic design in some Hardware Description Language (HDL) such as Verilog, SystemVerilog, VHDL, Chisel, etc. This is a language that can describe the logic without referring (too much) to how exactly this is implemented in a particular "technology".
    Synthesis: You use a "synthesis" tool which will take your design written in whatever language and transform it into a set of cells, DSPs, memories, etc. that can actually work on your target FPGA platform. Note that this is like a list of resources that a design uses and how they need to be 'connected' but nothing about how they're actually physically arranged in the FPGAs 'grid' of resources.
    PNR: You now use a "Place and Route" ("PNR") tool which takes the description of things and how they're supposed to be connected, takes a description of an actual FPGA and where the resources are physically located on the chip, and tries to find out how to hook up those real resources in such a way it matches what the synthesis tool created.
    Clock Rates: The PNR tool will hopefully find some solution, but depending on the solution it could find, the wires and arrangement of cells will have some physical delay which limits how fast signals can travel in the design, which effectively determines the maximum clock rate your design can run. If this is less than what you need/expected, at this point you need to spend a lot of time to figure out where the bottleneck/limitation is in your design to help speed up the design.
    Debugging: Much like software debugging, you can easily create a lot of bugs in FPGA hardware, and it can be extremely hard to debug sometimes. So there is some debugging-and-fix "loop" just like in software development. I will note that real development of hardware leans heavily into more serious formal verification methods which can literally prove that certain error cases are impossible in your design, etc.

There is so much more to FPGAs, but hopefully this tells you enough to understand the later parts of this document where we reference how/why we made certain design decisions.

Target Hardware Platforms

Today the design runs and 'targets' two different FPGA parts:

    ECP5-85F - This FPGA is our "smaller" variant. This part from Lattice Semiconductor has 85 thousand cells. The ECP5 family uses 4-input LUT cells. In reality this part will likely become more like a "reduced" version, with software potentially running at a reduced resolution or reduced framerate.
    Artix 7 A200T - This is our "larger" variant. This part from Xilinx has roughly 200 thousands cells and these are 6-input LUTs.

Developer Notes
From a system developer perspective, the synth+PNR workflow for both of these is different. There are also some slight differences in how memories and DSPs work between these two parts which means we must be careful in how we expose a single hardware design which makes good use of the available hardware.

System Components

Dragon is made up of several key components. Each of these is mentioned here and described in more detail later in the document.

Primary CPU
The Primary CPU is a RISCV-32IM core. We selected this core because it is reasonably simple to implement, reason about, fairly simple to get a good clock rate, because of the RISCV standard it is intentionally very customizable (which we have leveraged), and perhaps most importantly, it meant that we could make use of existing developer toolchains for compiling high level languages into working firmware and software. Originally we experimented on this design utilizing the picorv32 HDL core but have since migrated to our own design. On Dragon, the Primary CPU is nominally executing at 100 MHz.

System Memory
System memory represents the single shared pool of memory which is shared between CPU, GPU, Display Controller, and Audio system. The actual amount of memory depends on the target FPGA development board on which the system is being built for, but as a minimum configuration we expect at least 32MB with SDRAM-like access semantics. Because all of these devices may contend for access at once, there is a priority mechanism in CXB (see below) to control access. Note that in the case of our ECP5 boards, we have implemented our own SDRAM controller with the proper refresh timing etc. but for the A200T target which utilizes much more complex DDR3, handling this memory controller is taken as a later exercise once the rest of the project is already working.

Display Controller
We target GDPI (which is amazingly/suspiciously similar to HDMI without the branding/licensing requirements) for display output. The Display controller is responsible for feeding that display output with proper signal so that a framebuffer sitting in system memory will actually display properly on a display.

GPU: Dragon Control Unit (DCU)
The Dragon GPU is composed of two "sides", a control side and a work side. The Dragon Control Unit (DCU) is a second core identical to the primary CPU and after some setup from the Primary CPU, it is commanded to asynchronously execute a "Control Program" which operates the rest of the GPU which only the DCU can talk to. The Control Program is responsible for looking at the compute and rendering work which has been enqueued from the Primary CPU and ensuring the work side is kept busy operating on those tasks.

GPU: Vector Processing Units
TODO

CXB Fabric
Between the various components of Dragon we need to route requests from one device to another. There is one I/O request/routing system which is somewhat akin to Network-on-chip (NOC). It is called the Cheshire Bus (CXB).

Performance Targets

TODO

RISCV Core

TODO

Instruction and Data Caches

TODO

Dragon GPU Frontend

TODO

Dragon Control Unit

TODO

Vector Processing Units

TODO

Dragon VPU: Core Architecture

TODO

Dragon VPU: Instruction Set

TODO

Design Lessons: High-Level Emulation and Throughput Planning

"Throughput and Latency are Everything"
The most important lesson of all: Almost every design decision in this project has in the end come down to data latency and throughput. As an example "How many triangles can we render at 60 FPS?" becomes a sequence of questions which are ultimately bottlenecked either by the FPGA resources or external IO. Below is an example of how that question is answered. We've performed exercises like this many many times because almost all questions of "will this be fast enough" or "how much geometry can we render" etc. will all come down to bottlenecks in the physical hardware.

This will sound obvious, but it's really clear after doing many calculations like this, that chip and interconnections decisions for systems of this era were key decision points that architects had to make. The above calculation also elludes to a very obvious reason why GPUs commonly had dedicated VRAM (avoiding contention with CPU, RAM port sizes optimized to the common I/O size, along with several other reasons).

"Quickly figuring out what is possible"
When we began designing Dragon, before diving into SystemVerilog HDL, we wanted some sense of whether the rasterization, memory access, VPU instructions, etc. could be done fast enough to do interesting rendering on FPGAs like the ones we have. We created some quick C++-based high-level emulator which would simulate performaing various aspects of the system and output a Chrome Profiler trace JSON which would be easily loaded into Chrome or Perfetto for visualization 1. This exercise was critical for convincing ourselves that we had enough ALU and other operations to hide memory accesses, to find out how important binning/tile-based rendering was, etc.

Design Lessons: FPGA Technology

    Leverage simulation, fuzzing, and formal verification wherever possible. TODO (verilator testing)

Design Lessons: The Dragon GPU

TODO