Dragon
Table of Contents
High Level Specification
Primary CPU Specification
GPU Specification
Design Lessons
What is Dragon?
Dragon is a "fantasy console", where we imagine that shortly after the Playstation and the Nintendo 64 both came out, Dragon Corp, a new developer, put out a new console. Dragon is purposefully intended to be technologically feasible (mostly) so that it could have existed in the late 1990's, but borrows ideas and concepts which wouldn't become popular for another 5-10 years, primarily general purpose programmable GPUs and shaders.
Dragon is designed by myself (sh4
) and uchinokitsune
. The project itself is an exploration of what hardware design involves, the complexity and realities of developing a full stack of hardware design, firmware, software, debug tooling, etc. and how all of these fit together.
System and Project Goals
Introduction to FPGAs
We are never going to afford to produce an ASIC (It's sadly still ridiculously expensive). However, Dragon does run on real physical hardware via Field Programmable Gate Arrays (FPGAs). If you are not familiar with FPGAs, here's a super short introduction to what is inside them.
FPGAs boil down to a giant 2D grid of Cells (many names, another is Programmable Logic Block). What exactly is inside a cell depends on the FPGA manufacturer, but typically there is a Look-up Table (LUT), some single-bit register(s), and some "hard" adder logic. Anything "hard" ("Hard IP") means that literally the silicon in the FPGA has that thing implemented, so it's generally pretty fast. In addition to cells, scattered throughout your FPGA you will typically have some Digital Signal Processor (DSP) elements and "Block RAM" (sometimes called "BRAM(s)"). DSPs can serve a really important function of giving you some hard multiplier capabilities which we obviously need for many things, but especially for doing 3D graphics work. Block RAM is like little islands of hard memory that you can read and write to. Amongst this sea of cells, DSPs, Block RAM, etc. is a huge number of "wires".
Now, something to realize is that whether a LUT has 4 inputs versus 6, the size and modes of access for the block RAMs, the number of Cells, the amount of Block RAM, etc. etc. -- All of these particulars depend both on the manufacturer and the specific FPGA part. Even the naming of these things varies by manufacturer. You can get smaller ($) and larger ($$$) FPGAs which will usually have more DSPs, more cells, .
Now what makes FPGA programmable is that when an FPGA is powered on, it reads its 'configuration' from some other chip, and takes on that design that was loaded. What that means is that every look-up table, every DSP configuration, etc. will be loaded from a file. Because every cell can be loaded with arbitrary data and the wiring allows you to effectively connect cells together in nearly arbitrarily complex ways, FPGAs are capable of turning a hardware design into a physical thing.
The flow for an FPGA developer is something like:
There is so much more to FPGAs, but hopefully this tells you enough to understand the later parts of this document where we reference how/why we made certain design decisions.
Target Hardware Platforms
Today the design runs and 'targets' two different FPGA parts:
Developer Notes
From a system developer perspective, the synth+PNR workflow for both of these is different. There are also some slight differences in how memories and DSPs work between these two parts which means we must be careful in how we expose a single hardware design which makes good use of the available hardware.
System Components
Dragon is made up of several key components. Each of these is mentioned here and described in more detail later in the document.
Primary CPU
The Primary CPU is a RISCV-32IM core. We selected this core because it is reasonably simple to implement, reason about, fairly simple to get a good clock rate, because of the RISCV standard it is intentionally very customizable (which we have leveraged), and perhaps most importantly, it meant that we could make use of existing developer toolchains for compiling high level languages into working firmware and software. Originally we experimented on this design utilizing the picorv32 HDL core but have since migrated to our own design. On Dragon, the Primary CPU is nominally executing at 100 MHz.
System Memory
System memory represents the single shared pool of memory which is shared between CPU, GPU, Display Controller, and Audio system. The actual amount of memory depends on the target FPGA development board on which the system is being built for, but as a minimum configuration we expect at least 32MB with SDRAM-like access semantics. Because all of these devices may contend for access at once, there is a priority mechanism in CXB (see below) to control access. Note that in the case of our ECP5 boards, we have implemented our own SDRAM controller with the proper refresh timing etc. but for the A200T target which utilizes much more complex DDR3, handling this memory controller is taken as a later exercise once the rest of the project is already working.
Display Controller
We target GDPI (which is amazingly/suspiciously similar to HDMI without the branding/licensing requirements) for display output. The Display controller is responsible for feeding that display output with proper signal so that a framebuffer sitting in system memory will actually display properly on a display.
GPU: Dragon Control Unit (DCU)
The Dragon GPU is composed of two "sides", a control side and a work side. The Dragon Control Unit (DCU) is a second core identical to the primary CPU and after some setup from the Primary CPU, it is commanded to asynchronously execute a "Control Program" which operates the rest of the GPU which only the DCU can talk to. The Control Program is responsible for looking at the compute and rendering work which has been enqueued from the Primary CPU and ensuring the work side is kept busy operating on those tasks.
GPU: Vector Processing Units
TODO
CXB Fabric
Between the various components of Dragon we need to route requests from one device to another. There is one I/O request/routing system which is somewhat akin to Network-on-chip (NOC). It is called the Cheshire Bus (CXB).
Performance Targets
TODO
RISCV Core
TODO
Instruction and Data Caches
TODO
Dragon GPU Frontend
TODO
Dragon Control Unit
TODO
Vector Processing Units
TODO
Dragon VPU: Core Architecture
TODO
Dragon VPU: Instruction Set
TODO
Design Lessons: High-Level Emulation and Throughput Planning
"Throughput and Latency are Everything"
The most important lesson of all: Almost every design decision in this project has in the end come down to data latency and throughput. As an example "How many triangles can we render at 60 FPS?" becomes a sequence of questions which are ultimately bottlenecked either by the FPGA resources or external IO. Below is an example of how that question is answered. We've performed exercises like this many many times because almost all questions of "will this be fast enough" or "how much geometry can we render" etc. will all come down to bottlenecks in the physical hardware.
(24 vertex + 8 index)*(1000 triangles)*(60 FPS)
= ~1.88MiB/sec which is not bad at all from a system bandwidth perspective. Given that the VPU will be doing these operations we also consider the total latency. A single VPU has a 64b port to the CXB to system memory. (32*1000*60)/8
accesses will be required, which equates to approximately 2 milliseconds of time at 100MHz, again assuming constant access is possible (it will not). A single VPU has this kind of small "lightspeed" for data in and out which is another pressure for there to be multiple VPUs on a single GPU when possible.32x32x5
is 5120 writes to a tile memory. Assuming we can issue these writes perfectly back-to-back and the VPU core is running at 100MHz, that's 10 nanosecond per clock * (32*32*5 total writes to tile) * (80 total tiles) = 4.096 milliseconds
. Let's say we're able to 'fit' 2 VPUs on the target FPGA, then the effective number of tiles done by any VPU will be half, so the total time is then 2.048 milliseconds.This will sound obvious, but it's really clear after doing many calculations like this, that chip and interconnections decisions for systems of this era were key decision points that architects had to make. The above calculation also elludes to a very obvious reason why GPUs commonly had dedicated VRAM (avoiding contention with CPU, RAM port sizes optimized to the common I/O size, along with several other reasons).
"Quickly figuring out what is possible"
When we began designing Dragon, before diving into SystemVerilog HDL, we wanted some sense of whether the rasterization, memory access, VPU instructions, etc. could be done fast enough to do interesting rendering on FPGAs like the ones we have. We created some quick C++-based high-level emulator which would simulate performaing various aspects of the system and output a Chrome Profiler trace JSON which would be easily loaded into Chrome or Perfetto for visualization 1. This exercise was critical for convincing ourselves that we had enough ALU and other operations to hide memory accesses, to find out how important binning/tile-based rendering was, etc.
Design Lessons: FPGA Technology
Design Lessons: The Dragon GPU
TODO