558 lines
33 KiB
Markdown
558 lines
33 KiB
Markdown
# draw
|
||
|
||
2D rendering library built on SDL3 GPU, providing a unified shape-drawing and text-rendering API with
|
||
Clay UI integration.
|
||
|
||
## Current state
|
||
|
||
The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
|
||
modes dispatched by a push constant:
|
||
|
||
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
|
||
SDL_ttf atlas textures), axis-aligned sharp-corner rectangles (already optimal as 2 triangles),
|
||
per-vertex color gradients (`rectangle_gradient`, `circle_gradient`), angular-clipped circle
|
||
sectors (`circle_sector`), and arbitrary user geometry (`triangle`, `triangle_fan`,
|
||
`triangle_strip`). The fragment shader computes `out = color * texture(tex, uv)`.
|
||
|
||
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
|
||
`Primitive` structs uploaded each frame to a GPU storage buffer. The vertex shader reads
|
||
`primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners + primitive
|
||
bounds. The fragment shader dispatches on `Shape_Kind` to evaluate the correct signed distance
|
||
function analytically.
|
||
|
||
Seven SDF shape kinds are implemented:
|
||
|
||
1. **RRect** — rounded rectangle with per-corner radii (iq's `sdRoundedBox`)
|
||
2. **Circle** — filled or stroked circle
|
||
3. **Ellipse** — exact signed-distance ellipse (iq's iterative `sdEllipse`)
|
||
4. **Segment** — capsule-style line segment with rounded caps
|
||
5. **Ring_Arc** — annular ring with angular clipping for arcs
|
||
6. **NGon** — regular polygon with arbitrary side count and rotation
|
||
7. **Polyline** — decomposed into independent `Segment` primitives per adjacent point pair
|
||
|
||
All SDF shapes support fill and stroke modes via `Shape_Flags`, and produce mathematically exact
|
||
curves with analytical anti-aliasing via `smoothstep` — no tessellation, no piecewise-linear
|
||
approximation. A rounded rectangle is 1 primitive (64 bytes) instead of ~250 vertices (~5000 bytes).
|
||
|
||
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
|
||
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
|
||
glyph edges and tessellated user geometry if desired.
|
||
|
||
## 2D rendering pipeline plan
|
||
|
||
This section documents the planned architecture for levlib's 2D rendering system. The design is driven
|
||
by three goals: **draw quality** (mathematically exact curves with perfect anti-aliasing), **efficiency**
|
||
(minimal vertex bandwidth, high GPU occupancy, low draw-call count), and **extensibility** (new
|
||
primitives and effects can be added to the library without architectural changes).
|
||
|
||
### Overview: three pipelines
|
||
|
||
The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
|
||
**render-state requirements**:
|
||
|
||
1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
|
||
registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
|
||
frame.
|
||
|
||
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
||
effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
|
||
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
||
redundant overdraw.
|
||
|
||
3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
|
||
render target as input. High register footprint (~70–80 registers) and structurally requires a
|
||
`CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
|
||
pressure and because the texture-copy requirement forces a render-pass-level state change.
|
||
|
||
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
||
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
||
switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
|
||
is under 0.15% of an 8.3ms (120 FPS) frame budget.
|
||
|
||
### Why three pipelines, not one or seven
|
||
|
||
The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
|
||
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
|
||
|
||
The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
|
||
shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
|
||
compiler allocates registers pessimistically based on the worst-case path through the shader. If the
|
||
shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
|
||
— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
|
||
warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
|
||
|
||
Concrete example on a modern NVIDIA SM with 65,536 registers:
|
||
|
||
| Register allocation | Max concurrent threads | Occupancy |
|
||
| ------------------------- | ---------------------- | --------- |
|
||
| 20 regs (RRect only) | 3,276 | ~100% |
|
||
| 48 regs (+ drop shadow) | 1,365 | ~42% |
|
||
| 72 regs (+ frosted glass) | 910 | ~28% |
|
||
|
||
For a 4K frame (3840×2160) at 1.5× overdraw (~12.4M fragments), running all fragments at 28%
|
||
occupancy instead of 100% roughly triples fragment shading time. At 4K this is severe: if the main
|
||
pipeline's fragment work at full occupancy takes ~2ms, a single unified shader containing the glass
|
||
branch would push it to ~6ms — consuming 72% of the 8.3ms budget available at 120 FPS and leaving
|
||
almost nothing for CPU work, uploads, and presentation. This is a per-frame multiplier, not a
|
||
per-primitive cost — it applies even when the heavy branch is never taken.
|
||
|
||
The three-pipeline split groups primitives by register footprint so that:
|
||
|
||
- Main pipeline (~20 regs): 90%+ of fragments run at near-full occupancy.
|
||
- Effects pipeline (~55 regs): shadow/glow fragments run at moderate occupancy; unavoidable given the
|
||
blur math complexity.
|
||
- Backdrop-effects pipeline (~75 regs): glass fragments run at low occupancy; also unavoidable, and
|
||
structurally separated anyway by the texture-copy requirement.
|
||
|
||
This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
|
||
(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
|
||
actually need them.
|
||
|
||
**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
|
||
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
|
||
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
|
||
for our use case:
|
||
|
||
**Draw call count scales with kind variety, not just scissor count.** With a unified pipeline,
|
||
one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With
|
||
per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a
|
||
typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces
|
||
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
|
||
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
|
||
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
|
||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
|
||
negligible (all members cluster at 12–22 registers).
|
||
|
||
**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
|
||
from one storage buffer, submission order equals draw order — Clay's painterly render commands flow
|
||
through without reordering. With separate pipelines per kind, primitives can only batch with
|
||
same-kind neighbors, which means interleaved kinds (e.g., `[rrect, circle, text, rrect, text]`) must
|
||
either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort
|
||
by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their
|
||
API where each layer draws shadows before quads before glyphs. Our design avoids this constraint:
|
||
submission order is draw order, no layer juggling required.
|
||
|
||
**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
|
||
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
|
||
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
|
||
additional axis with 7 pipelines vs 3.
|
||
|
||
**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
|
||
the strongest candidate for per-kind splitting because effect branches are heavier than shape
|
||
branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses.
|
||
Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved
|
||
in submission order:
|
||
|
||
- _Unified effects pipeline (our plan):_ 1 pipeline bind, 1 instanced draw call. Category-3
|
||
divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2
|
||
boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra
|
||
instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ **1.7μs**.
|
||
|
||
- _Per-kind effects pipelines (GPUI-style):_ 2 pipeline binds + 2 draw calls. But submission order
|
||
is `[drop, drop, inner, drop, drop, inner, drop, ...]` — the two inner-shadow primitives split the
|
||
drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline
|
||
switches, not 2. Cost: 5 × 5μs + 4 × 5μs = **45μs**.
|
||
|
||
The per-kind approach costs **26× more** than the unified approach's divergence penalty (45μs vs
|
||
1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most
|
||
extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing
|
||
~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
|
||
cheaper than the pipeline-switching alternative.
|
||
|
||
The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
|
||
boundaries where occupancy differences are catastrophic at 4K (see numbers above). Within a tier,
|
||
unified is strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead,
|
||
and negligible GPU-side branching cost.
|
||
|
||
**References:**
|
||
|
||
- Zed GPUI blog post on their per-primitive pipeline architecture:
|
||
https://zed.dev/blog/videogame
|
||
- Zed GPUI Metal shader source (7 shader pairs):
|
||
https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
|
||
- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
|
||
https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
|
||
|
||
### Why fragment shader branching is safe in this design
|
||
|
||
There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware
|
||
where shader cores had no branch instructions at all — compilers emitted code for both sides of every
|
||
branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward),
|
||
this is no longer the case. Native dynamic branching is fully supported on all current hardware.
|
||
However, branching _can_ still be costly in specific circumstances. Understanding which circumstances
|
||
apply to our design — and which do not — is critical to justifying the unified-pipeline approach.
|
||
|
||
#### How GPU branching works
|
||
|
||
GPUs execute fragment shaders in **warps** (NVIDIA/Intel, 32 threads) or **wavefronts** (AMD, 32 or
|
||
64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a
|
||
branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken
|
||
path and skips the other — **zero cost**, identical to a CPU branch. This is called a **uniform
|
||
branch** or **warp-coherent branch**.
|
||
|
||
When threads within the same warp disagree on which path to take, the warp must execute both paths
|
||
sequentially, masking off threads that don't belong to the active path. This is called **warp
|
||
divergence** and it causes the warp to pay the cost of both sides of the branch. In the worst case
|
||
(50/50 split), throughput halves for that warp.
|
||
|
||
There are three categories of branch condition in a fragment shader, ranked by cost:
|
||
|
||
| Category | Condition source | GPU behavior | Cost |
|
||
| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------- |
|
||
| **Compile-time constant** | `#ifdef`, `const bool` | Dead code eliminated by compiler | Zero |
|
||
| **Uniform / push constant** | Same value for entire draw call | Warp-coherent; GPU skips dead path | Effectively zero |
|
||
| **Per-primitive `flat` varying** | Same value across all fragments of a primitive | Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries | Near zero (see below) |
|
||
| **Per-fragment varying** | Different value per pixel (e.g., texture lookup, screen position) | Potentially divergent within every warp | Can be expensive |
|
||
|
||
#### Which category our branches fall into
|
||
|
||
Our design has two branch points:
|
||
|
||
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
|
||
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
|
||
cost.**
|
||
|
||
2. **`shape_kind` (flat varying from storage buffer): which SDF to evaluate.** This is category 3.
|
||
The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
|
||
receive the same `shape_kind` value. Divergence can only occur at the **boundary between two
|
||
adjacent primitives of different kinds**, where the rasterizer might pack fragments from both
|
||
primitives into the same warp.
|
||
|
||
For category 3, the divergence analysis depends on primitive size:
|
||
|
||
- **Large primitives** (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect
|
||
produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a
|
||
different kind. Divergence rate: **0.6%** of warps.
|
||
|
||
- **Small primitives** (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps
|
||
diverge. Divergence rate: **25%** of warps for that primitive, but the primitive itself covers a
|
||
tiny fraction of the frame's total fragments.
|
||
|
||
- **Worst realistic case**: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect
|
||
icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total
|
||
frame-level divergence is typically **1–3%** of all warps.
|
||
|
||
At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
|
||
(~387,000 warps), divergent boundary warps number in the low thousands. Each divergent warp pays at
|
||
most ~25 extra instructions (the cost of the longest untaken SDF branch). At ~12G instructions/sec
|
||
on a mid-range GPU, that totals ~4μs — under 0.05% of an 8.3ms (120 FPS) frame budget. This is
|
||
confirmed by production renderers that use exactly this pattern:
|
||
|
||
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
|
||
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
|
||
specifically because CPU-side tessellation was the bottleneck, not fragment branching:
|
||
https://github.com/audulus/vger-rs
|
||
|
||
- **Randy Gaul's 2D renderer**: single pipeline with `shape_type` encoded as a vertex attribute.
|
||
Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because
|
||
"games tend to draw a lot of the same shape type":
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
|
||
#### What kind of branching IS expensive
|
||
|
||
For completeness, here are the cases where shader branching genuinely hurts — none of which apply to
|
||
our design:
|
||
|
||
1. **Per-fragment data-dependent branches with high divergence.** Example: `if (texture(noise, uv).r
|
||
|
||
> 0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every
|
||
> warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We
|
||
> have no per-fragment data-dependent branches in the main pipeline.
|
||
|
||
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
|
||
divergent warps pay double a large cost. Our SDF functions are 10–25 instructions each. Even
|
||
fully divergent, the penalty is ~25 extra instructions — less than a single texture sample's
|
||
latency.
|
||
|
||
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
|
||
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
|
||
Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a
|
||
concern.
|
||
|
||
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
|
||
split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, all SDF
|
||
branches have similar register footprints (12–22 registers), so combining them causes negligible
|
||
occupancy loss.
|
||
|
||
**References:**
|
||
|
||
- ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution
|
||
models across GPU generations, confirms uniform and warp-coherent branches are free on modern
|
||
hardware:
|
||
https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
|
||
- Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that
|
||
warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the
|
||
SIMT divergence mechanism:
|
||
https://www.peterstefek.me/shader-branch.html
|
||
- NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows
|
||
divergent threads to reconverge more efficiently than older architectures:
|
||
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
|
||
- Randy Gaul on warp divergence in practice with per-primitive shape_type branching:
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
|
||
### Main pipeline: SDF + tessellated (unified)
|
||
|
||
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
|
||
vertex input layout, distinguished by a push constant:
|
||
|
||
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Unchanged from
|
||
today. Used for text (SDL_ttf atlas sampling), polylines, triangle fans/strips, gradient-filled
|
||
shapes, and any user-provided raw vertex geometry.
|
||
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
|
||
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
|
||
|
||
Both modes converge on the same fragment shader, which dispatches on a `shape_kind` discriminant
|
||
carried either in the vertex data (tessellated, always `Solid = 0`) or in the storage-buffer
|
||
primitive struct (SDF modes).
|
||
|
||
#### Why SDF for shapes
|
||
|
||
CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:
|
||
|
||
1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
|
||
= 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
|
||
vertices. That is roughly a 90× reduction per shape.
|
||
|
||
2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
|
||
animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
|
||
mathematically exact boundaries with perfect anti-aliasing via `smoothstep` in the fragment
|
||
shader.
|
||
|
||
3. **Feature cost.** Adding soft edges, outlines, stroke effects, or rounded-cap line segments
|
||
requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader
|
||
operations: `abs(d) - thickness` for stroke, `smoothstep(-soft, soft, d)` for soft edges.
|
||
|
||
**References:**
|
||
|
||
- Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used):
|
||
https://iquilezles.org/articles/distfunctions2d/
|
||
- Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference):
|
||
https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
|
||
- Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout,
|
||
warp divergence tradeoffs, and polyline rendering:
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
- Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type
|
||
discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell
|
||
to 30 FPS due to CPU-side tessellation:
|
||
https://github.com/audulus/vger-rs
|
||
|
||
#### Storage-buffer instancing for SDF primitives
|
||
|
||
SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceIndex` in the vertex
|
||
shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
|
||
pattern used by both Zed GPUI and vger-rs.
|
||
|
||
Each SDF shape is described by a single `Primitive` struct (~56 bytes) in the storage buffer. The
|
||
vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
|
||
vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
|
||
interpolated varyings.
|
||
|
||
Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
|
||
buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
|
||
56 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.
|
||
|
||
The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
|
||
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
|
||
in a draw call has the same mode — so it is effectively free on all modern GPUs.
|
||
|
||
#### Shape kinds
|
||
|
||
Primitives in the main pipeline's storage buffer carry a `Shape_Kind` discriminant:
|
||
|
||
| Kind | SDF function | Notes |
|
||
| ---------- | -------------------------------------- | --------------------------------------------------------- |
|
||
| `RRect` | `sdRoundedBox` (iq) | Per-corner radii. Covers all Clay rectangles and borders. |
|
||
| `Circle` | `sdCircle` | Filled and stroked. |
|
||
| `Ellipse` | `sdEllipse` | Exact (iq's closed-form). |
|
||
| `Segment` | `sdSegment` capsule | Rounded caps, correct sub-pixel thin lines. |
|
||
| `Ring_Arc` | `abs(sdCircle) - thickness` + arc mask | Rings, arcs, circle sectors unified. |
|
||
| `NGon` | `sdRegularPolygon` | Regular n-gon for n ≥ 5. |
|
||
|
||
The `Solid` kind (value 0) is reserved for the tessellated path, where `shape_kind` is implicitly
|
||
zero because the fragment shader receives it from zero-initialized vertex attributes.
|
||
|
||
Stroke/outline variants of each shape are handled by the `Shape_Flags` bit set rather than separate
|
||
shape kinds. The fragment shader transforms `d = abs(d) - stroke_width` when the `Stroke` flag is
|
||
set.
|
||
|
||
**What stays tessellated:**
|
||
|
||
- Text (SDL_ttf atlas, pending future MSDF evaluation)
|
||
- `rectangle_gradient`, `circle_gradient` (per-vertex color interpolation)
|
||
- `triangle_fan`, `triangle_strip` (arbitrary user-provided point lists)
|
||
- `line_strip` / polylines (SDF polyline rendering is possible but complex; deferred)
|
||
- Any raw vertex geometry submitted via `prepare_shape`
|
||
|
||
The rule: if the shape has a closed-form SDF, it goes SDF. If it's described only by a vertex list or
|
||
needs per-vertex color interpolation, it stays tessellated.
|
||
|
||
### Effects pipeline
|
||
|
||
The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and
|
||
similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a
|
||
dedicated pipeline state object that has its own compiled fragment shader.
|
||
|
||
#### Combined shape + effect rendering
|
||
|
||
When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn
|
||
**once**, entirely in the effects pipeline. The effects fragment shader evaluates both the effect
|
||
(blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated
|
||
across pipelines.
|
||
|
||
This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5)
|
||
and blur sigma 10:
|
||
|
||
- **Separate-primitive approach** (shape in main pipeline + shadow in effects pipeline): the shadow
|
||
quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500
|
||
shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the
|
||
shape. Total fragment invocations: ~49,900.
|
||
|
||
- **Combined approach** (one primitive in effects pipeline): one quad covers ~230×130 = 29,900
|
||
pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape
|
||
on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape
|
||
shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.
|
||
|
||
The combined approach uses **~40% fewer fragment invocations** per effected shape (29,900 vs 49,900)
|
||
in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF
|
||
evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments
|
||
through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10
|
||
shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.
|
||
|
||
An `Effect_Flag.Draw_Base_Shape` flag controls whether the sharp shape layer composites on top
|
||
(default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with
|
||
no shape on top) clear this flag.
|
||
|
||
Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects
|
||
are routed to the effects pipeline.
|
||
|
||
#### Drop shadow implementation
|
||
|
||
Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post
|
||
describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed
|
||
form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is
|
||
~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan
|
||
Wallace's variant) and vger-rs.
|
||
|
||
**References:**
|
||
|
||
- Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement):
|
||
https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
|
||
- Evan Wallace's original WebGL implementation (used by Figma):
|
||
https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
|
||
- Vello's implementation of blurred rounded rectangle as a gradient type:
|
||
https://github.com/linebender/vello/pull/665
|
||
|
||
### Backdrop-effects pipeline
|
||
|
||
The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
|
||
glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
|
||
reasons:
|
||
|
||
1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
|
||
target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
|
||
buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
|
||
boundary.
|
||
|
||
2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
|
||
(multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
|
||
effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
|
||
measurable throughput on the common case.
|
||
|
||
The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
|
||
no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
|
||
happens — zero cost.
|
||
|
||
### Vertex layout
|
||
|
||
The vertex struct is unchanged from the current 20-byte layout:
|
||
|
||
```
|
||
Vertex :: struct {
|
||
position: [2]f32, // 0: screen-space position
|
||
uv: [2]f32, // 8: atlas UV (text) or unused (shapes)
|
||
color: Color, // 16: u8x4, GPU-normalized to float
|
||
}
|
||
```
|
||
|
||
This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated
|
||
draws, `position` carries actual world-space geometry. For SDF draws, `position` carries unit-quad
|
||
corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
|
||
primitive's bounds.
|
||
|
||
The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
|
||
|
||
```
|
||
Primitive :: struct {
|
||
kind: Shape_Kind, // 0: enum u8
|
||
flags: Shape_Flags, // 1: bit_set[Shape_Flag; u8]
|
||
_pad: u16, // 2: reserved
|
||
bounds: [4]f32, // 4: min_x, min_y, max_x, max_y
|
||
color: Color, // 20: u8x4
|
||
_pad2: [3]u8, // 24: alignment
|
||
params: Shape_Params, // 28: raw union, 32 bytes
|
||
}
|
||
// Total: 60 bytes (padded to 64 for GPU alignment)
|
||
```
|
||
|
||
`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
|
||
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
|
||
|
||
### Draw submission order
|
||
|
||
Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:
|
||
|
||
1. Bind **effects pipeline** → draw all queued effects primitives for this scissor (instanced, one
|
||
draw call). Each effects primitive includes its base shape and composites internally.
|
||
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
|
||
shapes, indexed for text). Pipeline state unchanged from today.
|
||
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
|
||
4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
|
||
backdrop primitives.
|
||
|
||
The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
|
||
invariant is that each primitive is drawn exactly once, in the pipeline that owns it.
|
||
|
||
### Text rendering
|
||
|
||
Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per `(font, size)`
|
||
pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData`. This path is
|
||
**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
|
||
mode with `shape_kind = Solid`, sampling the SDL_ttf atlas texture.
|
||
|
||
A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
|
||
allow resolution-independent glyph rendering from a single small atlas per font. This would involve:
|
||
|
||
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
|
||
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
|
||
- A new `Shape_Kind.MSDF_Glyph` variant in the main pipeline's fragment shader.
|
||
- Potential removal of the SDL_ttf dependency.
|
||
|
||
This is explicitly deferred. The SDF shape migration is independent of and does not block text
|
||
changes.
|
||
|
||
**References:**
|
||
|
||
- Viktor Chlumský's MSDF master's thesis and msdfgen tool:
|
||
https://github.com/Chlumsky/msdfgen
|
||
- MSDF atlas generator for font atlas packing:
|
||
https://github.com/Chlumsky/msdf-atlas-gen
|
||
- Valve's original SDF text rendering paper (SIGGRAPH 2007):
|
||
https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
|
||
|
||
## 3D rendering
|
||
|
||
3D pipeline architecture is under consideration and will be documented separately. The current
|
||
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
|
||
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
|
||
|
||
## Building shaders
|
||
|
||
GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
|
||
are generated into `shaders/generated/` via the meta tool:
|
||
|
||
```
|
||
odin run meta -- gen-shaders
|
||
```
|
||
|
||
Requires `glslangValidator` and `spirv-cross` on PATH.
|