Added draw package as renderer focused on mixed use layout / 2D / 3D scene applications (#7)

Co-authored-by: Zachary Levy <zachary@sunforge.is> Reviewed-on: #7
2026-04-20 20:14:56 +00:00
parent 59c600d630
commit 274289bd47
26 changed files with 5331 additions and 1 deletions
@@ -0,0 +1,580 @@
+# draw
+
+2D rendering library built on SDL3 GPU, providing a unified shape-drawing and text-rendering API with
+Clay UI integration.
+
+## Current state
+
+The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
+modes dispatched by a push constant:
+
+- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
+  SDL_ttf atlas textures), axis-aligned sharp-corner rectangles (already optimal as 2 triangles),
+  per-vertex color gradients (`rectangle_gradient`, `circle_gradient`), angular-clipped circle
+  sectors (`circle_sector`), and arbitrary user geometry (`triangle`, `triangle_fan`,
+  `triangle_strip`). The fragment shader computes `out = color * texture(tex, uv)`.
+
+- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
+  `Primitive` structs uploaded each frame to a GPU storage buffer. The vertex shader reads
+  `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners + primitive
+  bounds. The fragment shader dispatches on `Shape_Kind` to evaluate the correct signed distance
+  function analytically.
+
+Seven SDF shape kinds are implemented:
+
+1. **RRect** — rounded rectangle with per-corner radii (iq's `sdRoundedBox`)
+2. **Circle** — filled or stroked circle
+3. **Ellipse** — exact signed-distance ellipse (iq's iterative `sdEllipse`)
+4. **Segment** — capsule-style line segment with rounded caps
+5. **Ring_Arc** — annular ring with angular clipping for arcs
+6. **NGon** — regular polygon with arbitrary side count and rotation
+7. **Polyline** — decomposed into independent `Segment` primitives per adjacent point pair
+
+All SDF shapes support fill and stroke modes via `Shape_Flags`, and produce mathematically exact
+curves with analytical anti-aliasing via `smoothstep` — no tessellation, no piecewise-linear
+approximation. A rounded rectangle is 1 primitive (64 bytes) instead of ~250 vertices (~5000 bytes).
+
+MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
+benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
+glyph edges and tessellated user geometry if desired.
+
+## 2D rendering pipeline plan
+
+This section documents the planned architecture for levlib's 2D rendering system. The design is driven
+by three goals: **draw quality** (mathematically exact curves with perfect anti-aliasing), **efficiency**
+(minimal vertex bandwidth, high GPU occupancy, low draw-call count), and **extensibility** (new
+primitives and effects can be added to the library without architectural changes).
+
+### Overview: three pipelines
+
+The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
+**render-state requirements**:
+
+1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
+   registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
+   frame.
+
+2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
+   effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
+   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
+   redundant overdraw.
+
+3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
+   render target as input. High register footprint (~70–80 registers) and structurally requires a
+   `CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
+   pressure and because the texture-copy requirement forces a render-pass-level state change.
+
+A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
+uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
+switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
+is under 0.15% of an 8.3ms (120 FPS) frame budget.
+
+### Why three pipelines, not one or seven
+
+The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
+code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
+
+The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
+shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
+compiler allocates registers pessimistically based on the worst-case path through the shader. If the
+shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
+— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
+warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
+
+Concrete example on a modern NVIDIA SM with 65,536 registers:
+
+| Register allocation       | Max concurrent threads | Occupancy |
+| ------------------------- | ---------------------- | --------- |
+| 20 regs (RRect only)      | 3,276                  | ~100%     |
+| 48 regs (+ drop shadow)   | 1,365                  | ~42%      |
+| 72 regs (+ frosted glass) | 910                    | ~28%      |
+
+For a 4K frame (3840×2160) at 1.5× overdraw (~12.4M fragments), running all fragments at 28%
+occupancy instead of 100% roughly triples fragment shading time. At 4K this is severe: if the main
+pipeline's fragment work at full occupancy takes ~2ms, a single unified shader containing the glass
+branch would push it to ~6ms — consuming 72% of the 8.3ms budget available at 120 FPS and leaving
+almost nothing for CPU work, uploads, and presentation. This is a per-frame multiplier, not a
+per-primitive cost — it applies even when the heavy branch is never taken.
+
+The three-pipeline split groups primitives by register footprint so that:
+
+- Main pipeline (~20 regs): 90%+ of fragments run at near-full occupancy.
+- Effects pipeline (~55 regs): shadow/glow fragments run at moderate occupancy; unavoidable given the
+  blur math complexity.
+- Backdrop-effects pipeline (~75 regs): glass fragments run at low occupancy; also unavoidable, and
+  structurally separated anyway by the texture-copy requirement.
+
+This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
+(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
+actually need them.
+
+**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
+quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
+branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
+for our use case:
+
+**Draw call count scales with kind variety, not just scissor count.** With a unified pipeline,
+one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With
+per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a
+typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces
+~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
+pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
+375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
+compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
+negligible (all members cluster at 12–22 registers).
+
+**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
+from one storage buffer, submission order equals draw order — Clay's painterly render commands flow
+through without reordering. With separate pipelines per kind, primitives can only batch with
+same-kind neighbors, which means interleaved kinds (e.g., `[rrect, circle, text, rrect, text]`) must
+either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort
+by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their
+API where each layer draws shadows before quads before glyphs. Our design avoids this constraint:
+submission order is draw order, no layer juggling required.
+
+**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
+first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
+variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
+additional axis with 7 pipelines vs 3.
+
+**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
+the strongest candidate for per-kind splitting because effect branches are heavier than shape
+branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses.
+Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved
+in submission order:
+
+- _Unified effects pipeline (our plan):_ 1 pipeline bind, 1 instanced draw call. Category-3
+  divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2
+  boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra
+  instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ **1.7μs**.
+
+- _Per-kind effects pipelines (GPUI-style):_ 2 pipeline binds + 2 draw calls. But submission order
+  is `[drop, drop, inner, drop, drop, inner, drop, ...]` — the two inner-shadow primitives split the
+  drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline
+  switches, not 2. Cost: 5 × 5μs + 4 × 5μs = **45μs**.
+
+  The per-kind approach costs **26× more** than the unified approach's divergence penalty (45μs vs
+  1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most
+  extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing
+  ~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
+  cheaper than the pipeline-switching alternative.
+
+The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
+boundaries where occupancy differences are catastrophic at 4K (see numbers above). Within a tier,
+unified is strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead,
+and negligible GPU-side branching cost.
+
+**References:**
+
+- Zed GPUI blog post on their per-primitive pipeline architecture:
+  https://zed.dev/blog/videogame
+- Zed GPUI Metal shader source (7 shader pairs):
+  https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
+- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
+  https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
+
+### Why fragment shader branching is safe in this design
+
+There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware
+where shader cores had no branch instructions at all — compilers emitted code for both sides of every
+branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward),
+this is no longer the case. Native dynamic branching is fully supported on all current hardware.
+However, branching _can_ still be costly in specific circumstances. Understanding which circumstances
+apply to our design — and which do not — is critical to justifying the unified-pipeline approach.
+
+#### How GPU branching works
+
+GPUs execute fragment shaders in **warps** (NVIDIA/Intel, 32 threads) or **wavefronts** (AMD, 32 or
+64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a
+branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken
+path and skips the other — **zero cost**, identical to a CPU branch. This is called a **uniform
+branch** or **warp-coherent branch**.
+
+When threads within the same warp disagree on which path to take, the warp must execute both paths
+sequentially, masking off threads that don't belong to the active path. This is called **warp
+divergence** and it causes the warp to pay the cost of both sides of the branch. In the worst case
+(50/50 split), throughput halves for that warp.
+
+There are three categories of branch condition in a fragment shader, ranked by cost:
+
+| Category                         | Condition source                                                  | GPU behavior                                                                                   | Cost                  |
+| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------- |
+| **Compile-time constant**        | `#ifdef`, `const bool`                                            | Dead code eliminated by compiler                                                               | Zero                  |
+| **Uniform / push constant**      | Same value for entire draw call                                   | Warp-coherent; GPU skips dead path                                                             | Effectively zero      |
+| **Per-primitive `flat` varying** | Same value across all fragments of a primitive                    | Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries | Near zero (see below) |
+| **Per-fragment varying**         | Different value per pixel (e.g., texture lookup, screen position) | Potentially divergent within every warp                                                        | Can be expensive      |
+
+#### Which category our branches fall into
+
+Our design has two branch points:
+
+1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
+   Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
+   cost.**
+
+2. **`shape_kind` (flat varying from storage buffer): which SDF to evaluate.** This is category 3.
+   The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
+   receive the same `shape_kind` value. Divergence can only occur at the **boundary between two
+   adjacent primitives of different kinds**, where the rasterizer might pack fragments from both
+   primitives into the same warp.
+
+For category 3, the divergence analysis depends on primitive size:
+
+- **Large primitives** (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect
+  produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a
+  different kind. Divergence rate: **0.6%** of warps.
+
+- **Small primitives** (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps
+  diverge. Divergence rate: **25%** of warps for that primitive, but the primitive itself covers a
+  tiny fraction of the frame's total fragments.
+
+- **Worst realistic case**: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect
+  icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total
+  frame-level divergence is typically **1–3%** of all warps.
+
+At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
+(~387,000 warps), divergent boundary warps number in the low thousands. Each divergent warp pays at
+most ~25 extra instructions (the cost of the longest untaken SDF branch). At ~12G instructions/sec
+on a mid-range GPU, that totals ~4μs — under 0.05% of an 8.3ms (120 FPS) frame budget. This is
+confirmed by production renderers that use exactly this pattern:
+
+- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
+  flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
+  specifically because CPU-side tessellation was the bottleneck, not fragment branching:
+  https://github.com/audulus/vger-rs
+
+- **Randy Gaul's 2D renderer**: single pipeline with `shape_type` encoded as a vertex attribute.
+  Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because
+  "games tend to draw a lot of the same shape type":
+  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
+
+#### What kind of branching IS expensive
+
+For completeness, here are the cases where shader branching genuinely hurts — none of which apply to
+our design:
+
+1. **Per-fragment data-dependent branches with high divergence.** Example: `if (texture(noise, uv).r
+
+   > 0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every
+   > warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We
+   > have no per-fragment data-dependent branches in the main pipeline.
+
+2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
+   divergent warps pay double a large cost. Our SDF functions are 10–25 instructions each. Even
+   fully divergent, the penalty is ~25 extra instructions — less than a single texture sample's
+   latency.
+
+3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
+   across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
+   Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a
+   concern.
+
+4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
+   split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, all SDF
+   branches have similar register footprints (12–22 registers), so combining them causes negligible
+   occupancy loss.
+
+**References:**
+
+- ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution
+  models across GPU generations, confirms uniform and warp-coherent branches are free on modern
+  hardware:
+  https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
+- Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that
+  warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the
+  SIMT divergence mechanism:
+  https://www.peterstefek.me/shader-branch.html
+- NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows
+  divergent threads to reconverge more efficiently than older architectures:
+  https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
+- Randy Gaul on warp divergence in practice with per-primitive shape_type branching:
+  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
+
+### Main pipeline: SDF + tessellated (unified)
+
+The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
+vertex input layout, distinguished by a push constant:
+
+- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Unchanged from
+  today. Used for text (SDL_ttf atlas sampling), polylines, triangle fans/strips, gradient-filled
+  shapes, and any user-provided raw vertex geometry.
+- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
+  structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
+
+Both modes converge on the same fragment shader, which dispatches on a `shape_kind` discriminant
+carried either in the vertex data (tessellated, always `Solid = 0`) or in the storage-buffer
+primitive struct (SDF modes).
+
+#### Why SDF for shapes
+
+CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:
+
+1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
+   = 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
+   vertices. That is roughly a 90× reduction per shape.
+
+2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
+   animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
+   mathematically exact boundaries with perfect anti-aliasing via `smoothstep` in the fragment
+   shader.
+
+3. **Feature cost.** Adding soft edges, outlines, stroke effects, or rounded-cap line segments
+   requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader
+   operations: `abs(d) - thickness` for stroke, `smoothstep(-soft, soft, d)` for soft edges.
+
+**References:**
+
+- Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used):
+  https://iquilezles.org/articles/distfunctions2d/
+- Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference):
+  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
+- Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout,
+  warp divergence tradeoffs, and polyline rendering:
+  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
+- Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type
+  discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell
+  to 30 FPS due to CPU-side tessellation:
+  https://github.com/audulus/vger-rs
+
+#### Storage-buffer instancing for SDF primitives
+
+SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceIndex` in the vertex
+shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
+pattern used by both Zed GPUI and vger-rs.
+
+Each SDF shape is described by a single `Primitive` struct (~56 bytes) in the storage buffer. The
+vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
+vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
+interpolated varyings.
+
+Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
+buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
+56 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.
+
+The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
+buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
+in a draw call has the same mode — so it is effectively free on all modern GPUs.
+
+#### Shape kinds
+
+Primitives in the main pipeline's storage buffer carry a `Shape_Kind` discriminant:
+
+| Kind       | SDF function                           | Notes                                                     |
+| ---------- | -------------------------------------- | --------------------------------------------------------- |
+| `RRect`    | `sdRoundedBox` (iq)                    | Per-corner radii. Covers all Clay rectangles and borders. |
+| `Circle`   | `sdCircle`                             | Filled and stroked.                                       |
+| `Ellipse`  | `sdEllipse`                            | Exact (iq's closed-form).                                 |
+| `Segment`  | `sdSegment` capsule                    | Rounded caps, correct sub-pixel thin lines.               |
+| `Ring_Arc` | `abs(sdCircle) - thickness` + arc mask | Rings, arcs, circle sectors unified.                      |
+| `NGon`     | `sdRegularPolygon`                     | Regular n-gon for n ≥ 5.                                  |
+
+The `Solid` kind (value 0) is reserved for the tessellated path, where `shape_kind` is implicitly
+zero because the fragment shader receives it from zero-initialized vertex attributes.
+
+Stroke/outline variants of each shape are handled by the `Shape_Flags` bit set rather than separate
+shape kinds. The fragment shader transforms `d = abs(d) - stroke_width` when the `Stroke` flag is
+set.
+
+**What stays tessellated:**
+
+- Text (SDL_ttf atlas, pending future MSDF evaluation)
+- `rectangle_gradient`, `circle_gradient` (per-vertex color interpolation)
+- `triangle_fan`, `triangle_strip` (arbitrary user-provided point lists)
+- `line_strip` / polylines (SDF polyline rendering is possible but complex; deferred)
+- Any raw vertex geometry submitted via `prepare_shape`
+
+The rule: if the shape has a closed-form SDF, it goes SDF. If it's described only by a vertex list or
+needs per-vertex color interpolation, it stays tessellated.
+
+### Effects pipeline
+
+The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and
+similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a
+dedicated pipeline state object that has its own compiled fragment shader.
+
+#### Combined shape + effect rendering
+
+When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn
+**once**, entirely in the effects pipeline. The effects fragment shader evaluates both the effect
+(blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated
+across pipelines.
+
+This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5)
+and blur sigma 10:
+
+- **Separate-primitive approach** (shape in main pipeline + shadow in effects pipeline): the shadow
+  quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500
+  shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the
+  shape. Total fragment invocations: ~49,900.
+
+- **Combined approach** (one primitive in effects pipeline): one quad covers ~230×130 = 29,900
+  pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape
+  on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape
+  shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.
+
+The combined approach uses **~40% fewer fragment invocations** per effected shape (29,900 vs 49,900)
+in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF
+evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments
+through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10
+shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.
+
+An `Effect_Flag.Draw_Base_Shape` flag controls whether the sharp shape layer composites on top
+(default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with
+no shape on top) clear this flag.
+
+Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects
+are routed to the effects pipeline.
+
+#### Drop shadow implementation
+
+Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post
+describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed
+form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is
+~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan
+Wallace's variant) and vger-rs.
+
+**References:**
+
+- Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement):
+  https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
+- Evan Wallace's original WebGL implementation (used by Figma):
+  https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
+- Vello's implementation of blurred rounded rectangle as a gradient type:
+  https://github.com/linebender/vello/pull/665
+
+### Backdrop-effects pipeline
+
+The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
+glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
+reasons:
+
+1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
+   target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
+   buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
+   boundary.
+
+2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
+   (multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
+   effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
+   measurable throughput on the common case.
+
+The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
+no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
+happens — zero cost.
+
+### Vertex layout
+
+The vertex struct is unchanged from the current 20-byte layout:
+
+```
+Vertex :: struct {
+    position: [2]f32,  //  0: screen-space position
+    uv:       [2]f32,  //  8: atlas UV (text) or unused (shapes)
+    color:    Color,   // 16: u8x4, GPU-normalized to float
+}
+```
+
+This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated
+draws, `position` carries actual world-space geometry. For SDF draws, `position` carries unit-quad
+corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
+primitive's bounds.
+
+The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
+
+```
+Primitive :: struct {
+    kind:   Shape_Kind,     //  0: enum u8
+    flags:  Shape_Flags,    //  1: bit_set[Shape_Flag; u8]
+    _pad:   u16,            //  2: reserved
+    bounds: [4]f32,         //  4: min_x, min_y, max_x, max_y
+    color:  Color,          // 20: u8x4
+    _pad2:  [3]u8,          // 24: alignment
+    params: Shape_Params,   // 28: raw union, 32 bytes
+}
+// Total: 60 bytes (padded to 64 for GPU alignment)
+```
+
+`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
+etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
+
+### Draw submission order
+
+Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:
+
+1. Bind **effects pipeline** → draw all queued effects primitives for this scissor (instanced, one
+   draw call). Each effects primitive includes its base shape and composites internally.
+2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
+   shapes, indexed for text). Pipeline state unchanged from today.
+3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
+4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
+   backdrop primitives.
+
+The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
+invariant is that each primitive is drawn exactly once, in the pipeline that owns it.
+
+### Text rendering
+
+Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per `(font, size)`
+pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData`. This path is
+**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
+mode with `shape_kind = Solid`, sampling the SDL_ttf atlas texture.
+
+A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
+allow resolution-independent glyph rendering from a single small atlas per font. This would involve:
+
+- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
+- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
+- A new `Shape_Kind.MSDF_Glyph` variant in the main pipeline's fragment shader.
+- Potential removal of the SDL_ttf dependency.
+
+This is explicitly deferred. The SDF shape migration is independent of and does not block text
+changes.
+
+**References:**
+
+- Viktor Chlumský's MSDF master's thesis and msdfgen tool:
+  https://github.com/Chlumsky/msdfgen
+- MSDF atlas generator for font atlas packing:
+  https://github.com/Chlumsky/msdf-atlas-gen
+- Valve's original SDF text rendering paper (SIGGRAPH 2007):
+  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
+
+## 3D rendering
+
+3D pipeline architecture is under consideration and will be documented separately. The current
+expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
+sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
+
+## Building shaders
+
+GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
+are generated into `shaders/generated/` via the meta tool:
+
+```
+odin run meta -- gen-shaders
+```
+
+Requires `glslangValidator` and `spirv-cross` on PATH.
+
+### Shader format selection
+
+The library embeds shader bytecode per compile target — MSL + `main0` entry point on Darwin (via
+`spirv-cross --msl`, which renames `main` because it is reserved in Metal), SPIR-V + `main` entry
+point elsewhere. Three compile-time constants in `draw.odin` expose the build's shader configuration:
+
+| Constant                      | Type                      | Darwin    | Other      |
+| ----------------------------- | ------------------------- | --------- | ---------- |
+| `PLATFORM_SHADER_FORMAT_FLAG` | `sdl.GPUShaderFormatFlag` | `.MSL`    | `.SPIRV`   |
+| `PLATFORM_SHADER_FORMAT`      | `sdl.GPUShaderFormat`     | `{.MSL}`  | `{.SPIRV}` |
+| `SHADER_ENTRY`                | `cstring`                 | `"main0"` | `"main"`   |
+
+Pass `PLATFORM_SHADER_FORMAT` to `sdl.CreateGPUDevice` so SDL selects a backend compatible with the
+embedded bytecode:
+
+```
+gpu := sdl.CreateGPUDevice(draw.PLATFORM_SHADER_FORMAT, true, nil)
+```
+
+At init time the library calls `sdl.GetGPUShaderFormats(device)` to verify the active backend
+accepts `PLATFORM_SHADER_FORMAT_FLAG`. If it does not, `draw.init` returns `false` with a
+descriptive log message showing both the embedded and active format sets.