Update draw README

2026-04-28 16:04:44 -07:00
parent c59858dcd4
commit ff29dbd92f
1 changed files with 235 additions and 176 deletions
@@ -9,51 +9,53 @@ The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline)
 modes dispatched by a push constant:
 - **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
-  SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`,
+  SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
-  `tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle
+  (`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
-  reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors
+  `tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
-  (`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`.
+  shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
 - **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
  `Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
  reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
-  primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive
+  primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
-  kind dispatch.
+  `Primitive.flags`) to evaluate one of four signed distance functions:
  - **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
    circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
    radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
  - **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
  - **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
  - **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
    normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
-The SDF path handles all shapes that are algebraically reducible to a rounded rectangle:
+All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
-
+gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked,
+parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
-  textured, and gradient-filled rectangles.
+— zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles.
+exclusive.
 - **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
 - **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
 All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
 gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
 the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
 struct. Gradient and texture are mutually exclusive.
 All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
 no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
 instead of ~250 vertices (~5000 bytes).
-The fragment shader's estimated register footprint is ~20–23 VGPRs via static live-range analysis.
+The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
-RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus
+in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
-`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land
+fragment shader's estimated peak footprint is ~22–26 fp32 VGPRs (~16–22 fp16 VGPRs on architectures
-comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under
+with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
-desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur
+(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
-partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or
+like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 1–2 regs lower (`corner_radii` vec4
-Radeon GPU Analyzer against the compiled SPIR-V.
+replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
 apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
 2–4 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
 down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
 register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
 statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
 fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
 a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
 MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
 benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
 glyph edges and tessellated user geometry if desired.
 All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
 tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
 `sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
 count).
 ## 2D rendering pipeline plan
 This section documents the planned architecture for levlib's 2D rendering system. The design is driven
@@ -66,22 +68,23 @@ primitives and effects can be added to the library without architectural changes
 The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
 **render-pass structure** (everything vs backdrop):
-1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
+1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
-   footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
+   **≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
-   Handles 90%+ of all fragments in a typical frame.
+   in a typical frame.
 2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
-   effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
+   effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
   occupancy at the first cliff is accepted by design). Each effects primitive includes the base
   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
   redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
   low-end hardware (see register analysis below).
 3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
   target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
-   where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
+   where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
-   from the other pipelines because it structurally requires ending the current render pass and
+   Valhall). Separated from the other pipelines because it structurally requires ending the current
-   copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
+   render pass and copying the render target before any backdrop-sampling fragment can execute — a
-   level boundary that cannot be avoided regardless of shader complexity.
+   command-buffer-level boundary that cannot be avoided regardless of shader complexity.
 A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
 uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
@@ -97,56 +100,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad
 A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
 allocates registers pessimistically based on the worst-case path through the shader. If the shader
-contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
+contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
-trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
+trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
 warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
 latency.
-Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
+Each GPU architecture has discrete **occupancy cliffs** — register counts above which the number of
-Below the cliff, adding registers has zero occupancy cost.
+concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
 register over, throughput drops sharply.
-On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
+**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
 register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
 dominant current GPU architecture:
-| Register allocation      | Reg-limited threads | Actual (hw-capped) | Occupancy |
+- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
-| ------------------------ | ------------------- | ------------------ | --------- |
+  **Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
-| ~16 regs (main pipeline) | 4,096               | 1,536              | 100%      |
+  registers**, second cliff at **64 registers**.
-| 32 regs                  | 2,048               | 1,536              | 100%      |
+- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
-| 48 regs (effects)        | 1,365               | 1,365              | ~89%      |
+  same cliff structure — first at 32, second at 64.
 - **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~2016–2018): first cliff at **16 registers**.
  Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
  below.
 - **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
  the current SBC market. See Known limitations below.
 - **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
  Register allocation happens at runtime based on actual usage.
 - **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
 - **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
-On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
+**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
 backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
 registers of margin**:
-| Register allocation      | Reg-limited threads | Actual (hw-capped) | Occupancy |
+| Pipeline            | Cliff targeted         | Margin | Register budget   | Rationale                                                                                     |
-| ------------------------ | ------------------- | ------------------ | --------- |
+| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
-| ~16 regs (main pipeline) | 4,096               | 2,048              | 100%      |
+| Main pipeline       | 32 (Valhall 1st cliff) | 8      | **≤24 regs**      | Handles 90%+ of frame fragments; must run at full occupancy                                   |
-| 32 regs                  | 2,048               | 2,048              | 100%      |
+| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8      | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy                    |
-| 48 regs (effects)        | 1,365               | 1,365              | ~67%      |
+| Effects pipeline    | 64 (Valhall 2nd cliff) | 8      | **≤56 regs**      | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
-On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
+**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
 counts upward over a shader's lifetime:
-| Register allocation  | Occupancy                  |
+1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
-| -------------------- | -------------------------- |
+   allocators. Shaders typically drift ±2–3 registers between versions on unchanged source.
-| 0–32 regs (main)     | 100% (full thread count)   |
+2. **Feature additions.** Each new effect, flag, or uniform adds 1–4 live registers. A new gradient
-| 33–64 regs (effects) | ~50% (thread count halves) |
+   mode or outline option lands in this range.
 3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
   or a contributor not knowing) costs 2 registers per affected `vec4`.
-Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
+Realistic creep over a couple of years is 4–8 registers. The cost of conservatism is zero — a shader
-20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
+at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
-main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
+a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
 pipeline's register cost.
-For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
+**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
-fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
+SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
 allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
 90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
 stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
 actually render effects (~5–10% in a typical UI) run at reduced occupancy.
 For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
 texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
 fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
-low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
+Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
 compiler allocates registers for the worst-case path.
-All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
+The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
-12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
+50–67% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
-occupancy.
+that effects cover.
-**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
+**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
-which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
+actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
-static register-pressure argument on M3 and later, but the split remains useful for isolating blur
+later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
-ALU complexity and keeping the backdrop texture-copy out of the main render pass.
+texture-copy out of the main render pass.
 **Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
 pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
 (cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
 100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
 #### Known limitations: V3D and Bifrost (16-register cliff)
 Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
 have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
 the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
 reduced occupancy regardless of which shape kind or effect is active.
 Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
 architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
 register footprint under 16). This conflicts with the unified-pipeline design that enables single
 draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
 effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
 pipelines" below.
 We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
 (Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
 and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
 GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
 allocation).
 #### Verifying register counts
 The register estimates in this document are hand-counted via manual live-range analysis (see Current
 state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
 (ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
 exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
 Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
 measured `malioc` numbers is a follow-up task.
 #### Backdrop split: render-pass structure
@@ -156,10 +216,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
 level operation that requires ending the current render pass. This boundary exists regardless of
 shader complexity and cannot be optimized away.
-The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
+The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
-register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
+at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
-problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
+cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
-inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
+effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
 pass sequence.
 #### Why not per-primitive-type pipelines (GPUI's approach)
@@ -271,18 +332,23 @@ There are three categories of branch condition in a fragment shader, ranked by c
 #### Which category our branches fall into
-Our design has two branch points:
+Our design has three branch points:
 1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
   Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
   cost.**
-2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3.
+2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
-   The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
+   The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
-   receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no
+   to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
-   kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection
+   kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~15–30
-   — all lightweight (3–8 instructions per path). Divergence at primitive boundaries between
+   instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
-   different flag combinations has negligible cost.
+   different kinds.
 3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
   The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
   solid color selection and outline rendering — all lightweight branches (3–8 instructions per
   path). Divergence at primitive boundaries between different flag combinations has negligible cost.
 For category 3, the divergence analysis depends on primitive size:
@@ -299,11 +365,12 @@ For category 3, the divergence analysis depends on primitive size:
  frame-level divergence is typically **1–3%** of all warps.
 At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
-(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the
+(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
-longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function.
+is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
-Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU,
+of both (~45–60 instructions total). Each divergent warp's extra cost is modest — at ~12G
-that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is
+instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
-confirmed by production renderers that use exactly this pattern:
+~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
 that use exactly this pattern:
 - **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
  flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
@@ -327,10 +394,10 @@ our design:
   > have no per-fragment data-dependent branches in the main pipeline.
 2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
-   divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates
+   divergent warps pay double a large cost. Our SDF kind branches are short (~15–30 instructions
-   `sdRoundedBox`; the only branches are gradient/texture/solid color selection at 3–8 instructions
+   each), and the gradient/texture/solid color selection branches are shorter still (3–8 instructions
-   each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture
+   each). Even fully divergent, the combined penalty is ~30–60 extra instructions — comparable to a
-   sample's latency.
+   single texture sample's latency.
 3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
   across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
@@ -338,9 +405,10 @@ our design:
   concern.
 4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
-   split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF
+   split heavy effects into separate pipelines. Within the main pipeline, the four
-   path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~15–18
+   SDF kind branches and flag-based color selection cluster at ~22–26 registers (see register
-   registers, so there is negligible occupancy loss.
+   analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
   Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
 **References:**
@@ -361,19 +429,20 @@ our design:
 ### Main pipeline: SDF + tessellated (unified)
 The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
-vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte:
+vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
-0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex
+(`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
-shader; the SDF path sets it to 1 via `pack_flags`.
+vertex shader branches on this uniform to select the tessellated or SDF code path.
 - **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
-  (SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and
+  (SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
-  any user-provided raw vertex geometry.
+  user-provided raw vertex geometry.
 - **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
  structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
-Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes
+Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
-`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies
+`Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
-gradient/texture/solid color based on flag bits.
+and computes `out = color * t`; kinds 1–4 dispatch to one of four SDF functions (RRect, NGon,
 Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
 #### Why SDF for shapes
@@ -425,47 +494,39 @@ The tessellated path retains the existing direct vertex buffer layout (20 bytes/
 buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
 in a draw call has the same mode — so it is effectively free on all modern GPUs.
-#### Shape folding
+#### Shape kinds and SDF dispatch
-The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no
+The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
-`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically
+four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
-special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs:
+`pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
 and set the kind automatically:
-| User-facing shape            | RRect mapping                                | Notes                                    |
+| User-facing proc     | Shape_Kind | SDF function       | Notes                                                      |
-| ---------------------------- | -------------------------------------------- | ---------------------------------------- |
+| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
-| Rectangle (sharp or rounded) | Direct                                       | Per-corner radii from `radii` param      |
+| `rectangle`          | `RRect`    | `sdRoundedBox`     | Per-corner radii from `radii` param                        |
-| Circle                       | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size                |
+| `rectangle_texture`  | `RRect`    | `sdRoundedBox`     | Textured fill; `.Textured` flag set                        |
-| Line segment / capsule       | Rotated RRect, `radii = half_thickness`      | Stadium shape (fully-rounded minor axis) |
+| `circle`             | `RRect`    | `sdRoundedBox`     | Uniform radii = half-size (circle is a degenerate RRect)   |
-| Full ring / annulus          | Stroked circle at mid-radius                 | `stroke_px = outer - inner`              |
+| `line`, `line_strip` | `RRect`    | `sdRoundedBox`     | Rotated capsule — stadium shape (radii = half-thickness)   |
 | `ellipse`            | `Ellipse`  | `sdEllipseApprox`  | Approximate ellipse SDF (fast, suitable for UI)            |
 | `polygon`            | `NGon`     | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle              |
 | `ring` (full)        | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping       |
 | `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask           |
 | `ring` (pie slice)   | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
-Shapes without a closed-form RRect reduction are drawn via the tessellated path:
+The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
-
+arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
-| Shape                     | Tessellated proc                   | Method                     |
+definitions and bit assignments.
 | ------------------------- | ---------------------------------- | -------------------------- |
 | Ellipse                   | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
 | Regular polygon (N-gon)   | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center   |
 | Circle sector (pie slice) | `tes_sector`                       | Triangle fan arc           |
 The `Shape_Flags` bit set controls rendering mode per primitive:
 | Flag              | Bit | Effect                                                               |
 | ----------------- | --- | -------------------------------------------------------------------- |
 | `Stroke`          | 0   | Outline instead of fill (`d = abs(d) - stroke_width/2`)              |
 | `Textured`        | 1   | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
 | `Gradient`        | 2   | Bilinear 4-corner interpolation from `uv.corner_colors`              |
 | `Gradient_Radial` | 3   | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]`   |
 **What stays tessellated:**
 - Text (SDL_ttf atlas, pending future MSDF evaluation)
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`)
+- `tess.pixel` (single-pixel points)
- Regular polygons (`tes_polygon`, `tes_polygon_lines`)
+- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
- Circle sectors / pie slices (`tes_sector`)
+- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
 - `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
 - Any raw vertex geometry submitted via `prepare_shape`
-The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF
+The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
-function or is described by a vertex list, it stays tessellated.
+`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
 ### Effects pipeline
@@ -538,10 +599,9 @@ while also writing to it.
 **Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
 (downsample → horizontal blur → vertical blur → composite), following the standard approach used by
 iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
-pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
+sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
-occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
+multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
-Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
+require, keeping each sub-pass well under the 32-register cliff.
 Mali-G31 and VideoCore VI) where per-thread register limits are tight.
 **Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
 buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
@@ -549,14 +609,13 @@ resume normal drawing. The entry/exit cost (texture copy + render-pass break) is
 regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
 is never entered and the texture copy never happens — zero cost.
-**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
+**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
-~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
+registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
-that justifies the main/effects split does not apply here. The main/effects split protects the
+The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
-_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
+sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
-pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
+all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
-runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
+typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
-Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
+bracket would have negligible impact on frame time.
 UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
 ### Vertex layout
@@ -590,10 +649,14 @@ Primitive :: struct {
 // Total: 80 bytes (std430 aligned)
 ```
-`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union.
+`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
-`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient
+`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
-corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the
+writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
-tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`.
+`Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
 v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
 `gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
 width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
 via `pack_kind_flags`.
 ### Draw submission order
@@ -622,8 +685,8 @@ allow resolution-independent glyph rendering from a single small atlas per font.
 - Offline atlas generation via Chlumský's msdf-atlas-gen tool.
 - Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind
+- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
-  distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch).
+  already exists for the four current SDF kinds).
 - Potential removal of the SDL_ttf dependency.
 This is explicitly deferred. The SDF shape migration is independent of and does not block text
@@ -692,31 +755,27 @@ with the same texture but different samplers produce separate draw calls, which
 #### Textured draw procs
-Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and
+Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
-`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly —
+`rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
-same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
+parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.
 tint.
 An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
-quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF
+quads, on the theory that the tessellated path's lower register count would improve occupancy at
-textured branch) would improve occupancy at large fragment counts. Applying the register-pressure
+large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
-analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are
+full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
-well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at
+extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
-100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts
+splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU side per scissor,
-to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~1–5μs per
+matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.
 pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
 main pipeline, unified remains strictly better.
-The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes
+SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
-for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside
+`circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
-`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to
+parameters on each proc rather than separate overloads. Future per-shape texture variants
-`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture
+(`circle_texture`, `ellipse_texture`) are additive.
 variants (`sdf_circle_texture`) are additive.
 #### What SDF anti-aliasing does and does not do for textured draws
 The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
-stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
+outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
 sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
 the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
 of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
@@ -750,9 +809,9 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si
 Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
 `Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
-existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp
+existing rectangle handling: `fit_params` computes UVs from the fit mode, then
-corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A
+`rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
-`fit_params` call computes UVs from the fit mode before dispatch.
+from Clay's `cornerRadius` otherwise).
 #### Deferred features
@@ -764,7 +823,7 @@ The following are plumbed in the descriptor but not implemented in phase 1:
 - **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
 - **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
 - **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention.
+- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.
 **References:**