levlib/draw/README.md

# draw

2D rendering library built on SDL3 GPU, providing a unified shape-drawing and text-rendering API with
Clay UI integration.

## Current state

The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
modes dispatched by a push constant:

- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
  SDL_ttf atlas textures), axis-aligned sharp-corner rectangles (already optimal as 2 triangles),
  per-vertex color gradients (`rectangle_gradient`, `circle_gradient`), angular-clipped circle
  sectors (`circle_sector`), and arbitrary user geometry (`triangle`, `triangle_fan`,
  `triangle_strip`). The fragment shader computes `out = color * texture(tex, uv)`.

- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
  `Primitive` structs uploaded each frame to a GPU storage buffer. The vertex shader reads
  `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners + primitive
  bounds. The fragment shader dispatches on `Shape_Kind` to evaluate the correct signed distance
  function analytically.

Seven SDF shape kinds are implemented:

1. **RRect** — rounded rectangle with per-corner radii (iq's `sdRoundedBox`)
2. **Circle** — filled or stroked circle
3. **Ellipse** — exact signed-distance ellipse (iq's iterative `sdEllipse`)
4. **Segment** — capsule-style line segment with rounded caps
5. **Ring_Arc** — annular ring with angular clipping for arcs
6. **NGon** — regular polygon with arbitrary side count and rotation
7. **Polyline** — decomposed into independent `Segment` primitives per adjacent point pair

All SDF shapes support fill and stroke modes via `Shape_Flags`, and produce mathematically exact
curves with analytical anti-aliasing via `smoothstep` — no tessellation, no piecewise-linear
approximation. A rounded rectangle is 1 primitive (64 bytes) instead of ~250 vertices (~5000 bytes).

MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
glyph edges and tessellated user geometry if desired.

## 2D rendering pipeline plan

This section documents the planned architecture for levlib's 2D rendering system. The design is driven
by three goals: **draw quality** (mathematically exact curves with perfect anti-aliasing), **efficiency**
(minimal vertex bandwidth, high GPU occupancy, low draw-call count), and **extensibility** (new
primitives and effects can be added to the library without architectural changes).

### Overview: three pipelines

The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
**render-state requirements**:

1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
   registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
   frame.

2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
   effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
   redundant overdraw.

3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
   render target as input. High register footprint (~70–80 registers) and structurally requires a
   `CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
   pressure and because the texture-copy requirement forces a render-pass-level state change.

A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
is under 0.15% of an 8.3ms (120 FPS) frame budget.

### Why three pipelines, not one or seven

The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).

The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
compiler allocates registers pessimistically based on the worst-case path through the shader. If the
shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.

Concrete example on a modern NVIDIA SM with 65,536 registers:

| Register allocation       | Max concurrent threads | Occupancy |
| ------------------------- | ---------------------- | --------- |
| 20 regs (RRect only)      | 3,276                  | ~100%     |
| 48 regs (+ drop shadow)   | 1,365                  | ~42%      |
| 72 regs (+ frosted glass) | 910                    | ~28%      |

For a 4K frame (3840×2160) at 1.5× overdraw (~12.4M fragments), running all fragments at 28%
occupancy instead of 100% roughly triples fragment shading time. At 4K this is severe: if the main
pipeline's fragment work at full occupancy takes ~2ms, a single unified shader containing the glass
branch would push it to ~6ms — consuming 72% of the 8.3ms budget available at 120 FPS and leaving
almost nothing for CPU work, uploads, and presentation. This is a per-frame multiplier, not a
per-primitive cost — it applies even when the heavy branch is never taken.

The three-pipeline split groups primitives by register footprint so that:

- Main pipeline (~20 regs): 90%+ of fragments run at near-full occupancy.
- Effects pipeline (~55 regs): shadow/glow fragments run at moderate occupancy; unavoidable given the
  blur math complexity.
- Backdrop-effects pipeline (~75 regs): glass fragments run at low occupancy; also unavoidable, and
  structurally separated anyway by the texture-copy requirement.

This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
actually need them.

**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
for our use case:

**Draw call count scales with kind variety, not just scissor count.** With a unified pipeline,
one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With
per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a
typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
negligible (all members cluster at 12–22 registers).

**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
from one storage buffer, submission order equals draw order — Clay's painterly render commands flow
through without reordering. With separate pipelines per kind, primitives can only batch with
same-kind neighbors, which means interleaved kinds (e.g., `[rrect, circle, text, rrect, text]`) must
either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort
by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their
API where each layer draws shadows before quads before glyphs. Our design avoids this constraint:
submission order is draw order, no layer juggling required.

**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
additional axis with 7 pipelines vs 3.

**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
the strongest candidate for per-kind splitting because effect branches are heavier than shape
branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses.
Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved
in submission order:

- _Unified effects pipeline (our plan):_ 1 pipeline bind, 1 instanced draw call. Category-3
  divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2
  boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra
  instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ **1.7μs**.

- _Per-kind effects pipelines (GPUI-style):_ 2 pipeline binds + 2 draw calls. But submission order
  is `[drop, drop, inner, drop, drop, inner, drop, ...]` — the two inner-shadow primitives split the
  drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline
  switches, not 2. Cost: 5 × 5μs + 4 × 5μs = **45μs**.

  The per-kind approach costs **26× more** than the unified approach's divergence penalty (45μs vs
  1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most
  extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing
  ~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
  cheaper than the pipeline-switching alternative.

The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
boundaries where occupancy differences are catastrophic at 4K (see numbers above). Within a tier,
unified is strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead,
and negligible GPU-side branching cost.

**References:**

- Zed GPUI blog post on their per-primitive pipeline architecture:
  https://zed.dev/blog/videogame
- Zed GPUI Metal shader source (7 shader pairs):
  https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
  https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/

### Why fragment shader branching is safe in this design

There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware
where shader cores had no branch instructions at all — compilers emitted code for both sides of every
branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward),
this is no longer the case. Native dynamic branching is fully supported on all current hardware.
However, branching _can_ still be costly in specific circumstances. Understanding which circumstances
apply to our design — and which do not — is critical to justifying the unified-pipeline approach.

#### How GPU branching works

GPUs execute fragment shaders in **warps** (NVIDIA/Intel, 32 threads) or **wavefronts** (AMD, 32 or
64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a
branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken
path and skips the other — **zero cost**, identical to a CPU branch. This is called a **uniform
branch** or **warp-coherent branch**.

When threads within the same warp disagree on which path to take, the warp must execute both paths
sequentially, masking off threads that don't belong to the active path. This is called **warp
divergence** and it causes the warp to pay the cost of both sides of the branch. In the worst case
(50/50 split), throughput halves for that warp.

There are three categories of branch condition in a fragment shader, ranked by cost:

| Category                         | Condition source                                                  | GPU behavior                                                                                   | Cost                  |
| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------- |
| **Compile-time constant**        | `#ifdef`, `const bool`                                            | Dead code eliminated by compiler                                                               | Zero                  |
| **Uniform / push constant**      | Same value for entire draw call                                   | Warp-coherent; GPU skips dead path                                                             | Effectively zero      |
| **Per-primitive `flat` varying** | Same value across all fragments of a primitive                    | Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries | Near zero (see below) |
| **Per-fragment varying**         | Different value per pixel (e.g., texture lookup, screen position) | Potentially divergent within every warp                                                        | Can be expensive      |

#### Which category our branches fall into

Our design has two branch points:

1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
   Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
   cost.**

2. **`shape_kind` (flat varying from storage buffer): which SDF to evaluate.** This is category 3.
   The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
   receive the same `shape_kind` value. Divergence can only occur at the **boundary between two
   adjacent primitives of different kinds**, where the rasterizer might pack fragments from both
   primitives into the same warp.

For category 3, the divergence analysis depends on primitive size:

- **Large primitives** (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect
  produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a
  different kind. Divergence rate: **0.6%** of warps.

- **Small primitives** (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps
  diverge. Divergence rate: **25%** of warps for that primitive, but the primitive itself covers a
  tiny fraction of the frame's total fragments.

- **Worst realistic case**: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect
  icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total
  frame-level divergence is typically **1–3%** of all warps.

At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
(~387,000 warps), divergent boundary warps number in the low thousands. Each divergent warp pays at
most ~25 extra instructions (the cost of the longest untaken SDF branch). At ~12G instructions/sec
on a mid-range GPU, that totals ~4μs — under 0.05% of an 8.3ms (120 FPS) frame budget. This is
confirmed by production renderers that use exactly this pattern:

- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
  flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
  specifically because CPU-side tessellation was the bottleneck, not fragment branching:
  https://github.com/audulus/vger-rs

- **Randy Gaul's 2D renderer**: single pipeline with `shape_type` encoded as a vertex attribute.
  Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because
  "games tend to draw a lot of the same shape type":
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

#### What kind of branching IS expensive

For completeness, here are the cases where shader branching genuinely hurts — none of which apply to
our design:

1. **Per-fragment data-dependent branches with high divergence.** Example: `if (texture(noise, uv).r

   > 0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every
   > warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We
   > have no per-fragment data-dependent branches in the main pipeline.

2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
   divergent warps pay double a large cost. Our SDF functions are 10–25 instructions each. Even
   fully divergent, the penalty is ~25 extra instructions — less than a single texture sample's
   latency.

3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
   across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
   Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a
   concern.

4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
   split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, all SDF
   branches have similar register footprints (12–22 registers), so combining them causes negligible
   occupancy loss.

**References:**

- ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution
  models across GPU generations, confirms uniform and warp-coherent branches are free on modern
  hardware:
  https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
- Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that
  warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the
  SIMT divergence mechanism:
  https://www.peterstefek.me/shader-branch.html
- NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows
  divergent threads to reconverge more efficiently than older architectures:
  https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- Randy Gaul on warp divergence in practice with per-primitive shape_type branching:
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

### Main pipeline: SDF + tessellated (unified)

The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
vertex input layout, distinguished by a push constant:

- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Unchanged from
  today. Used for text (SDL_ttf atlas sampling), polylines, triangle fans/strips, gradient-filled
  shapes, and any user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
  structs, drawn instanced. Used for all shapes with closed-form signed distance functions.

Both modes converge on the same fragment shader, which dispatches on a `shape_kind` discriminant
carried either in the vertex data (tessellated, always `Solid = 0`) or in the storage-buffer
primitive struct (SDF modes).

#### Why SDF for shapes

CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:

1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
   = 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
   vertices. That is roughly a 90× reduction per shape.

2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
   animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
   mathematically exact boundaries with perfect anti-aliasing via `smoothstep` in the fragment
   shader.

3. **Feature cost.** Adding soft edges, outlines, stroke effects, or rounded-cap line segments
   requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader
   operations: `abs(d) - thickness` for stroke, `smoothstep(-soft, soft, d)` for soft edges.

**References:**

- Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used):
  https://iquilezles.org/articles/distfunctions2d/
- Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference):
  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
- Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout,
  warp divergence tradeoffs, and polyline rendering:
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
- Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type
  discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell
  to 30 FPS due to CPU-side tessellation:
  https://github.com/audulus/vger-rs

#### Storage-buffer instancing for SDF primitives

SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceIndex` in the vertex
shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
pattern used by both Zed GPUI and vger-rs.

Each SDF shape is described by a single `Primitive` struct (~56 bytes) in the storage buffer. The
vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
interpolated varyings.

Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
56 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.

The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
in a draw call has the same mode — so it is effectively free on all modern GPUs.

#### Shape kinds

Primitives in the main pipeline's storage buffer carry a `Shape_Kind` discriminant:

| Kind       | SDF function                           | Notes                                                     |
| ---------- | -------------------------------------- | --------------------------------------------------------- |
| `RRect`    | `sdRoundedBox` (iq)                    | Per-corner radii. Covers all Clay rectangles and borders. |
| `Circle`   | `sdCircle`                             | Filled and stroked.                                       |
| `Ellipse`  | `sdEllipse`                            | Exact (iq's closed-form).                                 |
| `Segment`  | `sdSegment` capsule                    | Rounded caps, correct sub-pixel thin lines.               |
| `Ring_Arc` | `abs(sdCircle) - thickness` + arc mask | Rings, arcs, circle sectors unified.                      |
| `NGon`     | `sdRegularPolygon`                     | Regular n-gon for n ≥ 5.                                  |

The `Solid` kind (value 0) is reserved for the tessellated path, where `shape_kind` is implicitly
zero because the fragment shader receives it from zero-initialized vertex attributes.

Stroke/outline variants of each shape are handled by the `Shape_Flags` bit set rather than separate
shape kinds. The fragment shader transforms `d = abs(d) - stroke_width` when the `Stroke` flag is
set.

**What stays tessellated:**

- Text (SDL_ttf atlas, pending future MSDF evaluation)
- `rectangle_gradient`, `circle_gradient` (per-vertex color interpolation)
- `triangle_fan`, `triangle_strip` (arbitrary user-provided point lists)
- `line_strip` / polylines (SDF polyline rendering is possible but complex; deferred)
- Any raw vertex geometry submitted via `prepare_shape`

The rule: if the shape has a closed-form SDF, it goes SDF. If it's described only by a vertex list or
needs per-vertex color interpolation, it stays tessellated.

### Effects pipeline

The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and
similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a
dedicated pipeline state object that has its own compiled fragment shader.

#### Combined shape + effect rendering

When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn
**once**, entirely in the effects pipeline. The effects fragment shader evaluates both the effect
(blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated
across pipelines.

This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5)
and blur sigma 10:

- **Separate-primitive approach** (shape in main pipeline + shadow in effects pipeline): the shadow
  quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500
  shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the
  shape. Total fragment invocations: ~49,900.

- **Combined approach** (one primitive in effects pipeline): one quad covers ~230×130 = 29,900
  pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape
  on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape
  shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.

The combined approach uses **~40% fewer fragment invocations** per effected shape (29,900 vs 49,900)
in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF
evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments
through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10
shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.

An `Effect_Flag.Draw_Base_Shape` flag controls whether the sharp shape layer composites on top
(default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with
no shape on top) clear this flag.

Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects
are routed to the effects pipeline.

#### Drop shadow implementation

Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post
describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed
form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is
~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan
Wallace's variant) and vger-rs.

**References:**

- Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement):
  https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
- Evan Wallace's original WebGL implementation (used by Figma):
  https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
- Vello's implementation of blurred rounded rectangle as a gradient type:
  https://github.com/linebender/vello/pull/665

### Backdrop-effects pipeline

The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
reasons:

1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
   target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
   buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
   boundary.

2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
   (multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
   effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
   measurable throughput on the common case.

The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
happens — zero cost.

### Vertex layout

The vertex struct is unchanged from the current 20-byte layout:

```
Vertex :: struct {
    position: [2]f32,  //  0: screen-space position
    uv:       [2]f32,  //  8: atlas UV (text) or unused (shapes)
    color:    Color,   // 16: u8x4, GPU-normalized to float
}
```

This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated
draws, `position` carries actual world-space geometry. For SDF draws, `position` carries unit-quad
corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
primitive's bounds.

The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:

```
Primitive :: struct {
    kind:   Shape_Kind,     //  0: enum u8
    flags:  Shape_Flags,    //  1: bit_set[Shape_Flag; u8]
    _pad:   u16,            //  2: reserved
    bounds: [4]f32,         //  4: min_x, min_y, max_x, max_y
    color:  Color,          // 20: u8x4
    _pad2:  [3]u8,          // 24: alignment
    params: Shape_Params,   // 28: raw union, 32 bytes
}
// Total: 60 bytes (padded to 64 for GPU alignment)
```

`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.

### Draw submission order

Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:

1. Bind **effects pipeline** → draw all queued effects primitives for this scissor (instanced, one
   draw call). Each effects primitive includes its base shape and composites internally.
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
   shapes, indexed for text). Pipeline state unchanged from today.
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
   backdrop primitives.

The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
invariant is that each primitive is drawn exactly once, in the pipeline that owns it.

### Text rendering

Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per `(font, size)`
pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData`. This path is
**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
mode with `shape_kind = Solid`, sampling the SDL_ttf atlas texture.

A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
allow resolution-independent glyph rendering from a single small atlas per font. This would involve:

- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new `Shape_Kind.MSDF_Glyph` variant in the main pipeline's fragment shader.
- Potential removal of the SDL_ttf dependency.

This is explicitly deferred. The SDF shape migration is independent of and does not block text
changes.

**References:**

- Viktor Chlumský's MSDF master's thesis and msdfgen tool:
  https://github.com/Chlumsky/msdfgen
- MSDF atlas generator for font atlas packing:
  https://github.com/Chlumsky/msdf-atlas-gen
- Valve's original SDF text rendering paper (SIGGRAPH 2007):
  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf

## 3D rendering

3D pipeline architecture is under consideration and will be documented separately. The current
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.

## Building shaders

GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
are generated into `shaders/generated/` via the meta tool:

```
odin run meta -- gen-shaders
```

Requires `glslangValidator` and `spirv-cross` on PATH.