Mode 0 (Tessellated): Vertex buffer contains real geometry. Used for text (indexed draws into SDL_ttf atlas textures), axis-aligned sharp-corner rectangles (already optimal as 2 triangles), per-vertex color gradients (rectangle_gradient, circle_gradient), angular-clipped circle sectors (circle_sector), and arbitrary user geometry (triangle, triangle_fan, triangle_strip). The fragment shader computes out = color * texture(tex, uv).
Mode 1 (SDF): A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive Primitive structs uploaded each frame to a GPU storage buffer. The vertex shader reads primitives[gl_InstanceIndex], computes world-space position from unit quad corners + primitive bounds. The fragment shader dispatches on Shape_Kind to evaluate the correct signed distance function analytically.

Seven SDF shape kinds are implemented:

RRect — rounded rectangle with per-corner radii (iq's sdRoundedBox)
Circle — filled or stroked circle
Ellipse — exact signed-distance ellipse (iq's iterative sdEllipse)
Segment — capsule-style line segment with rounded caps
Ring_Arc — annular ring with angular clipping for arcs
NGon — regular polygon with arbitrary side count and rotation
Polyline — decomposed into independent Segment primitives per adjacent point pair

All SDF shapes support fill and stroke modes via Shape_Flags, and produce mathematically exact curves with analytical anti-aliasing via smoothstep — no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (64 bytes) instead of ~250 vertices (~5000 bytes).

MSAA is opt-in (default ._1, no MSAA) via Init_Options.msaa_samples. SDF rendering does not benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text glyph edges and tessellated user geometry if desired.

2D rendering pipeline plan

This section documents the planned architecture for levlib's 2D rendering system. The design is driven by three goals: draw quality (mathematically exact curves with perfect anti-aliasing), efficiency (minimal vertex bandwidth, high GPU occupancy, low draw-call count), and extensibility (new primitives and effects can be added to the library without architectural changes).

Overview: three pipelines

The 2D renderer uses three GPU pipelines, split by register pressure (main vs effects) and render-pass structure (everything vs backdrop):

Main pipeline — shapes (SDF and tessellated), text, and textured rectangles. Low register footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture. Handles 90%+ of all fragments in a typical frame.
Effects pipeline — drop shadows, inner shadows, outer glow, and similar ALU-bound blur effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on low-end hardware (see register analysis below).
Backdrop pipeline — frosted glass, refraction, and any effect that samples the current render target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite), where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated from the other pipelines because it structurally requires ending the current render pass and copying the render target before any backdrop-sampling fragment can execute — a command-buffer- level boundary that cannot be avoided regardless of shader complexity.

A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2 switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching overhead is negligible relative to an 8.3ms (120 FPS) frame budget.

Why three pipelines, not one or seven

The natural question is whether we should use a single unified pipeline (fewer state changes, simpler code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).

Main/effects split: register pressure

A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler allocates registers pessimistically based on the worst-case path through the shader. If the shader contains both a 20-register RRect SDF and a 48-register drop-shadow blur, every fragment — even trivial RRects — is allocated 48 registers. This directly reduces occupancy (the number of warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory latency.

Each GPU architecture has a register cliff — a threshold above which occupancy starts dropping. Below the cliff, adding registers has zero occupancy cost.

On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):

Register allocation	Reg-limited threads	Actual (hw-capped)	Occupancy
20 regs (main pipeline)	3,276	1,536	100%
32 regs	2,048	1,536	100%
48 regs (effects)	1,365	1,365	~89%

On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):

Register allocation	Reg-limited threads	Actual (hw-capped)	Occupancy
20 regs (main pipeline)	3,276	2,048	100%
32 regs	2,048	2,048	100%
48 regs (effects)	1,365	1,365	~67%

On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):

Register allocation	Occupancy
0–32 regs (main)	100% (full thread count)
33–64 regs (effects)	~50% (thread count halves)

Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between 20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects pipeline's register cost.

For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the compiler allocates registers for the worst-case path.

All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at 12–24 registers — below the cliff on every architecture — so unifying them costs nothing in occupancy.

Note on Apple M3+ GPUs: Apple's M3 introduces Dynamic Caching (register file virtualization), which allocates registers at runtime based on actual usage rather than worst-case. This weakens the static register-pressure argument on M3 and later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop texture-copy out of the main render pass.

Backdrop split: render-pass structure

The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current render target must be copied to a separate texture via CopyGPUTextureToTexture — a command-buffer- level operation that requires ending the current render pass. This boundary exists regardless of shader complexity and cannot be optimized away.

The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy problem. But the render-pass boundary makes merging structurally impossible — effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.

Why not per-primitive-type pipelines (GPUI's approach)

Zed's GPUI uses 7 separate shader pairs: quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all branching and gives each shader minimal register usage. Three concrete costs make this approach wrong for our use case:

Draw call count scales with kind variety, not just scissor count. With a unified pipeline, one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces ~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5 pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds 375–500μs of CPU overhead per frame — 4.5–6% of an 8.3ms (120 FPS) budget — with no compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are negligible (all members cluster at 12–22 registers).

Z-order preservation forces the API to expose layers. With a single pipeline drawing all kinds from one storage buffer, submission order equals draw order — Clay's painterly render commands flow through without reordering. With separate pipelines per kind, primitives can only batch with same-kind neighbors, which means interleaved kinds (e.g., [rrect, circle, text, rrect, text]) must either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their API where each layer draws shadows before quads before glyphs. Our design avoids this constraint: submission order is draw order, no layer juggling required.

PSO compilation costs multiply. Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per additional axis with 7 pipelines vs 3.

Branching cost comparison: unified vs per-kind in the effects pipeline. The effects pipeline is the strongest candidate for per-kind splitting because effect branches are heavier than shape branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses. Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved in submission order:

Unified effects pipeline (our plan): 1 pipeline bind, 1 instanced draw call. Category-3 divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2 boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ 1.7μs.
Per-kind effects pipelines (GPUI-style): 2 pipeline binds + 2 draw calls. But submission order is [drop, drop, inner, drop, drop, inner, drop, ...] — the two inner-shadow primitives split the drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline switches, not 2. Cost: 5 × 5μs + 4 × 5μs = 45μs.

The per-kind approach costs 26× more than the unified approach's divergence penalty (45μs vs 1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing ~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5× cheaper than the pipeline-switching alternative.

The split we do perform (main / effects / backdrop) is motivated by register-pressure boundaries and structural render-pass requirements (see analysis above). Within a pipeline, unified is strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and negligible GPU-side branching cost.

References:

Zed GPUI blog post on their per-primitive pipeline architecture: https://zed.dev/blog/videogame
Zed GPUI Metal shader source (7 shader pairs): https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis: https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
NVIDIA Ampere GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.6, 64 for cc 8.0), register file size (64K), occupancy factors: https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
NVIDIA Ada GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.9): https://docs.nvidia.com/cuda/ada-tuning-guide/index.html
CUDA Occupancy Calculation walkthrough (register allocation granularity, worked examples): https://leimao.github.io/blog/CUDA-Occupancy-Calculation/
Apple M3 GPU architecture — Dynamic Caching (register file virtualization) eliminates static worst-case register allocation, reducing the occupancy penalty for high-register shaders: https://asplos.dev/wiki/m3-chip-explainer/gpu/index.html

Why fragment shader branching is safe in this design

There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware where shader cores had no branch instructions at all — compilers emitted code for both sides of every branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward), this is no longer the case. Native dynamic branching is fully supported on all current hardware. However, branching can still be costly in specific circumstances. Understanding which circumstances apply to our design — and which do not — is critical to justifying the unified-pipeline approach.

How GPU branching works

GPUs execute fragment shaders in warps (NVIDIA/Intel, 32 threads) or wavefronts (AMD, 32 or 64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken path and skips the other — zero cost, identical to a CPU branch. This is called a uniform branch or warp-coherent branch.

When threads within the same warp disagree on which path to take, the warp must execute both paths sequentially, masking off threads that don't belong to the active path. This is called warp divergence and it causes the warp to pay the cost of both sides of the branch. In the worst case (50/50 split), throughput halves for that warp.

There are three categories of branch condition in a fragment shader, ranked by cost:

Category	Condition source	GPU behavior	Cost
Compile-time constant	`#ifdef`, `const bool`	Dead code eliminated by compiler	Zero
Uniform / push constant	Same value for entire draw call	Warp-coherent; GPU skips dead path	Effectively zero
Per-primitive `flat` varying	Same value across all fragments of a primitive	Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries	Near zero (see below)
Per-fragment varying	Different value per pixel (e.g., texture lookup, screen position)	Potentially divergent within every warp	Can be expensive

Which category our branches fall into

Our design has two branch points:

mode (push constant): tessellated vs. SDF. This is category 2 — uniform per draw call. Every thread in every warp of a draw call sees the same mode value. Zero divergence, zero cost.
shape_kind (flat varying from storage buffer): which SDF to evaluate. This is category 3. The flat interpolation qualifier ensures that all fragments rasterized from one primitive's quad receive the same shape_kind value. Divergence can only occur at the boundary between two adjacent primitives of different kinds, where the rasterizer might pack fragments from both primitives into the same warp.

For category 3, the divergence analysis depends on primitive size:

Large primitives (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a different kind. Divergence rate: 0.6% of warps.
Small primitives (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps diverge. Divergence rate: 25% of warps for that primitive, but the primitive itself covers a tiny fraction of the frame's total fragments.
Worst realistic case: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total frame-level divergence is typically 1–3% of all warps.

At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments (~387,000 warps), divergent boundary warps number in the low thousands. Each divergent warp pays at most ~25 extra instructions (the cost of the longest untaken SDF branch). At ~12G instructions/sec on a mid-range GPU, that totals ~4μs — under 0.05% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers that use exactly this pattern:

vger / vger-rs (Audulus): single pipeline, 11 primitive kinds dispatched by a switch on a flat varying prim_type. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg specifically because CPU-side tessellation was the bottleneck, not fragment branching: https://github.com/audulus/vger-rs
Randy Gaul's 2D renderer: single pipeline with shape_type encoded as a vertex attribute. Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because "games tend to draw a lot of the same shape type": https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

What kind of branching IS expensive

For completeness, here are the cases where shader branching genuinely hurts — none of which apply to our design:

Per-fragment data-dependent branches with high divergence. Example: `if (texture(noise, uv).r

0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We have no per-fragment data-dependent branches in the main pipeline.
Branches where both paths are very long. If both sides of a branch are 500+ instructions, divergent warps pay double a large cost. Our SDF functions are 10–25 instructions each. Even fully divergent, the penalty is ~25 extra instructions — less than a single texture sample's latency.
Branches that prevent compiler optimizations. Some compilers cannot schedule instructions across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a concern.
Register pressure from the union of all branches. This is the real cost, and it is why we split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, all SDF branches have similar register footprints (12–22 registers), so combining them causes negligible occupancy loss.

References:

ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution models across GPU generations, confirms uniform and warp-coherent branches are free on modern hardware: https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the SIMT divergence mechanism: https://www.peterstefek.me/shader-branch.html
NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows divergent threads to reconverge more efficiently than older architectures: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
Randy Gaul on warp divergence in practice with per-primitive shape_type branching: https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

Main pipeline: SDF + tessellated (unified)

The main pipeline serves two submission modes through a single TRIANGLELIST pipeline and a single vertex input layout, distinguished by a push constant:

Tessellated mode (mode = 0): direct vertex buffer with explicit geometry. Unchanged from today. Used for text (SDL_ttf atlas sampling), polylines, triangle fans/strips, gradient-filled shapes, and any user-provided raw vertex geometry.
SDF mode (mode = 1): shared unit-quad vertex buffer + GPU storage buffer of Primitive structs, drawn instanced. Used for all shapes with closed-form signed distance functions.

Both modes converge on the same fragment shader, which dispatches on a shape_kind discriminant carried either in the vertex data (tessellated, always Solid = 0) or in the storage-buffer primitive struct (SDF modes).

Why SDF for shapes

CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:

Vertex bandwidth. A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes = 5 KB. An SDF rounded rectangle is one Primitive struct (~56 bytes) plus 4 shared unit-quad vertices. That is roughly a 90× reduction per shape.
Quality. Tessellated curves are piecewise-linear approximations. At high DPI or under animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces mathematically exact boundaries with perfect anti-aliasing via smoothstep in the fragment shader.
Feature cost. Adding soft edges, outlines, stroke effects, or rounded-cap line segments requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader operations: abs(d) - thickness for stroke, smoothstep(-soft, soft, d) for soft edges.

References:

Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used): https://iquilezles.org/articles/distfunctions2d/
Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference): https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout, warp divergence tradeoffs, and polyline rendering: https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell to 30 FPS due to CPU-side tessellation: https://github.com/audulus/vger-rs

Storage-buffer instancing for SDF primitives

SDF primitives are submitted via a GPU storage buffer indexed by gl_InstanceIndex in the vertex shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the pattern used by both Zed GPUI and vger-rs.

Each SDF shape is described by a single Primitive struct (~56 bytes) in the storage buffer. The vertex shader reads primitives[gl_InstanceIndex], computes the quad corner position from the unit vertex and the primitive's bounds, and passes shape parameters to the fragment shader via flat interpolated varyings.

Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage- buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs 56 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.

The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage buffer access). The vertex shader branch on mode (push constant) is warp-uniform — every invocation in a draw call has the same mode — so it is effectively free on all modern GPUs.

Shape kinds

Primitives in the main pipeline's storage buffer carry a Shape_Kind discriminant:

Kind	SDF function	Notes
`RRect`	`sdRoundedBox` (iq)	Per-corner radii. Covers all Clay rectangles and borders.
`Circle`	`sdCircle`	Filled and stroked.
`Ellipse`	`sdEllipse`	Exact (iq's closed-form).
`Segment`	`sdSegment` capsule	Rounded caps, correct sub-pixel thin lines.
`Ring_Arc`	`abs(sdCircle) - thickness` + arc mask	Rings, arcs, circle sectors unified.
`NGon`	`sdRegularPolygon`	Regular n-gon for n ≥ 5.

The Solid kind (value 0) is reserved for the tessellated path, where shape_kind is implicitly zero because the fragment shader receives it from zero-initialized vertex attributes.

Stroke/outline variants of each shape are handled by the Shape_Flags bit set rather than separate shape kinds. The fragment shader transforms d = abs(d) - stroke_width when the Stroke flag is set.

What stays tessellated:

Text (SDL_ttf atlas, pending future MSDF evaluation)
rectangle_gradient, circle_gradient (per-vertex color interpolation)
triangle_fan, triangle_strip (arbitrary user-provided point lists)
line_strip / polylines (SDF polyline rendering is possible but complex; deferred)
Any raw vertex geometry submitted via prepare_shape

The rule: if the shape has a closed-form SDF, it goes SDF. If it's described only by a vertex list or needs per-vertex color interpolation, it stays tessellated.

Effects pipeline

The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a dedicated pipeline state object that has its own compiled fragment shader.

Combined shape + effect rendering

When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn once, entirely in the effects pipeline. The effects fragment shader evaluates both the effect (blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated across pipelines.

This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5) and blur sigma 10:

Separate-primitive approach (shape in main pipeline + shadow in effects pipeline): the shadow quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500 shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the shape. Total fragment invocations: ~49,900.
Combined approach (one primitive in effects pipeline): one quad covers ~230×130 = 29,900 pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.

The combined approach uses ~40% fewer fragment invocations per effected shape (29,900 vs 49,900) in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10 shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.

An Effect_Flag.Draw_Base_Shape flag controls whether the sharp shape layer composites on top (default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with no shape on top) clear this flag.

Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects are routed to the effects pipeline.

Drop shadow implementation

Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is ~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan Wallace's variant) and vger-rs.

References:

Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement): https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
Evan Wallace's original WebGL implementation (used by Figma): https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
Vello's implementation of blurred rounded rectangle as a gradient type: https://github.com/linebender/vello/pull/665

Backdrop pipeline

The backdrop pipeline handles effects that sample the current render target as input: frosted glass, refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not register pressure.

Render-pass boundary. Before any backdrop-sampling fragment can run, the current render target must be copied to a separate texture via CopyGPUTextureToTexture. This is a command-buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface while also writing to it.

Multi-pass implementation. Backdrop effects are implemented as separable multi-pass sequences (downsample → horizontal blur → vertical blur → composite), following the standard approach used by iOS UIVisualEffectView, Android RenderEffect, and Flutter's BackdropFilter. Each individual pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including Mali-G31 and VideoCore VI) where per-thread register limits are tight.

Bracketed execution. All backdrop draws in a frame share a single bracketed region of the command buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket is never entered and the texture copy never happens — zero cost.

Why not split the backdrop sub-passes into separate pipelines? The individual passes range from ~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument that justifies the main/effects split does not apply here. The main/effects split protects the common path (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical UI scales), so the occupancy variation within the bracket has negligible impact on frame time.

Vertex layout

The vertex struct is unchanged from the current 20-byte layout:

Vertex :: struct {
    position: [2]f32,  //  0: screen-space position
    uv:       [2]f32,  //  8: atlas UV (text) or unused (shapes)
    color:    Color,   // 16: u8x4, GPU-normalized to float
}

This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated draws, position carries actual world-space geometry. For SDF draws, position carries unit-quad corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer primitive's bounds.

The Primitive struct for SDF shapes lives in the storage buffer, not in vertex attributes:

Primitive :: struct {
    bounds:     [4]f32,         //  0: min_x, min_y, max_x, max_y
    color:      Color,          // 16: u8x4, unpacked in shader via unpackUnorm4x8
    kind_flags: u32,            // 20: (kind as u32) | (flags as u32 << 8)
    rotation:   f32,            // 24: shader self-rotation in radians
    _pad:       f32,            // 28: alignment
    params:     Shape_Params,   // 32: raw union, 32 bytes (two vec4s of shape-specific data)
    uv_rect:    [4]f32,         // 64: texture UV sub-region (u_min, v_min, u_max, v_max)
}
// Total: 80 bytes (std430 aligned)

Shape_Params is a #raw_union with named variants per shape kind (rrect, circle, segment, etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side. The uv_rect field is used by textured SDF primitives (Shape_Flag.Textured); non-textured primitives leave it zeroed.

Draw submission order

Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:

Bind effects pipeline → draw all queued effects primitives for this scissor (instanced, one draw call). Each effects primitive includes its base shape and composites internally.
Bind main pipeline, tessellated mode → draw all queued tessellated vertices (non-indexed for shapes, indexed for text). Pipeline state unchanged from today.
Bind main pipeline, SDF mode → draw all queued SDF primitives (instanced, one draw call).
If backdrop effects are present: copy render target, bind backdrop pipeline → draw backdrop primitives.

The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key invariant is that each primitive is drawn exactly once, in the pipeline that owns it.

Text rendering

Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per (font, size) pair into bitmap atlases and emits indexed triangle data via GetGPUTextDrawData. This path is unchanged by the SDF migration — text continues to flow through the main pipeline's tessellated mode with shape_kind = Solid, sampling the SDL_ttf atlas texture.

A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would allow resolution-independent glyph rendering from a single small atlas per font. This would involve:

Offline atlas generation via Chlumský's msdf-atlas-gen tool.
Runtime glyph metrics via vendor:stb/truetype (already in the Odin distribution).
A new Shape_Kind.MSDF_Glyph variant in the main pipeline's fragment shader.
Potential removal of the SDL_ttf dependency.

This is explicitly deferred. The SDF shape migration is independent of and does not block text changes.

References:

Viktor Chlumský's MSDF master's thesis and msdfgen tool: https://github.com/Chlumsky/msdfgen
MSDF atlas generator for font atlas packing: https://github.com/Chlumsky/msdf-atlas-gen
Valve's original SDF text rendering paper (SIGGRAPH 2007): https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf

Textures

Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs that route into the existing tessellated and SDF paths respectively.

Why draw owns registered textures

A texture's GPU resource (the ^sdl.GPUTexture, transfer buffer, shader resource view) is created and destroyed by draw. The user provides raw bytes and a descriptor at registration time; draw uploads synchronously and returns an opaque Texture_Id handle. The user can free their CPU-side bytes immediately after register_texture returns.

This follows the model used by the RAD Debugger's render layer (src/render/render_core.h in EpicGamesExt/raddebugger, MIT license), where r_tex2d_alloc takes (kind, size, format, data) and returns an opaque handle that the renderer owns and releases. The single-owner model eliminates an entire class of lifecycle bugs (double-free, use-after-free across subsystems, unclear cleanup responsibility) that dual-ownership designs introduce.

If advanced interop is ever needed (e.g., a future 3D pipeline or compute shader sharing the same GPU texture), the clean extension is a borrowed-reference accessor (get_gpu_texture(id)) that returns the underlying handle without transferring ownership. This is purely additive and does not require changing the registration API.

Why `Texture_Kind` exists

Texture_Kind (Static / Dynamic / Stream) is a driver hint for update frequency, adopted from the RAD Debugger's R_ResourceKind. It maps directly to SDL3 GPU usage patterns:

Static: uploaded once, never changes. Covers QR codes, decoded PNGs, icons — the 90% case.
Dynamic: updatable via update_texture_region. Covers font atlas growth, procedural updates.
Stream: frequent full re-uploads. Covers video playback, per-frame procedural generation.

This costs one byte in the descriptor and lets the backend pick optimal memory placement without a future API change.

Why samplers are per-draw, not per-texture

A sampler describes how to filter and address a texture during sampling — nearest vs bilinear, clamp vs repeat. This is a property of the draw, not the texture. The same QR code texture should be sampled with Nearest_Clamp when displayed at native resolution but could reasonably be sampled with Linear_Clamp in a zoomed-out thumbnail. The same icon atlas might be sampled with Nearest_Clamp for pixel art or Linear_Clamp for smooth scaling.

The RAD Debugger follows this pattern: R_BatchGroup2DParams carries tex_sample_kind alongside the texture handle, chosen per batch group at draw time. We do the same — Sampler_Preset is a parameter on the draw procs, not a field on Texture_Desc.

Internally, draw keeps a small pool of pre-created ^sdl.GPUSampler objects (one per preset, lazily initialized). Sub-batch coalescing keys on (kind, texture_id, sampler_preset) — draws with the same texture but different samplers produce separate draw calls, which is correct.

Textured draw procs

Textured rectangles route through the existing SDF path via draw.rectangle_texture and draw.rectangle_texture_corners, mirroring draw.rectangle and draw.rectangle_corners exactly — same parameters, same naming — with the color parameter replaced by a texture ID plus an optional tint.

An earlier iteration of this design considered a separate tessellated draw.texture proc for "simple" fullscreen quads, on the theory that the tessellated path's lower register count (~16 regs vs ~24 for the SDF textured branch) would improve occupancy at large fragment counts. Applying the register-pressure analysis from the pipeline-strategy section above shows this is wrong: both 16 and 24 registers are well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at 100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.

The naming convention follows the existing shape API: rectangle_texture and rectangle_texture_corners sit alongside rectangle and rectangle_corners, mirroring the rectangle_gradient / circle_gradient pattern where the shape is the primary noun and the modifier (gradient, texture) is secondary. This groups related procs together in autocomplete (rectangle_*) and reads as natural English ("draw a rectangle with a texture").

Future per-shape texture variants (circle_texture, ellipse_texture, polygon_texture) are reserved by this naming convention and require only a Shape_Flag.Textured bit plus a small per-shape UV mapping function in the fragment shader. These are additive.

What SDF anti-aliasing does and does not do for textured draws

The SDF path anti-aliases the shape's outer silhouette — rounded-corner edges, rotated edges, stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments sample through the chosen Sampler_Preset, and image quality is whatever the sampler produces from the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless of which draw proc is used. This matches the current text-rendering model, where glyph sharpness depends on how closely the display size matches the SDL_ttf atlas's rasterized size.

Fit modes are a computation layer, not a renderer concept

Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV sub-region computations on top of the uv_rect parameter that both textured-draw procs accept. The renderer has no knowledge of fit modes — it samples whatever UV region it is given.

A fit_params helper computes the appropriate uv_rect, sampler preset, and (for letterbox/fit mode) shrunken inner rect from a Fit_Mode enum, the target rect, and the texture's pixel size. Users who need custom UV control (sprite atlas sub-regions, UV animation, nine-patch slicing) skip the helper and compute uv_rect directly. This keeps the renderer primitive minimal while making the common cases convenient.

Deferred release

unregister_texture does not immediately release the GPU texture. It queues the slot for release at the end of the current frame, after SubmitGPUCommandBuffer has handed work to the GPU. This prevents a race condition where a texture is freed while the GPU is still sampling from it in an already-submitted command buffer. The same deferred-release pattern is applied to clear_text_cache and clear_text_cache_entry, fixing a pre-existing latent bug where destroying a cached ^sdl_ttf.Text mid-frame could free an atlas texture still referenced by in-flight draw batches.

This pattern is standard in production renderers — the RAD Debugger's r_tex2d_release queues textures onto a free list that is processed in r_end_frame, not at the call site.

Clay integration

Clay's RenderCommandType.Image is handled by dereferencing imageData: rawptr as a pointer to a Clay_Image_Data struct containing a Texture_Id, Fit_Mode, and tint color. Routing mirrors the existing rectangle handling: zero cornerRadius dispatches to draw.texture (tessellated), nonzero dispatches to draw.rectangle_texture_corners (SDF). A fit_params call computes UVs from the fit mode before dispatch.

Deferred features

The following are plumbed in the descriptor but not implemented in phase 1:

Mipmaps: Texture_Desc.mip_levels field exists; generation via SDL3 deferred.
Compressed formats: Texture_Desc.format accepts BC/ASTC; upload path deferred.
Render-to-texture: Texture_Desc.usage accepts .COLOR_TARGET; render-pass refactor deferred.
3D textures, arrays, cube maps: Texture_Desc.type and depth_or_layers fields exist.
Additional samplers: anisotropic, trilinear, clamp-to-border — additive enum values.
Atlas packing: internal optimization for sub-batch coalescing; invisible to callers.
Per-shape texture variants: circle_texture, ellipse_texture, etc. — reserved by naming.

References:

RAD Debugger render layer (ownership model, deferred release, sampler-at-draw-time): https://github.com/EpicGamesExt/raddebugger — src/render/render_core.h, src/render/d3d11/render_d3d11.c
Casey Muratori, Handmade Hero day 472 — texture handling as a renderer-owned resource concern, atlases as a separate layer above the renderer.

3D rendering

3D pipeline architecture is under consideration and will be documented separately. The current expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines) sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.

Multi-window support

The renderer currently assumes a single window via the global GLOB state. Multi-window support is deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model (src/draw/draw.h, src/draw/draw.c in EpicGamesExt/raddebugger) is worth studying as a reference.

RAD separates draw submission from rendering via buckets. A DR_Bucket is an explicit handle that accumulates an ordered list of render passes (R_PassList). The user creates a bucket, pushes it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one per UI panel that gets composited into a parent bucket via dr_sub_bucket. Implicit draw parameters (clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each bucket, so different windows can have independent clip and transform state without interference.

The key properties this gives RAD:

Per-window isolation. Each window builds its own bucket with its own pass list and state stacks. No global contention.
Thread-parallel building. Each thread has its own draw context and arena. Multiple threads can build buckets concurrently, then submit them to the render backend sequentially.
Compositing. A pre-built bucket (e.g., a tooltip or overlay) can be injected into another bucket with a transform applied, without rebuilding its draw calls.

For our library, the likely adaptation would be replacing the single GLOB with a per-window draw context that users create and pass to begin/end, while keeping the explicit-parameter draw call style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain global (shared across windows), with only the per-frame staging buffers and layer/scissor state becoming per-context.

Building shaders

GLSL shader sources live in shaders/source/. Compiled outputs (SPIR-V and Metal Shading Language) are generated into shaders/generated/ via the meta tool:

odin run meta -- gen-shaders

Requires glslangValidator and spirv-cross on PATH.

Shader format selection

The library embeds shader bytecode per compile target — MSL + main0 entry point on Darwin (via spirv-cross --msl, which renames main because it is reserved in Metal), SPIR-V + main entry point elsewhere. Three compile-time constants in draw.odin expose the build's shader configuration:

Constant	Type	Darwin	Other
`PLATFORM_SHADER_FORMAT_FLAG`	`sdl.GPUShaderFormatFlag`	`.MSL`	`.SPIRV`
`PLATFORM_SHADER_FORMAT`	`sdl.GPUShaderFormat`	`{.MSL}`	`{.SPIRV}`
`SHADER_ENTRY`	`cstring`	`"main0"`	`"main"`

Pass PLATFORM_SHADER_FORMAT to sdl.CreateGPUDevice so SDL selects a backend compatible with the embedded bytecode:

gpu := sdl.CreateGPUDevice(draw.PLATFORM_SHADER_FORMAT, true, nil)

At init time the library calls sdl.GetGPUShaderFormats(device) to verify the active backend accepts PLATFORM_SHADER_FORMAT_FLAG. If it does not, draw.init returns false with a descriptive log message showing both the embedded and active format sets.

README.md Unescape Escape

draw

Current state

2D rendering pipeline plan

Overview: three pipelines

Why three pipelines, not one or seven

Main/effects split: register pressure

Backdrop split: render-pass structure

Why not per-primitive-type pipelines (GPUI's approach)

Why fragment shader branching is safe in this design

How GPU branching works

Which category our branches fall into

What kind of branching IS expensive

Main pipeline: SDF + tessellated (unified)

Why SDF for shapes

Storage-buffer instancing for SDF primitives

Shape kinds

Effects pipeline

Combined shape + effect rendering

Drop shadow implementation

Backdrop pipeline

Vertex layout

Draw submission order

Text rendering

Textures

Why draw owns registered textures

Why Texture_Kind exists

Why samplers are per-draw, not per-texture

Textured draw procs

What SDF anti-aliasing does and does not do for textured draws

Fit modes are a computation layer, not a renderer concept

Deferred release

Clay integration

Deferred features

3D rendering

Multi-window support

Building shaders

Shader format selection

README.md

Why `Texture_Kind` exists