Texture Rendering (#9)
Co-authored-by: Zachary Levy <zachary@sunforge.is> Reviewed-on: #9
This commit was merged in pull request #9.
This commit is contained in:
374
draw/README.md
374
draw/README.md
@@ -47,68 +47,107 @@ primitives and effects can be added to the library without architectural changes
|
||||
|
||||
### Overview: three pipelines
|
||||
|
||||
The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
|
||||
**render-state requirements**:
|
||||
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
|
||||
**render-pass structure** (everything vs backdrop):
|
||||
|
||||
1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
|
||||
registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
|
||||
frame.
|
||||
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
|
||||
footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
|
||||
Handles 90%+ of all fragments in a typical frame.
|
||||
|
||||
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
||||
effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
|
||||
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
||||
redundant overdraw.
|
||||
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
|
||||
low-end hardware (see register analysis below).
|
||||
|
||||
3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
|
||||
render target as input. High register footprint (~70–80 registers) and structurally requires a
|
||||
`CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
|
||||
pressure and because the texture-copy requirement forces a render-pass-level state change.
|
||||
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
|
||||
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
|
||||
where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
|
||||
from the other pipelines because it structurally requires ending the current render pass and
|
||||
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
|
||||
level boundary that cannot be avoided regardless of shader complexity.
|
||||
|
||||
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
||||
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
||||
switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
|
||||
is under 0.15% of an 8.3ms (120 FPS) frame budget.
|
||||
switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching
|
||||
overhead is negligible relative to an 8.3ms (120 FPS) frame budget.
|
||||
|
||||
### Why three pipelines, not one or seven
|
||||
|
||||
The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
|
||||
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
|
||||
|
||||
The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
|
||||
shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
|
||||
compiler allocates registers pessimistically based on the worst-case path through the shader. If the
|
||||
shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
|
||||
— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
|
||||
warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
|
||||
#### Main/effects split: register pressure
|
||||
|
||||
Concrete example on a modern NVIDIA SM with 65,536 registers:
|
||||
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
|
||||
allocates registers pessimistically based on the worst-case path through the shader. If the shader
|
||||
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
|
||||
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
|
||||
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
|
||||
latency.
|
||||
|
||||
| Register allocation | Max concurrent threads | Occupancy |
|
||||
| ------------------------- | ---------------------- | --------- |
|
||||
| 20 regs (RRect only) | 3,276 | ~100% |
|
||||
| 48 regs (+ drop shadow) | 1,365 | ~42% |
|
||||
| 72 regs (+ frosted glass) | 910 | ~28% |
|
||||
Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
|
||||
Below the cliff, adding registers has zero occupancy cost.
|
||||
|
||||
For a 4K frame (3840×2160) at 1.5× overdraw (~12.4M fragments), running all fragments at 28%
|
||||
occupancy instead of 100% roughly triples fragment shading time. At 4K this is severe: if the main
|
||||
pipeline's fragment work at full occupancy takes ~2ms, a single unified shader containing the glass
|
||||
branch would push it to ~6ms — consuming 72% of the 8.3ms budget available at 120 FPS and leaving
|
||||
almost nothing for CPU work, uploads, and presentation. This is a per-frame multiplier, not a
|
||||
per-primitive cost — it applies even when the heavy branch is never taken.
|
||||
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
|
||||
|
||||
The three-pipeline split groups primitives by register footprint so that:
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ----------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (main pipeline) | 3,276 | 1,536 | 100% |
|
||||
| 32 regs | 2,048 | 1,536 | 100% |
|
||||
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
|
||||
|
||||
- Main pipeline (~20 regs): 90%+ of fragments run at near-full occupancy.
|
||||
- Effects pipeline (~55 regs): shadow/glow fragments run at moderate occupancy; unavoidable given the
|
||||
blur math complexity.
|
||||
- Backdrop-effects pipeline (~75 regs): glass fragments run at low occupancy; also unavoidable, and
|
||||
structurally separated anyway by the texture-copy requirement.
|
||||
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
|
||||
|
||||
This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
|
||||
(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
|
||||
actually need them.
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ----------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (main pipeline) | 3,276 | 2,048 | 100% |
|
||||
| 32 regs | 2,048 | 2,048 | 100% |
|
||||
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
|
||||
|
||||
**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
|
||||
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
|
||||
|
||||
| Register allocation | Occupancy |
|
||||
| -------------------- | -------------------------- |
|
||||
| 0–32 regs (main) | 100% (full thread count) |
|
||||
| 33–64 regs (effects) | ~50% (thread count halves) |
|
||||
|
||||
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
|
||||
20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
|
||||
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
|
||||
pipeline's register cost.
|
||||
|
||||
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
|
||||
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
|
||||
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
|
||||
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
|
||||
compiler allocates registers for the worst-case path.
|
||||
|
||||
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
|
||||
12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
|
||||
occupancy.
|
||||
|
||||
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
|
||||
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
|
||||
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
|
||||
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
|
||||
|
||||
#### Backdrop split: render-pass structure
|
||||
|
||||
The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
|
||||
reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
|
||||
render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
|
||||
level operation that requires ending the current render pass. This boundary exists regardless of
|
||||
shader complexity and cannot be optimized away.
|
||||
|
||||
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
|
||||
register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
|
||||
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
|
||||
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
|
||||
|
||||
#### Why not per-primitive-type pipelines (GPUI's approach)
|
||||
|
||||
Zed's GPUI uses 7 separate shader pairs:
|
||||
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
|
||||
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
|
||||
for our use case:
|
||||
@@ -120,7 +159,7 @@ typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kin
|
||||
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
|
||||
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
|
||||
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
|
||||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
|
||||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
|
||||
negligible (all members cluster at 12–22 registers).
|
||||
|
||||
**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
|
||||
@@ -159,10 +198,10 @@ in submission order:
|
||||
~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
|
||||
cheaper than the pipeline-switching alternative.
|
||||
|
||||
The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
|
||||
boundaries where occupancy differences are catastrophic at 4K (see numbers above). Within a tier,
|
||||
unified is strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead,
|
||||
and negligible GPU-side branching cost.
|
||||
The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
|
||||
and structural render-pass requirements (see analysis above). Within a pipeline, unified is
|
||||
strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
|
||||
negligible GPU-side branching cost.
|
||||
|
||||
**References:**
|
||||
|
||||
@@ -172,6 +211,16 @@ and negligible GPU-side branching cost.
|
||||
https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
|
||||
- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
|
||||
https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
|
||||
- NVIDIA Ampere GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.6, 64 for
|
||||
cc 8.0), register file size (64K), occupancy factors:
|
||||
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
|
||||
- NVIDIA Ada GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.9):
|
||||
https://docs.nvidia.com/cuda/ada-tuning-guide/index.html
|
||||
- CUDA Occupancy Calculation walkthrough (register allocation granularity, worked examples):
|
||||
https://leimao.github.io/blog/CUDA-Occupancy-Calculation/
|
||||
- Apple M3 GPU architecture — Dynamic Caching (register file virtualization) eliminates static
|
||||
worst-case register allocation, reducing the occupancy penalty for high-register shaders:
|
||||
https://asplos.dev/wiki/m3-chip-explainer/gpu/index.html
|
||||
|
||||
### Why fragment shader branching is safe in this design
|
||||
|
||||
@@ -442,25 +491,40 @@ Wallace's variant) and vger-rs.
|
||||
- Vello's implementation of blurred rounded rectangle as a gradient type:
|
||||
https://github.com/linebender/vello/pull/665
|
||||
|
||||
### Backdrop-effects pipeline
|
||||
### Backdrop pipeline
|
||||
|
||||
The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
|
||||
glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
|
||||
reasons:
|
||||
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
|
||||
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
|
||||
register pressure.
|
||||
|
||||
1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
|
||||
target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
|
||||
buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
|
||||
boundary.
|
||||
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
|
||||
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
|
||||
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
|
||||
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
|
||||
while also writing to it.
|
||||
|
||||
2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
|
||||
(multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
|
||||
effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
|
||||
measurable throughput on the common case.
|
||||
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
||||
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
||||
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
||||
pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
|
||||
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
|
||||
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
|
||||
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
|
||||
|
||||
The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
|
||||
no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
|
||||
happens — zero cost.
|
||||
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
||||
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
||||
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
|
||||
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
||||
is never entered and the texture copy never happens — zero cost.
|
||||
|
||||
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
|
||||
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
|
||||
that justifies the main/effects split does not apply here. The main/effects split protects the
|
||||
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
|
||||
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
|
||||
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
|
||||
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
|
||||
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
|
||||
|
||||
### Vertex layout
|
||||
|
||||
@@ -483,19 +547,21 @@ The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex
|
||||
|
||||
```
|
||||
Primitive :: struct {
|
||||
kind: Shape_Kind, // 0: enum u8
|
||||
flags: Shape_Flags, // 1: bit_set[Shape_Flag; u8]
|
||||
_pad: u16, // 2: reserved
|
||||
bounds: [4]f32, // 4: min_x, min_y, max_x, max_y
|
||||
color: Color, // 20: u8x4
|
||||
_pad2: [3]u8, // 24: alignment
|
||||
params: Shape_Params, // 28: raw union, 32 bytes
|
||||
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
|
||||
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
|
||||
kind_flags: u32, // 20: (kind as u32) | (flags as u32 << 8)
|
||||
rotation: f32, // 24: shader self-rotation in radians
|
||||
_pad: f32, // 28: alignment
|
||||
params: Shape_Params, // 32: raw union, 32 bytes (two vec4s of shape-specific data)
|
||||
uv_rect: [4]f32, // 64: texture UV sub-region (u_min, v_min, u_max, v_max)
|
||||
}
|
||||
// Total: 60 bytes (padded to 64 for GPU alignment)
|
||||
// Total: 80 bytes (std430 aligned)
|
||||
```
|
||||
|
||||
`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
|
||||
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
|
||||
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side. The
|
||||
`uv_rect` field is used by textured SDF primitives (Shape_Flag.Textured); non-textured primitives
|
||||
leave it zeroed.
|
||||
|
||||
### Draw submission order
|
||||
|
||||
@@ -506,7 +572,7 @@ Within each scissor region, draws are issued in submission order to preserve the
|
||||
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
|
||||
shapes, indexed for text). Pipeline state unchanged from today.
|
||||
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
|
||||
4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
|
||||
4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
|
||||
backdrop primitives.
|
||||
|
||||
The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
|
||||
@@ -539,12 +605,180 @@ changes.
|
||||
- Valve's original SDF text rendering paper (SIGGRAPH 2007):
|
||||
https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
|
||||
|
||||
### Textures
|
||||
|
||||
Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The
|
||||
work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs
|
||||
that route into the existing tessellated and SDF paths respectively.
|
||||
|
||||
#### Why draw owns registered textures
|
||||
|
||||
A texture's GPU resource (the `^sdl.GPUTexture`, transfer buffer, shader resource view) is created
|
||||
and destroyed by draw. The user provides raw bytes and a descriptor at registration time; draw
|
||||
uploads synchronously and returns an opaque `Texture_Id` handle. The user can free their CPU-side
|
||||
bytes immediately after `register_texture` returns.
|
||||
|
||||
This follows the model used by the RAD Debugger's render layer (`src/render/render_core.h` in
|
||||
EpicGamesExt/raddebugger, MIT license), where `r_tex2d_alloc` takes `(kind, size, format, data)`
|
||||
and returns an opaque handle that the renderer owns and releases. The single-owner model eliminates
|
||||
an entire class of lifecycle bugs (double-free, use-after-free across subsystems, unclear cleanup
|
||||
responsibility) that dual-ownership designs introduce.
|
||||
|
||||
If advanced interop is ever needed (e.g., a future 3D pipeline or compute shader sharing the same
|
||||
GPU texture), the clean extension is a borrowed-reference accessor (`get_gpu_texture(id)`) that
|
||||
returns the underlying handle without transferring ownership. This is purely additive and does not
|
||||
require changing the registration API.
|
||||
|
||||
#### Why `Texture_Kind` exists
|
||||
|
||||
`Texture_Kind` (Static / Dynamic / Stream) is a driver hint for update frequency, adopted from the
|
||||
RAD Debugger's `R_ResourceKind`. It maps directly to SDL3 GPU usage patterns:
|
||||
|
||||
- **Static**: uploaded once, never changes. Covers QR codes, decoded PNGs, icons — the 90% case.
|
||||
- **Dynamic**: updatable via `update_texture_region`. Covers font atlas growth, procedural updates.
|
||||
- **Stream**: frequent full re-uploads. Covers video playback, per-frame procedural generation.
|
||||
|
||||
This costs one byte in the descriptor and lets the backend pick optimal memory placement without a
|
||||
future API change.
|
||||
|
||||
#### Why samplers are per-draw, not per-texture
|
||||
|
||||
A sampler describes how to filter and address a texture during sampling — nearest vs bilinear, clamp
|
||||
vs repeat. This is a property of the _draw_, not the texture. The same QR code texture should be
|
||||
sampled with `Nearest_Clamp` when displayed at native resolution but could reasonably be sampled
|
||||
with `Linear_Clamp` in a zoomed-out thumbnail. The same icon atlas might be sampled with
|
||||
`Nearest_Clamp` for pixel art or `Linear_Clamp` for smooth scaling.
|
||||
|
||||
The RAD Debugger follows this pattern: `R_BatchGroup2DParams` carries `tex_sample_kind` alongside
|
||||
the texture handle, chosen per batch group at draw time. We do the same — `Sampler_Preset` is a
|
||||
parameter on the draw procs, not a field on `Texture_Desc`.
|
||||
|
||||
Internally, draw keeps a small pool of pre-created `^sdl.GPUSampler` objects (one per preset,
|
||||
lazily initialized). Sub-batch coalescing keys on `(kind, texture_id, sampler_preset)` — draws
|
||||
with the same texture but different samplers produce separate draw calls, which is correct.
|
||||
|
||||
#### Textured draw procs
|
||||
|
||||
Textured rectangles route through the existing SDF path via `draw.rectangle_texture` and
|
||||
`draw.rectangle_texture_corners`, mirroring `draw.rectangle` and `draw.rectangle_corners` exactly —
|
||||
same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
|
||||
tint.
|
||||
|
||||
An earlier iteration of this design considered a separate tessellated `draw.texture` proc for
|
||||
"simple" fullscreen quads, on the theory that the tessellated path's lower register count (~16 regs
|
||||
vs ~24 for the SDF textured branch) would improve occupancy at large fragment counts. Applying the
|
||||
register-pressure analysis from the pipeline-strategy section above shows this is wrong: both 16 and
|
||||
24 registers are well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100),
|
||||
so both run at 100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF
|
||||
evaluation) amounts to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline
|
||||
would add ~1–5μs per pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side
|
||||
savings. Within the main pipeline, unified remains strictly better.
|
||||
|
||||
The naming convention follows the existing shape API: `rectangle_texture` and
|
||||
`rectangle_texture_corners` sit alongside `rectangle` and `rectangle_corners`, mirroring the
|
||||
`rectangle_gradient` / `circle_gradient` pattern where the shape is the primary noun and the
|
||||
modifier (gradient, texture) is secondary. This groups related procs together in autocomplete
|
||||
(`rectangle_*`) and reads as natural English ("draw a rectangle with a texture").
|
||||
|
||||
Future per-shape texture variants (`circle_texture`, `ellipse_texture`, `polygon_texture`) are
|
||||
reserved by this naming convention and require only a `Shape_Flag.Textured` bit plus a small
|
||||
per-shape UV mapping function in the fragment shader. These are additive.
|
||||
|
||||
#### What SDF anti-aliasing does and does not do for textured draws
|
||||
|
||||
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
|
||||
stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
|
||||
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
|
||||
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
|
||||
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
|
||||
depends on how closely the display size matches the SDL_ttf atlas's rasterized size.
|
||||
|
||||
#### Fit modes are a computation layer, not a renderer concept
|
||||
|
||||
Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV
|
||||
sub-region computations on top of the `uv_rect` parameter that both textured-draw procs accept. The
|
||||
renderer has no knowledge of fit modes — it samples whatever UV region it is given.
|
||||
|
||||
A `fit_params` helper computes the appropriate `uv_rect`, sampler preset, and (for letterbox/fit
|
||||
mode) shrunken inner rect from a `Fit_Mode` enum, the target rect, and the texture's pixel size.
|
||||
Users who need custom UV control (sprite atlas sub-regions, UV animation, nine-patch slicing) skip
|
||||
the helper and compute `uv_rect` directly. This keeps the renderer primitive minimal while making
|
||||
the common cases convenient.
|
||||
|
||||
#### Deferred release
|
||||
|
||||
`unregister_texture` does not immediately release the GPU texture. It queues the slot for release at
|
||||
the end of the current frame, after `SubmitGPUCommandBuffer` has handed work to the GPU. This
|
||||
prevents a race condition where a texture is freed while the GPU is still sampling from it in an
|
||||
already-submitted command buffer. The same deferred-release pattern is applied to `clear_text_cache`
|
||||
and `clear_text_cache_entry`, fixing a pre-existing latent bug where destroying a cached
|
||||
`^sdl_ttf.Text` mid-frame could free an atlas texture still referenced by in-flight draw batches.
|
||||
|
||||
This pattern is standard in production renderers — the RAD Debugger's `r_tex2d_release` queues
|
||||
textures onto a free list that is processed in `r_end_frame`, not at the call site.
|
||||
|
||||
#### Clay integration
|
||||
|
||||
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
|
||||
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
|
||||
existing rectangle handling: zero `cornerRadius` dispatches to `draw.texture` (tessellated), nonzero
|
||||
dispatches to `draw.rectangle_texture_corners` (SDF). A `fit_params` call computes UVs from the fit
|
||||
mode before dispatch.
|
||||
|
||||
#### Deferred features
|
||||
|
||||
The following are plumbed in the descriptor but not implemented in phase 1:
|
||||
|
||||
- **Mipmaps**: `Texture_Desc.mip_levels` field exists; generation via SDL3 deferred.
|
||||
- **Compressed formats**: `Texture_Desc.format` accepts BC/ASTC; upload path deferred.
|
||||
- **Render-to-texture**: `Texture_Desc.usage` accepts `.COLOR_TARGET`; render-pass refactor deferred.
|
||||
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
|
||||
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
|
||||
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
|
||||
- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, etc. — reserved by naming.
|
||||
|
||||
**References:**
|
||||
|
||||
- RAD Debugger render layer (ownership model, deferred release, sampler-at-draw-time):
|
||||
https://github.com/EpicGamesExt/raddebugger — `src/render/render_core.h`, `src/render/d3d11/render_d3d11.c`
|
||||
- Casey Muratori, Handmade Hero day 472 — texture handling as a renderer-owned resource concern,
|
||||
atlases as a separate layer above the renderer.
|
||||
|
||||
## 3D rendering
|
||||
|
||||
3D pipeline architecture is under consideration and will be documented separately. The current
|
||||
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
|
||||
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
|
||||
|
||||
## Multi-window support
|
||||
|
||||
The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
|
||||
deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
|
||||
(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.
|
||||
|
||||
RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
|
||||
that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
|
||||
it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
|
||||
the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
|
||||
per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
|
||||
(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
|
||||
bucket, so different windows can have independent clip and transform state without interference.
|
||||
|
||||
The key properties this gives RAD:
|
||||
|
||||
- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
|
||||
No global contention.
|
||||
- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
|
||||
build buckets concurrently, then submit them to the render backend sequentially.
|
||||
- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
|
||||
bucket with a transform applied, without rebuilding its draw calls.
|
||||
|
||||
For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
|
||||
context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
|
||||
style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
|
||||
global (shared across windows), with only the per-frame staging buffers and layer/scissor state
|
||||
becoming per-context.
|
||||
|
||||
## Building shaders
|
||||
|
||||
GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
|
||||
|
||||
Reference in New Issue
Block a user