Update draw README
This commit is contained in:
+235
-176
@@ -9,51 +9,53 @@ The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline)
|
|||||||
modes dispatched by a push constant:
|
modes dispatched by a push constant:
|
||||||
|
|
||||||
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
|
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
|
||||||
SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`,
|
SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
|
||||||
`tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle
|
(`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
|
||||||
reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors
|
`tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
|
||||||
(`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`.
|
shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
|
||||||
|
|
||||||
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
|
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
|
||||||
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
|
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
|
||||||
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
|
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
|
||||||
primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive
|
primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
|
||||||
kind dispatch.
|
`Primitive.flags`) to evaluate one of four signed distance functions:
|
||||||
|
- **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
|
||||||
|
circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
|
||||||
|
radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
|
||||||
|
- **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
|
||||||
|
- **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
|
||||||
|
- **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
|
||||||
|
normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
|
||||||
|
|
||||||
The SDF path handles all shapes that are algebraically reducible to a rounded rectangle:
|
All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
|
||||||
|
gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
|
||||||
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked,
|
parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
|
||||||
textured, and gradient-filled rectangles.
|
— zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
|
||||||
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles.
|
exclusive.
|
||||||
- **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
|
|
||||||
- **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
|
|
||||||
|
|
||||||
All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
|
|
||||||
gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
|
|
||||||
the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
|
|
||||||
struct. Gradient and texture are mutually exclusive.
|
|
||||||
|
|
||||||
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
|
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
|
||||||
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
|
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
|
||||||
instead of ~250 vertices (~5000 bytes).
|
instead of ~250 vertices (~5000 bytes).
|
||||||
|
|
||||||
The fragment shader's estimated register footprint is ~20–23 VGPRs via static live-range analysis.
|
The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
|
||||||
RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus
|
in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
|
||||||
`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land
|
fragment shader's estimated peak footprint is ~22–26 fp32 VGPRs (~16–22 fp16 VGPRs on architectures
|
||||||
comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under
|
with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
|
||||||
desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur
|
(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
|
||||||
partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or
|
like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 1–2 regs lower (`corner_radii` vec4
|
||||||
Radeon GPU Analyzer against the compiled SPIR-V.
|
replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
|
||||||
|
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
|
||||||
|
2–4 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
|
||||||
|
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
|
||||||
|
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
|
||||||
|
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
|
||||||
|
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
|
||||||
|
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
|
||||||
|
|
||||||
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
|
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
|
||||||
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
|
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
|
||||||
glyph edges and tessellated user geometry if desired.
|
glyph edges and tessellated user geometry if desired.
|
||||||
|
|
||||||
All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
|
|
||||||
tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
|
|
||||||
`sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
|
|
||||||
count).
|
|
||||||
|
|
||||||
## 2D rendering pipeline plan
|
## 2D rendering pipeline plan
|
||||||
|
|
||||||
This section documents the planned architecture for levlib's 2D rendering system. The design is driven
|
This section documents the planned architecture for levlib's 2D rendering system. The design is driven
|
||||||
@@ -66,22 +68,23 @@ primitives and effects can be added to the library without architectural changes
|
|||||||
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
|
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
|
||||||
**render-pass structure** (everything vs backdrop):
|
**render-pass structure** (everything vs backdrop):
|
||||||
|
|
||||||
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
|
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
|
||||||
footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
|
**≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
|
||||||
Handles 90%+ of all fragments in a typical frame.
|
in a typical frame.
|
||||||
|
|
||||||
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
||||||
effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
|
effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
|
||||||
|
occupancy at the first cliff is accepted by design). Each effects primitive includes the base
|
||||||
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
||||||
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
|
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
|
||||||
low-end hardware (see register analysis below).
|
low-end hardware (see register analysis below).
|
||||||
|
|
||||||
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
|
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
|
||||||
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
|
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
|
||||||
where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
|
where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
|
||||||
from the other pipelines because it structurally requires ending the current render pass and
|
Valhall). Separated from the other pipelines because it structurally requires ending the current
|
||||||
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
|
render pass and copying the render target before any backdrop-sampling fragment can execute — a
|
||||||
level boundary that cannot be avoided regardless of shader complexity.
|
command-buffer-level boundary that cannot be avoided regardless of shader complexity.
|
||||||
|
|
||||||
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
||||||
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
||||||
@@ -97,56 +100,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad
|
|||||||
|
|
||||||
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
|
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
|
||||||
allocates registers pessimistically based on the worst-case path through the shader. If the shader
|
allocates registers pessimistically based on the worst-case path through the shader. If the shader
|
||||||
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
|
contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
|
||||||
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
|
trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
|
||||||
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
|
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
|
||||||
latency.
|
latency.
|
||||||
|
|
||||||
Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
|
Each GPU architecture has discrete **occupancy cliffs** — register counts above which the number of
|
||||||
Below the cliff, adding registers has zero occupancy cost.
|
concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
|
||||||
|
register over, throughput drops sharply.
|
||||||
|
|
||||||
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
|
**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
|
||||||
|
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
|
||||||
|
dominant current GPU architecture:
|
||||||
|
|
||||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
|
||||||
| ------------------------ | ------------------- | ------------------ | --------- |
|
**Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
|
||||||
| ~16 regs (main pipeline) | 4,096 | 1,536 | 100% |
|
registers**, second cliff at **64 registers**.
|
||||||
| 32 regs | 2,048 | 1,536 | 100% |
|
- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
|
||||||
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
|
same cliff structure — first at 32, second at 64.
|
||||||
|
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~2016–2018): first cliff at **16 registers**.
|
||||||
|
Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
|
||||||
|
below.
|
||||||
|
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
|
||||||
|
the current SBC market. See Known limitations below.
|
||||||
|
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
|
||||||
|
Register allocation happens at runtime based on actual usage.
|
||||||
|
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
|
||||||
|
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
|
||||||
|
|
||||||
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
|
**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
|
||||||
|
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
|
||||||
|
registers of margin**:
|
||||||
|
|
||||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
| Pipeline | Cliff targeted | Margin | Register budget | Rationale |
|
||||||
| ------------------------ | ------------------- | ------------------ | --------- |
|
| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
|
||||||
| ~16 regs (main pipeline) | 4,096 | 2,048 | 100% |
|
| Main pipeline | 32 (Valhall 1st cliff) | 8 | **≤24 regs** | Handles 90%+ of frame fragments; must run at full occupancy |
|
||||||
| 32 regs | 2,048 | 2,048 | 100% |
|
| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8 | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy |
|
||||||
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
|
| Effects pipeline | 64 (Valhall 2nd cliff) | 8 | **≤56 regs** | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
|
||||||
|
|
||||||
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
|
**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
|
||||||
|
counts upward over a shader's lifetime:
|
||||||
|
|
||||||
| Register allocation | Occupancy |
|
1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
|
||||||
| -------------------- | -------------------------- |
|
allocators. Shaders typically drift ±2–3 registers between versions on unchanged source.
|
||||||
| 0–32 regs (main) | 100% (full thread count) |
|
2. **Feature additions.** Each new effect, flag, or uniform adds 1–4 live registers. A new gradient
|
||||||
| 33–64 regs (effects) | ~50% (thread count halves) |
|
mode or outline option lands in this range.
|
||||||
|
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
|
||||||
|
or a contributor not knowing) costs 2 registers per affected `vec4`.
|
||||||
|
|
||||||
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
|
Realistic creep over a couple of years is 4–8 registers. The cost of conservatism is zero — a shader
|
||||||
20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
|
at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
|
||||||
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
|
a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
|
||||||
pipeline's register cost.
|
|
||||||
|
|
||||||
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
|
**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
|
||||||
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
|
SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
|
||||||
|
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
|
||||||
|
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
|
||||||
|
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
|
||||||
|
actually render effects (~5–10% in a typical UI) run at reduced occupancy.
|
||||||
|
|
||||||
|
For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
|
||||||
|
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
|
||||||
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
|
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
|
||||||
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
|
Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
|
||||||
compiler allocates registers for the worst-case path.
|
compiler allocates registers for the worst-case path.
|
||||||
|
|
||||||
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
|
The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
|
||||||
12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
|
50–67% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
|
||||||
occupancy.
|
that effects cover.
|
||||||
|
|
||||||
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
|
**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
|
||||||
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
|
actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
|
||||||
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
|
later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
|
||||||
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
|
texture-copy out of the main render pass.
|
||||||
|
|
||||||
|
**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
|
||||||
|
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
|
||||||
|
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
|
||||||
|
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
|
||||||
|
|
||||||
|
#### Known limitations: V3D and Bifrost (16-register cliff)
|
||||||
|
|
||||||
|
Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
|
||||||
|
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
|
||||||
|
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
|
||||||
|
reduced occupancy regardless of which shape kind or effect is active.
|
||||||
|
|
||||||
|
Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
|
||||||
|
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
|
||||||
|
register footprint under 16). This conflicts with the unified-pipeline design that enables single
|
||||||
|
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
|
||||||
|
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
|
||||||
|
pipelines" below.
|
||||||
|
|
||||||
|
We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
|
||||||
|
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
|
||||||
|
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
|
||||||
|
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
|
||||||
|
allocation).
|
||||||
|
|
||||||
|
#### Verifying register counts
|
||||||
|
|
||||||
|
The register estimates in this document are hand-counted via manual live-range analysis (see Current
|
||||||
|
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
|
||||||
|
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
|
||||||
|
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
|
||||||
|
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
|
||||||
|
measured `malioc` numbers is a follow-up task.
|
||||||
|
|
||||||
#### Backdrop split: render-pass structure
|
#### Backdrop split: render-pass structure
|
||||||
|
|
||||||
@@ -156,10 +216,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
|
|||||||
level operation that requires ending the current render pass. This boundary exists regardless of
|
level operation that requires ending the current render pass. This boundary exists regardless of
|
||||||
shader complexity and cannot be optimized away.
|
shader complexity and cannot be optimized away.
|
||||||
|
|
||||||
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
|
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
|
||||||
register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
|
at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
|
||||||
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
|
cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
|
||||||
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
|
effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
|
||||||
|
pass sequence.
|
||||||
|
|
||||||
#### Why not per-primitive-type pipelines (GPUI's approach)
|
#### Why not per-primitive-type pipelines (GPUI's approach)
|
||||||
|
|
||||||
@@ -271,18 +332,23 @@ There are three categories of branch condition in a fragment shader, ranked by c
|
|||||||
|
|
||||||
#### Which category our branches fall into
|
#### Which category our branches fall into
|
||||||
|
|
||||||
Our design has two branch points:
|
Our design has three branch points:
|
||||||
|
|
||||||
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
|
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
|
||||||
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
|
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
|
||||||
cost.**
|
cost.**
|
||||||
|
|
||||||
2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3.
|
2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
|
||||||
The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
|
The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
|
||||||
receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no
|
to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
|
||||||
kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection
|
kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~15–30
|
||||||
— all lightweight (3–8 instructions per path). Divergence at primitive boundaries between
|
instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
|
||||||
different flag combinations has negligible cost.
|
different kinds.
|
||||||
|
|
||||||
|
3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
|
||||||
|
The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
|
||||||
|
solid color selection and outline rendering — all lightweight branches (3–8 instructions per
|
||||||
|
path). Divergence at primitive boundaries between different flag combinations has negligible cost.
|
||||||
|
|
||||||
For category 3, the divergence analysis depends on primitive size:
|
For category 3, the divergence analysis depends on primitive size:
|
||||||
|
|
||||||
@@ -299,11 +365,12 @@ For category 3, the divergence analysis depends on primitive size:
|
|||||||
frame-level divergence is typically **1–3%** of all warps.
|
frame-level divergence is typically **1–3%** of all warps.
|
||||||
|
|
||||||
At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
|
At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
|
||||||
(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the
|
(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
|
||||||
longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function.
|
is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
|
||||||
Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU,
|
of both (~45–60 instructions total). Each divergent warp's extra cost is modest — at ~12G
|
||||||
that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is
|
instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
|
||||||
confirmed by production renderers that use exactly this pattern:
|
~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
|
||||||
|
that use exactly this pattern:
|
||||||
|
|
||||||
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
|
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
|
||||||
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
|
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
|
||||||
@@ -327,10 +394,10 @@ our design:
|
|||||||
> have no per-fragment data-dependent branches in the main pipeline.
|
> have no per-fragment data-dependent branches in the main pipeline.
|
||||||
|
|
||||||
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
|
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
|
||||||
divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates
|
divergent warps pay double a large cost. Our SDF kind branches are short (~15–30 instructions
|
||||||
`sdRoundedBox`; the only branches are gradient/texture/solid color selection at 3–8 instructions
|
each), and the gradient/texture/solid color selection branches are shorter still (3–8 instructions
|
||||||
each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture
|
each). Even fully divergent, the combined penalty is ~30–60 extra instructions — comparable to a
|
||||||
sample's latency.
|
single texture sample's latency.
|
||||||
|
|
||||||
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
|
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
|
||||||
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
|
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
|
||||||
@@ -338,9 +405,10 @@ our design:
|
|||||||
concern.
|
concern.
|
||||||
|
|
||||||
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
|
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
|
||||||
split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF
|
split heavy effects into separate pipelines. Within the main pipeline, the four
|
||||||
path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~15–18
|
SDF kind branches and flag-based color selection cluster at ~22–26 registers (see register
|
||||||
registers, so there is negligible occupancy loss.
|
analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
|
||||||
|
Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
|
||||||
|
|
||||||
**References:**
|
**References:**
|
||||||
|
|
||||||
@@ -361,19 +429,20 @@ our design:
|
|||||||
### Main pipeline: SDF + tessellated (unified)
|
### Main pipeline: SDF + tessellated (unified)
|
||||||
|
|
||||||
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
|
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
|
||||||
vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte:
|
vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
|
||||||
0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex
|
(`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
|
||||||
shader; the SDF path sets it to 1 via `pack_flags`.
|
vertex shader branches on this uniform to select the tessellated or SDF code path.
|
||||||
|
|
||||||
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
|
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
|
||||||
(SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and
|
(SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
|
||||||
any user-provided raw vertex geometry.
|
user-provided raw vertex geometry.
|
||||||
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
|
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
|
||||||
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
|
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
|
||||||
|
|
||||||
Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes
|
Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
|
||||||
`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies
|
`Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
|
||||||
gradient/texture/solid color based on flag bits.
|
and computes `out = color * t`; kinds 1–4 dispatch to one of four SDF functions (RRect, NGon,
|
||||||
|
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
|
||||||
|
|
||||||
#### Why SDF for shapes
|
#### Why SDF for shapes
|
||||||
|
|
||||||
@@ -425,47 +494,39 @@ The tessellated path retains the existing direct vertex buffer layout (20 bytes/
|
|||||||
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
|
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
|
||||||
in a draw call has the same mode — so it is effectively free on all modern GPUs.
|
in a draw call has the same mode — so it is effectively free on all modern GPUs.
|
||||||
|
|
||||||
#### Shape folding
|
#### Shape kinds and SDF dispatch
|
||||||
|
|
||||||
The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no
|
The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
|
||||||
`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically
|
four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
|
||||||
special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs:
|
`pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
|
||||||
|
and set the kind automatically:
|
||||||
|
|
||||||
| User-facing shape | RRect mapping | Notes |
|
| User-facing proc | Shape_Kind | SDF function | Notes |
|
||||||
| ---------------------------- | -------------------------------------------- | ---------------------------------------- |
|
| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
|
||||||
| Rectangle (sharp or rounded) | Direct | Per-corner radii from `radii` param |
|
| `rectangle` | `RRect` | `sdRoundedBox` | Per-corner radii from `radii` param |
|
||||||
| Circle | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size |
|
| `rectangle_texture` | `RRect` | `sdRoundedBox` | Textured fill; `.Textured` flag set |
|
||||||
| Line segment / capsule | Rotated RRect, `radii = half_thickness` | Stadium shape (fully-rounded minor axis) |
|
| `circle` | `RRect` | `sdRoundedBox` | Uniform radii = half-size (circle is a degenerate RRect) |
|
||||||
| Full ring / annulus | Stroked circle at mid-radius | `stroke_px = outer - inner` |
|
| `line`, `line_strip` | `RRect` | `sdRoundedBox` | Rotated capsule — stadium shape (radii = half-thickness) |
|
||||||
|
| `ellipse` | `Ellipse` | `sdEllipseApprox` | Approximate ellipse SDF (fast, suitable for UI) |
|
||||||
|
| `polygon` | `NGon` | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle |
|
||||||
|
| `ring` (full) | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping |
|
||||||
|
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask |
|
||||||
|
| `ring` (pie slice) | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
|
||||||
|
|
||||||
Shapes without a closed-form RRect reduction are drawn via the tessellated path:
|
The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
|
||||||
|
arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
|
||||||
| Shape | Tessellated proc | Method |
|
definitions and bit assignments.
|
||||||
| ------------------------- | ---------------------------------- | -------------------------- |
|
|
||||||
| Ellipse | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
|
|
||||||
| Regular polygon (N-gon) | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center |
|
|
||||||
| Circle sector (pie slice) | `tes_sector` | Triangle fan arc |
|
|
||||||
|
|
||||||
The `Shape_Flags` bit set controls rendering mode per primitive:
|
|
||||||
|
|
||||||
| Flag | Bit | Effect |
|
|
||||||
| ----------------- | --- | -------------------------------------------------------------------- |
|
|
||||||
| `Stroke` | 0 | Outline instead of fill (`d = abs(d) - stroke_width/2`) |
|
|
||||||
| `Textured` | 1 | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
|
|
||||||
| `Gradient` | 2 | Bilinear 4-corner interpolation from `uv.corner_colors` |
|
|
||||||
| `Gradient_Radial` | 3 | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]` |
|
|
||||||
|
|
||||||
**What stays tessellated:**
|
**What stays tessellated:**
|
||||||
|
|
||||||
- Text (SDL_ttf atlas, pending future MSDF evaluation)
|
- Text (SDL_ttf atlas, pending future MSDF evaluation)
|
||||||
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`)
|
- `tess.pixel` (single-pixel points)
|
||||||
- Regular polygons (`tes_polygon`, `tes_polygon_lines`)
|
- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
|
||||||
- Circle sectors / pie slices (`tes_sector`)
|
- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
|
||||||
- `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
|
|
||||||
- Any raw vertex geometry submitted via `prepare_shape`
|
- Any raw vertex geometry submitted via `prepare_shape`
|
||||||
|
|
||||||
The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF
|
The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
|
||||||
function or is described by a vertex list, it stays tessellated.
|
`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
|
||||||
|
|
||||||
### Effects pipeline
|
### Effects pipeline
|
||||||
|
|
||||||
@@ -538,10 +599,9 @@ while also writing to it.
|
|||||||
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
||||||
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
||||||
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
||||||
pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
|
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
|
||||||
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
|
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
|
||||||
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
|
require, keeping each sub-pass well under the 32-register cliff.
|
||||||
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
|
|
||||||
|
|
||||||
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
||||||
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
||||||
@@ -549,14 +609,13 @@ resume normal drawing. The entry/exit cost (texture copy + render-pass break) is
|
|||||||
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
||||||
is never entered and the texture copy never happens — zero cost.
|
is never entered and the texture copy never happens — zero cost.
|
||||||
|
|
||||||
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
|
**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
|
||||||
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
|
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
|
||||||
that justifies the main/effects split does not apply here. The main/effects split protects the
|
The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
|
||||||
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
|
sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
|
||||||
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
|
all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
|
||||||
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
|
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
|
||||||
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
|
bracket would have negligible impact on frame time.
|
||||||
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
|
|
||||||
|
|
||||||
### Vertex layout
|
### Vertex layout
|
||||||
|
|
||||||
@@ -590,10 +649,14 @@ Primitive :: struct {
|
|||||||
// Total: 80 bytes (std430 aligned)
|
// Total: 80 bytes (std430 aligned)
|
||||||
```
|
```
|
||||||
|
|
||||||
`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union.
|
`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
|
||||||
`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient
|
`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
|
||||||
corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the
|
writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
|
||||||
tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`.
|
`Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
|
||||||
|
v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
|
||||||
|
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
|
||||||
|
width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
|
||||||
|
via `pack_kind_flags`.
|
||||||
|
|
||||||
### Draw submission order
|
### Draw submission order
|
||||||
|
|
||||||
@@ -622,8 +685,8 @@ allow resolution-independent glyph rendering from a single small atlas per font.
|
|||||||
|
|
||||||
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
|
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
|
||||||
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
|
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
|
||||||
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind
|
- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
|
||||||
distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch).
|
already exists for the four current SDF kinds).
|
||||||
- Potential removal of the SDL_ttf dependency.
|
- Potential removal of the SDL_ttf dependency.
|
||||||
|
|
||||||
This is explicitly deferred. The SDF shape migration is independent of and does not block text
|
This is explicitly deferred. The SDF shape migration is independent of and does not block text
|
||||||
@@ -692,31 +755,27 @@ with the same texture but different samplers produce separate draw calls, which
|
|||||||
|
|
||||||
#### Textured draw procs
|
#### Textured draw procs
|
||||||
|
|
||||||
Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and
|
Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
|
||||||
`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly —
|
`rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
|
||||||
same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
|
parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.
|
||||||
tint.
|
|
||||||
|
|
||||||
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
|
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
|
||||||
quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF
|
quads, on the theory that the tessellated path's lower register count would improve occupancy at
|
||||||
textured branch) would improve occupancy at large fragment counts. Applying the register-pressure
|
large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
|
||||||
analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are
|
full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
|
||||||
well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at
|
extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
|
||||||
100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts
|
splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU side per scissor,
|
||||||
to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~1–5μs per
|
matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.
|
||||||
pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
|
|
||||||
main pipeline, unified remains strictly better.
|
|
||||||
|
|
||||||
The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes
|
SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
|
||||||
for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside
|
`circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
|
||||||
`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to
|
parameters on each proc rather than separate overloads. Future per-shape texture variants
|
||||||
`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture
|
(`circle_texture`, `ellipse_texture`) are additive.
|
||||||
variants (`sdf_circle_texture`) are additive.
|
|
||||||
|
|
||||||
#### What SDF anti-aliasing does and does not do for textured draws
|
#### What SDF anti-aliasing does and does not do for textured draws
|
||||||
|
|
||||||
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
|
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
|
||||||
stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
|
outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
|
||||||
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
|
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
|
||||||
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
|
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
|
||||||
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
|
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
|
||||||
@@ -750,9 +809,9 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si
|
|||||||
|
|
||||||
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
|
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
|
||||||
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
|
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
|
||||||
existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp
|
existing rectangle handling: `fit_params` computes UVs from the fit mode, then
|
||||||
corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A
|
`rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
|
||||||
`fit_params` call computes UVs from the fit mode before dispatch.
|
from Clay's `cornerRadius` otherwise).
|
||||||
|
|
||||||
#### Deferred features
|
#### Deferred features
|
||||||
|
|
||||||
@@ -764,7 +823,7 @@ The following are plumbed in the descriptor but not implemented in phase 1:
|
|||||||
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
|
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
|
||||||
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
|
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
|
||||||
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
|
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
|
||||||
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention.
|
- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.
|
||||||
|
|
||||||
**References:**
|
**References:**
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user