|
|
Donner 0.5.1
Embeddable browser-grade SVG2 engine
|
Status: Phase 1 Implemented Author: Claude Opus 4.6 Created: 2026-04-07 Updated: 2026-04-08 Tracking: Org initiative
Developers run bazel test //... locally and expect it to catch all issues before pushing. In practice, 5-6 categories of failures routinely escape to CI, costing multiple round-trips per PR. This design doc catalogues every observed escape category from recent PRs, proposes concrete mitigations for each, and charts a path to zero escapes — where bazel test //... (plus a lightweight tools/presubmit.sh) catches everything CI would catch, with the sole exception of genuinely platform-dependent issues (Linux-only when developing on macOS).
Analysis of the last ~50 PRs and CI runs yields six categories of escapes, ordered by frequency and developer pain:
Affected PRs: #459 (8 CI iterations), #444 (linker errors + syntax), #481 (missing include)
Root cause: CMake is a completely separate build system maintained by gen_cmakelists.py. Developers add files to BUILD.bazel but the CMake side drifts because:
Observed failure modes:
Affected PRs: #415 (int64_t type width), #469 (HarfBuzz linker)
Root cause: macOS Clang and Linux Clang/GCC have different type widths, header transitivity, and linker behavior. std::int64_t is long long on macOS but long on Linux, causing template specialization collisions. Headers included transitively on macOS must be explicitly included on Linux.
Observed failure modes:
Affected PRs: #481 (bezier_utils_fuzzer assertion failure)
Root cause — investigated: Fuzzers run on Linux CI but bazel test //... on macOS skips them via fuzzer_compatible_with() in build_defs/rules.bzl. The restriction exists because Apple Clang does not ship libclang_rt.fuzzer_osx.a (confirmed on Apple Clang 17) — only the header FuzzedDataProvider.h is included.
However: The vendored LLVM 21 toolchain (downloaded via --config=latest_llvm) does ship libclang_rt.fuzzer_osx.a at external/toolchains_llvm++llvm+llvm_toolchain_llvm/lib/clang/21/lib/darwin/. Running bazel test --config=asan-fuzzer //... on macOS works today and catches fuzzer failures locally. The gap is purely in the default bazel test //... invocation (which uses Apple Clang for speed).
Affected PRs: #444 (SVGClipPath/SVGPattern ASCII test failures on Skia backend)
Root cause — clarified: Cross-backend testing already exists via donner_variant_cc_test in build_defs/rules.bzl, which creates separate _skia and _tiny_skia variants with backend transitions. These are exercised by bazel test //.... The #444 failure was macOS-specific ASCII rendering differences, not a missing backend test. No action needed for this category beyond monitoring the existing variant tests.
Affected PRs: #465, #480 (empty coverage.dat from transitioned tests)
Root cause: Custom Bazel rules (donner_transitioned_cc_test) didn't forward InstrumentedFilesInfo, causing coverage collection to silently produce empty results. This was only discoverable by checking Codecov reports, not by test failures.
Affected: Occasional CodeQL timeouts.
Root cause: Infrastructure issues unrelated to code changes. Not actionable via local testing.
Affected: Intermittent apt-get, bazel fetch, GitHub clone / checkout failures
Root cause: Flaky external services (GitHub / apt mirrors / Bazel Central Registry / chromium.googlesource.com). These aren't "escapes" in the code-quality sense but they do show up as red CI runs that waste developer attention. Mitigation: retry the network-sensitive steps automatically so a single transient failure doesn't block a PR.
Affected PRs: 8043ad7b, 1f147f2f, 43f42cf7, ab68092b, 8cd89ef7 — five threshold-widening hotfixes in 48h for AsyncRendererTest, FilterDragBench, CompositorBench.
Root cause: Perf tests written with ~2 ms thresholds on dev hardware fail on shared GHA macOS runners (~40 ms under load). Not a correctness escape, but it blocks PRs and churns the history.
Mitigation (tracked in doc 0030 M2): donner_perf_cc_test macro splitting correctness counters (PR-gate) from wall-clock thresholds (nightly, tagged perf). Nightly-only wall-clock runs stream into tools/ci_timing_report.py.
Affected: GitHub issue #552 — FiltersFeImage/* crashes in resvg_test_suite_geode. Root cause: heap UAF in GeodeDevice::counters_ (raw pointer into a sibling Impl that was freed) combined with iterator/reference invalidation in RendererDriver::traverse (a held RenderingInstanceView crossed a registry.sort<RenderingInstanceComponent> call inside preRenderFeImageFragments).
Root cause: bazel test //... runs the default config. Heap UAFs, ODR violations, and UB surface only when a developer opts into --config=asan or --config=ubsan. In #552 the UAF manifested as random SIGSEGV at unrelated allocation sites (tiny_free_list_remove_ptr, lvp_CreateBuffer → calloc); under ASan the exact free/use pair was identified in <1 second.
**Local repro:**
**Mitigation (tracked in doc 0030):** M1.2 adds a paths-scoped `asan-geode` PR gate for changes to `donner/svg/renderer/geode/**` and `donner/svg/renderer/RendererDriver.*`. M2 adds a nightly `sanitizers.yml` running ASan+UBSan across `//donner/...` (required for `main` merges, skip-idle when no commits landed that day).
The first phase of mitigations has landed. Recap:
The initial design proposed a regex-based check_includes.py. After review, that approach was dropped in favor of enabling clang-tidy's misc-include-cleaner (follow-up PR). Clang-tidy already runs via the existing --config=clang-tidy aspect, has real C++ semantics, no regex false positives, and no hand-maintained rule table.
2026-04 update (doc 0031 M2.9): misc-include-cleaner is now also emitted per-library inside bazel test //... for opt-in targets via include_cleaner = "strict" on donner_cc_library/_test/_binary (see build_defs/include_cleaner.bzl). Default off because of historical debt (#559, ~2,875 findings) — libraries opt in directory-by-directory as they're cleaned. The diff-only tools/run_misc_include_cleaner_diff.sh lint stays as the safety net for non-opted-in code.
Add a Bazel test target that:
This catches the most common escape: developers modify BUILD.bazel files but forget to regenerate CMake. The test is fast (just runs bazel queries + diffing) and integrates into bazel test //....
Implementation: sh_test that runs gen_cmakelists.py --check (new flag) which generates to a temp directory and diffs against the workspace. Exit code 1 on drift.
Extract dependency versions from MODULE.bazel instead of hardcoding them:
This eliminates the version drift between MODULE.bazel and CMake FetchContent (currently entt, zlib, woff2, and rules_cc are all out of sync).
Replace sys.platform with CMake conditionals for the Skia font defines:
Use bazel cquery for select-aware queries:
The current approach uses bazel query --output=xml which returns unconditional attribute values. select() branches are invisible, requiring all the manual CONDITIONAL_SOURCES and CONDITIONAL_TARGETS dicts. Using bazel cquery with multiple configurations can discover all variants:
This is the most ambitious change but would eliminate 4 of the 11 manual data structures (CONDITIONAL_TARGETS, OPTIONAL_DEPS, CONDITIONAL_SOURCES, CONDITIONAL_DEFINES).
Warn on unmapped external deps:
When query_deps returns an external label not in KNOWN_BAZEL_TO_CMAKE_DEPS, emit a warning (or error in --check mode) instead of silently dropping it.
Add find_package(Python3): The generated CMakeLists.txt uses ${Python3_EXECUTABLE} for embed_resources but never calls find_package(Python3 REQUIRED).
The script has zero tests. Add a gen_cmakelists_test.py that validates:
Missing includes were the second most common escape (PR #481). The initial plan was a regex-based check_includes.py, but that was dropped in review: clang-tidy's misc-include-cleaner does the same job with real C++ semantics, no hand-maintained rule tables, and already runs via the existing --config=clang-tidy aspect.
Follow-up PR plan:
For the #415-style escape where int64_t and long are the same type on Linux:
More generally: use if constexpr or SFINAE to avoid specializing templates for types that might alias on some platforms. Add a coding style rule and grep-based lint.
Add a Bazel test that greps for dangerous patterns:
Correction from initial design: libFuzzer does work on macOS via the vendored LLVM 21 toolchain. The --config=asan-fuzzer config chain activates --config=latest_llvm which downloads clang 21 and brings libclang_rt.fuzzer_osx.a with it. The donner_cc_fuzzer rule already has macOS-specific rpaths and runtime data deps.
The gap is purely that bazel test //... (default) uses Apple Clang for speed and skips fuzz targets via fuzzer_compatible_with(). The right fix is:
No corpus-replay workaround needed.
Correction from initial design: Cross-backend testing is already implemented via the donner_variant_cc_test macro in build_defs/rules.bzl, which creates _skia and _tiny_skia variants with backend transitions. These run as part of bazel test //... — no further action required for this category. PR #444's ASCII test diffs were due to platform-specific rendering differences on macOS CI, not a missing backend variant.
A single script that runs everything CI would catch:
Steps 2-4 are fast (<30s each). Step 1 is the main time cost. The script provides a single command that mirrors CI.
Where possible, make presubmit checks Bazel test targets so they're part of `bazel test //...`:
| Check | Bazel integration | Status |
|---|---|---|
| Banned source patterns | {name}_lint py_test per donner_cc_library/test/binary | Landed (Phase 1) |
| gen_cmakelists.py unit tests | //tools/cmake:gen_cmakelists_test py_test | Landed (Phase 1) |
| gen_cmakelists.py –check | Not yet (reentrant bazel issue) | Deferred to aspect PR |
| Include completeness | clang-tidy misc-include-cleaner aspect | Follow-up PR |
gen_cmakelists.py --check can't run from inside bazel test today because it calls bazel query and would deadlock on the outer command lock. The aspect refactor (Phase 3) replaces bazel query with aspect-provided JSON manifests computed during the outer bazel build, at which point the check becomes a normal sh_test with no reentrancy.
Formatting checks (clang-format, buildifier) are best left in presubmit.sh since they require external tools and are fast enough to run separately.
The CMake generator is the single largest source of escapes. Here is a prioritized improvement plan:
| # | Data Structure | Lines | Risk | Mitigation |
|---|---|---|---|---|
| 1 | KNOWN_BAZEL_TO_CMAKE_DEPS | 47-67 | External dep rename | Warn on unmapped deps |
| 2 | SKIPPED_PACKAGES | 71-77 | New package needs skip | Low risk, rarely changes |
| 3 | SKIPPED_TARGETS | 80-84 | New target needs skip | Low risk |
| 4 | CONDITIONAL_TARGETS | 93-122 | New select()-gated target | Eliminate via cquery |
| 5 | OPTIONAL_DEPS | 127-143 | Must track CONDITIONAL_TARGETS | Eliminate via cquery |
| 6 | CONDITIONAL_SOURCES | 148-157 | New select()-gated source | Eliminate via cquery |
| 7 | CONDITIONAL_DEFINES | 161-169 | New conditional define | Eliminate via cquery |
| 8 | EXTRA_LINK_DEPS | 173-176 | New system library dep | Keep (system deps aren't in Bazel graph) |
| 9 | EXTRA_INCLUDE_DIRS | 179-182 | New system include dir | Keep (same reason) |
| 10 | FetchContent versions | 469-476 | MODULE.bazel version bump | Auto-extract from MODULE.bazel |
| 11 | Platform-specific defines | 967-989 | Platform mismatch | Replace with CMake conditionals |
Tier 1 (Low effort, high impact):
Tier 2 (Medium effort, high impact):
Tier 3 (High effort, transformative):
After all mitigations, here is the expected escape coverage:
| Escape Category | Before | Phase 1 (landed) | Phase 2 (clang-tidy) | Phase 3 (aspect) |
|---|---|---|---|---|
| CMake generator bugs (missing srcs, undefined targets) | Escapes | Caught locally (--check) | Caught locally | Caught in bazel test //... |
| CMake version drift (MODULE vs FetchContent) | Escapes | Caught locally | Caught locally | Caught locally |
| long long / UDL / aligned_storage | Escapes | Caught in bazel test //... | Caught locally | Caught locally |
| Missing #include <optional> / Linux-only includes | Escapes | Escapes | Caught locally (clang-tidy) | Caught locally |
| Fuzzer regressions (parser bugs) | Linux-only | Caught on macOS via --config=asan-fuzzer | Same | Same |
| Cross-backend rendering diffs | Already caught | Already caught | Already caught | Already caught |
| Transient GitHub/apt/bazel-fetch outages | Red CI | Auto-retried 3× in CI | Same | Same |
| CodeQL flakiness | Infra-only | Infra-only (tracked separately) | Same | Same |
Acceptable remaining escapes:
All items below landed in a single PR (ci-escape-prevention branch):
| Task | Impact |
|---|---|
| gen_cmakelists.py --check + static validator | Blocks generator bugs before CI |
| Extract versions from MODULE.bazel | Eliminates entt/zlib/woff2 drift |
| Replace sys.platform with CMake conditionals | Cross-host-correct generation |
| Add find_package(Python3) | Fixes embed_resources custom commands |
| Warn/fail on unmapped external deps | Catches new dep additions |
| gen_cmakelists_test.py (23 unit tests in bazel) | Regression prevention |
| check_banned_patterns.py with per-library lint test in donner_cc_library | New long long/UDL/aligned_storage fails bazel test automatically |
| presubmit.sh + CI workflow retry for transient outages | Single-command local check; flaky networks don't break CI |
| macOS fuzzer step via --config=asan-fuzzer | Closes default macOS fuzzer gap |
| Task | Impact |
|---|---|
| Enable misc-include-cleaner in .clang-tidy | Catches PR #481-style missing-include escapes |
| Triage findings, NOLINT justified exceptions | Baseline the first run |
| Add bazel build --config=clang-tidy //donner/... to CI lint job | Continuous enforcement |
| Task | Impact |
|---|---|
| Write aspect that emits per-target JSON with srcs/hdrs/deps/defines/selects | Foundation for eliminating manual dicts |
| Refactor gen_cmakelists.py to consume the JSON manifest | Eliminates N+1 bazel query calls |
| Delete CONDITIONAL_TARGETS/OPTIONAL_DEPS/CONDITIONAL_SOURCES/CONDITIONAL_DEFINES | 4 of the 11 manual dicts removed |
| Move gen_cmakelists.py --check into bazel test //... | No more reentrant bazel issue |
Smaller follow-ups once Phase 1-3 are stable:
This complements the "Pull Request Workflow" section in AGENTS.md (landed in PR #483) with the new presubmit tooling and known CI behaviors.
Track escape rate over time:
| Risk | Mitigation |
|---|---|
| CMake validation test slows bazel test //... | gen_cmakelists.py runs bazel queries (~10-15s); tag test size = "large" so it runs but doesn't block fast iteration |
| cquery approach may not capture all select() semantics | Keep manual overrides as fallback; remove only when cquery output is validated against known-good CMake |
| IWYU may be too noisy / have false positives | Start with explicit-include lint (simpler, fewer false positives) before committing to IWYU |
| Presubmit script adds friction | Keep it optional; the Bazel tests catch the critical escapes automatically |
| Fuzzer corpus tests may be flaky | Corpus inputs are deterministic; failures are genuine bugs |
The script operates in two phases:
Key fragility: the depth-2 dep query, lack of select() awareness, and 11 manual dicts that must be maintained by hand whenever BUILD.bazel files change.
| Workflow | Platforms | What It Checks | Local Equivalent |
|---|---|---|---|
| main.yml | Linux + macOS | bazel build //... + bazel test //... | bazel test //... |
| cmake.yml | Linux + macOS | gen_cmakelists.py + CMake build (no tests) | None → Phase 1 adds |
| coverage.yml | Linux only | LCOV coverage + Codecov upload | tools/coverage.sh (manual) |
| codeql.yml | Linux | Static analysis | --config=clang-tidy (partial) |
| release.yml | Linux + macOS | -c opt release build | Manual |
| deploy_docs.yaml | Linux | Doxygen generation | tools/doxygen.sh (manual) |
| badges.yaml | Linux | Code metrics | tools/cloc.sh (manual) |