From 400fe313d9cecdc0cfe4e967e2ecd24b39c7c5db Mon Sep 17 00:00:00 2001 From: Lev Nachmanson <5377127+levnach@users.noreply.github.com> Date: Thu, 4 Jun 2026 00:35:43 +0000 Subject: [PATCH] ci(nightly): always-on issue-oracle smoke test (~2 min, never fails the build) (#9688) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a tightly-bounded issue-oracle smoke test as a sibling of the existing `test_benchmarks.py` step in the nightly's `ubuntu-build` job. The step always runs as part of every nightly, can never fail the build, and completes in ~2 min. ## Why `Z3Prover/bench` ships a per-issue regression corpus (`inputs/issues/iss-N/`) plus a runner (`scripts/issues_check_oracle.py`) that diffs current z3 output against captured `.expected.out` byte streams. Wiring that into the nightly gives us a daily smoke signal that detects regressions on benchmarks distilled from real z3 issues — without requiring any z3 contributor to ever touch the bench repo. ## What A two-step block added right after the existing `Clone z3test` + `Test` steps in `ubuntu-build`: 1. **Clone bench (sparse, ~800 MB of ~12 GB total)** `git clone --depth 1 --filter=blob:none --sparse https://github.com/Z3Prover/bench bench` then `sparse-checkout set scripts inputs/issues`. 2. **Run issue-oracle smoke test (~2 min)** ```yaml continue-on-error: true run: | timeout 90 python bench/scripts/issues_check_oracle.py \ --z3 build-dist/z3 \ --all bench/inputs/issues \ --max 200 --timeout 5 --wallclock 60 \ --jobs 0 --quiet \ --json-report issue-oracle-report.json ``` The JSON report is then uploaded as a workflow artifact (`issue-oracle-report`, 7-day retention) for inspection. ### Wall-clock bounds (defense in depth) | Bound | Where | Purpose | |---|---|---| | `--max 200` | issues_check_oracle CLI | walk only first 200 of ~2,700 `iss-*` dirs (alphabetic; stable across nightlies) | | `--timeout 5` | issues_check_oracle CLI | per-file z3 cap | | `--wallclock 60` | issues_check_oracle CLI | hard global cap inside the script | | `timeout 90` | shell wrapper | belt-and-braces backstop, leaves 30 s headroom for the script to flush its JSON report before SIGTERM | | `continue-on-error: true` | step gate | absorbs every failure mode (missing z3, sparse-clone failure, outer timeout firing, etc.) so the smoke test can **never** red the nightly build | ### Scope Only `ubuntu-build` and only one place in `nightly.yml`. The push/PR lanes (`ci.yml`, `Windows.yml`) and the other scheduled/dispatch lanes (`coverage.yml`, `memory-safety.yml`, `nightly-validation.yml`, `release.yml`, `wip.yml`, `daily-test-improver`) are intentionally left untouched so this gate runs exactly once per night. ## Local verification On Mac (16 cores, capped to 8 jobs by `--jobs 0` resolving to `min(jobs, cores)`): ``` [issues_check_oracle] 368 file-check(s) | timeout=5s | wallclock=60s === summary === total: 368 ok: 286 DIFF: 4 (per-file timeouts) skipped: 78 elapsed: 8.3s / 60s exit code: 0 ``` GHA Ubuntu (4 cores → 4 jobs) extrapolation: ~17 s typical, well under all wall-clock caps. ### Adversarial cases (all leave the workflow green via step-level `continue-on-error: true`) | Failure mode | Result | |---|---| | z3 binary missing | each per-file run records `exec-error`, script summary-exits 0 → green | | Sparse clone fails (previous step's continue-on-error absorbs it) | oracle finds no `bench/` → script `sys.exit(1)` → step's continue-on-error absorbs → green | | Wallclock fires | script writes report with `wallclock_hit: true`, exits 0 → green | | Outer `timeout 90` fires | SIGTERM → bash exits 124 → step's continue-on-error absorbs → green | ## Companion bench-repo PR The data side of this (per-bench sidecar schema, `bug-K.json` + `.expected.out`, oracle rewrite) lands in `Z3Prover/bench` as PR [#2503](https://github.com/Z3Prover/bench/pull/2503). The nightly step here depends on that PR's `scripts/issues_check_oracle.py` and the migrated corpus. Both PRs should be merged together; bench can also merge first (the script handles a missing corpus gracefully). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/workflows/nightly.yml | 56 +++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml index 548fca2eb..180dc6d90 100644 --- a/.github/workflows/nightly.yml +++ b/.github/workflows/nightly.yml @@ -244,6 +244,62 @@ jobs: - name: Test run: python z3test/scripts/test_benchmarks.py build-dist/z3 z3test/regressions/smt2 + - name: Clone bench (for issue-oracle smoke test) + continue-on-error: true + run: | + # Sparse clone: bench is ~12 GB total but only scripts/ and + # inputs/issues/ (~800 MB) are needed by the oracle. The + # smoke-test runs on a sampled subset of iss-* dirs (see + # next step), so even this 800 MB pull is upper-bounded + # at the clone level rather than runtime. + git clone --depth 1 --filter=blob:none --sparse \ + https://github.com/Z3Prover/bench bench + git -C bench sparse-checkout set scripts inputs/issues + + - name: Run issue-oracle smoke test + # continue-on-error: true means this step can NEVER red the + # nightly build. Its job is to produce a JSON report artifact, + # not to gate the build. The bounds below (--max 200, --timeout + # 5, --wallclock 20, outer timeout 90) keep the step under ~2 + # minutes regardless of corpus growth. + continue-on-error: true + run: | + # SMOKE-TEST budget (sampled subset of the corpus): + # --max 200 first 200 of ~2,700 iss-* dirs + # (sorted, so the same sample every run + # and easy to diff across nightlies) + # --timeout 5 per-file z3 cap (matches capture-time + # timeout/4, so any bench that escapes + # this cap was already on the edge) + # --wallclock 20 hard global cap INSIDE the script + # --quiet suppress per-issue progress lines + # timeout 90 shell-level belt-and-braces wrapper, + # leaves 30 s headroom for the script + # to flush its JSON report before SIGTERM + # The script ALWAYS exits 0 on normal operation, so the only + # ways this step can non-zero are: missing z3 binary, sparse + # clone failure (the previous step's continue-on-error + # absorbs that), or the outer `timeout` firing. All three + # are absorbed by this step's continue-on-error: true. + timeout 90 python bench/scripts/issues_check_oracle.py \ + --z3 build-dist/z3 \ + --all bench/inputs/issues \ + --max 200 \ + --timeout 5 \ + --wallclock 20 \ + --jobs 0 \ + --quiet \ + --json-report issue-oracle-report.json + + - name: Upload issue-oracle report + if: always() + continue-on-error: true + uses: actions/upload-artifact@v7.0.1 + with: + name: issue-oracle-report + path: issue-oracle-report.json + retention-days: 7 + - name: Upload artifact uses: actions/upload-artifact@v7.0.1