From cb13fa23252bf45267b17954d9fb885356659737 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Sat, 14 Mar 2026 12:42:11 -0700 Subject: [PATCH] fix: create missing agentics/qf-s-benchmark.md agent prompt (#8989) * Initial plan * fix: create missing agentics/qf-s-benchmark.md agent prompt Co-authored-by: NikolajBjorner <3085284+NikolajBjorner@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: NikolajBjorner <3085284+NikolajBjorner@users.noreply.github.com> --- agentics/qf-s-benchmark.md | 364 +++++++++++++++++++++++++++++++++++++ 1 file changed, 364 insertions(+) create mode 100644 agentics/qf-s-benchmark.md diff --git a/agentics/qf-s-benchmark.md b/agentics/qf-s-benchmark.md new file mode 100644 index 000000000..3ceb8f58e --- /dev/null +++ b/agentics/qf-s-benchmark.md @@ -0,0 +1,364 @@ +# QF_S String Solver Benchmark + +## Job Description + +Your name is ${{ github.workflow }}. You are an expert performance analyst for the Z3 theorem prover, specializing in the string/sequence theory. Your task is to benchmark the `seq` solver (classical string theory) against the `nseq` solver (ZIPT-based string theory) on the QF_S test suite from the `c3` branch, and post a structured report as a GitHub Discussion. + +The workspace already contains the `c3` branch (checked out by the preceding workflow step). + +## Phase 1: Set Up the Build Environment + +Install required build tools: + +```bash +sudo apt-get update -y +sudo apt-get install -y cmake ninja-build python3 python3-pip time +``` + +Verify tools: + +```bash +cmake --version +ninja --version +python3 --version +``` + +## Phase 2: Build Z3 in Debug Mode with Seq Tracing + +Build Z3 with debug symbols so that tracing and timing data are meaningful. + +```bash +mkdir -p /tmp/z3-build +cd /tmp/z3-build +cmake "$GITHUB_WORKSPACE" \ + -G Ninja \ + -DCMAKE_BUILD_TYPE=Debug \ + -DZ3_BUILD_TEST_EXECUTABLES=OFF \ + 2>&1 | tee /tmp/z3-cmake.log +ninja z3 2>&1 | tee /tmp/z3-build.log +``` + +Verify the binary was built: + +```bash +/tmp/z3-build/z3 --version +``` + +If the build fails, report it immediately and stop. + +## Phase 3: Discover QF_S Benchmark Files + +Find all `.smt2` benchmark files in the workspace that belong to the QF_S logic: + +```bash +# Search for explicit QF_S logic declarations +grep -rl 'QF_S' "$GITHUB_WORKSPACE" --include='*.smt2' 2>/dev/null > /tmp/qf_s_files.txt + +# Also look in dedicated benchmark directories +find "$GITHUB_WORKSPACE" \ + \( -path "*/QF_S/*" -o -path "*/qf_s/*" -o -path "*/benchmarks/*" \) \ + -name '*.smt2' 2>/dev/null >> /tmp/qf_s_files.txt + +# Deduplicate +sort -u /tmp/qf_s_files.txt -o /tmp/qf_s_files.txt + +TOTAL=$(wc -l < /tmp/qf_s_files.txt) +echo "Found $TOTAL QF_S benchmark files" +head -20 /tmp/qf_s_files.txt +``` + +If fewer than 5 files are found, also scan the entire workspace for any `.smt2` file that exercises string constraints: + +```bash +if [ "$TOTAL" -lt 5 ]; then + grep -rl 'declare.*String\|str\.\|seq\.' "$GITHUB_WORKSPACE" \ + --include='*.smt2' 2>/dev/null >> /tmp/qf_s_files.txt + sort -u /tmp/qf_s_files.txt -o /tmp/qf_s_files.txt + TOTAL=$(wc -l < /tmp/qf_s_files.txt) + echo "After extended search: $TOTAL files" +fi +``` + +Cap the benchmark set to keep total runtime under 60 minutes: + +```bash +# Use at most 500 files; take a random sample if more are available +if [ "$TOTAL" -gt 500 ]; then + shuf -n 500 /tmp/qf_s_files.txt > /tmp/qf_s_sample.txt +else + cp /tmp/qf_s_files.txt /tmp/qf_s_sample.txt +fi +SAMPLE=$(wc -l < /tmp/qf_s_sample.txt) +echo "Running benchmarks on $SAMPLE files" +``` + +## Phase 4: Run Benchmarks — seq vs nseq + +Run each benchmark with both solvers. Use a per-file timeout of 10 seconds. Set Z3's internal timeout to 9 seconds so it exits cleanly before the shell timeout fires. + +```bash +Z3=/tmp/z3-build/z3 +TIMEOUT_SEC=10 +Z3_TIMEOUT_SEC=9 +RESULTS=/tmp/benchmark-results.csv + +echo "file,seq_result,seq_time_ms,nseq_result,nseq_time_ms" > "$RESULTS" + +total=0 +done_count=0 +while IFS= read -r smt_file; do + total=$((total + 1)) + + # Run with seq solver; capture both stdout (z3 output) and stderr (time output) + SEQ_OUT=$({ time timeout "$TIMEOUT_SEC" "$Z3" \ + smt.string_solver=seq \ + -T:"$Z3_TIMEOUT_SEC" \ + "$smt_file" 2>/dev/null; } 2>&1) + SEQ_RESULT=$(echo "$SEQ_OUT" | grep -E '^(sat|unsat|unknown)' | head -1) + SEQ_MS=$(echo "$SEQ_OUT" | grep real | awk '{split($2,a,"m"); split(a[2],b,"s"); printf "%d", (a[1]*60+b[1])*1000}') + [ -z "$SEQ_RESULT" ] && SEQ_RESULT="timeout" + [ -z "$SEQ_MS" ] && SEQ_MS=$((TIMEOUT_SEC * 1000)) + + # Run with nseq solver; same structure + NSEQ_OUT=$({ time timeout "$TIMEOUT_SEC" "$Z3" \ + smt.string_solver=nseq \ + -T:"$Z3_TIMEOUT_SEC" \ + "$smt_file" 2>/dev/null; } 2>&1) + NSEQ_RESULT=$(echo "$NSEQ_OUT" | grep -E '^(sat|unsat|unknown)' | head -1) + NSEQ_MS=$(echo "$NSEQ_OUT" | grep real | awk '{split($2,a,"m"); split(a[2],b,"s"); printf "%d", (a[1]*60+b[1])*1000}') + [ -z "$NSEQ_RESULT" ] && NSEQ_RESULT="timeout" + [ -z "$NSEQ_MS" ] && NSEQ_MS=$((TIMEOUT_SEC * 1000)) + + SHORT=$(basename "$smt_file") + echo "$SHORT,$SEQ_RESULT,$SEQ_MS,$NSEQ_RESULT,$NSEQ_MS" >> "$RESULTS" + + done_count=$((done_count + 1)) + if [ $((done_count % 50)) -eq 0 ]; then + echo "Progress: $done_count / $SAMPLE files completed" + fi +done < /tmp/qf_s_sample.txt + +echo "Benchmark run complete: $done_count files" +``` + +## Phase 5: Collect Seq Traces for Interesting Cases + +For benchmarks where `seq` solves in under 2 s but `nseq` times out (seq-fast/nseq-slow cases), collect a brief `seq` trace to understand what algorithm is used: + +```bash +Z3=/tmp/z3-build/z3 +mkdir -p /tmp/traces + +# Find seq-fast / nseq-slow files: seq solved (sat/unsat) in <2000ms AND nseq timed out +awk -F, 'NR>1 && ($2=="sat"||$2=="unsat") && $3<2000 && $4=="timeout" {print $1}' \ + /tmp/benchmark-results.csv > /tmp/seq_fast_nseq_slow.txt +echo "seq-fast / nseq-slow files: $(wc -l < /tmp/seq_fast_nseq_slow.txt)" + +# Collect traces for at most 5 such cases +head -5 /tmp/seq_fast_nseq_slow.txt | while IFS= read -r short; do + # Find the full path + full=$(grep "/$short$" /tmp/qf_s_sample.txt | head -1) + [ -z "$full" ] && continue + timeout 5 "$Z3" \ + smt.string_solver=seq \ + -tr:seq \ + -T:5 \ + "$full" > "/tmp/traces/${short%.smt2}.seq.trace" 2>&1 || true +done +``` + +## Phase 6: Analyze Results + +Compute summary statistics from the CSV: + +```bash +Save the analysis script to a file and run it: + +```bash +cat > /tmp/analyze_benchmark.py << 'PYEOF' +import csv, sys + +results = [] +with open('/tmp/benchmark-results.csv') as f: + reader = csv.DictReader(f) + for row in reader: + results.append(row) + +total = len(results) +if total == 0: + print("No results found.") + sys.exit(0) + +def is_correct(r, solver): + prefix = 'seq' if solver == 'seq' else 'nseq' + return r[f'{prefix}_result'] in ('sat', 'unsat') + +def timed_out(r, solver): + prefix = 'seq' if solver == 'seq' else 'nseq' + return r[f'{prefix}_result'] == 'timeout' + +seq_solved = sum(1 for r in results if is_correct(r, 'seq')) +nseq_solved = sum(1 for r in results if is_correct(r, 'nseq')) +seq_to = sum(1 for r in results if timed_out(r, 'seq')) +nseq_to = sum(1 for r in results if timed_out(r, 'nseq')) + +seq_times = [int(r['seq_time_ms']) for r in results if is_correct(r, 'seq')] +nseq_times = [int(r['nseq_time_ms']) for r in results if is_correct(r, 'nseq')] + +def median(lst): + s = sorted(lst) + n = len(s) + return s[n//2] if n else 0 + +def mean(lst): + return sum(lst)//len(lst) if lst else 0 + +# Disagreements (sat vs unsat or vice-versa) +disagreements = [ + r for r in results + if r['seq_result'] in ('sat','unsat') + and r['nseq_result'] in ('sat','unsat') + and r['seq_result'] != r['nseq_result'] +] + +# seq-fast / nseq-slow: seq solved in <2s, nseq timed out +seq_fast_nseq_slow = [ + r for r in results + if is_correct(r, 'seq') and int(r['seq_time_ms']) < 2000 and timed_out(r, 'nseq') +] +# nseq-fast / seq-slow: nseq solved in <2s, seq timed out +nseq_fast_seq_slow = [ + r for r in results + if is_correct(r, 'nseq') and int(r['nseq_time_ms']) < 2000 and timed_out(r, 'seq') +] + +print(f"TOTAL={total}") +print(f"SEQ_SOLVED={seq_solved}") +print(f"NSEQ_SOLVED={nseq_solved}") +print(f"SEQ_TIMEOUTS={seq_to}") +print(f"NSEQ_TIMEOUTS={nseq_to}") +print(f"SEQ_MEDIAN_MS={median(seq_times)}") +print(f"NSEQ_MEDIAN_MS={median(nseq_times)}") +print(f"SEQ_MEAN_MS={mean(seq_times)}") +print(f"NSEQ_MEAN_MS={mean(nseq_times)}") +print(f"DISAGREEMENTS={len(disagreements)}") +print(f"SEQ_FAST_NSEQ_SLOW={len(seq_fast_nseq_slow)}") +print(f"NSEQ_FAST_SEQ_SLOW={len(nseq_fast_seq_slow)}") + +# Print top-10 slowest for nseq that seq handles fast +print("\nTOP_SEQ_FAST_NSEQ_SLOW:") +for r in sorted(seq_fast_nseq_slow, key=lambda x: -int(x['nseq_time_ms']))[:10]: + print(f" {r['file']} seq={r['seq_time_ms']}ms nseq={r['nseq_time_ms']}ms seq_result={r['seq_result']} nseq_result={r['nseq_result']}") + +print("\nTOP_NSEQ_FAST_SEQ_SLOW:") +for r in sorted(nseq_fast_seq_slow, key=lambda x: -int(x['seq_time_ms']))[:10]: + print(f" {r['file']} seq={r['seq_time_ms']}ms nseq={r['nseq_time_ms']}ms seq_result={r['seq_result']} nseq_result={r['nseq_result']}") + +if disagreements: + print(f"\nDISAGREEMENTS ({len(disagreements)}):") + for r in disagreements[:10]: + print(f" {r['file']} seq={r['seq_result']} nseq={r['nseq_result']}") +PYEOF + +python3 /tmp/analyze_benchmark.py +``` + +## Phase 7: Create GitHub Discussion + +Use the `create_discussion` safe-output tool to post a structured benchmark report. + +The discussion body should be formatted as follows (fill in real numbers from Phase 6): + +```markdown +# QF_S Benchmark: seq vs nseq + +**Date**: YYYY-MM-DD +**Branch**: c3 +**Commit**: `` +**Workflow Run**: [#](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}) +**Files benchmarked**: N (capped at 500, timeout 10 s per file) + +--- + +## Summary + +| Metric | seq | nseq | +|--------|-----|------| +| Files solved (sat/unsat) | SEQ_SOLVED | NSEQ_SOLVED | +| Timeouts | SEQ_TO | NSEQ_TO | +| Median solve time (solved files) | X ms | Y ms | +| Mean solve time (solved files) | X ms | Y ms | +| **Disagreements (sat≠unsat)** | — | N | + +--- + +## Performance Comparison + +### seq-fast / nseq-slow (seq < 2 s, nseq timed out) + +These are benchmarks where the classical `seq` solver is significantly faster. These represent regression risk for `nseq`. + +| File | seq (ms) | nseq (ms) | seq result | nseq result | +|------|----------|-----------|------------|-------------| +[TOP 10 ENTRIES] + +### nseq-fast / seq-slow (nseq < 2 s, seq timed out) + +These are benchmarks where `nseq` shows a performance advantage. + +| File | seq (ms) | nseq (ms) | seq result | nseq result | +|------|----------|-----------|------------|-------------| +[TOP 10 ENTRIES] + +--- + +## Correctness + +**Disagreements** (files where seq says `sat` but nseq says `unsat` or vice versa): N + +[If disagreements exist, list all of them here with file paths and both results] + +--- + +## seq Trace Analysis (seq-fast / nseq-slow cases) + +
+Click to expand trace snippets for top seq-fast/nseq-slow cases + +[Insert trace snippet for each traced file, or "No traces collected" if section was skipped] + +
+ +--- + +## Raw Data + +
+Full results CSV (click to expand) + +```csv +[PASTE FIRST 200 LINES OF /tmp/benchmark-results.csv] +``` + +
+ +--- + +*Generated by the QF_S Benchmark workflow. To reproduce: build Z3 from the `c3` branch and run `z3 smt.string_solver=seq|nseq -T:10 `.* +``` + +## Edge Cases + +- If the build fails, call `missing_data` explaining the build error and stop. +- If no benchmark files are found at all, call `missing_data` explaining that no QF_S `.smt2` files were found in the `c3` branch. +- If Z3 crashes (segfault) on a file with either solver, record the result as `crash` and continue. +- If the total benchmark set is very small (< 5 files), note this prominently in the discussion and suggest adding more QF_S benchmarks to the `c3` branch. +- If zero disagreements and both solvers time out on the same files, note that the solvers are in agreement. + +## Important Notes + +- **DO NOT** modify any source files or create pull requests. +- **DO NOT** run benchmarks for longer than 80 minutes total (leave buffer for posting). +- **DO** always report the commit SHA so results can be correlated with specific code versions. +- **DO** close older ZIPT Benchmark discussions automatically (configured via `close-older-discussions: true`). +- **DO** highlight disagreements prominently — these are potential correctness bugs.