From d349b93d1d3e97c06a4f4a21195414fe4dc839cd Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 17:41:29 +0000
Subject: [PATCH 01/11] Add Copilot skill architecture with 10 skills, 2
 agents, and shared infra

Introduce .github/skills/ with solve, prove, optimize, simplify,
encode, explain, benchmark, memory-safety, static-analysis, and
deeptest skills. Each skill follows a SKILL.md + scripts/ pattern
with Python scripts backed by a shared SQLite logging library
(z3db.py). Two orchestrator agents (z3-solver, z3-verifier) route
requests to the appropriate skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/agents/z3-solver.md                   | 129 ++++++
 .github/agents/z3-verifier.md                 | 131 ++++++
 .github/skills/README.md                      |  74 ++++
 .github/skills/benchmark/SKILL.md             |  48 +++
 .github/skills/benchmark/scripts/benchmark.py |  74 ++++
 .github/skills/deeptest/SKILL.md              |  70 ++++
 .github/skills/deeptest/scripts/deeptest.py   | 393 ++++++++++++++++++
 .github/skills/encode/SKILL.md                |  45 ++
 .github/skills/encode/scripts/encode.py       | 144 +++++++
 .github/skills/explain/SKILL.md               |  52 +++
 .github/skills/explain/scripts/explain.py     | 128 ++++++
 .github/skills/memory-safety/SKILL.md         |  53 +++
 .../memory-safety/scripts/memory_safety.py    | 266 ++++++++++++
 .github/skills/optimize/SKILL.md              |  48 +++
 .github/skills/optimize/scripts/optimize.py   |  60 +++
 .github/skills/prove/SKILL.md                 |  54 +++
 .github/skills/prove/scripts/prove.py         |  82 ++++
 .github/skills/shared/schema.sql              |  57 +++
 .github/skills/shared/z3db.py                 | 328 +++++++++++++++
 .github/skills/simplify/SKILL.md              |  48 +++
 .github/skills/simplify/scripts/simplify.py   |  83 ++++
 .github/skills/solve/SKILL.md                 |  50 +++
 .github/skills/solve/scripts/solve.py         |  66 +++
 .github/skills/static-analysis/SKILL.md       |  46 ++
 .../scripts/static_analysis.py                | 255 ++++++++++++
 25 files changed, 2784 insertions(+)
 create mode 100644 .github/agents/z3-solver.md
 create mode 100644 .github/agents/z3-verifier.md
 create mode 100644 .github/skills/README.md
 create mode 100644 .github/skills/benchmark/SKILL.md
 create mode 100644 .github/skills/benchmark/scripts/benchmark.py
 create mode 100644 .github/skills/deeptest/SKILL.md
 create mode 100644 .github/skills/deeptest/scripts/deeptest.py
 create mode 100644 .github/skills/encode/SKILL.md
 create mode 100644 .github/skills/encode/scripts/encode.py
 create mode 100644 .github/skills/explain/SKILL.md
 create mode 100644 .github/skills/explain/scripts/explain.py
 create mode 100644 .github/skills/memory-safety/SKILL.md
 create mode 100644 .github/skills/memory-safety/scripts/memory_safety.py
 create mode 100644 .github/skills/optimize/SKILL.md
 create mode 100644 .github/skills/optimize/scripts/optimize.py
 create mode 100644 .github/skills/prove/SKILL.md
 create mode 100644 .github/skills/prove/scripts/prove.py
 create mode 100644 .github/skills/shared/schema.sql
 create mode 100644 .github/skills/shared/z3db.py
 create mode 100644 .github/skills/simplify/SKILL.md
 create mode 100644 .github/skills/simplify/scripts/simplify.py
 create mode 100644 .github/skills/solve/SKILL.md
 create mode 100644 .github/skills/solve/scripts/solve.py
 create mode 100644 .github/skills/static-analysis/SKILL.md
 create mode 100644 .github/skills/static-analysis/scripts/static_analysis.py

diff --git a/.github/agents/z3-solver.md b/.github/agents/z3-solver.md
new file mode 100644
index 000000000..d1a97be80
--- /dev/null
+++ b/.github/agents/z3-solver.md
@@ -0,0 +1,129 @@
+---
+name: z3-solver
+description: 'Z3 theorem prover assistant: satisfiability checking, validity proofs, optimization, simplification, encoding, and performance analysis.'
+---
+
+## Instructions
+
+You are the Z3 Solver Agent, a Copilot agent for SMT solving workflows using the Z3 theorem prover. You help users formulate, solve, optimize, and interpret constraint satisfaction problems. Follow the workflow below. Use subagents for long-running skill invocations such as benchmarking.
+
+### Workflow
+
+1. **Understand the Request**: Determine what the user needs: a satisfiability check, a validity proof, an optimization, a simplification, an encoding from natural language, an explanation of output, or a performance analysis.
+
+2. **Encode (if needed)**: If the user provides a problem in natural language, pseudocode, or a domain-specific formulation, translate it into SMT-LIB2 using the **encode** skill before proceeding.
+
+3. **Solve or Transform**: Route to the appropriate skill based on the request type. Multiple skills may be chained when the task requires it (for example, encoding followed by optimization followed by explanation).
+
+4. **Explain Results**: After solving, invoke **explain** to present the result in clear, human-readable language. Always interpret models, proofs, and optimization results for the user.
+
+5. **Iterate**: On follow-up queries, refine the formulation or re-invoke skills with adjusted parameters. Do not re-run the full pipeline when only a narrow adjustment is needed.
+
+### Available Skills
+
+| # | Skill | Purpose |
+|---|-------|---------|
+| 1 | solve | Check satisfiability of a formula. Extract models when satisfiable. Report unsatisfiable cores when unsat. |
+| 2 | prove | Establish validity of a formula by checking the negation for unsatisfiability. If the negation is unsat, the original is valid. |
+| 3 | optimize | Solve constrained optimization problems. Supports minimize and maximize objectives, lexicographic and Pareto modes. |
+| 4 | simplify | Apply Z3 tactics to reduce formula complexity. Useful for preprocessing, normal form conversion, and human-readable reformulation. |
+| 5 | encode | Translate a problem description into SMT-LIB2 syntax. Handles sort selection, quantifier introduction, and theory annotation. |
+| 6 | explain | Interpret Z3 output (models, unsat cores, proofs, optimization results, statistics) and present it in plain language. |
+| 7 | benchmark | Measure solving performance. Collect statistics, compare tactic configurations, identify bottlenecks, and suggest parameter tuning. |
+
+### Skill Dependencies
+
+The planner respects these edges:
+
+```
+encode --> solve
+encode --> prove
+encode --> optimize
+encode --> simplify
+solve  --> explain
+prove  --> explain
+optimize --> explain
+simplify --> explain
+benchmark --> explain
+solve  --> benchmark
+optimize --> benchmark
+```
+
+Skills on the left must complete before skills on the right when both appear in a pipeline. Independent skills (for example, solve and optimize on separate formulas) may run in parallel.
+
+### Skill Selection
+
+Given a user request, select skills as follows:
+
+- "Is this formula satisfiable?" : `solve`
+- "Find a model for these constraints" : `solve` then `explain`
+- "Prove that P implies Q" : `encode` (if needed) then `prove` then `explain`
+- "Prove this is always true" : `prove` then `explain`
+- "Optimize this scheduling problem" : `encode` then `optimize` then `explain`
+- "Minimize cost subject to constraints" : `optimize` then `explain`
+- "Simplify this expression" : `simplify` then `explain`
+- "Convert to CNF" : `simplify`
+- "Translate this problem to SMT-LIB2" : `encode`
+- "Why is Z3 returning unknown?" : `explain`
+- "Why is this query slow?" : `benchmark` then `explain`
+- "Compare these two tactic pipelines" : `benchmark` then `explain`
+- "What does this model mean?" : `explain`
+- "Get the unsat core" : `solve` then `explain`
+
+When the request is ambiguous, prefer the most informative pipeline. For example, "check this formula" should invoke `solve` followed by `explain`, not `solve` alone.
+
+### Examples
+
+User: "Is (x > 0 and y > 0 and x + y < 1) satisfiable over the reals?"
+
+1. **solve**: Assert the conjunction over real-valued variables. Run `(check-sat)`.
+2. **explain**: If sat, present the model. If unsat, state that no assignment satisfies all three constraints simultaneously.
+
+User: "Prove that for all integers x, if x^2 is even then x is even."
+
+1. **encode**: Formalize the statement. Negate it: assert there exists an integer x such that x^2 is even and x is odd.
+2. **prove**: Check the negation for unsatisfiability.
+3. **explain**: If unsat, the original statement is valid. Present the reasoning. If sat (counterexample found), report the model and explain why the conjecture fails.
+
+User: "Schedule five tasks on two machines to minimize makespan."
+
+1. **encode**: Define integer variables for task assignments and start times. Encode machine capacity, precedence, and non-overlap constraints.
+2. **optimize**: Minimize the makespan variable subject to the encoded constraints.
+3. **explain**: Present the optimal schedule, makespan value, and any binding constraints.
+
+User: "Why is my bitvector query so slow?"
+
+1. **benchmark**: Run the query with `(set-option :timeout 30000)` and collect statistics via `(get-info :all-statistics)`.
+2. **explain**: Identify dominant cost centers (conflict count, propagation ratio, theory combination overhead). Suggest tactic or parameter adjustments such as `:blast_full` for bitvectors or increasing the relevancy threshold.
+
+### Error Handling
+
+Z3 may return results other than `sat` or `unsat`. Handle each case as follows:
+
+**unknown**: Z3 could not determine satisfiability within the given resource limits.
+- Check if a timeout was active. If so, suggest increasing it.
+- Inspect the reason with `(get-info :reason-unknown)`.
+- If the reason is "incomplete," the formula may use a theory fragment that Z3 cannot decide. Suggest alternative encodings (for example, replacing nonlinear arithmetic with linearization or bit-blasting).
+- If the reason is "timeout" or "max-conflicts," suggest parameter tuning: increase `:timeout`, adjust `:smt.relevancy`, or try a different tactic pipeline.
+
+**error (syntax or sort mismatch)**: The input is malformed.
+- Report the exact error message from Z3.
+- Identify the offending declaration or assertion.
+- Suggest a corrected encoding.
+
+**error (resource exhaustion)**: Z3 ran out of memory or hit an internal limit.
+- Suggest simplifying the problem: reduce variable count, eliminate quantifiers where possible, split into subproblems.
+- Suggest incremental solving with `(push)` / `(pop)` to reuse learned information.
+
+**unsat with no core requested**: The formula is unsatisfiable but the user may want to understand why.
+- Offer to re-run with `(set-option :produce-unsat-cores true)` and named assertions to extract a minimal explanation.
+
+### Notes
+
+- Always validate SMT-LIB2 syntax before invoking Z3. A malformed input wastes time and produces confusing errors.
+- Prefer incremental mode (`(push)` / `(pop)`) when the user is iterating on a formula.
+- Use `(set-option :produce-models true)` by default for satisfiability queries.
+- Use `(set-option :produce-proofs true)` when the user requests validity proofs.
+- Collect statistics with `z3 -st` when performance is relevant.
+- Present models in a readable table format, not raw S-expressions, unless the user requests SMT-LIB2 output.
+- Never fabricate results. If a skill fails or Z3 produces an unexpected answer, report the raw output and explain what went wrong.
diff --git a/.github/agents/z3-verifier.md b/.github/agents/z3-verifier.md
new file mode 100644
index 000000000..246ce1b5a
--- /dev/null
+++ b/.github/agents/z3-verifier.md
@@ -0,0 +1,131 @@
+---
+name: z3-verifier
+description: 'Z3 code quality agent: memory safety checking, static analysis, and stress testing for the Z3 codebase itself.'
+---
+
+## Instructions
+
+You are the Z3 Verifier Agent, a Copilot agent for code quality and correctness verification of the Z3 theorem prover codebase. You do not solve SMT problems (use **z3-solver** for that). Instead, you detect bugs, enforce code quality, and stress-test Z3 internals. Follow the workflow below. Use subagents for long-running skill invocations such as fuzzing campaigns.
+
+### Workflow
+
+1. **Identify the Verification Goal**: Determine what the user needs: memory bug detection, static analysis findings, or stress testing results. If the request is broad ("verify this code" or "full verification pass"), run all three skills.
+
+2. **Build the Target**: Ensure a Z3 build exists with the required instrumentation (sanitizers, debug symbols, coverage). If not, build one before proceeding.
+
+3. **Run Verification Skills**: Invoke the appropriate skill(s). When running a full verification pass, execute all three skills and aggregate results.
+
+4. **Report Findings**: Present results sorted by severity. Each finding should include: location (file, function, line), category, severity, and reproduction steps where applicable.
+
+5. **Iterate**: On follow-ups, narrow scope to specific files, functions, or bug categories. Do not re-run the full pipeline unnecessarily.
+
+### Available Skills
+
+| # | Skill | Purpose |
+|---|-------|---------|
+| 1 | memory-safety | Build Z3 with AddressSanitizer (ASan), MemorySanitizer (MSan), or UndefinedBehaviorSanitizer (UBSan). Run the test suite under instrumentation to detect memory corruption, use-after-free, buffer overflows, uninitialized reads, and undefined behavior. |
+| 2 | static-analysis | Run the Clang Static Analyzer over the Z3 source tree. Detects null pointer dereferences, resource leaks, dead stores, logic errors, and API misuse without executing the code. |
+| 3 | deeptest | Stress-test Z3 with randomized inputs, differential testing against known-good solvers, and targeted fuzzing of parser and solver components. Detects crashes, assertion failures, and correctness regressions. |
+
+### Skill Dependencies
+
+```
+memory-safety (independent)
+static-analysis (independent)
+deeptest (independent)
+```
+
+All three skills are independent and may run in parallel. None requires the output of another as input. When running a full verification pass, launch all three simultaneously via subagents.
+
+### Skill Selection
+
+Given a user request, select skills as follows:
+
+- "Check for memory bugs" : `memory-safety`
+- "Run ASan on the test suite" : `memory-safety`
+- "Find undefined behavior" : `memory-safety` (with UBSan configuration)
+- "Run static analysis" : `static-analysis`
+- "Find null pointer bugs" : `static-analysis`
+- "Check for resource leaks" : `static-analysis`
+- "Fuzz Z3" : `deeptest`
+- "Stress test the parser" : `deeptest`
+- "Run differential testing" : `deeptest`
+- "Full verification pass" : `memory-safety` + `static-analysis` + `deeptest`
+- "Verify this pull request" : `memory-safety` + `static-analysis` (scope to changed files)
+- "Is this change safe?" : `memory-safety` + `static-analysis` (scope to changed files)
+
+### Examples
+
+User: "Check for memory bugs in the SAT solver."
+
+1. **memory-safety**: Build Z3 with ASan enabled (`cmake -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" ..`). Run the SAT solver tests. Collect any sanitizer reports.
+2. Report findings with stack traces, categorized by bug type (heap-buffer-overflow, use-after-free, stack-buffer-overflow, etc.).
+
+User: "Run static analysis on src/ast/."
+
+1. **static-analysis**: Invoke `scan-build` or `clang-tidy` over `src/ast/` with Z3's compile commands database.
+2. Report findings sorted by severity. Include checker name, file, line, and a brief description of each issue.
+
+User: "Fuzz the SMT-LIB2 parser."
+
+1. **deeptest**: Generate randomized SMT-LIB2 inputs targeting the parser. Run Z3 on each input with a timeout. Collect crashes, assertion failures, and unexpected error messages.
+2. Report crash-inducing inputs with minimized reproduction cases. Classify findings as crashes, assertion violations, or incorrect results.
+
+User: "Full verification pass before the release."
+
+1. Launch all three skills in parallel via subagents:
+   - **memory-safety**: Full test suite under ASan and UBSan.
+   - **static-analysis**: Full source tree scan.
+   - **deeptest**: Broad fuzzing campaign across theories (arithmetic, bitvectors, arrays, strings).
+2. Aggregate all findings. Deduplicate issues that appear in multiple skills (for example, a null dereference found by both static analysis and ASan). Sort by severity: Critical, High, Medium, Low.
+3. Present a summary table followed by detailed findings.
+
+### Build Configurations
+
+Each skill may require a specific build configuration:
+
+**memory-safety (ASan)**:
+```bash
+mkdir build-asan && cd build-asan
+cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_BUILD_TYPE=Debug
+make -j$(nproc)
+```
+
+**memory-safety (UBSan)**:
+```bash
+mkdir build-ubsan && cd build-ubsan
+cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=undefined" -DCMAKE_C_FLAGS="-fsanitize=undefined" -DCMAKE_BUILD_TYPE=Debug
+make -j$(nproc)
+```
+
+**static-analysis**:
+```bash
+mkdir build-analyze && cd build-analyze
+scan-build cmake .. -DCMAKE_BUILD_TYPE=Debug
+scan-build make -j$(nproc)
+```
+
+**deeptest**: Uses a standard Release build for performance, with Debug builds reserved for reproducing crashes:
+```bash
+mkdir build-fuzz && cd build-fuzz
+cmake .. -DCMAKE_BUILD_TYPE=Release
+make -j$(nproc)
+```
+
+### Error Handling
+
+**Build failure**: If the instrumented build fails, report the compiler errors. Common causes: sanitizer flags incompatible with certain optimization levels, or missing sanitizer runtime libraries.
+
+**Flaky sanitizer reports**: Some sanitizer findings may be nondeterministic (especially under MSan with uninitialized memory). Re-run flagged tests three times to confirm reproducibility. Mark non-reproducible findings as "intermittent" rather than discarding them.
+
+**Fuzzing timeouts**: Individual fuzz inputs that cause Z3 to exceed the timeout threshold should be collected separately and reported as potential performance regressions, not crashes.
+
+**False positives in static analysis**: The Clang Static Analyzer may produce false positives, particularly around custom allocators and reference-counted smart pointers used in Z3. Flag likely false positives but do not suppress them without user confirmation.
+
+### Notes
+
+- Sanitizer builds are significantly slower than Release builds. Set timeouts to at least 3x the normal test suite duration.
+- Store sanitizer reports and fuzzing artifacts in `.z3-verifier/` unless the user specifies otherwise.
+- When scoping to changed files for pull request verification, use `git diff` to determine the affected source files and limit skill invocations accordingly.
+- Never suppress or ignore sanitizer findings automatically. Every report should be presented to the user for triage.
+- Prefer ASan as the default sanitizer. It catches the broadest class of memory errors with the lowest false-positive rate.
diff --git a/.github/skills/README.md b/.github/skills/README.md
new file mode 100644
index 000000000..53fc9f80d
--- /dev/null
+++ b/.github/skills/README.md
@@ -0,0 +1,74 @@
+# Z3 Agent Skills
+
+Reusable, composable verification primitives for the Z3 theorem prover.
+Each skill is a self-contained unit: a `SKILL.md` prompt that guides the
+LLM agent, backed by a Python validation script in `scripts/`.
+
+## Skill Catalogue
+
+| Skill | Status | Description |
+|-------|--------|-------------|
+| solve | implemented | Check satisfiability of SMT-LIB2 formulas; return models or unsat cores |
+| prove | implemented | Prove validity by negation and satisfiability checking |
+| encode | implemented | Translate constraint problems into SMT-LIB2 or Z3 Python API code |
+| simplify | implemented | Reduce formula complexity using configurable Z3 tactic chains |
+| optimize | implemented | Solve constrained optimization (minimize/maximize) over numeric domains |
+| explain | implemented | Parse and interpret Z3 output: models, cores, statistics, errors |
+| benchmark | implemented | Measure Z3 performance and collect solver statistics |
+| static-analysis | planned | Run Clang Static Analyzer on Z3 source and log structured findings |
+| deeptest | planned | Deep property-based testing of Z3 internals |
+| memory-safety | planned | Memory safety verification of Z3 C++ source |
+
+## Agents
+
+Two orchestration agents compose these skills into end-to-end workflows:
+
+| Agent | Role |
+|-------|------|
+| z3-solver | Formulation and solving: encode, solve, prove, simplify, optimize, explain |
+| z3-verifier | Codebase quality: benchmark, static-analysis, deeptest, memory-safety |
+
+## Shared Infrastructure
+
+All scripts share a common library at `shared/z3db.py` with:
+
+* `Z3DB`: SQLite wrapper for tracking runs, formulas, findings, and interaction logs.
+* `run_z3()`: Pipe SMT-LIB2 into `z3 -in` with timeout handling.
+* `find_z3()`: Locate the Z3 binary across build directories and PATH.
+* Parsers: `parse_model()`, `parse_stats()`, `parse_unsat_core()`.
+
+The database schema lives in `shared/schema.sql`.
+
+## Relationship to a3/ Workflows
+
+The `a3/` directory at the repository root contains two existing agentic workflow
+prompts that predate the skill architecture:
+
+* `a3/a3-python.md`: A3 Python Code Analysis agent (uses the a3-python pip tool
+  to scan Python source, classifies findings, creates GitHub issues).
+* `a3/a3-rust.md`: A3 Rust Verifier Output Analyzer (downloads a3-rust build
+  artifacts, parses bug reports, creates GitHub discussions).
+
+These workflows are complementary to the skills defined here, not replaced by
+them. The a3 prompts focus on external analysis tooling and GitHub integration,
+while skills focus on Z3 solver operations and their validation. Both may be
+composed by the same orchestrating agent.
+
+## Usage
+
+Check database status and recent runs:
+
+```
+python shared/z3db.py status
+python shared/z3db.py runs --skill solve --last 5
+python shared/z3db.py log --run-id 12
+python shared/z3db.py query "SELECT skill, COUNT(*) FROM runs GROUP BY skill"
+```
+
+Run an individual skill script directly:
+
+```
+python solve/scripts/solve.py --file problem.smt2
+python encode/scripts/encode.py --validate formula.smt2
+python benchmark/scripts/benchmark.py --file problem.smt2
+```
diff --git a/.github/skills/benchmark/SKILL.md b/.github/skills/benchmark/SKILL.md
new file mode 100644
index 000000000..cffacde54
--- /dev/null
+++ b/.github/skills/benchmark/SKILL.md
@@ -0,0 +1,48 @@
+---
+name: benchmark
+description: Measure Z3 performance on a formula or file. Collects wall-clock time, theory solver statistics, memory usage, and conflict counts. Results are logged to z3agent.db for longitudinal tracking.
+---
+
+Given an SMT-LIB2 formula or file, run Z3 with statistics enabled and report performance characteristics. This is useful for identifying performance regressions, comparing tactic strategies, and profiling theory solver workload distribution.
+
+# Step 1: Run Z3 with statistics
+
+```bash
+python3 scripts/benchmark.py --file problem.smt2
+python3 scripts/benchmark.py --file problem.smt2 --runs 5
+python3 scripts/benchmark.py --formula "(declare-const x Int)..." --debug
+```
+
+The script invokes `z3 -st` and parses the `:key value` statistics block.
+
+# Step 2: Interpret the output
+
+The output includes:
+
+- wall-clock time (ms)
+- result (sat/unsat/unknown/timeout)
+- memory usage (MB)
+- conflicts, decisions, propagations
+- per-theory breakdown (arithmetic, bv, array, etc.)
+
+With `--runs N`, the script runs Z3 N times and reports min/median/max timing.
+
+# Step 3: Compare over time
+
+Past benchmark runs are logged to `z3agent.db`. Query them:
+```bash
+python3 ../../shared/z3db.py runs --skill benchmark --last 20
+python3 ../../shared/z3db.py query "SELECT smtlib2, result, stats FROM formulas WHERE run_id IN (SELECT run_id FROM runs WHERE skill='benchmark') ORDER BY run_id DESC LIMIT 5"
+```
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| formula | string | no | | SMT-LIB2 formula |
+| file | path | no | | path to .smt2 file |
+| runs | int | no | 1 | number of repeated runs for timing |
+| timeout | int | no | 60 | seconds per run |
+| z3 | path | no | auto | path to z3 binary |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/benchmark/scripts/benchmark.py b/.github/skills/benchmark/scripts/benchmark.py
new file mode 100644
index 000000000..1e23abe1f
--- /dev/null
+++ b/.github/skills/benchmark/scripts/benchmark.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+"""
+benchmark.py: measure Z3 performance with statistics.
+
+Usage:
+    python benchmark.py --file problem.smt2
+    python benchmark.py --file problem.smt2 --runs 5
+"""
+
+import argparse
+import statistics
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, parse_stats, setup_logging
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="benchmark")
+    parser.add_argument("--formula")
+    parser.add_argument("--file")
+    parser.add_argument("--runs", type=int, default=1)
+    parser.add_argument("--timeout", type=int, default=60)
+    parser.add_argument("--z3", default=None)
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        formula = Path(args.file).read_text()
+    elif args.formula:
+        formula = args.formula
+    else:
+        parser.error("provide --formula or --file")
+        return
+
+    db = Z3DB(args.db)
+    timings = []
+
+    for i in range(args.runs):
+        run_id = db.start_run("benchmark", formula)
+        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
+                        args=["-st"], debug=args.debug)
+
+        stats = parse_stats(result["stdout"])
+        db.log_formula(run_id, formula, result["result"], stats=stats)
+        db.finish_run(run_id, result["result"], result["duration_ms"],
+                      result["exit_code"])
+        timings.append(result["duration_ms"])
+
+        if args.runs == 1:
+            print(f"result: {result['result']}")
+            print(f"time: {result['duration_ms']}ms")
+            if stats:
+                print("statistics:")
+                for k, v in sorted(stats.items()):
+                    print(f"  :{k} {v}")
+
+    if args.runs > 1:
+        print(f"runs: {args.runs}")
+        print(f"min: {min(timings)}ms")
+        print(f"median: {statistics.median(timings):.0f}ms")
+        print(f"max: {max(timings)}ms")
+        print(f"result: {result['result']}")
+
+    db.close()
+    sys.exit(0 if result["exit_code"] == 0 else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/deeptest/SKILL.md b/.github/skills/deeptest/SKILL.md
new file mode 100644
index 000000000..ead3f5b84
--- /dev/null
+++ b/.github/skills/deeptest/SKILL.md
@@ -0,0 +1,70 @@
+---
+name: deeptest
+description: Generate stress tests and differential tests for Z3 theories. Creates random or structured SMT-LIB2 formulas, runs them through Z3, and checks for crashes, assertion failures, or result inconsistencies. Inspired by fuzzing and metamorphic testing approaches applied to SMT solvers.
+---
+
+Given a strategy and count, generate SMT-LIB2 formulas targeting Z3 internals and report anomalies. Strategies range from pure random generation to structured metamorphic and cross-theory combinations. Every formula and finding is logged to z3agent.db.
+
+# Step 1: Choose a strategy and run
+
+```bash
+python3 scripts/deeptest.py --strategy random --count 100 --seed 42
+python3 scripts/deeptest.py --strategy metamorphic --seed-file base.smt2 --count 50
+python3 scripts/deeptest.py --strategy cross-theory --theories "LIA,BV" --count 80
+python3 scripts/deeptest.py --strategy incremental --count 60 --debug
+```
+
+Available strategies:
+
+- `random`: generate formulas with random declarations (Int, Bool, BitVec), random arithmetic and boolean assertions, and check-sat.
+- `metamorphic`: start from a base formula (generated or loaded from file), apply equisatisfiable transformations (tautology insertion, double negation, assertion duplication), and verify the result stays consistent.
+- `cross-theory`: combine multiple theories (LIA, Bool, BV) in a single formula with bridging constraints to stress theory combination.
+- `incremental`: generate push/pop sequences with per-frame assertions to stress incremental solving.
+
+# Step 2: Interpret the output
+
+The script prints a summary after completion:
+
+```
+strategy:  random
+seed:      42
+formulas:  100
+anomalies: 2
+elapsed:   4500ms
+```
+
+A nonzero anomaly count means the run detected crashes (nonzero exit code), assertion failures in stderr, solver errors, or result disagreements between a base formula and its metamorphic variants.
+
+# Step 3: Inspect findings
+
+Findings are logged to `z3agent.db` with category, severity, and details:
+
+```bash
+python3 ../../shared/z3db.py query "SELECT category, severity, message FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='deeptest') ORDER BY finding_id DESC LIMIT 20"
+```
+
+Each finding includes the formula index, exit code, and a stderr excerpt for triage.
+
+# Step 4: Reproduce
+
+Use the `--seed` parameter to reproduce a run exactly:
+
+```bash
+python3 scripts/deeptest.py --strategy random --count 100 --seed 42
+```
+
+The seed is printed in every run summary and logged in the run record.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| strategy | string | no | random | test generation strategy: random, metamorphic, cross-theory, incremental |
+| count | int | no | 50 | number of formulas to generate |
+| seed | int | no | clock | random seed for reproducibility |
+| seed-file | path | no | | base .smt2 file for metamorphic strategy |
+| theories | string | no | LIA,BV | comma-separated theories for cross-theory strategy |
+| timeout | int | no | 10 | per-formula Z3 timeout in seconds |
+| z3 | path | no | auto | path to z3 binary |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/deeptest/scripts/deeptest.py b/.github/skills/deeptest/scripts/deeptest.py
new file mode 100644
index 000000000..5d513a6bd
--- /dev/null
+++ b/.github/skills/deeptest/scripts/deeptest.py
@@ -0,0 +1,393 @@
+#!/usr/bin/env python3
+"""
+deeptest.py: generate and run stress tests for Z3.
+
+Usage:
+    python deeptest.py --strategy random --count 100
+    python deeptest.py --strategy metamorphic --seed-file base.smt2
+    python deeptest.py --strategy cross-theory --theories "LIA,BV" --debug
+"""
+
+import argparse
+import logging
+import random
+import sys
+import time
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, setup_logging
+
+log = logging.getLogger("deeptest")
+
+# Sort and operator tables
+
+THEORY_SORTS = {
+    "LIA":  "Int",
+    "Bool": "Bool",
+    "BV":   "(_ BitVec 32)",
+}
+
+INT_ARITH = ["+", "-", "*"]
+INT_CMP   = [">", "<", ">=", "<=", "="]
+BV_ARITH  = ["bvadd", "bvsub", "bvand", "bvor", "bvxor"]
+BV_CMP    = ["bvslt", "bvsgt", "bvsle", "bvsge", "="]
+
+# Assertion generators (one per sort)
+
+
+def _int_assertion(rng, vs):
+    if len(vs) < 2:
+        return f"(assert ({rng.choice(INT_CMP)} {vs[0]} {rng.randint(-10, 10)}))"
+    a, b = rng.sample(vs, 2)
+    return f"(assert ({rng.choice(INT_CMP)} ({rng.choice(INT_ARITH)} {a} {b}) {rng.randint(-10, 10)}))"
+
+
+def _bool_assertion(rng, vs):
+    if len(vs) == 1:
+        return f"(assert {vs[0]})" if rng.random() < 0.5 else f"(assert (not {vs[0]}))"
+    a, b = rng.sample(vs, 2)
+    return f"(assert ({rng.choice(['and', 'or', '=>'])} {a} {b}))"
+
+
+def _bv_assertion(rng, vs):
+    lit = f"(_ bv{rng.randint(0, 255)} 32)"
+    if len(vs) < 2:
+        return f"(assert ({rng.choice(BV_CMP)} {vs[0]} {lit}))"
+    a, b = rng.sample(vs, 2)
+    return f"(assert ({rng.choice(BV_CMP)} ({rng.choice(BV_ARITH)} {a} {b}) {lit}))"
+
+
+SORT_ASSERTION = {
+    "Int":            _int_assertion,
+    "Bool":           _bool_assertion,
+    "(_ BitVec 32)":  _bv_assertion,
+}
+
+
+def _random_assertion(rng, vars_by_sort):
+    """Pick a populated sort and emit one random assertion."""
+    available = [s for s in vars_by_sort if vars_by_sort[s]]
+    if not available:
+        return None
+    sort = rng.choice(available)
+    return SORT_ASSERTION[sort](rng, vars_by_sort[sort])
+
+# Formula generators
+
+
+def gen_random_formula(rng, num_vars=5, num_assertions=5):
+    """Random declarations, random assertions, single check-sat."""
+    lines = []
+    vars_by_sort = {}
+    sorts = list(THEORY_SORTS.values())
+
+    for i in range(num_vars):
+        sort = rng.choice(sorts)
+        name = f"v{i}"
+        lines.append(f"(declare-const {name} {sort})")
+        vars_by_sort.setdefault(sort, []).append(name)
+
+    for _ in range(num_assertions):
+        a = _random_assertion(rng, vars_by_sort)
+        if a:
+            lines.append(a)
+
+    lines.append("(check-sat)")
+    return "\n".join(lines)
+
+
+def gen_metamorphic_variant(rng, base_formula):
+    """Apply an equisatisfiable transformation to a formula.
+
+    Transformations:
+      tautology   : insert (assert true) before check-sat
+      double_neg  : wrap one assertion body in double negation
+      duplicate   : repeat an existing assertion
+    """
+    lines = base_formula.strip().split("\n")
+    transform = rng.choice(["tautology", "double_neg", "duplicate"])
+    assertion_idxs = [i for i, l in enumerate(lines)
+                      if l.strip().startswith("(assert")]
+
+    if transform == "tautology":
+        pos = next((i for i, l in enumerate(lines) if "check-sat" in l),
+                   len(lines))
+        lines.insert(pos, "(assert true)")
+
+    elif transform == "double_neg" and assertion_idxs:
+        idx = rng.choice(assertion_idxs)
+        body = lines[idx].strip()
+        inner = body[len("(assert "):-1]
+        lines[idx] = f"(assert (not (not {inner})))"
+
+    elif transform == "duplicate" and assertion_idxs:
+        idx = rng.choice(assertion_idxs)
+        lines.insert(idx + 1, lines[idx])
+
+    return "\n".join(lines)
+
+
+def gen_cross_theory_formula(rng, theories, num_vars=4, num_assertions=6):
+    """Combine variables from multiple theories with bridging constraints."""
+    lines = []
+    vars_by_sort = {}
+    sorts = [THEORY_SORTS[t] for t in theories if t in THEORY_SORTS]
+    if not sorts:
+        sorts = list(THEORY_SORTS.values())
+
+    for i in range(num_vars):
+        sort = sorts[i % len(sorts)]
+        name = f"v{i}"
+        lines.append(f"(declare-const {name} {sort})")
+        vars_by_sort.setdefault(sort, []).append(name)
+
+    for _ in range(num_assertions):
+        a = _random_assertion(rng, vars_by_sort)
+        if a:
+            lines.append(a)
+
+    # Bridge Int and Bool when both present
+    int_vs = vars_by_sort.get("Int", [])
+    bool_vs = vars_by_sort.get("Bool", [])
+    if int_vs and bool_vs:
+        iv = rng.choice(int_vs)
+        bv = rng.choice(bool_vs)
+        lines.append(f"(assert (= {bv} (> {iv} 0)))")
+
+    lines.append("(check-sat)")
+    return "\n".join(lines)
+
+
+def gen_incremental_formula(rng, num_frames=3, num_vars=4,
+                            asserts_per_frame=3):
+    """Push/pop sequence: all variables declared globally, assertions scoped."""
+    lines = []
+    vars_by_sort = {}
+    sorts = list(THEORY_SORTS.values())
+
+    for i in range(num_vars):
+        sort = rng.choice(sorts)
+        name = f"v{i}"
+        lines.append(f"(declare-const {name} {sort})")
+        vars_by_sort.setdefault(sort, []).append(name)
+
+    for _ in range(num_frames):
+        lines.append("(push 1)")
+        for _ in range(asserts_per_frame):
+            a = _random_assertion(rng, vars_by_sort)
+            if a:
+                lines.append(a)
+        lines.append("(check-sat)")
+        lines.append("(pop 1)")
+
+    lines.append("(check-sat)")
+    return "\n".join(lines)
+
+# Anomaly detection
+
+
+def classify_result(result):
+    """Return an anomaly category string or None if the result looks normal."""
+    if result["exit_code"] != 0 and result["result"] != "timeout":
+        return "crash"
+    if "assertion" in result["stderr"].lower():
+        return "assertion_failure"
+    if result["result"] == "error":
+        return "error"
+    return None
+
+# Strategy runners
+
+
+def run_random(args, rng, db, run_id):
+    anomalies = 0
+    for i in range(args.count):
+        formula = gen_random_formula(rng, rng.randint(2, 8),
+                                     rng.randint(1, 10))
+        log.debug("formula %d:\n%s", i, formula)
+        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
+                        debug=args.debug)
+        db.log_formula(run_id, formula, result["result"])
+
+        cat = classify_result(result)
+        if cat:
+            anomalies += 1
+            db.log_finding(
+                run_id, cat,
+                f"random formula #{i}: {cat} (exit={result['exit_code']})",
+                severity="high" if cat == "crash" else "medium",
+                details={"formula_index": i,
+                         "exit_code": result["exit_code"],
+                         "stderr": result["stderr"][:500]})
+            log.warning("anomaly in formula %d: %s", i, cat)
+    return anomalies
+
+
+def run_metamorphic(args, rng, db, run_id):
+    if args.seed_file:
+        base = Path(args.seed_file).read_text()
+    else:
+        base = gen_random_formula(rng, num_vars=4, num_assertions=3)
+
+    base_out = run_z3(base, z3_bin=args.z3, timeout=args.timeout,
+                      debug=args.debug)
+    base_status = base_out["result"]
+    db.log_formula(run_id, base, base_status)
+    log.info("base formula result: %s", base_status)
+
+    if base_status not in ("sat", "unsat"):
+        db.log_finding(run_id, "skip",
+                       f"base formula not definite: {base_status}",
+                       severity="info")
+        return 0
+
+    anomalies = 0
+    for i in range(args.count):
+        variant = gen_metamorphic_variant(rng, base)
+        log.debug("variant %d:\n%s", i, variant)
+        result = run_z3(variant, z3_bin=args.z3, timeout=args.timeout,
+                        debug=args.debug)
+        db.log_formula(run_id, variant, result["result"])
+
+        cat = classify_result(result)
+        if cat:
+            anomalies += 1
+            db.log_finding(
+                run_id, cat,
+                f"metamorphic variant #{i}: {cat}",
+                severity="high",
+                details={"variant_index": i,
+                         "stderr": result["stderr"][:500]})
+            log.warning("anomaly in variant %d: %s", i, cat)
+            continue
+
+        if result["result"] in ("sat", "unsat") \
+                and result["result"] != base_status:
+            anomalies += 1
+            db.log_finding(
+                run_id, "disagreement",
+                f"variant #{i}: expected {base_status}, "
+                f"got {result['result']}",
+                severity="critical",
+                details={"variant_index": i,
+                         "expected": base_status,
+                         "actual": result["result"]})
+            log.warning("disagreement in variant %d: expected %s, got %s",
+                        i, base_status, result["result"])
+    return anomalies
+
+
+def run_cross_theory(args, rng, db, run_id):
+    theories = [t.strip() for t in args.theories.split(",")]
+    anomalies = 0
+    for i in range(args.count):
+        formula = gen_cross_theory_formula(rng, theories,
+                                           rng.randint(3, 8),
+                                           rng.randint(2, 10))
+        log.debug("cross-theory formula %d:\n%s", i, formula)
+        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
+                        debug=args.debug)
+        db.log_formula(run_id, formula, result["result"])
+
+        cat = classify_result(result)
+        if cat:
+            anomalies += 1
+            db.log_finding(
+                run_id, cat,
+                f"cross-theory #{i} ({','.join(theories)}): {cat}",
+                severity="high" if cat == "crash" else "medium",
+                details={"formula_index": i, "theories": theories,
+                         "exit_code": result["exit_code"],
+                         "stderr": result["stderr"][:500]})
+            log.warning("anomaly in cross-theory formula %d: %s", i, cat)
+    return anomalies
+
+
+def run_incremental(args, rng, db, run_id):
+    anomalies = 0
+    for i in range(args.count):
+        num_frames = rng.randint(2, 6)
+        formula = gen_incremental_formula(rng, num_frames)
+        log.debug("incremental formula %d:\n%s", i, formula)
+        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
+                        debug=args.debug)
+        db.log_formula(run_id, formula, result["result"])
+
+        cat = classify_result(result)
+        if cat:
+            anomalies += 1
+            db.log_finding(
+                run_id, cat,
+                f"incremental #{i} ({num_frames} frames): {cat}",
+                severity="high" if cat == "crash" else "medium",
+                details={"formula_index": i, "num_frames": num_frames,
+                         "exit_code": result["exit_code"],
+                         "stderr": result["stderr"][:500]})
+            log.warning("anomaly in incremental formula %d: %s", i, cat)
+    return anomalies
+
+
+STRATEGIES = {
+    "random":        run_random,
+    "metamorphic":   run_metamorphic,
+    "cross-theory":  run_cross_theory,
+    "incremental":   run_incremental,
+}
+
+# Entry point
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        prog="deeptest",
+        description="Generate and run stress tests for Z3.",
+    )
+    parser.add_argument("--strategy", choices=list(STRATEGIES),
+                        default="random",
+                        help="test generation strategy")
+    parser.add_argument("--count", type=int, default=50,
+                        help="number of formulas to generate")
+    parser.add_argument("--seed", type=int, default=None,
+                        help="random seed for reproducibility")
+    parser.add_argument("--seed-file", default=None,
+                        help="base .smt2 file for metamorphic strategy")
+    parser.add_argument("--theories", default="LIA,BV",
+                        help="comma-separated theories for cross-theory")
+    parser.add_argument("--timeout", type=int, default=10,
+                        help="per-formula Z3 timeout in seconds")
+    parser.add_argument("--z3", default=None, help="path to z3 binary")
+    parser.add_argument("--db", default=None, help="path to z3agent.db")
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    seed = args.seed if args.seed is not None else int(time.time())
+    rng = random.Random(seed)
+    log.info("seed: %d", seed)
+
+    db = Z3DB(args.db)
+    run_id = db.start_run(
+        "deeptest",
+        f"strategy={args.strategy} count={args.count} seed={seed}")
+
+    t0 = time.monotonic()
+    anomalies = STRATEGIES[args.strategy](args, rng, db, run_id)
+    elapsed_ms = int((time.monotonic() - t0) * 1000)
+
+    status = "success" if anomalies == 0 else "findings"
+    db.finish_run(run_id, status, elapsed_ms)
+
+    print(f"strategy:  {args.strategy}")
+    print(f"seed:      {seed}")
+    print(f"formulas:  {args.count}")
+    print(f"anomalies: {anomalies}")
+    print(f"elapsed:   {elapsed_ms}ms")
+
+    db.close()
+    sys.exit(1 if anomalies > 0 else 0)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/encode/SKILL.md b/.github/skills/encode/SKILL.md
new file mode 100644
index 000000000..eef343bef
--- /dev/null
+++ b/.github/skills/encode/SKILL.md
@@ -0,0 +1,45 @@
+---
+name: encode
+description: Translate constraint problems into SMT-LIB2 or Z3 Python API code. Handles common problem classes including scheduling, graph coloring, arithmetic puzzles, and verification conditions.
+---
+
+Given a problem description (natural language, pseudocode, or a partial formulation), produce a complete, syntactically valid SMT-LIB2 encoding or Z3 Python script. The encoding should declare all variables, assert all constraints, and include the appropriate check-sat / get-model commands.
+
+# Step 1: Identify the problem class
+
+Common encodings:
+
+| Problem class | Theory | Typical sorts |
+|---------------|--------|---------------|
+| Integer arithmetic | LIA / NIA | Int |
+| Real arithmetic | LRA / NRA | Real |
+| Bitvector operations | QF_BV | (_ BitVec N) |
+| Arrays and maps | QF_AX | (Array Int Int) |
+| Strings and regex | QF_S | String, RegLan |
+| Uninterpreted functions | QF_UF | custom sorts |
+| Mixed theories | AUFLIA, etc. | combination |
+
+# Step 2: Generate the encoding
+
+```bash
+python3 scripts/encode.py --problem "Find integers x, y such that x^2 + y^2 = 25 and x > 0" --format smtlib2
+python3 scripts/encode.py --problem "Schedule 4 tasks on 2 machines minimizing makespan" --format python
+```
+
+For `--format smtlib2`, the output is a complete .smt2 file ready for the **solve** skill.
+For `--format python`, the output is a standalone Z3 Python script.
+
+# Step 3: Validate the encoding
+
+The script checks that the generated formula is syntactically valid by running a quick `z3 -in` parse check (no solving, just syntax). Parse errors are reported with the offending line.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| problem | string | yes | | problem description |
+| format | string | no | smtlib2 | output format: smtlib2 or python |
+| output | path | no | stdout | write to file instead of stdout |
+| validate | flag | no | on | run syntax check on the output |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/encode/scripts/encode.py b/.github/skills/encode/scripts/encode.py
new file mode 100644
index 000000000..67f3ea87d
--- /dev/null
+++ b/.github/skills/encode/scripts/encode.py
@@ -0,0 +1,144 @@
+#!/usr/bin/env python3
+"""
+encode.py: validate and format SMT-LIB2 encodings.
+
+Usage:
+    python encode.py --validate formula.smt2
+    python encode.py --validate formula.smt2 --debug
+"""
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, setup_logging
+
+
+VALIDATION_TIMEOUT = 5
+
+
+def read_input(path_or_stdin: str) -> str:
+    """Read formula from a file path or stdin (when path is '-')."""
+    if path_or_stdin == "-":
+        return sys.stdin.read()
+    p = Path(path_or_stdin)
+    if not p.exists():
+        print(f"file not found: {p}", file=sys.stderr)
+        sys.exit(1)
+    return p.read_text()
+
+
+def find_errors(output: str) -> list:
+    """Extract (error ...) messages from Z3 output."""
+    return re.findall(r'\(error\s+"([^"]+)"\)', output)
+
+
+def validate(formula: str, z3_bin: str = None, debug: bool = False) -> dict:
+    """
+    Validate an SMT-LIB2 formula by piping it through z3 -in.
+    Returns a dict with 'valid' (bool), 'errors' (list), and 'raw' output.
+    """
+    result = run_z3(
+        formula, z3_bin=z3_bin, timeout=VALIDATION_TIMEOUT, debug=debug,
+    )
+    errors = find_errors(result["stdout"]) + find_errors(result["stderr"])
+
+    if result["result"] == "timeout":
+        # Timeout during validation is not a syntax error: the formula
+        # parsed successfully but solving exceeded the limit.  That counts
+        # as syntactically valid.
+        return {"valid": True, "errors": [], "raw": result}
+
+    if errors or result["exit_code"] != 0:
+        return {"valid": False, "errors": errors, "raw": result}
+
+    return {"valid": True, "errors": [], "raw": result}
+
+
+def report_errors(errors: list, formula: str):
+    """Print each syntax error with surrounding context."""
+    lines = formula.splitlines()
+    print(f"validation failed: {len(errors)} error(s)", file=sys.stderr)
+    for err in errors:
+        print(f"  : {err}", file=sys.stderr)
+    if len(lines) <= 20:
+        print("formula:", file=sys.stderr)
+        for i, line in enumerate(lines, 1):
+            print(f"  {i:3d}  {line}", file=sys.stderr)
+
+
+def write_output(formula: str, output_path: str, fmt: str):
+    """Write the validated formula to a file or stdout."""
+    if fmt == "python":
+        print("python format output is generated by the agent, "
+              "not by this script", file=sys.stderr)
+        sys.exit(1)
+
+    if output_path:
+        Path(output_path).write_text(formula)
+        print(f"written to {output_path}")
+    else:
+        print(formula)
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="encode")
+    parser.add_argument(
+        "--validate",
+        metavar="FILE",
+        help="path to .smt2 file to validate, or '-' for stdin",
+    )
+    parser.add_argument(
+        "--format",
+        choices=["smtlib2", "python"],
+        default="smtlib2",
+        help="output format (default: smtlib2)",
+    )
+    parser.add_argument(
+        "--output",
+        metavar="FILE",
+        default=None,
+        help="write result to file instead of stdout",
+    )
+    parser.add_argument("--z3", default=None, help="path to z3 binary")
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if not args.validate:
+        parser.error("provide --validate FILE")
+        return
+
+    formula = read_input(args.validate)
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("encode", formula)
+
+    result = validate(formula, z3_bin=args.z3, debug=args.debug)
+
+    if result["valid"]:
+        db.log_formula(run_id, formula, "valid")
+        db.finish_run(run_id, "valid", result["raw"]["duration_ms"], 0)
+        write_output(formula, args.output, args.format)
+        db.close()
+        sys.exit(0)
+    else:
+        db.log_formula(run_id, formula, "error")
+        for err in result["errors"]:
+            db.log_finding(run_id, "syntax", err, severity="error")
+        db.finish_run(
+            run_id, "error",
+            result["raw"]["duration_ms"],
+            result["raw"]["exit_code"],
+        )
+        report_errors(result["errors"], formula)
+        db.close()
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/explain/SKILL.md b/.github/skills/explain/SKILL.md
new file mode 100644
index 000000000..515b51378
--- /dev/null
+++ b/.github/skills/explain/SKILL.md
@@ -0,0 +1,52 @@
+---
+name: explain
+description: Parse and interpret Z3 output for human consumption. Handles models, unsat cores, proofs, statistics, and error messages. Translates solver internals into plain-language explanations.
+---
+
+Given raw Z3 output (from the **solve**, **prove**, **optimize**, or **benchmark** skills), produce a structured explanation. This skill is for cases where the solver output is large, nested, or otherwise difficult to read directly.
+
+# Step 1: Identify the output type
+
+| Output contains | Explanation type |
+|----------------|-----------------|
+| `(define-fun ...)` blocks | model explanation |
+| unsat core labels | conflict explanation |
+| `:key value` statistics | performance breakdown |
+| `(error ...)` | error diagnosis |
+| proof terms | proof sketch |
+
+# Step 2: Run the explainer
+
+```bash
+python3 scripts/explain.py --file output.txt
+python3 scripts/explain.py --stdin < output.txt
+python3 scripts/explain.py --file output.txt --debug
+```
+
+The script auto-detects the output type and produces a structured summary.
+
+# Step 3: Interpret the explanation
+
+For models:
+- Each variable is listed with its value and sort
+- Array and function interpretations are expanded
+- Bitvector values are shown in decimal and hex
+
+For unsat cores:
+- The conflicting named assertions are listed
+- A minimal conflict set is highlighted
+
+For statistics:
+- Time breakdown by phase (preprocessing, solving, model construction)
+- Theory solver load distribution
+- Memory high-water mark
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| file | path | no | | file containing Z3 output |
+| stdin | flag | no | off | read from stdin |
+| type | string | no | auto | force output type: model, core, stats, error |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/explain/scripts/explain.py b/.github/skills/explain/scripts/explain.py
new file mode 100644
index 000000000..d2704085a
--- /dev/null
+++ b/.github/skills/explain/scripts/explain.py
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""
+explain.py: interpret Z3 output in a readable form.
+
+Usage:
+    python explain.py --file output.txt
+    echo "sat\n(model ...)" | python explain.py --stdin
+"""
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, parse_model, parse_stats, parse_unsat_core, setup_logging
+
+
+def detect_type(text: str) -> str:
+    if "(define-fun" in text:
+        return "model"
+    if "(error" in text:
+        return "error"
+    if re.search(r':\S+\s+[\d.]+', text):
+        return "stats"
+    first = text.strip().split("\n")[0].strip()
+    if first == "unsat":
+        return "core"
+    return "unknown"
+
+
+def explain_model(text: str):
+    model = parse_model(text)
+    if not model:
+        print("no model found in output")
+        return
+    print("satisfying assignment:")
+    for name, val in model.items():
+        # show hex for large integers (likely bitvectors)
+        try:
+            n = int(val)
+            if abs(n) > 255:
+                print(f"  {name} = {val} (0x{n:x})")
+            else:
+                print(f"  {name} = {val}")
+        except ValueError:
+            print(f"  {name} = {val}")
+
+
+def explain_core(text: str):
+    core = parse_unsat_core(text)
+    if core:
+        print(f"conflicting assertions ({len(core)}):")
+        for label in core:
+            print(f"  {label}")
+    else:
+        print("unsat (no named assertions for core extraction)")
+
+
+def explain_stats(text: str):
+    stats = parse_stats(text)
+    if not stats:
+        print("no statistics found")
+        return
+    print("performance breakdown:")
+    for k in sorted(stats):
+        print(f"  :{k} {stats[k]}")
+
+    if "time" in stats:
+        print(f"\ntotal time: {stats['time']}s")
+    if "memory" in stats:
+        print(f"peak memory: {stats['memory']} MB")
+
+
+def explain_error(text: str):
+    errors = re.findall(r'\(error\s+"([^"]+)"\)', text)
+    if errors:
+        print(f"Z3 reported {len(errors)} error(s):")
+        for e in errors:
+            print(f"  {e}")
+    else:
+        print("error in output but could not parse message")
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="explain")
+    parser.add_argument("--file")
+    parser.add_argument("--stdin", action="store_true")
+    parser.add_argument("--type", choices=["model", "core", "stats", "error", "auto"],
+                        default="auto")
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        text = Path(args.file).read_text()
+    elif args.stdin:
+        text = sys.stdin.read()
+    else:
+        parser.error("provide --file or --stdin")
+        return
+
+    output_type = args.type if args.type != "auto" else detect_type(text)
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("explain", text[:200])
+
+    if output_type == "model":
+        explain_model(text)
+    elif output_type == "core":
+        explain_core(text)
+    elif output_type == "stats":
+        explain_stats(text)
+    elif output_type == "error":
+        explain_error(text)
+    else:
+        print("could not determine output type")
+        print("raw output:")
+        print(text[:500])
+
+    db.finish_run(run_id, "success", 0, 0)
+    db.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/memory-safety/SKILL.md b/.github/skills/memory-safety/SKILL.md
new file mode 100644
index 000000000..75a7861c2
--- /dev/null
+++ b/.github/skills/memory-safety/SKILL.md
@@ -0,0 +1,53 @@
+---
+name: memory-safety
+description: Run AddressSanitizer and UndefinedBehaviorSanitizer on the Z3 test suite to detect memory errors, undefined behavior, and leaks. Logs each finding to z3agent.db.
+---
+
+Build Z3 with compiler-based sanitizer instrumentation, execute the test suite, and parse the runtime output for memory safety violations. Supported sanitizers are AddressSanitizer (heap and stack buffer overflows, use-after-free, double-free, memory leaks) and UndefinedBehaviorSanitizer (signed integer overflow, null pointer dereference, misaligned access, shift errors). Findings are deduplicated and stored in z3agent.db for triage and longitudinal tracking.
+
+# Step 1: Configure and build
+
+The script invokes cmake with the appropriate `-fsanitize` flags and builds the `test-z3` target. Each sanitizer uses a separate build directory to avoid flag conflicts. If a prior instrumented build exists with matching flags, only incremental compilation runs.
+
+```bash
+python3 scripts/memory_safety.py --sanitizer asan
+python3 scripts/memory_safety.py --sanitizer ubsan
+python3 scripts/memory_safety.py --sanitizer both
+```
+
+To reuse an existing build:
+```bash
+python3 scripts/memory_safety.py --sanitizer asan --skip-build --build-dir build/sanitizer-asan
+```
+
+# Step 2: Run and collect
+
+The test binary runs with `halt_on_error=0` so the sanitizer reports all violations rather than aborting on the first. The script parses `ERROR: AddressSanitizer`, `runtime error:`, and `ERROR: LeakSanitizer` patterns from the combined output, extracts source locations where available, and deduplicates by category, file, and line.
+
+```bash
+python3 scripts/memory_safety.py --sanitizer asan --timeout 900 --debug
+```
+
+# Step 3: Interpret results
+
+- `clean`: no sanitizer violations detected.
+- `findings`: one or more violations found. Each is printed with severity, category, message, and source location.
+- `timeout`: the test suite did not complete within the deadline. Increase the timeout or investigate a possible infinite loop.
+- `error`: build or execution failed before sanitizer output could be collected.
+
+Query past runs:
+```bash
+python3 ../../shared/z3db.py runs --skill memory-safety --last 10
+python3 ../../shared/z3db.py query "SELECT category, severity, file, line, message FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='memory-safety') ORDER BY run_id DESC LIMIT 20"
+```
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| sanitizer | choice | no | asan | which sanitizer to enable: asan, ubsan, or both |
+| build-dir | path | no | build/sanitizer-{name} | path to the build directory |
+| timeout | int | no | 600 | seconds before killing the test run |
+| skip-build | flag | no | off | reuse an existing instrumented build |
+| debug | flag | no | off | verbose cmake, make, and test output |
+| db | path | no | .z3-agent/z3agent.db | path to the logging database |
diff --git a/.github/skills/memory-safety/scripts/memory_safety.py b/.github/skills/memory-safety/scripts/memory_safety.py
new file mode 100644
index 000000000..cab818a63
--- /dev/null
+++ b/.github/skills/memory-safety/scripts/memory_safety.py
@@ -0,0 +1,266 @@
+#!/usr/bin/env python3
+"""
+memory_safety.py: run sanitizer checks on Z3 test suite.
+
+Usage:
+    python memory_safety.py --sanitizer asan
+    python memory_safety.py --sanitizer ubsan --debug
+"""
+
+import argparse
+import logging
+import os
+import re
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, setup_logging
+
+logger = logging.getLogger("z3agent")
+
+SANITIZER_FLAGS = {
+    "asan": "-fsanitize=address -fno-omit-frame-pointer",
+    "ubsan": "-fsanitize=undefined -fno-omit-frame-pointer",
+}
+
+ASAN_ERROR = re.compile(r"ERROR:\s*AddressSanitizer:\s*(\S+)")
+UBSAN_ERROR = re.compile(r":\d+:\d+:\s*runtime error:\s*(.+)")
+LEAK_ERROR = re.compile(r"ERROR:\s*LeakSanitizer:")
+LOCATION = re.compile(r"(\S+\.(?:cpp|c|h|hpp)):(\d+)")
+
+
+def find_repo_root() -> Path:
+    d = Path.cwd()
+    for _ in range(10):
+        if (d / "CMakeLists.txt").exists() and (d / "src").is_dir():
+            return d
+        parent = d.parent
+        if parent == d:
+            break
+        d = parent
+    logger.error("could not locate Z3 repository root")
+    sys.exit(1)
+
+
+def build_is_configured(build_dir: Path, sanitizer: str) -> bool:
+    """Check whether the build directory already has a matching cmake config."""
+    cache = build_dir / "CMakeCache.txt"
+    if not cache.is_file():
+        return False
+    expected = SANITIZER_FLAGS[sanitizer].split()[0]
+    return expected in cache.read_text()
+
+
+def configure(build_dir: Path, sanitizer: str, repo_root: Path) -> bool:
+    """Run cmake with the requested sanitizer flags."""
+    flags = SANITIZER_FLAGS[sanitizer]
+    build_dir.mkdir(parents=True, exist_ok=True)
+    cmd = [
+        "cmake", str(repo_root),
+        f"-DCMAKE_C_FLAGS={flags}",
+        f"-DCMAKE_CXX_FLAGS={flags}",
+        f"-DCMAKE_EXE_LINKER_FLAGS={flags}",
+        "-DCMAKE_BUILD_TYPE=Debug",
+        "-DZ3_BUILD_TEST=ON",
+    ]
+    logger.info("configuring %s build in %s", sanitizer, build_dir)
+    logger.debug("cmake command: %s", " ".join(cmd))
+    proc = subprocess.run(cmd, cwd=build_dir, capture_output=True, text=True)
+    if proc.returncode != 0:
+        logger.error("cmake failed:\n%s", proc.stderr)
+        return False
+    return True
+
+
+def compile_tests(build_dir: Path) -> bool:
+    """Compile the test-z3 target."""
+    nproc = os.cpu_count() or 4
+    cmd = ["make", f"-j{nproc}", "test-z3"]
+    logger.info("compiling test-z3 (%d parallel jobs)", nproc)
+    proc = subprocess.run(cmd, cwd=build_dir, capture_output=True, text=True)
+    if proc.returncode != 0:
+        logger.error("compilation failed:\n%s", proc.stderr[-2000:])
+        return False
+    return True
+
+
+def run_tests(build_dir: Path, timeout: int) -> dict:
+    """Execute test-z3 under sanitizer runtime and capture output."""
+    test_bin = build_dir / "test-z3"
+    if not test_bin.is_file():
+        logger.error("test-z3 not found at %s", test_bin)
+        return {"stdout": "", "stderr": "binary not found", "exit_code": -1,
+                "duration_ms": 0}
+
+    env = os.environ.copy()
+    env["ASAN_OPTIONS"] = "detect_leaks=1:halt_on_error=0:print_stacktrace=1"
+    env["UBSAN_OPTIONS"] = "print_stacktrace=1:halt_on_error=0"
+
+    cmd = [str(test_bin), "/a"]
+    logger.info("running: %s", " ".join(cmd))
+    start = time.monotonic()
+    try:
+        proc = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=timeout,
+            cwd=build_dir, env=env,
+        )
+    except subprocess.TimeoutExpired:
+        ms = int((time.monotonic() - start) * 1000)
+        logger.warning("test-z3 timed out after %dms", ms)
+        return {"stdout": "", "stderr": "timeout", "exit_code": -1,
+                "duration_ms": ms}
+
+    ms = int((time.monotonic() - start) * 1000)
+    logger.debug("exit_code=%d duration=%dms", proc.returncode, ms)
+    return {
+        "stdout": proc.stdout,
+        "stderr": proc.stderr,
+        "exit_code": proc.returncode,
+        "duration_ms": ms,
+    }
+
+
+def parse_findings(output: str) -> list:
+    """Extract sanitizer error reports from combined stdout and stderr."""
+    findings = []
+    lines = output.split("\n")
+
+    for i, line in enumerate(lines):
+        entry = None
+
+        m = ASAN_ERROR.search(line)
+        if m:
+            entry = {"category": "asan", "message": m.group(1),
+                     "severity": "high"}
+
+        if not entry:
+            m = LEAK_ERROR.search(line)
+            if m:
+                entry = {"category": "leak",
+                         "message": "detected memory leaks",
+                         "severity": "high"}
+
+        if not entry:
+            m = UBSAN_ERROR.search(line)
+            if m:
+                entry = {"category": "ubsan", "message": m.group(1),
+                         "severity": "medium"}
+
+        if not entry:
+            continue
+
+        file_path, line_no = None, None
+        window = lines[max(0, i - 2):i + 5]
+        for ctx in window:
+            loc = LOCATION.search(ctx)
+            if loc and "/usr/" not in loc.group(1):
+                file_path = loc.group(1)
+                line_no = int(loc.group(2))
+                break
+
+        entry["file"] = file_path
+        entry["line"] = line_no
+        entry["raw"] = line.strip()
+        findings.append(entry)
+
+    return findings
+
+
+def deduplicate(findings: list) -> list:
+    """Remove duplicate reports at the same category, file, and line."""
+    seen = set()
+    result = []
+    for f in findings:
+        key = (f["category"], f["file"], f["line"], f["message"])
+        if key in seen:
+            continue
+        seen.add(key)
+        result.append(f)
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="memory-safety")
+    parser.add_argument("--sanitizer", choices=["asan", "ubsan", "both"],
+                        default="asan",
+                        help="sanitizer to enable (default: asan)")
+    parser.add_argument("--build-dir", default=None,
+                        help="path to build directory")
+    parser.add_argument("--timeout", type=int, default=600,
+                        help="seconds before killing test run")
+    parser.add_argument("--skip-build", action="store_true",
+                        help="reuse existing instrumented build")
+    parser.add_argument("--db", default=None,
+                        help="path to z3agent.db")
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+    repo_root = find_repo_root()
+
+    sanitizers = ["asan", "ubsan"] if args.sanitizer == "both" else [args.sanitizer]
+    all_findings = []
+
+    db = Z3DB(args.db)
+
+    for san in sanitizers:
+        if args.build_dir:
+            build_dir = Path(args.build_dir)
+        else:
+            build_dir = repo_root / "build" / f"sanitizer-{san}"
+
+        run_id = db.start_run("memory-safety", f"sanitizer={san}")
+        db.log(f"sanitizer: {san}, build: {build_dir}", run_id=run_id)
+
+        if not args.skip_build:
+            needs_configure = not build_is_configured(build_dir, san)
+            if needs_configure and not configure(build_dir, san, repo_root):
+                db.finish_run(run_id, "error", 0, exit_code=1)
+                print(f"FAIL: cmake configuration failed for {san}")
+                continue
+            if not compile_tests(build_dir):
+                db.finish_run(run_id, "error", 0, exit_code=1)
+                print(f"FAIL: compilation failed for {san}")
+                continue
+
+        result = run_tests(build_dir, args.timeout)
+        combined = result["stdout"] + "\n" + result["stderr"]
+        findings = deduplicate(parse_findings(combined))
+
+        for f in findings:
+            db.log_finding(
+                run_id,
+                category=f["category"],
+                message=f["message"],
+                severity=f["severity"],
+                file=f["file"],
+                line=f["line"],
+                details={"raw": f["raw"]},
+            )
+
+        status = "clean" if not findings else "findings"
+        if result["exit_code"] == -1:
+            status = "timeout" if "timeout" in result["stderr"] else "error"
+
+        db.finish_run(run_id, status, result["duration_ms"], result["exit_code"])
+        all_findings.extend(findings)
+        print(f"{san}: {len(findings)} finding(s), {result['duration_ms']}ms")
+
+    if all_findings:
+        print(f"\nTotal: {len(all_findings)} finding(s)")
+        for f in all_findings:
+            loc = f"{f['file']}:{f['line']}" if f["file"] else "unknown location"
+            print(f"  [{f['severity']}] {f['category']}: {f['message']} at {loc}")
+        db.close()
+        sys.exit(1)
+    else:
+        print("\nNo sanitizer findings.")
+        db.close()
+        sys.exit(0)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/optimize/SKILL.md b/.github/skills/optimize/SKILL.md
new file mode 100644
index 000000000..fc93a7e2c
--- /dev/null
+++ b/.github/skills/optimize/SKILL.md
@@ -0,0 +1,48 @@
+---
+name: optimize
+description: Solve constrained optimization problems using Z3. Supports minimization and maximization of objective functions over integer, real, and bitvector domains.
+---
+
+Given a set of constraints and an objective function, find the optimal value. Z3 supports both hard constraints (must hold) and soft constraints (weighted preferences), as well as lexicographic multi-objective optimization.
+
+# Step 1: Formulate the problem
+
+The formula uses the `(minimize ...)` or `(maximize ...)` directives followed by `(check-sat)` and `(get-model)`.
+
+Example: minimize `x + y` subject to `x >= 1`, `y >= 2`, `x + y <= 10`:
+```smtlib
+(declare-const x Int)
+(declare-const y Int)
+(assert (>= x 1))
+(assert (>= y 2))
+(assert (<= (+ x y) 10))
+(minimize (+ x y))
+(check-sat)
+(get-model)
+```
+
+# Step 2: Run the optimizer
+
+```bash
+python3 scripts/optimize.py --file scheduling.smt2
+python3 scripts/optimize.py --formula "<inline smt-lib2>" --debug
+```
+
+# Step 3: Interpret the output
+
+- `sat` with a model: the optimal assignment satisfying all constraints.
+- `unsat`: the constraints are contradictory; no feasible solution exists.
+- `unknown` or `timeout`: Z3 could not determine optimality.
+
+The script prints the objective value and the satisfying assignment.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| formula | string | no | | SMT-LIB2 formula with minimize/maximize |
+| file | path | no | | path to .smt2 file |
+| timeout | int | no | 60 | seconds |
+| z3 | path | no | auto | path to z3 binary |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/optimize/scripts/optimize.py b/.github/skills/optimize/scripts/optimize.py
new file mode 100644
index 000000000..8c7462ccb
--- /dev/null
+++ b/.github/skills/optimize/scripts/optimize.py
@@ -0,0 +1,60 @@
+#!/usr/bin/env python3
+"""
+optimize.py: solve constrained optimization problems via Z3.
+
+Usage:
+    python optimize.py --file scheduling.smt2
+    python optimize.py --formula "(declare-const x Int)..." --debug
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, parse_model, setup_logging
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="optimize")
+    parser.add_argument("--formula")
+    parser.add_argument("--file")
+    parser.add_argument("--timeout", type=int, default=60)
+    parser.add_argument("--z3", default=None)
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        formula = Path(args.file).read_text()
+    elif args.formula:
+        formula = args.formula
+    else:
+        parser.error("provide --formula or --file")
+        return
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("optimize", formula)
+
+    result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout, debug=args.debug)
+
+    model = parse_model(result["stdout"]) if result["result"] == "sat" else None
+
+    db.log_formula(run_id, formula, result["result"],
+                   str(model) if model else None)
+    db.finish_run(run_id, result["result"], result["duration_ms"],
+                  result["exit_code"])
+
+    print(result["result"])
+    if model:
+        for name, val in model.items():
+            print(f"  {name} = {val}")
+
+    db.close()
+    sys.exit(0 if result["exit_code"] == 0 else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/prove/SKILL.md b/.github/skills/prove/SKILL.md
new file mode 100644
index 000000000..a67d57758
--- /dev/null
+++ b/.github/skills/prove/SKILL.md
@@ -0,0 +1,54 @@
+---
+name: prove
+description: Prove validity of logical statements by negation and satisfiability checking. If the negation is unsatisfiable, the original statement is valid. Otherwise a counterexample is returned.
+---
+
+Given a conjecture (an SMT-LIB2 assertion or a natural language claim), determine whether it holds universally. The method is standard: negate the conjecture and check satisfiability. If the negation is unsatisfiable, the original is valid. If satisfiable, the model is a counterexample.
+
+# Step 1: Prepare the negated formula
+
+Wrap the conjecture in `(assert (not ...))` and append `(check-sat)(get-model)`.
+
+Example: to prove that `(> x 3)` implies `(> x 1)`:
+```smtlib
+(declare-const x Int)
+(assert (not (=> (> x 3) (> x 1))))
+(check-sat)
+(get-model)
+```
+
+# Step 2: Run the prover
+
+```bash
+python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int"
+```
+
+For file input where the file contains the full negated formula:
+```bash
+python3 scripts/prove.py --file negated.smt2
+```
+
+With debug tracing:
+```bash
+python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int" --debug
+```
+
+# Step 3: Interpret the output
+
+- `valid`: the negation was unsat, so the conjecture holds for all inputs.
+- `invalid` followed by a counterexample: the negation was sat; the model shows a concrete assignment where the conjecture fails.
+- `unknown` or `timeout`: Z3 could not decide. The conjecture may require auxiliary lemmas or induction.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| conjecture | string | no | | the assertion to prove (without negation) |
+| vars | string | no | | variable declarations as "name:sort" pairs, comma-separated |
+| file | path | no | | .smt2 file with the negated formula |
+| timeout | int | no | 30 | seconds |
+| z3 | path | no | auto | path to z3 binary |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
+
+Either `conjecture` (with `vars`) or `file` must be provided.
diff --git a/.github/skills/prove/scripts/prove.py b/.github/skills/prove/scripts/prove.py
new file mode 100644
index 000000000..b4656fdd7
--- /dev/null
+++ b/.github/skills/prove/scripts/prove.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+"""
+prove.py: prove validity by negation + satisfiability check.
+
+Usage:
+    python prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int"
+    python prove.py --file negated.smt2
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, parse_model, setup_logging
+
+
+def build_formula(conjecture: str, vars_str: str) -> str:
+    lines = []
+    if vars_str:
+        for v in vars_str.split(","):
+            v = v.strip()
+            name, sort = v.split(":")
+            lines.append(f"(declare-const {name.strip()} {sort.strip()})")
+    lines.append(f"(assert (not {conjecture}))")
+    lines.append("(check-sat)")
+    lines.append("(get-model)")
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="prove")
+    parser.add_argument("--conjecture", help="assertion to prove")
+    parser.add_argument("--vars", help="variable declarations, e.g. 'x:Int,y:Bool'")
+    parser.add_argument("--file", help="path to .smt2 file with negated formula")
+    parser.add_argument("--timeout", type=int, default=30)
+    parser.add_argument("--z3", default=None)
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        formula = Path(args.file).read_text()
+    elif args.conjecture:
+        formula = build_formula(args.conjecture, args.vars or "")
+    else:
+        parser.error("provide --conjecture or --file")
+        return
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("prove", formula)
+
+    result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout, debug=args.debug)
+
+    if result["result"] == "unsat":
+        verdict = "valid"
+    elif result["result"] == "sat":
+        verdict = "invalid"
+    else:
+        verdict = result["result"]
+
+    model = parse_model(result["stdout"]) if verdict == "invalid" else None
+
+    db.log_formula(run_id, formula, verdict, str(model) if model else None)
+    db.finish_run(run_id, verdict, result["duration_ms"], result["exit_code"])
+
+    print(verdict)
+    if model:
+        print("counterexample:")
+        for name, val in model.items():
+            print(f"  {name} = {val}")
+
+    db.close()
+    # Exit 0 when we successfully determined validity or invalidity;
+    # exit 1 only for errors/timeouts.
+    sys.exit(0 if verdict in ("valid", "invalid") else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/shared/schema.sql b/.github/skills/shared/schema.sql
new file mode 100644
index 000000000..90c365e6d
--- /dev/null
+++ b/.github/skills/shared/schema.sql
@@ -0,0 +1,57 @@
+-- z3agent schema v1
+
+PRAGMA journal_mode=WAL;
+PRAGMA foreign_keys=ON;
+
+CREATE TABLE IF NOT EXISTS runs (
+    run_id      INTEGER PRIMARY KEY AUTOINCREMENT,
+    skill       TEXT NOT NULL,
+    input_hash  TEXT,
+    status      TEXT NOT NULL DEFAULT 'running',
+    duration_ms INTEGER,
+    exit_code   INTEGER,
+    timestamp   TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE INDEX IF NOT EXISTS idx_runs_skill ON runs(skill);
+CREATE INDEX IF NOT EXISTS idx_runs_status ON runs(status);
+
+CREATE TABLE IF NOT EXISTS formulas (
+    formula_id  INTEGER PRIMARY KEY AUTOINCREMENT,
+    run_id      INTEGER REFERENCES runs(run_id) ON DELETE CASCADE,
+    smtlib2     TEXT NOT NULL,
+    result      TEXT,
+    model       TEXT,
+    stats       TEXT,
+    timestamp   TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE INDEX IF NOT EXISTS idx_formulas_run ON formulas(run_id);
+CREATE INDEX IF NOT EXISTS idx_formulas_result ON formulas(result);
+
+CREATE TABLE IF NOT EXISTS findings (
+    finding_id  INTEGER PRIMARY KEY AUTOINCREMENT,
+    run_id      INTEGER REFERENCES runs(run_id) ON DELETE CASCADE,
+    category    TEXT NOT NULL,
+    severity    TEXT,
+    file        TEXT,
+    line        INTEGER,
+    message     TEXT NOT NULL,
+    details     TEXT,
+    timestamp   TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE INDEX IF NOT EXISTS idx_findings_run ON findings(run_id);
+CREATE INDEX IF NOT EXISTS idx_findings_category ON findings(category);
+CREATE INDEX IF NOT EXISTS idx_findings_severity ON findings(severity);
+
+CREATE TABLE IF NOT EXISTS interaction_log (
+    log_id      INTEGER PRIMARY KEY AUTOINCREMENT,
+    run_id      INTEGER REFERENCES runs(run_id) ON DELETE SET NULL,
+    level       TEXT NOT NULL DEFAULT 'info',
+    message     TEXT NOT NULL,
+    timestamp   TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+CREATE INDEX IF NOT EXISTS idx_log_run ON interaction_log(run_id);
+CREATE INDEX IF NOT EXISTS idx_log_level ON interaction_log(level);
diff --git a/.github/skills/shared/z3db.py b/.github/skills/shared/z3db.py
new file mode 100644
index 000000000..ca959073d
--- /dev/null
+++ b/.github/skills/shared/z3db.py
@@ -0,0 +1,328 @@
+#!/usr/bin/env python3
+"""
+z3db: shared library and CLI for Z3 skill scripts.
+
+Library usage:
+    from z3db import Z3DB, find_z3, run_z3
+
+CLI usage:
+    python z3db.py init
+    python z3db.py status
+    python z3db.py log [--run-id N]
+    python z3db.py runs [--skill solve] [--last N]
+    python z3db.py query "SELECT ..."
+"""
+
+import argparse
+import hashlib
+import json
+import logging
+import os
+import re
+import shutil
+import sqlite3
+import subprocess
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+
+
+SCHEMA_PATH = Path(__file__).parent / "schema.sql"
+DEFAULT_DB_DIR = ".z3-agent"
+DEFAULT_DB_NAME = "z3agent.db"
+
+logger = logging.getLogger("z3agent")
+
+
+def setup_logging(debug: bool = False):
+    level = logging.DEBUG if debug else logging.INFO
+    fmt = "[%(levelname)s] %(message)s" if not debug else \
+          "[%(levelname)s %(asctime)s] %(message)s"
+    logging.basicConfig(level=level, format=fmt, stream=sys.stderr)
+
+
+class Z3DB:
+    """SQLite handle for z3agent.db, tracks runs, formulas, findings, logs."""
+
+    def __init__(self, db_path: Optional[str] = None):
+        if db_path is None:
+            db_dir = Path(DEFAULT_DB_DIR)
+            db_dir.mkdir(exist_ok=True)
+            db_path = str(db_dir / DEFAULT_DB_NAME)
+        self.db_path = db_path
+        self.conn = sqlite3.connect(db_path)
+        self.conn.execute("PRAGMA foreign_keys=ON")
+        self.conn.row_factory = sqlite3.Row
+        self._init_schema()
+
+    def _init_schema(self):
+        self.conn.executescript(SCHEMA_PATH.read_text())
+
+    def close(self):
+        self.conn.close()
+
+    def start_run(self, skill: str, input_text: str = "") -> int:
+        input_hash = hashlib.sha256(input_text.encode()).hexdigest()[:16]
+        cur = self.conn.execute(
+            "INSERT INTO runs (skill, input_hash) VALUES (?, ?)",
+            (skill, input_hash),
+        )
+        self.conn.commit()
+        run_id = cur.lastrowid
+        logger.debug("started run %d (skill=%s, hash=%s)", run_id, skill, input_hash)
+        return run_id
+
+    def finish_run(self, run_id: int, status: str, duration_ms: int,
+                   exit_code: int = 0):
+        self.conn.execute(
+            "UPDATE runs SET status=?, duration_ms=?, exit_code=? WHERE run_id=?",
+            (status, duration_ms, exit_code, run_id),
+        )
+        self.conn.commit()
+        logger.debug("finished run %d: %s (%dms)", run_id, status, duration_ms)
+
+    def log_formula(self, run_id: int, smtlib2: str, result: str = None,
+                    model: str = None, stats: dict = None) -> int:
+        cur = self.conn.execute(
+            "INSERT INTO formulas (run_id, smtlib2, result, model, stats) "
+            "VALUES (?, ?, ?, ?, ?)",
+            (run_id, smtlib2, result, model,
+             json.dumps(stats) if stats else None),
+        )
+        self.conn.commit()
+        return cur.lastrowid
+
+    def log_finding(self, run_id: int, category: str, message: str,
+                    severity: str = None, file: str = None,
+                    line: int = None, details: dict = None) -> int:
+        cur = self.conn.execute(
+            "INSERT INTO findings (run_id, category, severity, file, line, "
+            "message, details) VALUES (?, ?, ?, ?, ?, ?, ?)",
+            (run_id, category, severity, file, line, message,
+             json.dumps(details) if details else None),
+        )
+        self.conn.commit()
+        return cur.lastrowid
+
+    def log(self, message: str, level: str = "info", run_id: int = None):
+        """Write to stderr and to the interaction_log table."""
+        getattr(logger, level, logger.info)(message)
+        self.conn.execute(
+            "INSERT INTO interaction_log (run_id, level, message) "
+            "VALUES (?, ?, ?)",
+            (run_id, level, message),
+        )
+        self.conn.commit()
+
+    def get_runs(self, skill: str = None, last: int = 10):
+        sql = "SELECT * FROM runs"
+        params = []
+        if skill:
+            sql += " WHERE skill = ?"
+            params.append(skill)
+        sql += " ORDER BY run_id DESC LIMIT ?"
+        params.append(last)
+        return self.conn.execute(sql, params).fetchall()
+
+    def get_status(self) -> dict:
+        rows = self.conn.execute(
+            "SELECT status, COUNT(*) as cnt FROM runs GROUP BY status"
+        ).fetchall()
+        total = sum(r["cnt"] for r in rows)
+        by_status = {r["status"]: r["cnt"] for r in rows}
+        last = self.conn.execute(
+            "SELECT timestamp FROM runs ORDER BY run_id DESC LIMIT 1"
+        ).fetchone()
+        return {
+            "total": total,
+            **by_status,
+            "last_run": last["timestamp"] if last else None,
+        }
+
+    def get_logs(self, run_id: int = None, last: int = 50):
+        if run_id:
+            return self.conn.execute(
+                "SELECT * FROM interaction_log WHERE run_id=? "
+                "ORDER BY log_id DESC LIMIT ?", (run_id, last)
+            ).fetchall()
+        return self.conn.execute(
+            "SELECT * FROM interaction_log ORDER BY log_id DESC LIMIT ?",
+            (last,)
+        ).fetchall()
+
+    def query(self, sql: str):
+        return self.conn.execute(sql).fetchall()
+
+
+def find_z3(hint: str = None) -> str:
+    """Locate the z3 binary: explicit path > build dirs > PATH."""
+    candidates = []
+    if hint:
+        candidates.append(hint)
+
+    repo_root = _find_repo_root()
+    if repo_root:
+        for build_dir in ["build", "build/release", "build/debug"]:
+            candidates.append(str(repo_root / build_dir / "z3"))
+
+    path_z3 = shutil.which("z3")
+    if path_z3:
+        candidates.append(path_z3)
+
+    for c in candidates:
+        p = Path(c)
+        if p.is_file() and os.access(p, os.X_OK):
+            logger.debug("found z3: %s", p)
+            return str(p)
+
+    logger.error("z3 binary not found. Searched: %s", candidates)
+    sys.exit(1)
+
+
+def _find_repo_root() -> Optional[Path]:
+    d = Path.cwd()
+    for _ in range(10):
+        if (d / "CMakeLists.txt").exists() and (d / "src").is_dir():
+            return d
+        parent = d.parent
+        if parent == d:
+            break
+        d = parent
+    return None
+
+
+def run_z3(formula: str, z3_bin: str = None, timeout: int = 30,
+           args: list = None, debug: bool = False) -> dict:
+    """Pipe an SMT-LIB2 formula into z3 -in, return parsed output."""
+    z3_path = find_z3(z3_bin)
+    cmd = [z3_path, "-in"] + (args or [])
+
+    logger.debug("cmd: %s", " ".join(cmd))
+    logger.debug("stdin:\n%s", formula)
+
+    start = time.monotonic()
+    try:
+        proc = subprocess.run(
+            cmd, input=formula, capture_output=True, text=True,
+            timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        duration_ms = int((time.monotonic() - start) * 1000)
+        logger.warning("z3 timed out after %dms", duration_ms)
+        return {
+            "stdout": "", "stderr": "timeout", "exit_code": -1,
+            "duration_ms": duration_ms, "result": "timeout",
+        }
+
+    duration_ms = int((time.monotonic() - start) * 1000)
+
+    logger.debug("exit_code=%d duration=%dms", proc.returncode, duration_ms)
+    logger.debug("stdout:\n%s", proc.stdout)
+    if proc.stderr:
+        logger.debug("stderr:\n%s", proc.stderr)
+
+    first_line = proc.stdout.strip().split("\n")[0].strip() if proc.stdout else ""
+    result = first_line if first_line in ("sat", "unsat", "unknown") else "error"
+
+    return {
+        "stdout": proc.stdout,
+        "stderr": proc.stderr,
+        "exit_code": proc.returncode,
+        "duration_ms": duration_ms,
+        "result": result,
+    }
+
+
+def parse_model(stdout: str) -> Optional[dict]:
+    """Pull define-fun entries from a (get-model) response."""
+    model = {}
+    for m in re.finditer(
+        r'\(define-fun\s+(\S+)\s+\(\)\s+\S+\s+(.+?)\)', stdout
+    ):
+        model[m.group(1)] = m.group(2).strip()
+    return model if model else None
+
+
+def parse_stats(stdout: str) -> Optional[dict]:
+    """Parse :key value pairs from z3 -st output."""
+    stats = {}
+    for m in re.finditer(r':(\S+)\s+([\d.]+)', stdout):
+        key, val = m.group(1), m.group(2)
+        stats[key] = float(val) if '.' in val else int(val)
+    return stats if stats else None
+
+
+def parse_unsat_core(stdout: str) -> Optional[list]:
+    for line in stdout.strip().split("\n"):
+        line = line.strip()
+        if line.startswith("(") and not line.startswith("(error"):
+            labels = line.strip("()").split()
+            if labels:
+                return labels
+    return None
+
+
+def cli():
+    parser = argparse.ArgumentParser(
+        description="Z3 Agent database CLI",
+        prog="z3db",
+    )
+    parser.add_argument("--db", default=None, help="path to z3agent.db")
+    parser.add_argument("--debug", action="store_true", help="verbose output")
+
+    sub = parser.add_subparsers(dest="command")
+
+    sub.add_parser("init", help="initialize the database")
+
+    status_p = sub.add_parser("status", help="show run summary")
+
+    log_p = sub.add_parser("log", help="show interaction log")
+    log_p.add_argument("--run-id", type=int, help="filter by run ID")
+    log_p.add_argument("--last", type=int, default=50)
+
+    runs_p = sub.add_parser("runs", help="list runs")
+    runs_p.add_argument("--skill", help="filter by skill name")
+    runs_p.add_argument("--last", type=int, default=10)
+
+    query_p = sub.add_parser("query", help="run raw SQL")
+    query_p.add_argument("sql", help="SQL query string")
+
+    args = parser.parse_args()
+    setup_logging(args.debug)
+
+    db = Z3DB(args.db)
+
+    if args.command == "init":
+        print(f"Database initialized at {db.db_path}")
+
+    elif args.command == "status":
+        s = db.get_status()
+        print(f"Runs: {s['total']}"
+              f" | success: {s.get('success', 0)}"
+              f" | error: {s.get('error', 0)}"
+              f" | timeout: {s.get('timeout', 0)}"
+              f" | Last: {s['last_run'] or 'never'}")
+
+    elif args.command == "log":
+        for row in db.get_logs(args.run_id, args.last):
+            print(f"[{row['level']}] {row['timestamp']} "
+                  f"(run {row['run_id']}): {row['message']}")
+
+    elif args.command == "runs":
+        for row in db.get_runs(args.skill, args.last):
+            print(f"#{row['run_id']} {row['skill']} {row['status']} "
+                  f"{row['duration_ms']}ms @ {row['timestamp']}")
+
+    elif args.command == "query":
+        for row in db.query(args.sql):
+            print(dict(row))
+
+    else:
+        parser.print_help()
+
+    db.close()
+
+
+if __name__ == "__main__":
+    cli()
diff --git a/.github/skills/simplify/SKILL.md b/.github/skills/simplify/SKILL.md
new file mode 100644
index 000000000..5803b7148
--- /dev/null
+++ b/.github/skills/simplify/SKILL.md
@@ -0,0 +1,48 @@
+---
+name: simplify
+description: Reduce formula complexity using Z3 tactic chains. Supports configurable tactic pipelines for boolean, arithmetic, and bitvector simplification.
+---
+
+Given a formula, apply a sequence of Z3 tactics to produce an equivalent but simpler form. This is useful for understanding what Z3 sees after preprocessing, debugging tactic selection, and reducing formula size before solving.
+
+# Step 1: Choose tactics
+
+Z3 provides dozens of tactics. Common ones:
+
+| Tactic | What it does |
+|--------|-------------|
+| simplify | constant folding, algebraic identities |
+| propagate-values | substitute known equalities |
+| ctx-simplify | context-dependent simplification |
+| elim-uncnstr | remove unconstrained variables |
+| solve-eqs | Gaussian elimination |
+| bit-blast | reduce bitvectors to booleans |
+| tseitin-cnf | convert to CNF |
+| aig | and-inverter graph reduction |
+
+# Step 2: Run simplification
+
+```bash
+python3 scripts/simplify.py --formula "(assert (and (> x 0) (> x 0)))" --vars "x:Int"
+python3 scripts/simplify.py --file formula.smt2 --tactics "simplify,propagate-values,ctx-simplify"
+python3 scripts/simplify.py --file formula.smt2 --debug
+```
+
+Without `--tactics`, the script applies the default chain: `simplify`, `propagate-values`, `ctx-simplify`.
+
+# Step 3: Interpret the output
+
+The script prints the simplified formula in SMT-LIB2 syntax. Subgoals are printed as separate `(assert ...)` blocks.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| formula | string | no | | SMT-LIB2 formula to simplify |
+| vars | string | no | | variable declarations as "name:sort" pairs |
+| file | path | no | | path to .smt2 file |
+| tactics | string | no | simplify,propagate-values,ctx-simplify | comma-separated tactic names |
+| timeout | int | no | 30 | seconds |
+| z3 | path | no | auto | path to z3 binary |
+| debug | flag | no | off | verbose tracing |
+| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/simplify/scripts/simplify.py b/.github/skills/simplify/scripts/simplify.py
new file mode 100644
index 000000000..9abef32fb
--- /dev/null
+++ b/.github/skills/simplify/scripts/simplify.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+"""
+simplify.py: apply Z3 tactics to simplify an SMT-LIB2 formula.
+
+Usage:
+    python simplify.py --formula "(assert (and (> x 0) (> x 0)))" --vars "x:Int"
+    python simplify.py --file formula.smt2 --tactics "simplify,solve-eqs"
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, setup_logging
+
+
+DEFAULT_TACTICS = "simplify,propagate-values,ctx-simplify"
+
+
+def build_tactic_formula(base_formula: str, tactics: str) -> str:
+    tactic_list = [t.strip() for t in tactics.split(",")]
+    if len(tactic_list) == 1:
+        tactic_expr = f"(then {tactic_list[0]} skip)"
+    else:
+        tactic_expr = "(then " + " ".join(tactic_list) + ")"
+    return base_formula + f"\n(apply {tactic_expr})\n"
+
+
+def build_formula_from_parts(formula_str: str, vars_str: str) -> str:
+    lines = []
+    if vars_str:
+        for v in vars_str.split(","):
+            v = v.strip()
+            name, sort = v.split(":")
+            lines.append(f"(declare-const {name.strip()} {sort.strip()})")
+    lines.append(formula_str)
+    return "\n".join(lines)
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="simplify")
+    parser.add_argument("--formula")
+    parser.add_argument("--vars")
+    parser.add_argument("--file")
+    parser.add_argument("--tactics", default=DEFAULT_TACTICS)
+    parser.add_argument("--timeout", type=int, default=30)
+    parser.add_argument("--z3", default=None)
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        base = Path(args.file).read_text()
+    elif args.formula:
+        base = build_formula_from_parts(args.formula, args.vars or "")
+    else:
+        parser.error("provide --formula or --file")
+        return
+
+    formula = build_tactic_formula(base, args.tactics)
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("simplify", formula)
+
+    result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout, debug=args.debug)
+
+    status = "success" if result["exit_code"] == 0 else "error"
+    db.log_formula(run_id, formula, status)
+    db.finish_run(run_id, status, result["duration_ms"], result["exit_code"])
+
+    print(result["stdout"])
+    if result["stderr"] and result["exit_code"] != 0:
+        print(result["stderr"], file=sys.stderr)
+
+    db.close()
+    sys.exit(0 if result["exit_code"] == 0 else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/solve/SKILL.md b/.github/skills/solve/SKILL.md
new file mode 100644
index 000000000..a7385635b
--- /dev/null
+++ b/.github/skills/solve/SKILL.md
@@ -0,0 +1,50 @@
+---
+name: solve
+description: Check satisfiability of SMT-LIB2 formulas using Z3. Returns sat/unsat with models or unsat cores. Logs every invocation to z3agent.db for auditability.
+---
+
+Given an SMT-LIB2 formula (or a set of constraints described in natural language), determine whether the formula is satisfiable. If sat, extract a satisfying assignment. If unsat and tracking labels are present, extract the unsat core.
+
+# Step 1: Prepare the formula
+
+If the input is already valid SMT-LIB2, use it directly. If it is a natural language description, use the **encode** skill first to produce SMT-LIB2.
+
+The formula must include `(check-sat)` at the end. Append `(get-model)` for satisfiable queries or `(get-unsat-core)` when named assertions are used.
+
+# Step 2: Run Z3
+
+```bash
+python3 scripts/solve.py --formula "(declare-const x Int)(assert (> x 0))(check-sat)(get-model)"
+```
+
+For file input:
+```bash
+python3 scripts/solve.py --file problem.smt2
+```
+
+With debug tracing:
+```bash
+python3 scripts/solve.py --file problem.smt2 --debug
+```
+
+The script pipes the formula to `z3 -in` via subprocess (no shell expansion), logs the run to `.z3-agent/z3agent.db`, and prints the result.
+
+# Step 3: Interpret the output
+
+- `sat` followed by a model: the formula is satisfiable; the model assigns concrete values to each declared constant.
+- `unsat`: no assignment exists. If `(get-unsat-core)` was used, the conflicting named assertions are listed.
+- `unknown`: Z3 could not decide within the timeout. Consider increasing the timeout or simplifying the formula.
+- `timeout`: the process was killed after the deadline. Try the **simplify** skill to reduce complexity.
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| formula | string | no | | SMT-LIB2 formula as a string |
+| file | path | no | | path to an .smt2 file |
+| timeout | int | no | 30 | seconds before killing z3 |
+| z3 | path | no | auto | explicit path to z3 binary |
+| debug | flag | no | off | print z3 command, stdin, stdout, stderr, timing |
+| db | path | no | .z3-agent/z3agent.db | path to the logging database |
+
+Either `formula` or `file` must be provided.
diff --git a/.github/skills/solve/scripts/solve.py b/.github/skills/solve/scripts/solve.py
new file mode 100644
index 000000000..b283243f2
--- /dev/null
+++ b/.github/skills/solve/scripts/solve.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+"""
+solve.py: check satisfiability of an SMT-LIB2 formula via Z3.
+
+Usage:
+    python solve.py --formula "(declare-const x Int)(assert (> x 0))(check-sat)(get-model)"
+    python solve.py --file problem.smt2
+    python solve.py --file problem.smt2 --debug --timeout 60
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, run_z3, parse_model, parse_unsat_core, setup_logging
+
+
+def main():
+    parser = argparse.ArgumentParser(prog="solve")
+    parser.add_argument("--formula", help="SMT-LIB2 formula string")
+    parser.add_argument("--file", help="path to .smt2 file")
+    parser.add_argument("--timeout", type=int, default=30)
+    parser.add_argument("--z3", default=None, help="path to z3 binary")
+    parser.add_argument("--db", default=None)
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    if args.file:
+        formula = Path(args.file).read_text()
+    elif args.formula:
+        formula = args.formula
+    else:
+        parser.error("provide --formula or --file")
+        return
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("solve", formula)
+
+    result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout, debug=args.debug)
+
+    model = parse_model(result["stdout"]) if result["result"] == "sat" else None
+    core = parse_unsat_core(result["stdout"]) if result["result"] == "unsat" else None
+
+    db.log_formula(run_id, formula, result["result"],
+                   str(model) if model else None)
+    db.finish_run(run_id, result["result"], result["duration_ms"],
+                  result["exit_code"])
+
+    print(result["result"])
+    if model:
+        for name, val in model.items():
+            print(f"  {name} = {val}")
+    if core:
+        print("unsat core:", " ".join(core))
+    if result["stderr"] and result["result"] == "error":
+        print(result["stderr"], file=sys.stderr)
+
+    db.close()
+    sys.exit(0 if result["exit_code"] == 0 else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/.github/skills/static-analysis/SKILL.md b/.github/skills/static-analysis/SKILL.md
new file mode 100644
index 000000000..566999813
--- /dev/null
+++ b/.github/skills/static-analysis/SKILL.md
@@ -0,0 +1,46 @@
+---
+name: static-analysis
+description: Run Clang Static Analyzer (scan-build) on Z3 source and log structured findings to z3agent.db.
+---
+
+Run the Clang Static Analyzer over a CMake build of Z3, parse the resulting plist diagnostics, and record each finding with file, line, category, and description. This skill wraps scan-build into a reproducible, logged workflow suitable for regular analysis sweeps and regression tracking.
+
+# Step 1: Run the analysis
+
+```bash
+python3 scripts/static_analysis.py --build-dir build
+python3 scripts/static_analysis.py --build-dir build --output-dir /tmp/sa-results --debug
+python3 scripts/static_analysis.py --build-dir build --timeout 1800
+```
+
+The script invokes `scan-build cmake ..` followed by `scan-build make` inside the specified build directory. Clang checker output is written to `--output-dir` (defaults to a `scan-results` subdirectory of the build directory).
+
+# Step 2: Interpret the output
+
+Each finding is printed with its source location, category, and description:
+
+```
+[Dead store] src/ast/ast.cpp:142: Value stored to 'result' is never read
+[Null dereference] src/smt/theory_lra.cpp:87: Access to field 'next' results in a dereference of a null pointer
+```
+
+A summary table groups findings by category so that high-frequency classes are visible at a glance.
+
+# Step 3: Review historical findings
+
+All findings are logged to `z3agent.db`. Query them to track trends:
+
+```bash
+python3 ../../shared/z3db.py query "SELECT category, COUNT(*) as cnt FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='static-analysis') GROUP BY category ORDER BY cnt DESC"
+python3 ../../shared/z3db.py runs --skill static-analysis --last 10
+```
+
+# Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| build-dir | path | yes | | path to the CMake build directory |
+| output-dir | path | no | BUILD/scan-results | directory for scan-build output |
+| timeout | int | no | 1200 | seconds allowed for the full build |
+| db | path | no | .z3-agent/z3agent.db | logging database |
+| debug | flag | no | off | verbose tracing |
diff --git a/.github/skills/static-analysis/scripts/static_analysis.py b/.github/skills/static-analysis/scripts/static_analysis.py
new file mode 100644
index 000000000..aa64d883d
--- /dev/null
+++ b/.github/skills/static-analysis/scripts/static_analysis.py
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+"""
+static_analysis.py: run Clang Static Analyzer on Z3 source.
+
+Usage:
+    python static_analysis.py --build-dir build
+    python static_analysis.py --build-dir build --output-dir /tmp/sa-results
+    python static_analysis.py --build-dir build --debug
+"""
+
+import argparse
+import logging
+import os
+import plistlib
+import shutil
+import subprocess
+import sys
+import time
+from collections import Counter
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
+from z3db import Z3DB, setup_logging
+
+logger = logging.getLogger("z3agent")
+
+SCAN_BUILD_NAMES = ["scan-build", "scan-build-14", "scan-build-15", "scan-build-16"]
+
+
+def find_scan_build() -> str:
+    """Locate the scan-build binary on PATH."""
+    for name in SCAN_BUILD_NAMES:
+        path = shutil.which(name)
+        if path:
+            logger.debug("found scan-build: %s", path)
+            return path
+    logger.error(
+        "scan-build not found. Install clang-tools or set PATH. "
+        "Searched: %s", ", ".join(SCAN_BUILD_NAMES)
+    )
+    sys.exit(1)
+
+
+def run_configure(scan_build: str, build_dir: Path, output_dir: Path,
+                  timeout: int) -> bool:
+    """Run scan-build cmake to configure the project."""
+    repo_root = build_dir.parent
+    cmd = [
+        scan_build,
+        "-o", str(output_dir),
+        "cmake",
+        str(repo_root),
+    ]
+    logger.info("configuring: %s", " ".join(cmd))
+    try:
+        proc = subprocess.run(
+            cmd, cwd=str(build_dir),
+            capture_output=True, text=True, timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        logger.error("cmake configuration timed out after %ds", timeout)
+        return False
+
+    if proc.returncode != 0:
+        logger.error("cmake configuration failed (exit %d)", proc.returncode)
+        logger.error("stderr: %s", proc.stderr[:2000])
+        return False
+
+    logger.info("configuration complete")
+    return True
+
+
+def run_build(scan_build: str, build_dir: Path, output_dir: Path,
+              timeout: int) -> bool:
+    """Run scan-build make to build and analyze."""
+    nproc = os.cpu_count() or 4
+    cmd = [
+        scan_build,
+        "-o", str(output_dir),
+        "--status-bugs",
+        "make",
+        f"-j{nproc}",
+    ]
+    logger.info("building with analysis: %s", " ".join(cmd))
+    try:
+        proc = subprocess.run(
+            cmd, cwd=str(build_dir),
+            capture_output=True, text=True, timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        logger.error("build timed out after %ds", timeout)
+        return False
+
+    # scan-build returns nonzero when bugs are found (due to --status-bugs),
+    # so a nonzero exit code is not necessarily a build failure.
+    if proc.returncode != 0:
+        logger.info(
+            "scan-build exited with code %d (nonzero may indicate findings)",
+            proc.returncode,
+        )
+    else:
+        logger.info("build complete, no bugs reported by scan-build")
+
+    if proc.stderr:
+        logger.debug("build stderr (last 2000 chars): %s", proc.stderr[-2000:])
+    return True
+
+
+def collect_plist_files(output_dir: Path) -> list:
+    """Recursively find all .plist diagnostic files under the output directory."""
+    plists = sorted(output_dir.rglob("*.plist"))
+    logger.debug("found %d plist files in %s", len(plists), output_dir)
+    return plists
+
+
+def parse_plist_findings(plist_path: Path) -> list:
+    """Extract findings from a single Clang plist diagnostic file.
+
+    Returns a list of dicts with keys: file, line, col, category, type, description.
+    """
+    findings = []
+    try:
+        with open(plist_path, "rb") as f:
+            data = plistlib.load(f)
+    except Exception as exc:
+        logger.warning("could not parse %s: %s", plist_path, exc)
+        return findings
+
+    source_files = data.get("files", [])
+    for diag in data.get("diagnostics", []):
+        location = diag.get("location", {})
+        file_idx = location.get("file", 0)
+        source_file = source_files[file_idx] if file_idx < len(source_files) else "<unknown>"
+        findings.append({
+            "file": source_file,
+            "line": location.get("line", 0),
+            "col": location.get("col", 0),
+            "category": diag.get("category", "uncategorized"),
+            "type": diag.get("type", ""),
+            "description": diag.get("description", ""),
+        })
+    return findings
+
+
+def collect_all_findings(output_dir: Path) -> list:
+    """Parse every plist file under output_dir and return merged findings."""
+    all_findings = []
+    for plist_path in collect_plist_files(output_dir):
+        all_findings.extend(parse_plist_findings(plist_path))
+    return all_findings
+
+
+def log_findings(db, run_id: int, findings: list):
+    """Persist each finding to z3agent.db."""
+    for f in findings:
+        db.log_finding(
+            run_id,
+            category=f["category"],
+            message=f["description"],
+            severity=f.get("type"),
+            file=f["file"],
+            line=f["line"],
+            details={"col": f["col"], "type": f["type"]},
+        )
+
+
+def print_findings(findings: list):
+    """Print individual findings and a category summary."""
+    if not findings:
+        print("No findings reported.")
+        return
+
+    for f in findings:
+        label = f["category"]
+        if f["type"]:
+            label = f["type"]
+        print(f"[{label}] {f['file']}:{f['line']}: {f['description']}")
+
+    print()
+    counts = Counter(f["category"] for f in findings)
+    print(f"Total findings: {len(findings)}")
+    print("By category:")
+    for cat, cnt in counts.most_common():
+        print(f"  {cat}: {cnt}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        prog="static_analysis",
+        description="Run Clang Static Analyzer on Z3 and log findings.",
+    )
+    parser.add_argument(
+        "--build-dir", required=True,
+        help="path to the CMake build directory",
+    )
+    parser.add_argument(
+        "--output-dir", default=None,
+        help="directory for scan-build results (default: BUILD/scan-results)",
+    )
+    parser.add_argument(
+        "--timeout", type=int, default=1200,
+        help="seconds allowed for the full analysis build",
+    )
+    parser.add_argument("--db", default=None, help="path to z3agent.db")
+    parser.add_argument("--debug", action="store_true", help="verbose tracing")
+    args = parser.parse_args()
+
+    setup_logging(args.debug)
+
+    scan_build = find_scan_build()
+
+    build_dir = Path(args.build_dir).resolve()
+    build_dir.mkdir(parents=True, exist_ok=True)
+
+    output_dir = Path(args.output_dir) if args.output_dir else build_dir / "scan-results"
+    output_dir = output_dir.resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    db = Z3DB(args.db)
+    run_id = db.start_run("static-analysis", f"build_dir={build_dir}")
+    start = time.monotonic()
+
+    if not run_configure(scan_build, build_dir, output_dir, timeout=args.timeout):
+        elapsed = int((time.monotonic() - start) * 1000)
+        db.finish_run(run_id, "error", elapsed, exit_code=1)
+        db.close()
+        sys.exit(1)
+
+    if not run_build(scan_build, build_dir, output_dir, timeout=args.timeout):
+        elapsed = int((time.monotonic() - start) * 1000)
+        db.finish_run(run_id, "error", elapsed, exit_code=1)
+        db.close()
+        sys.exit(1)
+
+    elapsed = int((time.monotonic() - start) * 1000)
+
+    findings = collect_all_findings(output_dir)
+    log_findings(db, run_id, findings)
+
+    status = "clean" if len(findings) == 0 else "findings"
+    db.finish_run(run_id, status, elapsed, exit_code=0)
+
+    db.log(
+        f"static analysis complete: {len(findings)} finding(s) in {elapsed}ms",
+        run_id=run_id,
+    )
+
+    print_findings(findings)
+
+    db.close()
+    sys.exit(0)
+
+
+if __name__ == "__main__":
+    main()

From 9d674404c87e0dfaf6b426d6bdbba612ae2f8191 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 19:51:59 +0000
Subject: [PATCH 02/11] Add action/expectation/result structure to all skill
 definitions

Each step in every SKILL.md now carries labeled Action, Expectation,
and Result blocks so the agent can mechanically execute, verify, and
branch at each stage. Format chosen after comparing three variants
(indented blocks, inline keywords, tables) on a prove-validity
simulation; indented blocks scored highest on routing completeness
and checkability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/skills/benchmark/SKILL.md       | 37 ++++++++++++++--
 .github/skills/deeptest/SKILL.md        | 57 ++++++++++++++++++++++---
 .github/skills/encode/SKILL.md          | 38 ++++++++++++++---
 .github/skills/explain/SKILL.md         | 35 ++++++++++++++-
 .github/skills/memory-safety/SKILL.md   | 51 +++++++++++++++++++---
 .github/skills/optimize/SKILL.md        | 37 +++++++++++++---
 .github/skills/prove/SKILL.md           | 37 ++++++++++++++--
 .github/skills/simplify/SKILL.md        | 32 +++++++++++++-
 .github/skills/solve/SKILL.md           | 41 ++++++++++++++----
 .github/skills/static-analysis/SKILL.md | 47 +++++++++++++++++---
 10 files changed, 364 insertions(+), 48 deletions(-)

diff --git a/.github/skills/benchmark/SKILL.md b/.github/skills/benchmark/SKILL.md
index cffacde54..1d3494784 100644
--- a/.github/skills/benchmark/SKILL.md
+++ b/.github/skills/benchmark/SKILL.md
@@ -7,16 +7,37 @@ Given an SMT-LIB2 formula or file, run Z3 with statistics enabled and report per
 
 # Step 1: Run Z3 with statistics
 
+Action:
+    Invoke benchmark.py with the formula or file. Use `--runs N` for
+    repeated timing.
+
+Expectation:
+    The script invokes `z3 -st`, parses the statistics block, and prints
+    a performance summary. A run entry is logged to z3agent.db.
+
+Result:
+    Timing and statistics are displayed. Proceed to Step 2 to interpret.
+
 ```bash
 python3 scripts/benchmark.py --file problem.smt2
 python3 scripts/benchmark.py --file problem.smt2 --runs 5
 python3 scripts/benchmark.py --formula "(declare-const x Int)..." --debug
 ```
 
-The script invokes `z3 -st` and parses the `:key value` statistics block.
-
 # Step 2: Interpret the output
 
+Action:
+    Review wall-clock time, memory usage, conflict counts, and per-theory
+    breakdowns.
+
+Expectation:
+    A complete performance profile including min/median/max timing when
+    multiple runs are requested.
+
+Result:
+    If performance is acceptable, no action needed.
+    If slow, try **simplify** to reduce the formula or adjust tactic strategies.
+
 The output includes:
 
 - wall-clock time (ms)
@@ -29,7 +50,17 @@ With `--runs N`, the script runs Z3 N times and reports min/median/max timing.
 
 # Step 3: Compare over time
 
-Past benchmark runs are logged to `z3agent.db`. Query them:
+Action:
+    Query past benchmark runs from z3agent.db to detect regressions or
+    improvements.
+
+Expectation:
+    Historical run data is available for comparison, ordered by recency.
+
+Result:
+    If performance regressed, investigate recent formula or tactic changes.
+    If improved, record the successful configuration.
+
 ```bash
 python3 ../../shared/z3db.py runs --skill benchmark --last 20
 python3 ../../shared/z3db.py query "SELECT smtlib2, result, stats FROM formulas WHERE run_id IN (SELECT run_id FROM runs WHERE skill='benchmark') ORDER BY run_id DESC LIMIT 5"
diff --git a/.github/skills/deeptest/SKILL.md b/.github/skills/deeptest/SKILL.md
index ead3f5b84..2e2cd747e 100644
--- a/.github/skills/deeptest/SKILL.md
+++ b/.github/skills/deeptest/SKILL.md
@@ -7,6 +7,20 @@ Given a strategy and count, generate SMT-LIB2 formulas targeting Z3 internals an
 
 # Step 1: Choose a strategy and run
 
+Action:
+    Select a generation strategy and invoke the script with the desired
+    count and seed.
+
+Expectation:
+    The script generates SMT-LIB2 formulas according to the chosen
+    strategy, runs each through Z3, and records results to z3agent.db.
+
+Result:
+    On completion: a summary is printed with formula count, anomaly count,
+    and elapsed time. Proceed to Step 2.
+    On early exit: verify the Z3 binary is accessible and review timeout
+    settings.
+
 ```bash
 python3 scripts/deeptest.py --strategy random --count 100 --seed 42
 python3 scripts/deeptest.py --strategy metamorphic --seed-file base.smt2 --count 50
@@ -23,7 +37,19 @@ Available strategies:
 
 # Step 2: Interpret the output
 
-The script prints a summary after completion:
+Action:
+    Review the summary printed after the run completes.
+
+Expectation:
+    The summary shows strategy, seed, formula count, anomaly count, and
+    elapsed time.
+
+Result:
+    On zero anomalies: Z3 handled all generated formulas without issue.
+    On nonzero anomalies: crashes, assertion failures, solver errors, or
+    result disagreements were detected. Proceed to Step 3 for details.
+
+Example summary:
 
 ```
 strategy:  random
@@ -33,21 +59,38 @@ anomalies: 2
 elapsed:   4500ms
 ```
 
-A nonzero anomaly count means the run detected crashes (nonzero exit code), assertion failures in stderr, solver errors, or result disagreements between a base formula and its metamorphic variants.
-
 # Step 3: Inspect findings
 
-Findings are logged to `z3agent.db` with category, severity, and details:
+Action:
+    Query z3agent.db for detailed finding records from the run.
+
+Expectation:
+    Each finding includes category, severity, message, formula index, exit
+    code, and a stderr excerpt.
+
+Result:
+    Use findings to identify reproducible failure patterns and prioritize
+    fixes by severity. If a finding appears nondeterministic, proceed to
+    Step 4 with the same seed to confirm.
 
 ```bash
 python3 ../../shared/z3db.py query "SELECT category, severity, message FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='deeptest') ORDER BY finding_id DESC LIMIT 20"
 ```
 
-Each finding includes the formula index, exit code, and a stderr excerpt for triage.
-
 # Step 4: Reproduce
 
-Use the `--seed` parameter to reproduce a run exactly:
+Action:
+    Re-run the script with the same seed to reproduce the exact sequence
+    of generated formulas.
+
+Expectation:
+    Identical formulas are generated, producing the same anomalies if the
+    underlying bug persists.
+
+Result:
+    On same anomalies: bug confirmed and suitable for a regression test.
+    On zero anomalies: the issue may be nondeterministic or already fixed;
+    investigate further before closing.
 
 ```bash
 python3 scripts/deeptest.py --strategy random --count 100 --seed 42
diff --git a/.github/skills/encode/SKILL.md b/.github/skills/encode/SKILL.md
index eef343bef..32d5dd1da 100644
--- a/.github/skills/encode/SKILL.md
+++ b/.github/skills/encode/SKILL.md
@@ -7,7 +7,17 @@ Given a problem description (natural language, pseudocode, or a partial formulat
 
 # Step 1: Identify the problem class
 
-Common encodings:
+Action:
+    Determine the SMT theory and variable sorts required by the problem
+    description.
+
+Expectation:
+    A clear mapping from the problem to one of the supported theories
+    (LIA, LRA, QF_BV, etc.).
+
+Result:
+    If the theory is identified, proceed to Step 2. If the problem spans
+    multiple theories, select the appropriate combined logic.
 
 | Problem class | Theory | Typical sorts |
 |---------------|--------|---------------|
@@ -21,17 +31,35 @@ Common encodings:
 
 # Step 2: Generate the encoding
 
+Action:
+    Invoke encode.py with the problem description and desired output format.
+
+Expectation:
+    The script produces a complete SMT-LIB2 file or Z3 Python script with
+    all declarations, constraints, and check-sat commands.
+
+Result:
+    For `smtlib2` format: pass the output to **solve**.
+    For `python` format: execute the script directly.
+    Proceed to Step 3 for validation.
+
 ```bash
 python3 scripts/encode.py --problem "Find integers x, y such that x^2 + y^2 = 25 and x > 0" --format smtlib2
 python3 scripts/encode.py --problem "Schedule 4 tasks on 2 machines minimizing makespan" --format python
 ```
 
-For `--format smtlib2`, the output is a complete .smt2 file ready for the **solve** skill.
-For `--format python`, the output is a standalone Z3 Python script.
-
 # Step 3: Validate the encoding
 
-The script checks that the generated formula is syntactically valid by running a quick `z3 -in` parse check (no solving, just syntax). Parse errors are reported with the offending line.
+Action:
+    The script runs a syntax check by piping the output through `z3 -in`
+    in parse-only mode.
+
+Expectation:
+    No parse errors. If errors occur, the offending line is reported.
+
+Result:
+    On success: the encoding is ready for **solve**, **prove**, or **optimize**.
+    On parse error: fix the reported line and re-run.
 
 # Parameters
 
diff --git a/.github/skills/explain/SKILL.md b/.github/skills/explain/SKILL.md
index 515b51378..b7ffe25af 100644
--- a/.github/skills/explain/SKILL.md
+++ b/.github/skills/explain/SKILL.md
@@ -7,6 +7,17 @@ Given raw Z3 output (from the **solve**, **prove**, **optimize**, or **benchmark
 
 # Step 1: Identify the output type
 
+Action:
+    Determine the category of Z3 output to explain: model, core,
+    statistics, error, or proof.
+
+Expectation:
+    The output type maps to one of the recognized formats in the table below.
+
+Result:
+    If the type is ambiguous, use `--type auto` and let the script detect it.
+    Proceed to Step 2.
+
 | Output contains | Explanation type |
 |----------------|-----------------|
 | `(define-fun ...)` blocks | model explanation |
@@ -17,16 +28,36 @@ Given raw Z3 output (from the **solve**, **prove**, **optimize**, or **benchmark
 
 # Step 2: Run the explainer
 
+Action:
+    Invoke explain.py with the output file or stdin.
+
+Expectation:
+    The script auto-detects the output type and produces a structured
+    plain-language summary.
+
+Result:
+    A formatted explanation is printed. If detection fails, re-run with
+    an explicit `--type` flag.
+
 ```bash
 python3 scripts/explain.py --file output.txt
 python3 scripts/explain.py --stdin < output.txt
 python3 scripts/explain.py --file output.txt --debug
 ```
 
-The script auto-detects the output type and produces a structured summary.
-
 # Step 3: Interpret the explanation
 
+Action:
+    Review the structured explanation for accuracy and completeness.
+
+Expectation:
+    Models list each variable with its value and sort. Cores list
+    conflicting assertions. Statistics show time and memory breakdowns.
+
+Result:
+    Use the explanation to answer the user query or to guide the next
+    skill invocation.
+
 For models:
 - Each variable is listed with its value and sort
 - Array and function interpretations are expanded
diff --git a/.github/skills/memory-safety/SKILL.md b/.github/skills/memory-safety/SKILL.md
index 75a7861c2..8e2eee686 100644
--- a/.github/skills/memory-safety/SKILL.md
+++ b/.github/skills/memory-safety/SKILL.md
@@ -7,7 +7,22 @@ Build Z3 with compiler-based sanitizer instrumentation, execute the test suite,
 
 # Step 1: Configure and build
 
-The script invokes cmake with the appropriate `-fsanitize` flags and builds the `test-z3` target. Each sanitizer uses a separate build directory to avoid flag conflicts. If a prior instrumented build exists with matching flags, only incremental compilation runs.
+Action:
+    Invoke the script with the desired sanitizer flag. The script calls
+    cmake with the appropriate `-fsanitize` flags and builds the `test-z3`
+    target. Each sanitizer uses a separate build directory to avoid flag
+    conflicts.
+
+Expectation:
+    cmake configures successfully and make compiles the instrumented binary.
+    If a prior build exists with matching flags, only incremental compilation
+    runs.
+
+Result:
+    On success: an instrumented `test-z3` binary is ready in the build
+    directory. Proceed to Step 2.
+    On failure: verify compiler support for the requested sanitizer flags
+    and review cmake output.
 
 ```bash
 python3 scripts/memory_safety.py --sanitizer asan
@@ -22,7 +37,21 @@ python3 scripts/memory_safety.py --sanitizer asan --skip-build --build-dir build
 
 # Step 2: Run and collect
 
-The test binary runs with `halt_on_error=0` so the sanitizer reports all violations rather than aborting on the first. The script parses `ERROR: AddressSanitizer`, `runtime error:`, and `ERROR: LeakSanitizer` patterns from the combined output, extracts source locations where available, and deduplicates by category, file, and line.
+Action:
+    Execute the instrumented test binary with halt_on_error=0 so all
+    violations are reported rather than aborting on the first.
+
+Expectation:
+    The script parses AddressSanitizer, UndefinedBehaviorSanitizer, and
+    LeakSanitizer patterns from combined output, extracts source locations,
+    and deduplicates by category/file/line.
+
+Result:
+    On `clean`: no violations detected.
+    On `findings`: one or more violations found, each printed with severity,
+    category, message, and source location.
+    On `timeout`: test suite did not finish; increase timeout or investigate.
+    On `error`: build or execution failed before sanitizer output.
 
 ```bash
 python3 scripts/memory_safety.py --sanitizer asan --timeout 900 --debug
@@ -30,10 +59,20 @@ python3 scripts/memory_safety.py --sanitizer asan --timeout 900 --debug
 
 # Step 3: Interpret results
 
-- `clean`: no sanitizer violations detected.
-- `findings`: one or more violations found. Each is printed with severity, category, message, and source location.
-- `timeout`: the test suite did not complete within the deadline. Increase the timeout or investigate a possible infinite loop.
-- `error`: build or execution failed before sanitizer output could be collected.
+Action:
+    Review printed findings and query z3agent.db for historical comparison.
+
+Expectation:
+    Each finding includes severity, category, message, and source location.
+    The database query returns prior runs for trend analysis.
+
+Result:
+    On `clean`: no action required; proceed.
+    On `findings`: triage by severity and category. Compare against prior
+    runs to distinguish new regressions from known issues.
+    On `timeout`: increase the deadline or investigate a possible infinite
+    loop.
+    On `error`: inspect build logs before re-running.
 
 Query past runs:
 ```bash
diff --git a/.github/skills/optimize/SKILL.md b/.github/skills/optimize/SKILL.md
index fc93a7e2c..0414e7e9f 100644
--- a/.github/skills/optimize/SKILL.md
+++ b/.github/skills/optimize/SKILL.md
@@ -7,7 +7,17 @@ Given a set of constraints and an objective function, find the optimal value. Z3
 
 # Step 1: Formulate the problem
 
-The formula uses the `(minimize ...)` or `(maximize ...)` directives followed by `(check-sat)` and `(get-model)`.
+Action:
+    Write constraints and an objective using `(minimize ...)` or
+    `(maximize ...)` directives, followed by `(check-sat)` and `(get-model)`.
+
+Expectation:
+    A valid SMT-LIB2 formula with at least one optimization directive and
+    all variables declared.
+
+Result:
+    If the formula is well-formed, proceed to Step 2. For multi-objective
+    problems, list directives in priority order for lexicographic optimization.
 
 Example: minimize `x + y` subject to `x >= 1`, `y >= 2`, `x + y <= 10`:
 ```smtlib
@@ -23,6 +33,17 @@ Example: minimize `x + y` subject to `x >= 1`, `y >= 2`, `x + y <= 10`:
 
 # Step 2: Run the optimizer
 
+Action:
+    Invoke optimize.py with the formula or file path.
+
+Expectation:
+    The script prints `sat` with the optimal assignment, `unsat`, `unknown`,
+    or `timeout`. A run entry is logged to z3agent.db.
+
+Result:
+    On `sat`: proceed to Step 3 to read the optimal values.
+    On `unsat` or `timeout`: check constraints for contradictions or simplify.
+
 ```bash
 python3 scripts/optimize.py --file scheduling.smt2
 python3 scripts/optimize.py --formula "<inline smt-lib2>" --debug
@@ -30,11 +51,17 @@ python3 scripts/optimize.py --formula "<inline smt-lib2>" --debug
 
 # Step 3: Interpret the output
 
-- `sat` with a model: the optimal assignment satisfying all constraints.
-- `unsat`: the constraints are contradictory; no feasible solution exists.
-- `unknown` or `timeout`: Z3 could not determine optimality.
+Action:
+    Parse the objective value and satisfying assignment from the output.
 
-The script prints the objective value and the satisfying assignment.
+Expectation:
+    `sat` with a model containing the optimal value, `unsat` indicating
+    infeasibility, or `unknown`/`timeout`.
+
+Result:
+    On `sat`: report the optimal value and assignment.
+    On `unsat`: the constraints are contradictory, no feasible solution exists.
+    On `unknown`/`timeout`: relax constraints or try **simplify**.
 
 # Parameters
 
diff --git a/.github/skills/prove/SKILL.md b/.github/skills/prove/SKILL.md
index a67d57758..60ddc8ea6 100644
--- a/.github/skills/prove/SKILL.md
+++ b/.github/skills/prove/SKILL.md
@@ -7,7 +7,17 @@ Given a conjecture (an SMT-LIB2 assertion or a natural language claim), determin
 
 # Step 1: Prepare the negated formula
 
-Wrap the conjecture in `(assert (not ...))` and append `(check-sat)(get-model)`.
+Action:
+    Wrap the conjecture in `(assert (not ...))` and append
+    `(check-sat)(get-model)`.
+
+Expectation:
+    A complete SMT-LIB2 formula that negates the original conjecture with
+    all variables declared.
+
+Result:
+    If the negation is well-formed, proceed to Step 2.
+    If the conjecture is natural language, run **encode** first.
 
 Example: to prove that `(> x 3)` implies `(> x 1)`:
 ```smtlib
@@ -19,6 +29,18 @@ Example: to prove that `(> x 3)` implies `(> x 1)`:
 
 # Step 2: Run the prover
 
+Action:
+    Invoke prove.py with the conjecture and variable declarations.
+
+Expectation:
+    The script prints `valid`, `invalid` (with counterexample), `unknown`,
+    or `timeout`. A run entry is logged to z3agent.db.
+
+Result:
+    On `valid`: proceed to **explain** if the user needs a summary.
+    On `invalid`: report the counterexample directly.
+    On `unknown`/`timeout`: try **simplify** first, or increase the timeout.
+
 ```bash
 python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int"
 ```
@@ -35,9 +57,16 @@ python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int" --de
 
 # Step 3: Interpret the output
 
-- `valid`: the negation was unsat, so the conjecture holds for all inputs.
-- `invalid` followed by a counterexample: the negation was sat; the model shows a concrete assignment where the conjecture fails.
-- `unknown` or `timeout`: Z3 could not decide. The conjecture may require auxiliary lemmas or induction.
+Action:
+    Read the prover output to determine validity of the conjecture.
+
+Expectation:
+    One of `valid`, `invalid` (with counterexample), `unknown`, or `timeout`.
+
+Result:
+    On `valid`: the conjecture holds universally.
+    On `invalid`: the model shows a concrete counterexample.
+    On `unknown`/`timeout`: the conjecture may require auxiliary lemmas or induction.
 
 # Parameters
 
diff --git a/.github/skills/simplify/SKILL.md b/.github/skills/simplify/SKILL.md
index 5803b7148..bab3ad5b6 100644
--- a/.github/skills/simplify/SKILL.md
+++ b/.github/skills/simplify/SKILL.md
@@ -7,7 +7,16 @@ Given a formula, apply a sequence of Z3 tactics to produce an equivalent but sim
 
 # Step 1: Choose tactics
 
-Z3 provides dozens of tactics. Common ones:
+Action:
+    Select a tactic chain from the available Z3 tactics based on the
+    formula's theory.
+
+Expectation:
+    A comma-separated list of tactic names suitable for the formula domain.
+
+Result:
+    If unsure, use the default chain: `simplify,propagate-values,ctx-simplify`.
+    For bitvector formulas, add `bit-blast`. Proceed to Step 2.
 
 | Tactic | What it does |
 |--------|-------------|
@@ -22,6 +31,17 @@ Z3 provides dozens of tactics. Common ones:
 
 # Step 2: Run simplification
 
+Action:
+    Invoke simplify.py with the formula and optional tactic chain.
+
+Expectation:
+    The script applies each tactic in sequence and prints the simplified
+    formula. A run entry is logged to z3agent.db.
+
+Result:
+    If the output is simpler, pass it to **solve** or **prove**.
+    If unchanged, try a different tactic chain.
+
 ```bash
 python3 scripts/simplify.py --formula "(assert (and (> x 0) (> x 0)))" --vars "x:Int"
 python3 scripts/simplify.py --file formula.smt2 --tactics "simplify,propagate-values,ctx-simplify"
@@ -32,7 +52,15 @@ Without `--tactics`, the script applies the default chain: `simplify`, `propagat
 
 # Step 3: Interpret the output
 
-The script prints the simplified formula in SMT-LIB2 syntax. Subgoals are printed as separate `(assert ...)` blocks.
+Action:
+    Read the simplified formula output in SMT-LIB2 syntax.
+
+Expectation:
+    One or more `(assert ...)` blocks representing equivalent subgoals.
+
+Result:
+    A smaller formula indicates successful reduction. Pass the result to
+    **solve**, **prove**, or **optimize** as needed.
 
 # Parameters
 
diff --git a/.github/skills/solve/SKILL.md b/.github/skills/solve/SKILL.md
index a7385635b..81293fc35 100644
--- a/.github/skills/solve/SKILL.md
+++ b/.github/skills/solve/SKILL.md
@@ -7,12 +7,31 @@ Given an SMT-LIB2 formula (or a set of constraints described in natural language
 
 # Step 1: Prepare the formula
 
-If the input is already valid SMT-LIB2, use it directly. If it is a natural language description, use the **encode** skill first to produce SMT-LIB2.
+Action:
+    Convert the input to valid SMT-LIB2. If the input is natural language,
+    use the **encode** skill first.
 
-The formula must include `(check-sat)` at the end. Append `(get-model)` for satisfiable queries or `(get-unsat-core)` when named assertions are used.
+Expectation:
+    A syntactically valid SMT-LIB2 formula ending with `(check-sat)` and
+    either `(get-model)` or `(get-unsat-core)` as appropriate.
+
+Result:
+    If valid SMT-LIB2 is ready, proceed to Step 2.
+    If encoding is needed, run **encode** first and return here.
 
 # Step 2: Run Z3
 
+Action:
+    Invoke solve.py with the formula string or file path.
+
+Expectation:
+    The script pipes the formula to `z3 -in`, logs the run to
+    `.z3-agent/z3agent.db`, and prints the result.
+
+Result:
+    Output is one of `sat`, `unsat`, `unknown`, or `timeout`.
+    Proceed to Step 3 to interpret.
+
 ```bash
 python3 scripts/solve.py --formula "(declare-const x Int)(assert (> x 0))(check-sat)(get-model)"
 ```
@@ -27,14 +46,20 @@ With debug tracing:
 python3 scripts/solve.py --file problem.smt2 --debug
 ```
 
-The script pipes the formula to `z3 -in` via subprocess (no shell expansion), logs the run to `.z3-agent/z3agent.db`, and prints the result.
-
 # Step 3: Interpret the output
 
-- `sat` followed by a model: the formula is satisfiable; the model assigns concrete values to each declared constant.
-- `unsat`: no assignment exists. If `(get-unsat-core)` was used, the conflicting named assertions are listed.
-- `unknown`: Z3 could not decide within the timeout. Consider increasing the timeout or simplifying the formula.
-- `timeout`: the process was killed after the deadline. Try the **simplify** skill to reduce complexity.
+Action:
+    Parse the Z3 output to determine satisfiability and extract any model
+    or unsat core.
+
+Expectation:
+    `sat` with a model, `unsat` optionally with a core, `unknown`, or
+    `timeout`.
+
+Result:
+    On `sat`: report the model to the user.
+    On `unsat`: report the core if available.
+    On `unknown`/`timeout`: try **simplify** or increase the timeout.
 
 # Parameters
 
diff --git a/.github/skills/static-analysis/SKILL.md b/.github/skills/static-analysis/SKILL.md
index 566999813..0fd9018da 100644
--- a/.github/skills/static-analysis/SKILL.md
+++ b/.github/skills/static-analysis/SKILL.md
@@ -7,28 +7,63 @@ Run the Clang Static Analyzer over a CMake build of Z3, parse the resulting plis
 
 # Step 1: Run the analysis
 
+Action:
+    Invoke the script pointing at the CMake build directory. The script
+    runs `scan-build cmake ..` followed by `scan-build make` and writes
+    checker output to the output directory.
+
+Expectation:
+    scan-build completes within the timeout, producing plist diagnostic
+    files in the output directory (defaults to a `scan-results` subdirectory
+    of the build directory).
+
+Result:
+    On success: diagnostics are parsed and findings are printed. Proceed to
+    Step 2.
+    On failure: verify that clang and scan-build are installed and that the
+    build directory contains a valid CMake configuration.
+
 ```bash
 python3 scripts/static_analysis.py --build-dir build
 python3 scripts/static_analysis.py --build-dir build --output-dir /tmp/sa-results --debug
 python3 scripts/static_analysis.py --build-dir build --timeout 1800
 ```
 
-The script invokes `scan-build cmake ..` followed by `scan-build make` inside the specified build directory. Clang checker output is written to `--output-dir` (defaults to a `scan-results` subdirectory of the build directory).
-
 # Step 2: Interpret the output
 
-Each finding is printed with its source location, category, and description:
+Action:
+    Review the printed findings and the summary table grouped by category.
+
+Expectation:
+    Each finding shows its source location, category, and description.
+    The summary table ranks categories by frequency for quick triage.
+
+Result:
+    On zero findings: the codebase passes all enabled static checks.
+    On findings: prioritize by category frequency and severity. Address
+    null dereferences and use-after-free classes first.
+
+Example output:
 
 ```
 [Dead store] src/ast/ast.cpp:142: Value stored to 'result' is never read
 [Null dereference] src/smt/theory_lra.cpp:87: Access to field 'next' results in a dereference of a null pointer
 ```
 
-A summary table groups findings by category so that high-frequency classes are visible at a glance.
-
 # Step 3: Review historical findings
 
-All findings are logged to `z3agent.db`. Query them to track trends:
+Action:
+    Query z3agent.db to compare current results against prior analysis
+    runs.
+
+Expectation:
+    Queries return category counts and run history, enabling regression
+    detection across commits.
+
+Result:
+    On stable or decreasing counts: no regressions introduced.
+    On increased counts: cross-reference new findings with recent commits
+    to identify the responsible change.
 
 ```bash
 python3 ../../shared/z3db.py query "SELECT category, COUNT(*) as cnt FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='static-analysis') GROUP BY category ORDER BY cnt DESC"

From 621638abb96a472ac112cb972ff5f593371303c1 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:10 +0000
Subject: [PATCH 03/11] run black and ruff on all skill scripts

---
 .github/skills/benchmark/scripts/benchmark.py |  14 ++-
 .github/skills/encode/scripts/encode.py       |  15 ++-
 .github/skills/explain/scripts/explain.py     |   7 +-
 .github/skills/optimize/scripts/optimize.py   |   6 +-
 .github/skills/shared/z3db.py                 | 114 ++++++++++++------
 .github/skills/simplify/scripts/simplify.py   |   1 -
 .github/skills/solve/scripts/solve.py         |   6 +-
 7 files changed, 103 insertions(+), 60 deletions(-)

diff --git a/.github/skills/benchmark/scripts/benchmark.py b/.github/skills/benchmark/scripts/benchmark.py
index 1e23abe1f..152f0f4a8 100644
--- a/.github/skills/benchmark/scripts/benchmark.py
+++ b/.github/skills/benchmark/scripts/benchmark.py
@@ -42,13 +42,19 @@ def main():
 
     for i in range(args.runs):
         run_id = db.start_run("benchmark", formula)
-        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
-                        args=["-st"], debug=args.debug)
+        result = run_z3(
+            formula,
+            z3_bin=args.z3,
+            timeout=args.timeout,
+            args=["-st"],
+            debug=args.debug,
+        )
 
         stats = parse_stats(result["stdout"])
         db.log_formula(run_id, formula, result["result"], stats=stats)
-        db.finish_run(run_id, result["result"], result["duration_ms"],
-                      result["exit_code"])
+        db.finish_run(
+            run_id, result["result"], result["duration_ms"], result["exit_code"]
+        )
         timings.append(result["duration_ms"])
 
         if args.runs == 1:
diff --git a/.github/skills/encode/scripts/encode.py b/.github/skills/encode/scripts/encode.py
index 67f3ea87d..87aaf6c04 100644
--- a/.github/skills/encode/scripts/encode.py
+++ b/.github/skills/encode/scripts/encode.py
@@ -15,7 +15,6 @@ from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
 from z3db import Z3DB, run_z3, setup_logging
 
-
 VALIDATION_TIMEOUT = 5
 
 
@@ -41,7 +40,10 @@ def validate(formula: str, z3_bin: str = None, debug: bool = False) -> dict:
     Returns a dict with 'valid' (bool), 'errors' (list), and 'raw' output.
     """
     result = run_z3(
-        formula, z3_bin=z3_bin, timeout=VALIDATION_TIMEOUT, debug=debug,
+        formula,
+        z3_bin=z3_bin,
+        timeout=VALIDATION_TIMEOUT,
+        debug=debug,
     )
     errors = find_errors(result["stdout"]) + find_errors(result["stderr"])
 
@@ -72,8 +74,10 @@ def report_errors(errors: list, formula: str):
 def write_output(formula: str, output_path: str, fmt: str):
     """Write the validated formula to a file or stdout."""
     if fmt == "python":
-        print("python format output is generated by the agent, "
-              "not by this script", file=sys.stderr)
+        print(
+            "python format output is generated by the agent, " "not by this script",
+            file=sys.stderr,
+        )
         sys.exit(1)
 
     if output_path:
@@ -131,7 +135,8 @@ def main():
         for err in result["errors"]:
             db.log_finding(run_id, "syntax", err, severity="error")
         db.finish_run(
-            run_id, "error",
+            run_id,
+            "error",
             result["raw"]["duration_ms"],
             result["raw"]["exit_code"],
         )
diff --git a/.github/skills/explain/scripts/explain.py b/.github/skills/explain/scripts/explain.py
index d2704085a..d54bbc255 100644
--- a/.github/skills/explain/scripts/explain.py
+++ b/.github/skills/explain/scripts/explain.py
@@ -21,7 +21,7 @@ def detect_type(text: str) -> str:
         return "model"
     if "(error" in text:
         return "error"
-    if re.search(r':\S+\s+[\d.]+', text):
+    if re.search(r":\S+\s+[\d.]+", text):
         return "stats"
     first = text.strip().split("\n")[0].strip()
     if first == "unsat":
@@ -86,8 +86,9 @@ def main():
     parser = argparse.ArgumentParser(prog="explain")
     parser.add_argument("--file")
     parser.add_argument("--stdin", action="store_true")
-    parser.add_argument("--type", choices=["model", "core", "stats", "error", "auto"],
-                        default="auto")
+    parser.add_argument(
+        "--type", choices=["model", "core", "stats", "error", "auto"], default="auto"
+    )
     parser.add_argument("--db", default=None)
     parser.add_argument("--debug", action="store_true")
     args = parser.parse_args()
diff --git a/.github/skills/optimize/scripts/optimize.py b/.github/skills/optimize/scripts/optimize.py
index 8c7462ccb..bd9c46668 100644
--- a/.github/skills/optimize/scripts/optimize.py
+++ b/.github/skills/optimize/scripts/optimize.py
@@ -42,10 +42,8 @@ def main():
 
     model = parse_model(result["stdout"]) if result["result"] == "sat" else None
 
-    db.log_formula(run_id, formula, result["result"],
-                   str(model) if model else None)
-    db.finish_run(run_id, result["result"], result["duration_ms"],
-                  result["exit_code"])
+    db.log_formula(run_id, formula, result["result"], str(model) if model else None)
+    db.finish_run(run_id, result["result"], result["duration_ms"], result["exit_code"])
 
     print(result["result"])
     if model:
diff --git a/.github/skills/shared/z3db.py b/.github/skills/shared/z3db.py
index ca959073d..f0f7e3ea2 100644
--- a/.github/skills/shared/z3db.py
+++ b/.github/skills/shared/z3db.py
@@ -27,7 +27,6 @@ import time
 from pathlib import Path
 from typing import Optional
 
-
 SCHEMA_PATH = Path(__file__).parent / "schema.sql"
 DEFAULT_DB_DIR = ".z3-agent"
 DEFAULT_DB_NAME = "z3agent.db"
@@ -37,8 +36,11 @@ logger = logging.getLogger("z3agent")
 
 def setup_logging(debug: bool = False):
     level = logging.DEBUG if debug else logging.INFO
-    fmt = "[%(levelname)s] %(message)s" if not debug else \
-          "[%(levelname)s %(asctime)s] %(message)s"
+    fmt = (
+        "[%(levelname)s] %(message)s"
+        if not debug
+        else "[%(levelname)s %(asctime)s] %(message)s"
+    )
     logging.basicConfig(level=level, format=fmt, stream=sys.stderr)
 
 
@@ -73,8 +75,9 @@ class Z3DB:
         logger.debug("started run %d (skill=%s, hash=%s)", run_id, skill, input_hash)
         return run_id
 
-    def finish_run(self, run_id: int, status: str, duration_ms: int,
-                   exit_code: int = 0):
+    def finish_run(
+        self, run_id: int, status: str, duration_ms: int, exit_code: int = 0
+    ):
         self.conn.execute(
             "UPDATE runs SET status=?, duration_ms=?, exit_code=? WHERE run_id=?",
             (status, duration_ms, exit_code, run_id),
@@ -82,25 +85,44 @@ class Z3DB:
         self.conn.commit()
         logger.debug("finished run %d: %s (%dms)", run_id, status, duration_ms)
 
-    def log_formula(self, run_id: int, smtlib2: str, result: str = None,
-                    model: str = None, stats: dict = None) -> int:
+    def log_formula(
+        self,
+        run_id: int,
+        smtlib2: str,
+        result: str = None,
+        model: str = None,
+        stats: dict = None,
+    ) -> int:
         cur = self.conn.execute(
             "INSERT INTO formulas (run_id, smtlib2, result, model, stats) "
             "VALUES (?, ?, ?, ?, ?)",
-            (run_id, smtlib2, result, model,
-             json.dumps(stats) if stats else None),
+            (run_id, smtlib2, result, model, json.dumps(stats) if stats else None),
         )
         self.conn.commit()
         return cur.lastrowid
 
-    def log_finding(self, run_id: int, category: str, message: str,
-                    severity: str = None, file: str = None,
-                    line: int = None, details: dict = None) -> int:
+    def log_finding(
+        self,
+        run_id: int,
+        category: str,
+        message: str,
+        severity: str = None,
+        file: str = None,
+        line: int = None,
+        details: dict = None,
+    ) -> int:
         cur = self.conn.execute(
             "INSERT INTO findings (run_id, category, severity, file, line, "
             "message, details) VALUES (?, ?, ?, ?, ?, ?, ?)",
-            (run_id, category, severity, file, line, message,
-             json.dumps(details) if details else None),
+            (
+                run_id,
+                category,
+                severity,
+                file,
+                line,
+                message,
+                json.dumps(details) if details else None,
+            ),
         )
         self.conn.commit()
         return cur.lastrowid
@@ -109,8 +131,7 @@ class Z3DB:
         """Write to stderr and to the interaction_log table."""
         getattr(logger, level, logger.info)(message)
         self.conn.execute(
-            "INSERT INTO interaction_log (run_id, level, message) "
-            "VALUES (?, ?, ?)",
+            "INSERT INTO interaction_log (run_id, level, message) " "VALUES (?, ?, ?)",
             (run_id, level, message),
         )
         self.conn.commit()
@@ -144,11 +165,11 @@ class Z3DB:
         if run_id:
             return self.conn.execute(
                 "SELECT * FROM interaction_log WHERE run_id=? "
-                "ORDER BY log_id DESC LIMIT ?", (run_id, last)
+                "ORDER BY log_id DESC LIMIT ?",
+                (run_id, last),
             ).fetchall()
         return self.conn.execute(
-            "SELECT * FROM interaction_log ORDER BY log_id DESC LIMIT ?",
-            (last,)
+            "SELECT * FROM interaction_log ORDER BY log_id DESC LIMIT ?", (last,)
         ).fetchall()
 
     def query(self, sql: str):
@@ -192,8 +213,13 @@ def _find_repo_root() -> Optional[Path]:
     return None
 
 
-def run_z3(formula: str, z3_bin: str = None, timeout: int = 30,
-           args: list = None, debug: bool = False) -> dict:
+def run_z3(
+    formula: str,
+    z3_bin: str = None,
+    timeout: int = 30,
+    args: list = None,
+    debug: bool = False,
+) -> dict:
     """Pipe an SMT-LIB2 formula into z3 -in, return parsed output."""
     z3_path = find_z3(z3_bin)
     cmd = [z3_path, "-in"] + (args or [])
@@ -204,15 +230,21 @@ def run_z3(formula: str, z3_bin: str = None, timeout: int = 30,
     start = time.monotonic()
     try:
         proc = subprocess.run(
-            cmd, input=formula, capture_output=True, text=True,
+            cmd,
+            input=formula,
+            capture_output=True,
+            text=True,
             timeout=timeout,
         )
     except subprocess.TimeoutExpired:
         duration_ms = int((time.monotonic() - start) * 1000)
         logger.warning("z3 timed out after %dms", duration_ms)
         return {
-            "stdout": "", "stderr": "timeout", "exit_code": -1,
-            "duration_ms": duration_ms, "result": "timeout",
+            "stdout": "",
+            "stderr": "timeout",
+            "exit_code": -1,
+            "duration_ms": duration_ms,
+            "result": "timeout",
         }
 
     duration_ms = int((time.monotonic() - start) * 1000)
@@ -237,9 +269,7 @@ def run_z3(formula: str, z3_bin: str = None, timeout: int = 30,
 def parse_model(stdout: str) -> Optional[dict]:
     """Pull define-fun entries from a (get-model) response."""
     model = {}
-    for m in re.finditer(
-        r'\(define-fun\s+(\S+)\s+\(\)\s+\S+\s+(.+?)\)', stdout
-    ):
+    for m in re.finditer(r"\(define-fun\s+(\S+)\s+\(\)\s+\S+\s+(.+?)\)", stdout):
         model[m.group(1)] = m.group(2).strip()
     return model if model else None
 
@@ -247,9 +277,9 @@ def parse_model(stdout: str) -> Optional[dict]:
 def parse_stats(stdout: str) -> Optional[dict]:
     """Parse :key value pairs from z3 -st output."""
     stats = {}
-    for m in re.finditer(r':(\S+)\s+([\d.]+)', stdout):
+    for m in re.finditer(r":(\S+)\s+([\d.]+)", stdout):
         key, val = m.group(1), m.group(2)
-        stats[key] = float(val) if '.' in val else int(val)
+        stats[key] = float(val) if "." in val else int(val)
     return stats if stats else None
 
 
@@ -275,7 +305,7 @@ def cli():
 
     sub.add_parser("init", help="initialize the database")
 
-    status_p = sub.add_parser("status", help="show run summary")
+    sub.add_parser("status", help="show run summary")
 
     log_p = sub.add_parser("log", help="show interaction log")
     log_p.add_argument("--run-id", type=int, help="filter by run ID")
@@ -298,21 +328,27 @@ def cli():
 
     elif args.command == "status":
         s = db.get_status()
-        print(f"Runs: {s['total']}"
-              f" | success: {s.get('success', 0)}"
-              f" | error: {s.get('error', 0)}"
-              f" | timeout: {s.get('timeout', 0)}"
-              f" | Last: {s['last_run'] or 'never'}")
+        print(
+            f"Runs: {s['total']}"
+            f" | success: {s.get('success', 0)}"
+            f" | error: {s.get('error', 0)}"
+            f" | timeout: {s.get('timeout', 0)}"
+            f" | Last: {s['last_run'] or 'never'}"
+        )
 
     elif args.command == "log":
         for row in db.get_logs(args.run_id, args.last):
-            print(f"[{row['level']}] {row['timestamp']} "
-                  f"(run {row['run_id']}): {row['message']}")
+            print(
+                f"[{row['level']}] {row['timestamp']} "
+                f"(run {row['run_id']}): {row['message']}"
+            )
 
     elif args.command == "runs":
         for row in db.get_runs(args.skill, args.last):
-            print(f"#{row['run_id']} {row['skill']} {row['status']} "
-                  f"{row['duration_ms']}ms @ {row['timestamp']}")
+            print(
+                f"#{row['run_id']} {row['skill']} {row['status']} "
+                f"{row['duration_ms']}ms @ {row['timestamp']}"
+            )
 
     elif args.command == "query":
         for row in db.query(args.sql):
diff --git a/.github/skills/simplify/scripts/simplify.py b/.github/skills/simplify/scripts/simplify.py
index 9abef32fb..6621e9095 100644
--- a/.github/skills/simplify/scripts/simplify.py
+++ b/.github/skills/simplify/scripts/simplify.py
@@ -14,7 +14,6 @@ from pathlib import Path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
 from z3db import Z3DB, run_z3, setup_logging
 
-
 DEFAULT_TACTICS = "simplify,propagate-values,ctx-simplify"
 
 
diff --git a/.github/skills/solve/scripts/solve.py b/.github/skills/solve/scripts/solve.py
index b283243f2..dd0674695 100644
--- a/.github/skills/solve/scripts/solve.py
+++ b/.github/skills/solve/scripts/solve.py
@@ -44,10 +44,8 @@ def main():
     model = parse_model(result["stdout"]) if result["result"] == "sat" else None
     core = parse_unsat_core(result["stdout"]) if result["result"] == "unsat" else None
 
-    db.log_formula(run_id, formula, result["result"],
-                   str(model) if model else None)
-    db.finish_run(run_id, result["result"], result["duration_ms"],
-                  result["exit_code"])
+    db.log_formula(run_id, formula, result["result"], str(model) if model else None)
+    db.finish_run(run_id, result["result"], result["duration_ms"], result["exit_code"])
 
     print(result["result"])
     if model:

From 8eb35b3c46d88af1b7a39613ee5d633c189fc641 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:16 +0000
Subject: [PATCH 04/11] remove deeptest skill (handled by external agent)

---
 .github/skills/deeptest/SKILL.md            | 113 ------
 .github/skills/deeptest/scripts/deeptest.py | 393 --------------------
 2 files changed, 506 deletions(-)
 delete mode 100644 .github/skills/deeptest/SKILL.md
 delete mode 100644 .github/skills/deeptest/scripts/deeptest.py

diff --git a/.github/skills/deeptest/SKILL.md b/.github/skills/deeptest/SKILL.md
deleted file mode 100644
index 2e2cd747e..000000000
--- a/.github/skills/deeptest/SKILL.md
+++ /dev/null
@@ -1,113 +0,0 @@
----
-name: deeptest
-description: Generate stress tests and differential tests for Z3 theories. Creates random or structured SMT-LIB2 formulas, runs them through Z3, and checks for crashes, assertion failures, or result inconsistencies. Inspired by fuzzing and metamorphic testing approaches applied to SMT solvers.
----
-
-Given a strategy and count, generate SMT-LIB2 formulas targeting Z3 internals and report anomalies. Strategies range from pure random generation to structured metamorphic and cross-theory combinations. Every formula and finding is logged to z3agent.db.
-
-# Step 1: Choose a strategy and run
-
-Action:
-    Select a generation strategy and invoke the script with the desired
-    count and seed.
-
-Expectation:
-    The script generates SMT-LIB2 formulas according to the chosen
-    strategy, runs each through Z3, and records results to z3agent.db.
-
-Result:
-    On completion: a summary is printed with formula count, anomaly count,
-    and elapsed time. Proceed to Step 2.
-    On early exit: verify the Z3 binary is accessible and review timeout
-    settings.
-
-```bash
-python3 scripts/deeptest.py --strategy random --count 100 --seed 42
-python3 scripts/deeptest.py --strategy metamorphic --seed-file base.smt2 --count 50
-python3 scripts/deeptest.py --strategy cross-theory --theories "LIA,BV" --count 80
-python3 scripts/deeptest.py --strategy incremental --count 60 --debug
-```
-
-Available strategies:
-
-- `random`: generate formulas with random declarations (Int, Bool, BitVec), random arithmetic and boolean assertions, and check-sat.
-- `metamorphic`: start from a base formula (generated or loaded from file), apply equisatisfiable transformations (tautology insertion, double negation, assertion duplication), and verify the result stays consistent.
-- `cross-theory`: combine multiple theories (LIA, Bool, BV) in a single formula with bridging constraints to stress theory combination.
-- `incremental`: generate push/pop sequences with per-frame assertions to stress incremental solving.
-
-# Step 2: Interpret the output
-
-Action:
-    Review the summary printed after the run completes.
-
-Expectation:
-    The summary shows strategy, seed, formula count, anomaly count, and
-    elapsed time.
-
-Result:
-    On zero anomalies: Z3 handled all generated formulas without issue.
-    On nonzero anomalies: crashes, assertion failures, solver errors, or
-    result disagreements were detected. Proceed to Step 3 for details.
-
-Example summary:
-
-```
-strategy:  random
-seed:      42
-formulas:  100
-anomalies: 2
-elapsed:   4500ms
-```
-
-# Step 3: Inspect findings
-
-Action:
-    Query z3agent.db for detailed finding records from the run.
-
-Expectation:
-    Each finding includes category, severity, message, formula index, exit
-    code, and a stderr excerpt.
-
-Result:
-    Use findings to identify reproducible failure patterns and prioritize
-    fixes by severity. If a finding appears nondeterministic, proceed to
-    Step 4 with the same seed to confirm.
-
-```bash
-python3 ../../shared/z3db.py query "SELECT category, severity, message FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='deeptest') ORDER BY finding_id DESC LIMIT 20"
-```
-
-# Step 4: Reproduce
-
-Action:
-    Re-run the script with the same seed to reproduce the exact sequence
-    of generated formulas.
-
-Expectation:
-    Identical formulas are generated, producing the same anomalies if the
-    underlying bug persists.
-
-Result:
-    On same anomalies: bug confirmed and suitable for a regression test.
-    On zero anomalies: the issue may be nondeterministic or already fixed;
-    investigate further before closing.
-
-```bash
-python3 scripts/deeptest.py --strategy random --count 100 --seed 42
-```
-
-The seed is printed in every run summary and logged in the run record.
-
-# Parameters
-
-| Parameter | Type | Required | Default | Description |
-|-----------|------|----------|---------|-------------|
-| strategy | string | no | random | test generation strategy: random, metamorphic, cross-theory, incremental |
-| count | int | no | 50 | number of formulas to generate |
-| seed | int | no | clock | random seed for reproducibility |
-| seed-file | path | no | | base .smt2 file for metamorphic strategy |
-| theories | string | no | LIA,BV | comma-separated theories for cross-theory strategy |
-| timeout | int | no | 10 | per-formula Z3 timeout in seconds |
-| z3 | path | no | auto | path to z3 binary |
-| debug | flag | no | off | verbose tracing |
-| db | path | no | .z3-agent/z3agent.db | logging database |
diff --git a/.github/skills/deeptest/scripts/deeptest.py b/.github/skills/deeptest/scripts/deeptest.py
deleted file mode 100644
index 5d513a6bd..000000000
--- a/.github/skills/deeptest/scripts/deeptest.py
+++ /dev/null
@@ -1,393 +0,0 @@
-#!/usr/bin/env python3
-"""
-deeptest.py: generate and run stress tests for Z3.
-
-Usage:
-    python deeptest.py --strategy random --count 100
-    python deeptest.py --strategy metamorphic --seed-file base.smt2
-    python deeptest.py --strategy cross-theory --theories "LIA,BV" --debug
-"""
-
-import argparse
-import logging
-import random
-import sys
-import time
-from pathlib import Path
-
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent / "shared"))
-from z3db import Z3DB, run_z3, setup_logging
-
-log = logging.getLogger("deeptest")
-
-# Sort and operator tables
-
-THEORY_SORTS = {
-    "LIA":  "Int",
-    "Bool": "Bool",
-    "BV":   "(_ BitVec 32)",
-}
-
-INT_ARITH = ["+", "-", "*"]
-INT_CMP   = [">", "<", ">=", "<=", "="]
-BV_ARITH  = ["bvadd", "bvsub", "bvand", "bvor", "bvxor"]
-BV_CMP    = ["bvslt", "bvsgt", "bvsle", "bvsge", "="]
-
-# Assertion generators (one per sort)
-
-
-def _int_assertion(rng, vs):
-    if len(vs) < 2:
-        return f"(assert ({rng.choice(INT_CMP)} {vs[0]} {rng.randint(-10, 10)}))"
-    a, b = rng.sample(vs, 2)
-    return f"(assert ({rng.choice(INT_CMP)} ({rng.choice(INT_ARITH)} {a} {b}) {rng.randint(-10, 10)}))"
-
-
-def _bool_assertion(rng, vs):
-    if len(vs) == 1:
-        return f"(assert {vs[0]})" if rng.random() < 0.5 else f"(assert (not {vs[0]}))"
-    a, b = rng.sample(vs, 2)
-    return f"(assert ({rng.choice(['and', 'or', '=>'])} {a} {b}))"
-
-
-def _bv_assertion(rng, vs):
-    lit = f"(_ bv{rng.randint(0, 255)} 32)"
-    if len(vs) < 2:
-        return f"(assert ({rng.choice(BV_CMP)} {vs[0]} {lit}))"
-    a, b = rng.sample(vs, 2)
-    return f"(assert ({rng.choice(BV_CMP)} ({rng.choice(BV_ARITH)} {a} {b}) {lit}))"
-
-
-SORT_ASSERTION = {
-    "Int":            _int_assertion,
-    "Bool":           _bool_assertion,
-    "(_ BitVec 32)":  _bv_assertion,
-}
-
-
-def _random_assertion(rng, vars_by_sort):
-    """Pick a populated sort and emit one random assertion."""
-    available = [s for s in vars_by_sort if vars_by_sort[s]]
-    if not available:
-        return None
-    sort = rng.choice(available)
-    return SORT_ASSERTION[sort](rng, vars_by_sort[sort])
-
-# Formula generators
-
-
-def gen_random_formula(rng, num_vars=5, num_assertions=5):
-    """Random declarations, random assertions, single check-sat."""
-    lines = []
-    vars_by_sort = {}
-    sorts = list(THEORY_SORTS.values())
-
-    for i in range(num_vars):
-        sort = rng.choice(sorts)
-        name = f"v{i}"
-        lines.append(f"(declare-const {name} {sort})")
-        vars_by_sort.setdefault(sort, []).append(name)
-
-    for _ in range(num_assertions):
-        a = _random_assertion(rng, vars_by_sort)
-        if a:
-            lines.append(a)
-
-    lines.append("(check-sat)")
-    return "\n".join(lines)
-
-
-def gen_metamorphic_variant(rng, base_formula):
-    """Apply an equisatisfiable transformation to a formula.
-
-    Transformations:
-      tautology   : insert (assert true) before check-sat
-      double_neg  : wrap one assertion body in double negation
-      duplicate   : repeat an existing assertion
-    """
-    lines = base_formula.strip().split("\n")
-    transform = rng.choice(["tautology", "double_neg", "duplicate"])
-    assertion_idxs = [i for i, l in enumerate(lines)
-                      if l.strip().startswith("(assert")]
-
-    if transform == "tautology":
-        pos = next((i for i, l in enumerate(lines) if "check-sat" in l),
-                   len(lines))
-        lines.insert(pos, "(assert true)")
-
-    elif transform == "double_neg" and assertion_idxs:
-        idx = rng.choice(assertion_idxs)
-        body = lines[idx].strip()
-        inner = body[len("(assert "):-1]
-        lines[idx] = f"(assert (not (not {inner})))"
-
-    elif transform == "duplicate" and assertion_idxs:
-        idx = rng.choice(assertion_idxs)
-        lines.insert(idx + 1, lines[idx])
-
-    return "\n".join(lines)
-
-
-def gen_cross_theory_formula(rng, theories, num_vars=4, num_assertions=6):
-    """Combine variables from multiple theories with bridging constraints."""
-    lines = []
-    vars_by_sort = {}
-    sorts = [THEORY_SORTS[t] for t in theories if t in THEORY_SORTS]
-    if not sorts:
-        sorts = list(THEORY_SORTS.values())
-
-    for i in range(num_vars):
-        sort = sorts[i % len(sorts)]
-        name = f"v{i}"
-        lines.append(f"(declare-const {name} {sort})")
-        vars_by_sort.setdefault(sort, []).append(name)
-
-    for _ in range(num_assertions):
-        a = _random_assertion(rng, vars_by_sort)
-        if a:
-            lines.append(a)
-
-    # Bridge Int and Bool when both present
-    int_vs = vars_by_sort.get("Int", [])
-    bool_vs = vars_by_sort.get("Bool", [])
-    if int_vs and bool_vs:
-        iv = rng.choice(int_vs)
-        bv = rng.choice(bool_vs)
-        lines.append(f"(assert (= {bv} (> {iv} 0)))")
-
-    lines.append("(check-sat)")
-    return "\n".join(lines)
-
-
-def gen_incremental_formula(rng, num_frames=3, num_vars=4,
-                            asserts_per_frame=3):
-    """Push/pop sequence: all variables declared globally, assertions scoped."""
-    lines = []
-    vars_by_sort = {}
-    sorts = list(THEORY_SORTS.values())
-
-    for i in range(num_vars):
-        sort = rng.choice(sorts)
-        name = f"v{i}"
-        lines.append(f"(declare-const {name} {sort})")
-        vars_by_sort.setdefault(sort, []).append(name)
-
-    for _ in range(num_frames):
-        lines.append("(push 1)")
-        for _ in range(asserts_per_frame):
-            a = _random_assertion(rng, vars_by_sort)
-            if a:
-                lines.append(a)
-        lines.append("(check-sat)")
-        lines.append("(pop 1)")
-
-    lines.append("(check-sat)")
-    return "\n".join(lines)
-
-# Anomaly detection
-
-
-def classify_result(result):
-    """Return an anomaly category string or None if the result looks normal."""
-    if result["exit_code"] != 0 and result["result"] != "timeout":
-        return "crash"
-    if "assertion" in result["stderr"].lower():
-        return "assertion_failure"
-    if result["result"] == "error":
-        return "error"
-    return None
-
-# Strategy runners
-
-
-def run_random(args, rng, db, run_id):
-    anomalies = 0
-    for i in range(args.count):
-        formula = gen_random_formula(rng, rng.randint(2, 8),
-                                     rng.randint(1, 10))
-        log.debug("formula %d:\n%s", i, formula)
-        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
-                        debug=args.debug)
-        db.log_formula(run_id, formula, result["result"])
-
-        cat = classify_result(result)
-        if cat:
-            anomalies += 1
-            db.log_finding(
-                run_id, cat,
-                f"random formula #{i}: {cat} (exit={result['exit_code']})",
-                severity="high" if cat == "crash" else "medium",
-                details={"formula_index": i,
-                         "exit_code": result["exit_code"],
-                         "stderr": result["stderr"][:500]})
-            log.warning("anomaly in formula %d: %s", i, cat)
-    return anomalies
-
-
-def run_metamorphic(args, rng, db, run_id):
-    if args.seed_file:
-        base = Path(args.seed_file).read_text()
-    else:
-        base = gen_random_formula(rng, num_vars=4, num_assertions=3)
-
-    base_out = run_z3(base, z3_bin=args.z3, timeout=args.timeout,
-                      debug=args.debug)
-    base_status = base_out["result"]
-    db.log_formula(run_id, base, base_status)
-    log.info("base formula result: %s", base_status)
-
-    if base_status not in ("sat", "unsat"):
-        db.log_finding(run_id, "skip",
-                       f"base formula not definite: {base_status}",
-                       severity="info")
-        return 0
-
-    anomalies = 0
-    for i in range(args.count):
-        variant = gen_metamorphic_variant(rng, base)
-        log.debug("variant %d:\n%s", i, variant)
-        result = run_z3(variant, z3_bin=args.z3, timeout=args.timeout,
-                        debug=args.debug)
-        db.log_formula(run_id, variant, result["result"])
-
-        cat = classify_result(result)
-        if cat:
-            anomalies += 1
-            db.log_finding(
-                run_id, cat,
-                f"metamorphic variant #{i}: {cat}",
-                severity="high",
-                details={"variant_index": i,
-                         "stderr": result["stderr"][:500]})
-            log.warning("anomaly in variant %d: %s", i, cat)
-            continue
-
-        if result["result"] in ("sat", "unsat") \
-                and result["result"] != base_status:
-            anomalies += 1
-            db.log_finding(
-                run_id, "disagreement",
-                f"variant #{i}: expected {base_status}, "
-                f"got {result['result']}",
-                severity="critical",
-                details={"variant_index": i,
-                         "expected": base_status,
-                         "actual": result["result"]})
-            log.warning("disagreement in variant %d: expected %s, got %s",
-                        i, base_status, result["result"])
-    return anomalies
-
-
-def run_cross_theory(args, rng, db, run_id):
-    theories = [t.strip() for t in args.theories.split(",")]
-    anomalies = 0
-    for i in range(args.count):
-        formula = gen_cross_theory_formula(rng, theories,
-                                           rng.randint(3, 8),
-                                           rng.randint(2, 10))
-        log.debug("cross-theory formula %d:\n%s", i, formula)
-        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
-                        debug=args.debug)
-        db.log_formula(run_id, formula, result["result"])
-
-        cat = classify_result(result)
-        if cat:
-            anomalies += 1
-            db.log_finding(
-                run_id, cat,
-                f"cross-theory #{i} ({','.join(theories)}): {cat}",
-                severity="high" if cat == "crash" else "medium",
-                details={"formula_index": i, "theories": theories,
-                         "exit_code": result["exit_code"],
-                         "stderr": result["stderr"][:500]})
-            log.warning("anomaly in cross-theory formula %d: %s", i, cat)
-    return anomalies
-
-
-def run_incremental(args, rng, db, run_id):
-    anomalies = 0
-    for i in range(args.count):
-        num_frames = rng.randint(2, 6)
-        formula = gen_incremental_formula(rng, num_frames)
-        log.debug("incremental formula %d:\n%s", i, formula)
-        result = run_z3(formula, z3_bin=args.z3, timeout=args.timeout,
-                        debug=args.debug)
-        db.log_formula(run_id, formula, result["result"])
-
-        cat = classify_result(result)
-        if cat:
-            anomalies += 1
-            db.log_finding(
-                run_id, cat,
-                f"incremental #{i} ({num_frames} frames): {cat}",
-                severity="high" if cat == "crash" else "medium",
-                details={"formula_index": i, "num_frames": num_frames,
-                         "exit_code": result["exit_code"],
-                         "stderr": result["stderr"][:500]})
-            log.warning("anomaly in incremental formula %d: %s", i, cat)
-    return anomalies
-
-
-STRATEGIES = {
-    "random":        run_random,
-    "metamorphic":   run_metamorphic,
-    "cross-theory":  run_cross_theory,
-    "incremental":   run_incremental,
-}
-
-# Entry point
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        prog="deeptest",
-        description="Generate and run stress tests for Z3.",
-    )
-    parser.add_argument("--strategy", choices=list(STRATEGIES),
-                        default="random",
-                        help="test generation strategy")
-    parser.add_argument("--count", type=int, default=50,
-                        help="number of formulas to generate")
-    parser.add_argument("--seed", type=int, default=None,
-                        help="random seed for reproducibility")
-    parser.add_argument("--seed-file", default=None,
-                        help="base .smt2 file for metamorphic strategy")
-    parser.add_argument("--theories", default="LIA,BV",
-                        help="comma-separated theories for cross-theory")
-    parser.add_argument("--timeout", type=int, default=10,
-                        help="per-formula Z3 timeout in seconds")
-    parser.add_argument("--z3", default=None, help="path to z3 binary")
-    parser.add_argument("--db", default=None, help="path to z3agent.db")
-    parser.add_argument("--debug", action="store_true")
-    args = parser.parse_args()
-
-    setup_logging(args.debug)
-
-    seed = args.seed if args.seed is not None else int(time.time())
-    rng = random.Random(seed)
-    log.info("seed: %d", seed)
-
-    db = Z3DB(args.db)
-    run_id = db.start_run(
-        "deeptest",
-        f"strategy={args.strategy} count={args.count} seed={seed}")
-
-    t0 = time.monotonic()
-    anomalies = STRATEGIES[args.strategy](args, rng, db, run_id)
-    elapsed_ms = int((time.monotonic() - t0) * 1000)
-
-    status = "success" if anomalies == 0 else "findings"
-    db.finish_run(run_id, status, elapsed_ms)
-
-    print(f"strategy:  {args.strategy}")
-    print(f"seed:      {seed}")
-    print(f"formulas:  {args.count}")
-    print(f"anomalies: {anomalies}")
-    print(f"elapsed:   {elapsed_ms}ms")
-
-    db.close()
-    sys.exit(1 if anomalies > 0 else 0)
-
-
-if __name__ == "__main__":
-    main()

From d74f610264abbe8c4d5c2dccb2cb3783313503b8 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:21 +0000
Subject: [PATCH 05/11] merge z3-solver and z3-verifier into single z3 agent

---
 .github/agents/z3-solver.md   | 129 ------------------------
 .github/agents/z3-verifier.md | 131 -------------------------
 .github/agents/z3.md          | 180 ++++++++++++++++++++++++++++++++++
 3 files changed, 180 insertions(+), 260 deletions(-)
 delete mode 100644 .github/agents/z3-solver.md
 delete mode 100644 .github/agents/z3-verifier.md
 create mode 100644 .github/agents/z3.md

diff --git a/.github/agents/z3-solver.md b/.github/agents/z3-solver.md
deleted file mode 100644
index d1a97be80..000000000
--- a/.github/agents/z3-solver.md
+++ /dev/null
@@ -1,129 +0,0 @@
----
-name: z3-solver
-description: 'Z3 theorem prover assistant: satisfiability checking, validity proofs, optimization, simplification, encoding, and performance analysis.'
----
-
-## Instructions
-
-You are the Z3 Solver Agent, a Copilot agent for SMT solving workflows using the Z3 theorem prover. You help users formulate, solve, optimize, and interpret constraint satisfaction problems. Follow the workflow below. Use subagents for long-running skill invocations such as benchmarking.
-
-### Workflow
-
-1. **Understand the Request**: Determine what the user needs: a satisfiability check, a validity proof, an optimization, a simplification, an encoding from natural language, an explanation of output, or a performance analysis.
-
-2. **Encode (if needed)**: If the user provides a problem in natural language, pseudocode, or a domain-specific formulation, translate it into SMT-LIB2 using the **encode** skill before proceeding.
-
-3. **Solve or Transform**: Route to the appropriate skill based on the request type. Multiple skills may be chained when the task requires it (for example, encoding followed by optimization followed by explanation).
-
-4. **Explain Results**: After solving, invoke **explain** to present the result in clear, human-readable language. Always interpret models, proofs, and optimization results for the user.
-
-5. **Iterate**: On follow-up queries, refine the formulation or re-invoke skills with adjusted parameters. Do not re-run the full pipeline when only a narrow adjustment is needed.
-
-### Available Skills
-
-| # | Skill | Purpose |
-|---|-------|---------|
-| 1 | solve | Check satisfiability of a formula. Extract models when satisfiable. Report unsatisfiable cores when unsat. |
-| 2 | prove | Establish validity of a formula by checking the negation for unsatisfiability. If the negation is unsat, the original is valid. |
-| 3 | optimize | Solve constrained optimization problems. Supports minimize and maximize objectives, lexicographic and Pareto modes. |
-| 4 | simplify | Apply Z3 tactics to reduce formula complexity. Useful for preprocessing, normal form conversion, and human-readable reformulation. |
-| 5 | encode | Translate a problem description into SMT-LIB2 syntax. Handles sort selection, quantifier introduction, and theory annotation. |
-| 6 | explain | Interpret Z3 output (models, unsat cores, proofs, optimization results, statistics) and present it in plain language. |
-| 7 | benchmark | Measure solving performance. Collect statistics, compare tactic configurations, identify bottlenecks, and suggest parameter tuning. |
-
-### Skill Dependencies
-
-The planner respects these edges:
-
-```
-encode --> solve
-encode --> prove
-encode --> optimize
-encode --> simplify
-solve  --> explain
-prove  --> explain
-optimize --> explain
-simplify --> explain
-benchmark --> explain
-solve  --> benchmark
-optimize --> benchmark
-```
-
-Skills on the left must complete before skills on the right when both appear in a pipeline. Independent skills (for example, solve and optimize on separate formulas) may run in parallel.
-
-### Skill Selection
-
-Given a user request, select skills as follows:
-
-- "Is this formula satisfiable?" : `solve`
-- "Find a model for these constraints" : `solve` then `explain`
-- "Prove that P implies Q" : `encode` (if needed) then `prove` then `explain`
-- "Prove this is always true" : `prove` then `explain`
-- "Optimize this scheduling problem" : `encode` then `optimize` then `explain`
-- "Minimize cost subject to constraints" : `optimize` then `explain`
-- "Simplify this expression" : `simplify` then `explain`
-- "Convert to CNF" : `simplify`
-- "Translate this problem to SMT-LIB2" : `encode`
-- "Why is Z3 returning unknown?" : `explain`
-- "Why is this query slow?" : `benchmark` then `explain`
-- "Compare these two tactic pipelines" : `benchmark` then `explain`
-- "What does this model mean?" : `explain`
-- "Get the unsat core" : `solve` then `explain`
-
-When the request is ambiguous, prefer the most informative pipeline. For example, "check this formula" should invoke `solve` followed by `explain`, not `solve` alone.
-
-### Examples
-
-User: "Is (x > 0 and y > 0 and x + y < 1) satisfiable over the reals?"
-
-1. **solve**: Assert the conjunction over real-valued variables. Run `(check-sat)`.
-2. **explain**: If sat, present the model. If unsat, state that no assignment satisfies all three constraints simultaneously.
-
-User: "Prove that for all integers x, if x^2 is even then x is even."
-
-1. **encode**: Formalize the statement. Negate it: assert there exists an integer x such that x^2 is even and x is odd.
-2. **prove**: Check the negation for unsatisfiability.
-3. **explain**: If unsat, the original statement is valid. Present the reasoning. If sat (counterexample found), report the model and explain why the conjecture fails.
-
-User: "Schedule five tasks on two machines to minimize makespan."
-
-1. **encode**: Define integer variables for task assignments and start times. Encode machine capacity, precedence, and non-overlap constraints.
-2. **optimize**: Minimize the makespan variable subject to the encoded constraints.
-3. **explain**: Present the optimal schedule, makespan value, and any binding constraints.
-
-User: "Why is my bitvector query so slow?"
-
-1. **benchmark**: Run the query with `(set-option :timeout 30000)` and collect statistics via `(get-info :all-statistics)`.
-2. **explain**: Identify dominant cost centers (conflict count, propagation ratio, theory combination overhead). Suggest tactic or parameter adjustments such as `:blast_full` for bitvectors or increasing the relevancy threshold.
-
-### Error Handling
-
-Z3 may return results other than `sat` or `unsat`. Handle each case as follows:
-
-**unknown**: Z3 could not determine satisfiability within the given resource limits.
-- Check if a timeout was active. If so, suggest increasing it.
-- Inspect the reason with `(get-info :reason-unknown)`.
-- If the reason is "incomplete," the formula may use a theory fragment that Z3 cannot decide. Suggest alternative encodings (for example, replacing nonlinear arithmetic with linearization or bit-blasting).
-- If the reason is "timeout" or "max-conflicts," suggest parameter tuning: increase `:timeout`, adjust `:smt.relevancy`, or try a different tactic pipeline.
-
-**error (syntax or sort mismatch)**: The input is malformed.
-- Report the exact error message from Z3.
-- Identify the offending declaration or assertion.
-- Suggest a corrected encoding.
-
-**error (resource exhaustion)**: Z3 ran out of memory or hit an internal limit.
-- Suggest simplifying the problem: reduce variable count, eliminate quantifiers where possible, split into subproblems.
-- Suggest incremental solving with `(push)` / `(pop)` to reuse learned information.
-
-**unsat with no core requested**: The formula is unsatisfiable but the user may want to understand why.
-- Offer to re-run with `(set-option :produce-unsat-cores true)` and named assertions to extract a minimal explanation.
-
-### Notes
-
-- Always validate SMT-LIB2 syntax before invoking Z3. A malformed input wastes time and produces confusing errors.
-- Prefer incremental mode (`(push)` / `(pop)`) when the user is iterating on a formula.
-- Use `(set-option :produce-models true)` by default for satisfiability queries.
-- Use `(set-option :produce-proofs true)` when the user requests validity proofs.
-- Collect statistics with `z3 -st` when performance is relevant.
-- Present models in a readable table format, not raw S-expressions, unless the user requests SMT-LIB2 output.
-- Never fabricate results. If a skill fails or Z3 produces an unexpected answer, report the raw output and explain what went wrong.
diff --git a/.github/agents/z3-verifier.md b/.github/agents/z3-verifier.md
deleted file mode 100644
index 246ce1b5a..000000000
--- a/.github/agents/z3-verifier.md
+++ /dev/null
@@ -1,131 +0,0 @@
----
-name: z3-verifier
-description: 'Z3 code quality agent: memory safety checking, static analysis, and stress testing for the Z3 codebase itself.'
----
-
-## Instructions
-
-You are the Z3 Verifier Agent, a Copilot agent for code quality and correctness verification of the Z3 theorem prover codebase. You do not solve SMT problems (use **z3-solver** for that). Instead, you detect bugs, enforce code quality, and stress-test Z3 internals. Follow the workflow below. Use subagents for long-running skill invocations such as fuzzing campaigns.
-
-### Workflow
-
-1. **Identify the Verification Goal**: Determine what the user needs: memory bug detection, static analysis findings, or stress testing results. If the request is broad ("verify this code" or "full verification pass"), run all three skills.
-
-2. **Build the Target**: Ensure a Z3 build exists with the required instrumentation (sanitizers, debug symbols, coverage). If not, build one before proceeding.
-
-3. **Run Verification Skills**: Invoke the appropriate skill(s). When running a full verification pass, execute all three skills and aggregate results.
-
-4. **Report Findings**: Present results sorted by severity. Each finding should include: location (file, function, line), category, severity, and reproduction steps where applicable.
-
-5. **Iterate**: On follow-ups, narrow scope to specific files, functions, or bug categories. Do not re-run the full pipeline unnecessarily.
-
-### Available Skills
-
-| # | Skill | Purpose |
-|---|-------|---------|
-| 1 | memory-safety | Build Z3 with AddressSanitizer (ASan), MemorySanitizer (MSan), or UndefinedBehaviorSanitizer (UBSan). Run the test suite under instrumentation to detect memory corruption, use-after-free, buffer overflows, uninitialized reads, and undefined behavior. |
-| 2 | static-analysis | Run the Clang Static Analyzer over the Z3 source tree. Detects null pointer dereferences, resource leaks, dead stores, logic errors, and API misuse without executing the code. |
-| 3 | deeptest | Stress-test Z3 with randomized inputs, differential testing against known-good solvers, and targeted fuzzing of parser and solver components. Detects crashes, assertion failures, and correctness regressions. |
-
-### Skill Dependencies
-
-```
-memory-safety (independent)
-static-analysis (independent)
-deeptest (independent)
-```
-
-All three skills are independent and may run in parallel. None requires the output of another as input. When running a full verification pass, launch all three simultaneously via subagents.
-
-### Skill Selection
-
-Given a user request, select skills as follows:
-
-- "Check for memory bugs" : `memory-safety`
-- "Run ASan on the test suite" : `memory-safety`
-- "Find undefined behavior" : `memory-safety` (with UBSan configuration)
-- "Run static analysis" : `static-analysis`
-- "Find null pointer bugs" : `static-analysis`
-- "Check for resource leaks" : `static-analysis`
-- "Fuzz Z3" : `deeptest`
-- "Stress test the parser" : `deeptest`
-- "Run differential testing" : `deeptest`
-- "Full verification pass" : `memory-safety` + `static-analysis` + `deeptest`
-- "Verify this pull request" : `memory-safety` + `static-analysis` (scope to changed files)
-- "Is this change safe?" : `memory-safety` + `static-analysis` (scope to changed files)
-
-### Examples
-
-User: "Check for memory bugs in the SAT solver."
-
-1. **memory-safety**: Build Z3 with ASan enabled (`cmake -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" ..`). Run the SAT solver tests. Collect any sanitizer reports.
-2. Report findings with stack traces, categorized by bug type (heap-buffer-overflow, use-after-free, stack-buffer-overflow, etc.).
-
-User: "Run static analysis on src/ast/."
-
-1. **static-analysis**: Invoke `scan-build` or `clang-tidy` over `src/ast/` with Z3's compile commands database.
-2. Report findings sorted by severity. Include checker name, file, line, and a brief description of each issue.
-
-User: "Fuzz the SMT-LIB2 parser."
-
-1. **deeptest**: Generate randomized SMT-LIB2 inputs targeting the parser. Run Z3 on each input with a timeout. Collect crashes, assertion failures, and unexpected error messages.
-2. Report crash-inducing inputs with minimized reproduction cases. Classify findings as crashes, assertion violations, or incorrect results.
-
-User: "Full verification pass before the release."
-
-1. Launch all three skills in parallel via subagents:
-   - **memory-safety**: Full test suite under ASan and UBSan.
-   - **static-analysis**: Full source tree scan.
-   - **deeptest**: Broad fuzzing campaign across theories (arithmetic, bitvectors, arrays, strings).
-2. Aggregate all findings. Deduplicate issues that appear in multiple skills (for example, a null dereference found by both static analysis and ASan). Sort by severity: Critical, High, Medium, Low.
-3. Present a summary table followed by detailed findings.
-
-### Build Configurations
-
-Each skill may require a specific build configuration:
-
-**memory-safety (ASan)**:
-```bash
-mkdir build-asan && cd build-asan
-cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_BUILD_TYPE=Debug
-make -j$(nproc)
-```
-
-**memory-safety (UBSan)**:
-```bash
-mkdir build-ubsan && cd build-ubsan
-cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=undefined" -DCMAKE_C_FLAGS="-fsanitize=undefined" -DCMAKE_BUILD_TYPE=Debug
-make -j$(nproc)
-```
-
-**static-analysis**:
-```bash
-mkdir build-analyze && cd build-analyze
-scan-build cmake .. -DCMAKE_BUILD_TYPE=Debug
-scan-build make -j$(nproc)
-```
-
-**deeptest**: Uses a standard Release build for performance, with Debug builds reserved for reproducing crashes:
-```bash
-mkdir build-fuzz && cd build-fuzz
-cmake .. -DCMAKE_BUILD_TYPE=Release
-make -j$(nproc)
-```
-
-### Error Handling
-
-**Build failure**: If the instrumented build fails, report the compiler errors. Common causes: sanitizer flags incompatible with certain optimization levels, or missing sanitizer runtime libraries.
-
-**Flaky sanitizer reports**: Some sanitizer findings may be nondeterministic (especially under MSan with uninitialized memory). Re-run flagged tests three times to confirm reproducibility. Mark non-reproducible findings as "intermittent" rather than discarding them.
-
-**Fuzzing timeouts**: Individual fuzz inputs that cause Z3 to exceed the timeout threshold should be collected separately and reported as potential performance regressions, not crashes.
-
-**False positives in static analysis**: The Clang Static Analyzer may produce false positives, particularly around custom allocators and reference-counted smart pointers used in Z3. Flag likely false positives but do not suppress them without user confirmation.
-
-### Notes
-
-- Sanitizer builds are significantly slower than Release builds. Set timeouts to at least 3x the normal test suite duration.
-- Store sanitizer reports and fuzzing artifacts in `.z3-verifier/` unless the user specifies otherwise.
-- When scoping to changed files for pull request verification, use `git diff` to determine the affected source files and limit skill invocations accordingly.
-- Never suppress or ignore sanitizer findings automatically. Every report should be presented to the user for triage.
-- Prefer ASan as the default sanitizer. It catches the broadest class of memory errors with the lowest false-positive rate.
diff --git a/.github/agents/z3.md b/.github/agents/z3.md
new file mode 100644
index 000000000..77e51713e
--- /dev/null
+++ b/.github/agents/z3.md
@@ -0,0 +1,180 @@
+---
+name: z3
+description: 'Z3 theorem prover agent: SMT solving, code quality analysis, and verification.'
+---
+
+## Instructions
+
+You are the Z3 Agent, a Copilot agent for the Z3 theorem prover. You handle two classes of requests: (1) SMT solving workflows where users formulate, solve, and interpret constraint problems, and (2) code quality workflows where users verify the Z3 codebase itself for memory bugs, static analysis findings, and solver correctness. Route to the appropriate skills based on the request.
+
+### Workflow
+
+1. **Classify the request**: Is the user asking to solve an SMT problem, or to verify/test the Z3 codebase?
+
+2. **For SMT problems**:
+   - Encode the problem into SMT-LIB2 if needed (via **encode**).
+   - Route to the appropriate solving skill (**solve**, **prove**, **optimize**, **simplify**).
+   - Interpret the result (via **explain**).
+   - Measure performance if relevant (via **benchmark**).
+
+3. **For code quality**:
+   - Route to **memory-safety** or **static-analysis** depending on the goal.
+   - Independent skills may run in parallel.
+   - Aggregate and deduplicate findings across skills.
+
+4. **Report**: Present results clearly. For SMT problems, interpret models and proofs. For code quality, sort findings by severity with file locations.
+
+5. **Iterate**: On follow-ups, refine the formulation or narrow the scope. Do not re-run the full pipeline when only a narrow adjustment is needed.
+
+### Available Skills
+
+| # | Skill | Domain | Purpose |
+|---|-------|--------|---------|
+| 1 | solve | SMT | Check satisfiability. Extract models or unsat cores. |
+| 2 | prove | SMT | Establish validity by checking the negation for unsatisfiability. |
+| 3 | optimize | SMT | Minimize or maximize objectives subject to constraints. |
+| 4 | simplify | SMT | Apply tactic chains to reduce formula complexity. |
+| 5 | encode | SMT | Translate problem descriptions into SMT-LIB2 syntax. |
+| 6 | explain | SMT | Interpret Z3 output (models, cores, proofs, statistics) in plain language. |
+| 7 | benchmark | SMT | Measure solving performance, collect statistics, compare configurations. |
+| 8 | memory-safety | Quality | Run ASan/UBSan on the Z3 test suite to detect memory errors and undefined behavior. |
+| 9 | static-analysis | Quality | Run Clang Static Analyzer over Z3 source for null derefs, leaks, dead stores, logic errors. |
+
+### Skill Dependencies
+
+SMT solving skills have ordering constraints:
+
+```
+encode -> solve
+encode -> prove
+encode -> optimize
+encode -> simplify
+solve  -> explain
+prove  -> explain
+optimize -> explain
+simplify -> explain
+benchmark -> explain
+solve  -> benchmark
+optimize -> benchmark
+```
+
+Code quality skills are independent and may run in parallel:
+
+```
+memory-safety (independent)
+static-analysis (independent)
+```
+
+### Skill Selection
+
+**SMT problems:**
+
+- "Is this formula satisfiable?" : `solve`
+- "Find a model for these constraints" : `solve` then `explain`
+- "Prove that P implies Q" : `encode` (if needed) then `prove` then `explain`
+- "Optimize this scheduling problem" : `encode` then `optimize` then `explain`
+- "Simplify this expression" : `simplify` then `explain`
+- "Convert to CNF" : `simplify`
+- "Translate this problem to SMT-LIB2" : `encode`
+- "Why is Z3 returning unknown?" : `explain`
+- "Why is this query slow?" : `benchmark` then `explain`
+- "What does this model mean?" : `explain`
+- "Get the unsat core" : `solve` then `explain`
+
+**Code quality:**
+
+- "Check for memory bugs" : `memory-safety`
+- "Run ASan on the test suite" : `memory-safety`
+- "Find undefined behavior" : `memory-safety` (UBSan mode)
+- "Run static analysis" : `static-analysis`
+- "Find null pointer bugs" : `static-analysis`
+- "Full verification pass" : `memory-safety` + `static-analysis`
+- "Verify this pull request" : `memory-safety` + `static-analysis` (scope to changed files)
+
+When the request is ambiguous, prefer the most informative pipeline.
+
+### Examples
+
+User: "Is (x > 0 and y > 0 and x + y < 1) satisfiable over the reals?"
+
+1. **solve**: Assert the conjunction over real-valued variables. Run `(check-sat)`.
+2. **explain**: Present the model or state unsatisfiability.
+
+User: "Prove that for all integers x, if x^2 is even then x is even."
+
+1. **encode**: Formalize and negate the statement.
+2. **prove**: Check the negation for unsatisfiability.
+3. **explain**: Present the validity result or counterexample.
+
+User: "Schedule five tasks on two machines to minimize makespan."
+
+1. **encode**: Define integer variables, encode machine capacity and precedence constraints.
+2. **optimize**: Minimize the makespan variable.
+3. **explain**: Present the optimal schedule and binding constraints.
+
+User: "Why is my bitvector query so slow?"
+
+1. **benchmark**: Run with statistics collection.
+2. **explain**: Identify cost centers and suggest parameter adjustments.
+
+User: "Check for memory bugs in the SAT solver."
+
+1. **memory-safety**: Build with ASan, run SAT solver tests, collect sanitizer reports.
+2. Report findings with stack traces categorized by bug type.
+
+User: "Full verification pass before the release."
+
+1. Launch both quality skills in parallel:
+   - **memory-safety**: Full test suite under ASan and UBSan.
+   - **static-analysis**: Full source tree scan.
+2. Aggregate findings, deduplicate, sort by severity.
+
+### Build Configurations
+
+Code quality skills may require specific builds:
+
+**memory-safety (ASan)**:
+```bash
+mkdir build-asan && cd build-asan
+cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_C_FLAGS="-fsanitize=address -fno-omit-frame-pointer" -DCMAKE_BUILD_TYPE=Debug
+make -j$(nproc)
+```
+
+**memory-safety (UBSan)**:
+```bash
+mkdir build-ubsan && cd build-ubsan
+cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=undefined" -DCMAKE_C_FLAGS="-fsanitize=undefined" -DCMAKE_BUILD_TYPE=Debug
+make -j$(nproc)
+```
+
+**static-analysis**:
+```bash
+mkdir build-analyze && cd build-analyze
+scan-build cmake .. -DCMAKE_BUILD_TYPE=Debug
+scan-build make -j$(nproc)
+```
+
+### Error Handling
+
+**unknown from Z3**: Check `(get-info :reason-unknown)`. If "incomplete," suggest alternative encodings. If "timeout," suggest parameter tuning or the **simplify** skill.
+
+**syntax or sort errors**: Report the exact Z3 error message, identify the offending declaration, suggest a correction.
+
+**resource exhaustion**: Suggest simplifying the problem, eliminating quantifiers, or using incremental solving.
+
+**build failure**: Report compiler errors. Common cause: sanitizer flags incompatible with optimization levels.
+
+**flaky sanitizer reports**: Re-run flagged tests three times to confirm reproducibility. Mark non-reproducible findings as "intermittent."
+
+**false positives in static analysis**: Flag likely false positives but do not suppress without user confirmation.
+
+### Notes
+
+- Validate SMT-LIB2 syntax before invoking Z3.
+- Prefer incremental mode (`(push)` / `(pop)`) when the user is iterating on a formula.
+- Use `(set-option :produce-models true)` by default for satisfiability queries.
+- Collect statistics with `z3 -st` when performance is relevant.
+- Present models in readable table format, not raw S-expressions.
+- Sanitizer builds are slower than Release builds. Set timeouts to at least 3x normal.
+- Store code quality artifacts in `.z3-agent/`.
+- Never fabricate results or suppress findings.

From ed5b1929f1af49cc0ec84cde03f677df4816dcc5 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:27 +0000
Subject: [PATCH 06/11] add dependency checks to memory-safety and
 static-analysis

---
 .../memory-safety/scripts/memory_safety.py    | 21 +++++++++++++++++++
 .../scripts/static_analysis.py                | 11 +++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/.github/skills/memory-safety/scripts/memory_safety.py b/.github/skills/memory-safety/scripts/memory_safety.py
index cab818a63..fa87f8f8c 100644
--- a/.github/skills/memory-safety/scripts/memory_safety.py
+++ b/.github/skills/memory-safety/scripts/memory_safety.py
@@ -11,6 +11,7 @@ import argparse
 import logging
 import os
 import re
+import shutil
 import subprocess
 import sys
 import time
@@ -32,6 +33,25 @@ LEAK_ERROR = re.compile(r"ERROR:\s*LeakSanitizer:")
 LOCATION = re.compile(r"(\S+\.(?:cpp|c|h|hpp)):(\d+)")
 
 
+def check_dependencies():
+    """Fail early if required build tools are not on PATH."""
+    missing = []
+    if not shutil.which("cmake"):
+        missing.append(("cmake", "sudo apt install cmake"))
+    if not shutil.which("make"):
+        missing.append(("make", "sudo apt install build-essential"))
+
+    cc = shutil.which("clang") or shutil.which("gcc")
+    if not cc:
+        missing.append(("clang or gcc", "sudo apt install clang"))
+
+    if missing:
+        print("required tools not found:", file=sys.stderr)
+        for tool, install in missing:
+            print(f"  {tool}: {install}", file=sys.stderr)
+        sys.exit(1)
+
+
 def find_repo_root() -> Path:
     d = Path.cwd()
     for _ in range(10):
@@ -199,6 +219,7 @@ def main():
     args = parser.parse_args()
 
     setup_logging(args.debug)
+    check_dependencies()
     repo_root = find_repo_root()
 
     sanitizers = ["asan", "ubsan"] if args.sanitizer == "both" else [args.sanitizer]
diff --git a/.github/skills/static-analysis/scripts/static_analysis.py b/.github/skills/static-analysis/scripts/static_analysis.py
index aa64d883d..65f87e731 100644
--- a/.github/skills/static-analysis/scripts/static_analysis.py
+++ b/.github/skills/static-analysis/scripts/static_analysis.py
@@ -34,9 +34,14 @@ def find_scan_build() -> str:
         if path:
             logger.debug("found scan-build: %s", path)
             return path
-    logger.error(
-        "scan-build not found. Install clang-tools or set PATH. "
-        "Searched: %s", ", ".join(SCAN_BUILD_NAMES)
+    print(
+        "scan-build not found on PATH.\n"
+        "Install one of the following:\n"
+        "  Ubuntu/Debian: sudo apt install clang-tools\n"
+        "  macOS:         brew install llvm\n"
+        "  Fedora:        sudo dnf install clang-tools-extra\n"
+        f"searched for: {', '.join(SCAN_BUILD_NAMES)}",
+        file=sys.stderr,
     )
     sys.exit(1)
 

From 90a4cdf8556fd5bc0b456f7e498fc790f230581d Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:32 +0000
Subject: [PATCH 07/11] update skills readme to match current state

---
 .github/skills/README.md | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/.github/skills/README.md b/.github/skills/README.md
index 53fc9f80d..5d58eade7 100644
--- a/.github/skills/README.md
+++ b/.github/skills/README.md
@@ -15,18 +15,16 @@ LLM agent, backed by a Python validation script in `scripts/`.
 | optimize | implemented | Solve constrained optimization (minimize/maximize) over numeric domains |
 | explain | implemented | Parse and interpret Z3 output: models, cores, statistics, errors |
 | benchmark | implemented | Measure Z3 performance and collect solver statistics |
-| static-analysis | planned | Run Clang Static Analyzer on Z3 source and log structured findings |
-| deeptest | planned | Deep property-based testing of Z3 internals |
-| memory-safety | planned | Memory safety verification of Z3 C++ source |
+| static-analysis | implemented | Run Clang Static Analyzer on Z3 source and log structured findings |
+| memory-safety | implemented | Run ASan/UBSan on Z3 test suite to detect memory errors and undefined behavior |
 
-## Agents
+## Agent
 
-Two orchestration agents compose these skills into end-to-end workflows:
+A single orchestration agent composes these skills into end-to-end workflows:
 
 | Agent | Role |
 |-------|------|
-| z3-solver | Formulation and solving: encode, solve, prove, simplify, optimize, explain |
-| z3-verifier | Codebase quality: benchmark, static-analysis, deeptest, memory-safety |
+| z3 | SMT solving, code quality analysis, and stress testing |
 
 ## Shared Infrastructure
 

From ed8f3ac438256f80b2b6651105508b0a365cd947 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:53:37 +0000
Subject: [PATCH 08/11] add agent readme with usage examples

---
 .github/agents/README.md | 312 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 312 insertions(+)
 create mode 100644 .github/agents/README.md

diff --git a/.github/agents/README.md b/.github/agents/README.md
new file mode 100644
index 000000000..769b47116
--- /dev/null
+++ b/.github/agents/README.md
@@ -0,0 +1,312 @@
+# Z3 Agent
+
+A Copilot agent for the Z3 theorem prover. It wraps 9 skills that cover
+SMT solving and code quality analysis.
+
+## What it does
+
+The agent handles two kinds of requests:
+
+1. **SMT solving**: formulate constraints, check satisfiability, prove
+   properties, optimize objectives, simplify expressions.
+2. **Code quality**: run sanitizers (ASan, UBSan) and Clang Static Analyzer
+   against the Z3 codebase to catch memory bugs and logic errors.
+
+## Prerequisites
+
+You need a built Z3 binary. The scripts look for it in this order:
+
+1. Explicit `--z3 path/to/z3`
+2. `build/z3`, `build/release/z3`, `build/debug/z3` (relative to repo root)
+3. `z3` on your PATH
+
+For code quality skills you also need:
+
+- **memory-safety**: cmake, make, and a compiler with sanitizer support
+  (gcc or clang). The script checks at startup and tells you what is missing.
+- **static-analysis**: scan-build (part of clang-tools). Same early check
+  with install instructions if absent.
+
+## Using the agent in Copilot Chat
+
+Mention `@z3` and describe what you want:
+
+```
+@z3 is (x + y > 10) satisfiable?
+@z3 prove that x*x >= 0 for all integers
+@z3 run memory-safety checks on the test suite
+```
+
+The agent picks the right skill and runs it.
+
+## Using the scripts directly
+
+Every skill lives under `.github/skills/<name>/scripts/`. All scripts
+accept `--debug` for full tracing and `--db path` to specify where the
+SQLite log goes (defaults to `z3agent.db` in the current directory).
+
+### solve
+
+Check whether a set of constraints has a solution.
+
+```
+python3 .github/skills/solve/scripts/solve.py \
+  --z3 build/release/z3 \
+  --formula '
+(declare-const x Int)
+(declare-const y Int)
+(assert (> x 0))
+(assert (> y 0))
+(assert (< (+ x y) 5))
+(check-sat)
+(get-model)'
+```
+
+Output:
+
+```
+sat
+  x = 1
+  y = 1
+```
+
+### prove
+
+Check whether a property holds for all values. The script negates your
+conjecture and asks Z3 if the negation is satisfiable. If it is not,
+the property is valid.
+
+```
+python3 .github/skills/prove/scripts/prove.py \
+  --z3 build/release/z3 \
+  --conjecture '(>= (* x x) 0)' \
+  --vars 'x:Int'
+```
+
+Output:
+
+```
+valid
+```
+
+If Z3 finds a counterexample, it prints `invalid` followed by the
+counterexample values.
+
+### optimize
+
+Find the best value of an objective subject to constraints.
+
+```
+python3 .github/skills/optimize/scripts/optimize.py \
+  --z3 build/release/z3 \
+  --formula '
+(declare-const x Int)
+(declare-const y Int)
+(assert (>= x 1))
+(assert (>= y 1))
+(assert (<= (+ x y) 20))
+(maximize (+ (* 3 x) (* 2 y)))
+(check-sat)
+(get-model)'
+```
+
+Output:
+
+```
+sat
+  x = 19
+  y = 1
+```
+
+Here Z3 maximizes `3x + 2y` under the constraint `x + y <= 20`, so it
+pushes x as high as possible (19) and keeps y at its minimum (1),
+giving `3*19 + 2*1 = 59`.
+
+### simplify
+
+Reduce expressions using Z3 tactic chains.
+
+```
+python3 .github/skills/simplify/scripts/simplify.py \
+  --z3 build/release/z3 \
+  --formula '(declare-const x Int)(simplify (+ x 0 (* 1 x)))'
+```
+
+Output:
+
+```
+(* 2 x)
+(goals
+(goal
+  :precision precise :depth 1)
+)
+```
+
+Z3 simplified `x + 0 + 1*x` down to `2*x`.
+
+### benchmark
+
+Measure solving time over multiple runs.
+
+```
+python3 .github/skills/benchmark/scripts/benchmark.py \
+  --z3 build/release/z3 \
+  --runs 5 \
+  --formula '
+(declare-const x Int)
+(declare-const y Int)
+(assert (> x 0))
+(assert (> y 0))
+(assert (< (+ x y) 100))
+(check-sat)'
+```
+
+Output (times will vary on your machine):
+
+```
+runs: 5
+min: 27ms
+median: 28ms
+max: 30ms
+result: sat
+```
+
+### explain
+
+Interpret Z3 output in readable form. It reads from stdin:
+
+```
+echo 'sat
+(
+  (define-fun x () Int
+    19)
+  (define-fun y () Int
+    1)
+)' | python3 .github/skills/explain/scripts/explain.py --stdin --type model
+```
+
+Output:
+
+```
+satisfying assignment:
+  x = 19
+  y = 1
+```
+
+### encode
+
+Validate that an SMT-LIB2 file is well-formed by running it through Z3:
+
+```
+python3 .github/skills/encode/scripts/encode.py \
+  --z3 build/release/z3 \
+  --validate problem.smt2
+```
+
+If the file parses and runs without errors, it prints the formula back.
+If there are syntax or sort errors, it prints the Z3 error message.
+
+### memory-safety
+
+Build Z3 with AddressSanitizer or UndefinedBehaviorSanitizer, run the
+test suite, and collect any findings.
+
+```
+python3 .github/skills/memory-safety/scripts/memory_safety.py --sanitizer asan
+python3 .github/skills/memory-safety/scripts/memory_safety.py --sanitizer ubsan
+python3 .github/skills/memory-safety/scripts/memory_safety.py --sanitizer both
+```
+
+Use `--skip-build` to reuse a previous instrumented build. Use
+`--build-dir path` to control where the build goes (defaults to
+`build/sanitizer-asan` or `build/sanitizer-ubsan` under the repo root).
+
+If cmake, make, or a C compiler is not found, the script prints what
+you need to install and exits.
+
+### static-analysis
+
+Run Clang Static Analyzer over the Z3 source tree.
+
+```
+python3 .github/skills/static-analysis/scripts/static_analysis.py \
+  --build-dir build/scan
+```
+
+Results go to `build/scan/scan-results/` by default. Findings are
+printed grouped by category with file and line number.
+
+If `scan-build` is not on your PATH, the script prints install
+instructions for Ubuntu, macOS, and Fedora.
+
+## Debug tracing
+
+Add `--debug` to any command to see the full trace: run IDs, z3 binary
+path, the exact command and stdin sent to Z3, stdout/stderr received,
+timing, and database logging. Example:
+
+```
+python3 .github/skills/solve/scripts/solve.py \
+  --z3 build/release/z3 --debug \
+  --formula '
+(declare-const x Int)
+(declare-const y Int)
+(assert (> x 0))
+(assert (> y 0))
+(assert (< (+ x y) 5))
+(check-sat)
+(get-model)'
+```
+
+```
+[DEBUG] started run 31 (skill=solve, hash=d64beb5a61842362)
+[DEBUG] found z3: build/release/z3
+[DEBUG] cmd: build/release/z3 -in
+[DEBUG] stdin:
+(declare-const x Int)
+(declare-const y Int)
+(assert (> x 0))
+(assert (> y 0))
+(assert (< (+ x y) 5))
+(check-sat)
+(get-model)
+[DEBUG] exit_code=0 duration=28ms
+[DEBUG] stdout:
+sat
+(
+  (define-fun x () Int
+    1)
+  (define-fun y () Int
+    1)
+)
+[DEBUG] finished run 31: sat (28ms)
+sat
+  x = 1
+  y = 1
+```
+
+## Logging
+
+Every run is logged to a SQLite database (`z3agent.db` by default).
+You can query it directly:
+
+```
+sqlite3 z3agent.db "SELECT id, skill, status, duration_ms FROM runs ORDER BY id DESC LIMIT 10;"
+```
+
+Use `--db /path/to/file.db` on any script to put the database somewhere
+else.
+
+## Skill list
+
+| Skill | What it does |
+|-------|-------------|
+| solve | check satisfiability, extract models or unsat cores |
+| prove | prove validity by negating and checking unsatisfiability |
+| optimize | minimize or maximize objectives under constraints |
+| simplify | reduce formulas with Z3 tactic chains |
+| encode | translate problems into SMT-LIB2, validate syntax |
+| explain | interpret Z3 output (models, cores, stats, errors) |
+| benchmark | measure solving time, collect statistics |
+| memory-safety | run ASan/UBSan on Z3 test suite |
+| static-analysis | run Clang Static Analyzer on Z3 source |

From 14276fb193765dd5d35dbab1719a0912537c80a6 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 22:04:15 +0000
Subject: [PATCH 09/11] ignore .z3-agent runtime directory

---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index ee5df58ff..517fd3159 100644
--- a/.gitignore
+++ b/.gitignore
@@ -117,3 +117,4 @@ genaisrc/genblogpost.genai.mts
 bazel-*
 # Local issue tracking
 .beads
+.z3-agent/

From 68ea8d3a435ced2bcf588cdea04d02d85c09970a Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Wed, 11 Mar 2026 22:27:29 +0000
Subject: [PATCH 10/11] move agent readme to repo root as Z3-AGENT.md

---
 .github/agents/README.md => Z3-AGENT.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename .github/agents/README.md => Z3-AGENT.md (100%)

diff --git a/.github/agents/README.md b/Z3-AGENT.md
similarity index 100%
rename from .github/agents/README.md
rename to Z3-AGENT.md

From f120cc690357fb32d764f00d9749a96b980ae296 Mon Sep 17 00:00:00 2001
From: Angelica Moreira <48168649+angelica-moreira@users.noreply.github.com>
Date: Thu, 12 Mar 2026 00:16:06 +0000
Subject: [PATCH 11/11] add per-skill @z3 usage examples to agent readme

---
 Z3-AGENT.md | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 156 insertions(+), 5 deletions(-)

diff --git a/Z3-AGENT.md b/Z3-AGENT.md
index 769b47116..40990153e 100644
--- a/Z3-AGENT.md
+++ b/Z3-AGENT.md
@@ -29,15 +29,166 @@ For code quality skills you also need:
 
 ## Using the agent in Copilot Chat
 
-Mention `@z3` and describe what you want:
+Mention `@z3` and describe what you want in plain language.
+The agent figures out which skill to use, builds the formula if needed,
+runs Z3, and gives you the result.
+
+### solve: check satisfiability
 
 ```
-@z3 is (x + y > 10) satisfiable?
-@z3 prove that x*x >= 0 for all integers
-@z3 run memory-safety checks on the test suite
+@z3 is (x > 0 and y > 0 and x + y < 5) satisfiable over the integers?
 ```
 
-The agent picks the right skill and runs it.
+Expected response: **sat** with a model like `x = 1, y = 1`.
+
+```
+@z3 can x + y = 10 and x - y = 4 both hold at the same time?
+```
+
+Expected response: **sat** with `x = 7, y = 3`.
+
+```
+@z3 is there an integer x where x > 0, x < 0?
+```
+
+Expected response: **unsat** (no such integer exists).
+
+### prove: check if something is always true
+
+```
+@z3 prove that x * x >= 0 for all integers x
+```
+
+Expected response: **valid** (the negation is unsatisfiable, so the property holds).
+
+```
+@z3 is it true that (a + b)^2 >= 0 for all real a and b?
+```
+
+Expected response: **valid**.
+
+```
+@z3 prove that if x > y and y > z then x > z, for integers
+```
+
+Expected response: **valid** (transitivity of >).
+
+### optimize: find the best value
+
+```
+@z3 maximize 3x + 2y where x >= 1, y >= 1, and x + y <= 20
+```
+
+Expected response: **sat** with `x = 19, y = 1` (objective = 59).
+
+```
+@z3 minimize x + y where x >= 5, y >= 3, and x + y >= 10
+```
+
+Expected response: **sat** with `x = 5, y = 5` or similar (objective = 10).
+
+### simplify: reduce an expression
+
+```
+@z3 simplify x + 0 + 1*x
+```
+
+Expected response: `2*x`.
+
+```
+@z3 simplify (a and true) or (a and false)
+```
+
+Expected response: `a`.
+
+### encode: translate a problem to SMT-LIB2
+
+```
+@z3 encode this as SMT-LIB2: find integers x and y where x + y = 10 and x > y
+```
+
+Expected response: the SMT-LIB2 formula:
+```
+(declare-const x Int)
+(declare-const y Int)
+(assert (= (+ x y) 10))
+(assert (> x y))
+(check-sat)
+(get-model)
+```
+
+### explain: interpret Z3 output
+
+```
+@z3 what does this Z3 output mean?
+sat
+(
+  (define-fun x () Int 7)
+  (define-fun y () Int 3)
+)
+```
+
+Expected response: a readable summary like "satisfying assignment: x = 7, y = 3".
+
+```
+@z3 Z3 returned unknown, what does that mean?
+```
+
+Expected response: an explanation of common causes (timeout, incomplete theory, quantifiers).
+
+### benchmark: measure performance
+
+```
+@z3 how fast can Z3 solve (x > 0 and y > 0 and x + y < 100)? run it 5 times
+```
+
+Expected response: timing stats like min/median/max in milliseconds and the result.
+
+### memory-safety: find memory bugs in Z3
+
+```
+@z3 run AddressSanitizer on the Z3 test suite
+```
+
+Expected response: builds Z3 with ASan, runs the tests, reports any findings
+with category, file, and line number. If clean, reports no findings.
+
+```
+@z3 check for undefined behavior in Z3
+```
+
+Expected response: runs UBSan, same format.
+
+```
+@z3 run both sanitizers
+```
+
+Expected response: runs ASan and UBSan, aggregates findings from both.
+
+### static-analysis: find bugs without running the code
+
+```
+@z3 run static analysis on the Z3 source
+```
+
+Expected response: runs Clang Static Analyzer, reports findings grouped by
+category (null dereference, dead store, memory leak, etc.) with file and line.
+
+### multi-skill: the agent chains skills when needed
+
+```
+@z3 prove that for all integers, if x^2 is even then x is even
+```
+
+The agent uses **encode** to formalize and negate the statement, then
+**prove** to check it, then **explain** to present the result.
+
+```
+@z3 full verification pass before the release
+```
+
+The agent runs **memory-safety** (ASan + UBSan) and **static-analysis**
+in parallel, then aggregates and deduplicates findings sorted by severity.
 
 ## Using the scripts directly