Add action/expectation/result structure to all skill definitions

Each step in every SKILL.md now carries labeled Action, Expectation, and Result blocks so the agent can mechanically execute, verify, and branch at each stage. Format chosen after comparing three variants (indented blocks, inline keywords, tables) on a prove-validity simulation; indented blocks scored highest on routing completeness and checkability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-07-09 08:40:18 +00:00 · 2026-03-11 19:51:59 +00:00 · 2026-03-11 19:51:59 +00:00 · 9d674404c8
commit 9d674404c8
parent d349b93d1d
10 changed files with 364 additions and 48 deletions
--- a/.github/skills/benchmark/SKILL.md
+++ b/.github/skills/benchmark/SKILL.md
@ -7,16 +7,37 @@ Given an SMT-LIB2 formula or file, run Z3 with statistics enabled and report per

 # Step 1: Run Z3 with statistics

+Action:
+    Invoke benchmark.py with the formula or file. Use `--runs N` for
+    repeated timing.
+
+Expectation:
+    The script invokes `z3 -st`, parses the statistics block, and prints
+    a performance summary. A run entry is logged to z3agent.db.
+
+Result:
+    Timing and statistics are displayed. Proceed to Step 2 to interpret.
+
 ```bash
 python3 scripts/benchmark.py --file problem.smt2
 python3 scripts/benchmark.py --file problem.smt2 --runs 5
 python3 scripts/benchmark.py --formula "(declare-const x Int)..." --debug
 ```

-The script invokes `z3 -st` and parses the `:key value` statistics block.
-
 # Step 2: Interpret the output

+Action:
+    Review wall-clock time, memory usage, conflict counts, and per-theory
+    breakdowns.
+
+Expectation:
+    A complete performance profile including min/median/max timing when
+    multiple runs are requested.
+
+Result:
+    If performance is acceptable, no action needed.
+    If slow, try **simplify** to reduce the formula or adjust tactic strategies.
+
 The output includes:

 - wall-clock time (ms)
@ -29,7 +50,17 @@ With `--runs N`, the script runs Z3 N times and reports min/median/max timing.

 # Step 3: Compare over time

-Past benchmark runs are logged to `z3agent.db`. Query them:
+Action:
+    Query past benchmark runs from z3agent.db to detect regressions or
+    improvements.
+
+Expectation:
+    Historical run data is available for comparison, ordered by recency.
+
+Result:
+    If performance regressed, investigate recent formula or tactic changes.
+    If improved, record the successful configuration.
+
 ```bash
 python3 ../../shared/z3db.py runs --skill benchmark --last 20
 python3 ../../shared/z3db.py query "SELECT smtlib2, result, stats FROM formulas WHERE run_id IN (SELECT run_id FROM runs WHERE skill='benchmark') ORDER BY run_id DESC LIMIT 5"
--- a/.github/skills/deeptest/SKILL.md
+++ b/.github/skills/deeptest/SKILL.md
@ -7,6 +7,20 @@ Given a strategy and count, generate SMT-LIB2 formulas targeting Z3 internals an

 # Step 1: Choose a strategy and run

+Action:
+    Select a generation strategy and invoke the script with the desired
+    count and seed.
+
+Expectation:
+    The script generates SMT-LIB2 formulas according to the chosen
+    strategy, runs each through Z3, and records results to z3agent.db.
+
+Result:
+    On completion: a summary is printed with formula count, anomaly count,
+    and elapsed time. Proceed to Step 2.
+    On early exit: verify the Z3 binary is accessible and review timeout
+    settings.
+
 ```bash
 python3 scripts/deeptest.py --strategy random --count 100 --seed 42
 python3 scripts/deeptest.py --strategy metamorphic --seed-file base.smt2 --count 50
@ -23,7 +37,19 @@ Available strategies:

 # Step 2: Interpret the output

-The script prints a summary after completion:
+Action:
+    Review the summary printed after the run completes.
+
+Expectation:
+    The summary shows strategy, seed, formula count, anomaly count, and
+    elapsed time.
+
+Result:
+    On zero anomalies: Z3 handled all generated formulas without issue.
+    On nonzero anomalies: crashes, assertion failures, solver errors, or
+    result disagreements were detected. Proceed to Step 3 for details.
+
+Example summary:

 ```
 strategy:  random
@ -33,21 +59,38 @@ anomalies: 2
 elapsed:   4500ms
 ```

-A nonzero anomaly count means the run detected crashes (nonzero exit code), assertion failures in stderr, solver errors, or result disagreements between a base formula and its metamorphic variants.
-
 # Step 3: Inspect findings

-Findings are logged to `z3agent.db` with category, severity, and details:
+Action:
+    Query z3agent.db for detailed finding records from the run.
+
+Expectation:
+    Each finding includes category, severity, message, formula index, exit
+    code, and a stderr excerpt.
+
+Result:
+    Use findings to identify reproducible failure patterns and prioritize
+    fixes by severity. If a finding appears nondeterministic, proceed to
+    Step 4 with the same seed to confirm.

 ```bash
 python3 ../../shared/z3db.py query "SELECT category, severity, message FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='deeptest') ORDER BY finding_id DESC LIMIT 20"
 ```

-Each finding includes the formula index, exit code, and a stderr excerpt for triage.
-
 # Step 4: Reproduce

-Use the `--seed` parameter to reproduce a run exactly:
+Action:
+    Re-run the script with the same seed to reproduce the exact sequence
+    of generated formulas.
+
+Expectation:
+    Identical formulas are generated, producing the same anomalies if the
+    underlying bug persists.
+
+Result:
+    On same anomalies: bug confirmed and suitable for a regression test.
+    On zero anomalies: the issue may be nondeterministic or already fixed;
+    investigate further before closing.

 ```bash
 python3 scripts/deeptest.py --strategy random --count 100 --seed 42
--- a/.github/skills/encode/SKILL.md
+++ b/.github/skills/encode/SKILL.md
@ -7,7 +7,17 @@ Given a problem description (natural language, pseudocode, or a partial formulat

 # Step 1: Identify the problem class

-Common encodings:
+Action:
+    Determine the SMT theory and variable sorts required by the problem
+    description.
+
+Expectation:
+    A clear mapping from the problem to one of the supported theories
+    (LIA, LRA, QF_BV, etc.).
+
+Result:
+    If the theory is identified, proceed to Step 2. If the problem spans
+    multiple theories, select the appropriate combined logic.

 | Problem class | Theory | Typical sorts |
 |---------------|--------|---------------|
@ -21,17 +31,35 @@ Common encodings:

 # Step 2: Generate the encoding

+Action:
+    Invoke encode.py with the problem description and desired output format.
+
+Expectation:
+    The script produces a complete SMT-LIB2 file or Z3 Python script with
+    all declarations, constraints, and check-sat commands.
+
+Result:
+    For `smtlib2` format: pass the output to **solve**.
+    For `python` format: execute the script directly.
+    Proceed to Step 3 for validation.
+
 ```bash
 python3 scripts/encode.py --problem "Find integers x, y such that x^2 + y^2 = 25 and x > 0" --format smtlib2
 python3 scripts/encode.py --problem "Schedule 4 tasks on 2 machines minimizing makespan" --format python
 ```

-For `--format smtlib2`, the output is a complete .smt2 file ready for the **solve** skill.
-For `--format python`, the output is a standalone Z3 Python script.
-
 # Step 3: Validate the encoding

-The script checks that the generated formula is syntactically valid by running a quick `z3 -in` parse check (no solving, just syntax). Parse errors are reported with the offending line.
+Action:
+    The script runs a syntax check by piping the output through `z3 -in`
+    in parse-only mode.
+
+Expectation:
+    No parse errors. If errors occur, the offending line is reported.
+
+Result:
+    On success: the encoding is ready for **solve**, **prove**, or **optimize**.
+    On parse error: fix the reported line and re-run.

 # Parameters

--- a/.github/skills/explain/SKILL.md
+++ b/.github/skills/explain/SKILL.md
@ -7,6 +7,17 @@ Given raw Z3 output (from the **solve**, **prove**, **optimize**, or **benchmark

 # Step 1: Identify the output type

+Action:
+    Determine the category of Z3 output to explain: model, core,
+    statistics, error, or proof.
+
+Expectation:
+    The output type maps to one of the recognized formats in the table below.
+
+Result:
+    If the type is ambiguous, use `--type auto` and let the script detect it.
+    Proceed to Step 2.
+
 | Output contains | Explanation type |
 |----------------|-----------------|
 | `(define-fun ...)` blocks | model explanation |
@ -17,16 +28,36 @@ Given raw Z3 output (from the **solve**, **prove**, **optimize**, or **benchmark

 # Step 2: Run the explainer

+Action:
+    Invoke explain.py with the output file or stdin.
+
+Expectation:
+    The script auto-detects the output type and produces a structured
+    plain-language summary.
+
+Result:
+    A formatted explanation is printed. If detection fails, re-run with
+    an explicit `--type` flag.
+
 ```bash
 python3 scripts/explain.py --file output.txt
 python3 scripts/explain.py --stdin < output.txt
 python3 scripts/explain.py --file output.txt --debug
 ```

-The script auto-detects the output type and produces a structured summary.
-
 # Step 3: Interpret the explanation

+Action:
+    Review the structured explanation for accuracy and completeness.
+
+Expectation:
+    Models list each variable with its value and sort. Cores list
+    conflicting assertions. Statistics show time and memory breakdowns.
+
+Result:
+    Use the explanation to answer the user query or to guide the next
+    skill invocation.
+
 For models:
 - Each variable is listed with its value and sort
 - Array and function interpretations are expanded
--- a/.github/skills/memory-safety/SKILL.md
+++ b/.github/skills/memory-safety/SKILL.md
@ -7,7 +7,22 @@ Build Z3 with compiler-based sanitizer instrumentation, execute the test suite,

 # Step 1: Configure and build

-The script invokes cmake with the appropriate `-fsanitize` flags and builds the `test-z3` target. Each sanitizer uses a separate build directory to avoid flag conflicts. If a prior instrumented build exists with matching flags, only incremental compilation runs.
+Action:
+    Invoke the script with the desired sanitizer flag. The script calls
+    cmake with the appropriate `-fsanitize` flags and builds the `test-z3`
+    target. Each sanitizer uses a separate build directory to avoid flag
+    conflicts.
+
+Expectation:
+    cmake configures successfully and make compiles the instrumented binary.
+    If a prior build exists with matching flags, only incremental compilation
+    runs.
+
+Result:
+    On success: an instrumented `test-z3` binary is ready in the build
+    directory. Proceed to Step 2.
+    On failure: verify compiler support for the requested sanitizer flags
+    and review cmake output.

 ```bash
 python3 scripts/memory_safety.py --sanitizer asan
@ -22,7 +37,21 @@ python3 scripts/memory_safety.py --sanitizer asan --skip-build --build-dir build

 # Step 2: Run and collect

-The test binary runs with `halt_on_error=0` so the sanitizer reports all violations rather than aborting on the first. The script parses `ERROR: AddressSanitizer`, `runtime error:`, and `ERROR: LeakSanitizer` patterns from the combined output, extracts source locations where available, and deduplicates by category, file, and line.
+Action:
+    Execute the instrumented test binary with halt_on_error=0 so all
+    violations are reported rather than aborting on the first.
+
+Expectation:
+    The script parses AddressSanitizer, UndefinedBehaviorSanitizer, and
+    LeakSanitizer patterns from combined output, extracts source locations,
+    and deduplicates by category/file/line.
+
+Result:
+    On `clean`: no violations detected.
+    On `findings`: one or more violations found, each printed with severity,
+    category, message, and source location.
+    On `timeout`: test suite did not finish; increase timeout or investigate.
+    On `error`: build or execution failed before sanitizer output.

 ```bash
 python3 scripts/memory_safety.py --sanitizer asan --timeout 900 --debug
@ -30,10 +59,20 @@ python3 scripts/memory_safety.py --sanitizer asan --timeout 900 --debug

 # Step 3: Interpret results

- `clean`: no sanitizer violations detected.
- `findings`: one or more violations found. Each is printed with severity, category, message, and source location.
- `timeout`: the test suite did not complete within the deadline. Increase the timeout or investigate a possible infinite loop.
- `error`: build or execution failed before sanitizer output could be collected.
+Action:
+    Review printed findings and query z3agent.db for historical comparison.
+
+Expectation:
+    Each finding includes severity, category, message, and source location.
+    The database query returns prior runs for trend analysis.
+
+Result:
+    On `clean`: no action required; proceed.
+    On `findings`: triage by severity and category. Compare against prior
+    runs to distinguish new regressions from known issues.
+    On `timeout`: increase the deadline or investigate a possible infinite
+    loop.
+    On `error`: inspect build logs before re-running.

 Query past runs:
 ```bash
--- a/.github/skills/optimize/SKILL.md
+++ b/.github/skills/optimize/SKILL.md
@ -7,7 +7,17 @@ Given a set of constraints and an objective function, find the optimal value. Z3

 # Step 1: Formulate the problem

-The formula uses the `(minimize ...)` or `(maximize ...)` directives followed by `(check-sat)` and `(get-model)`.
+Action:
+    Write constraints and an objective using `(minimize ...)` or
+    `(maximize ...)` directives, followed by `(check-sat)` and `(get-model)`.
+
+Expectation:
+    A valid SMT-LIB2 formula with at least one optimization directive and
+    all variables declared.
+
+Result:
+    If the formula is well-formed, proceed to Step 2. For multi-objective
+    problems, list directives in priority order for lexicographic optimization.

 Example: minimize `x + y` subject to `x >= 1`, `y >= 2`, `x + y <= 10`:
 ```smtlib
@ -23,6 +33,17 @@ Example: minimize `x + y` subject to `x >= 1`, `y >= 2`, `x + y <= 10`:

 # Step 2: Run the optimizer

+Action:
+    Invoke optimize.py with the formula or file path.
+
+Expectation:
+    The script prints `sat` with the optimal assignment, `unsat`, `unknown`,
+    or `timeout`. A run entry is logged to z3agent.db.
+
+Result:
+    On `sat`: proceed to Step 3 to read the optimal values.
+    On `unsat` or `timeout`: check constraints for contradictions or simplify.
+
 ```bash
 python3 scripts/optimize.py --file scheduling.smt2
 python3 scripts/optimize.py --formula "<inline smt-lib2>" --debug
@ -30,11 +51,17 @@ python3 scripts/optimize.py --formula "<inline smt-lib2>" --debug

 # Step 3: Interpret the output

- `sat` with a model: the optimal assignment satisfying all constraints.
- `unsat`: the constraints are contradictory; no feasible solution exists.
- `unknown` or `timeout`: Z3 could not determine optimality.
+Action:
+    Parse the objective value and satisfying assignment from the output.

-The script prints the objective value and the satisfying assignment.
+Expectation:
+    `sat` with a model containing the optimal value, `unsat` indicating
+    infeasibility, or `unknown`/`timeout`.
+
+Result:
+    On `sat`: report the optimal value and assignment.
+    On `unsat`: the constraints are contradictory, no feasible solution exists.
+    On `unknown`/`timeout`: relax constraints or try **simplify**.

 # Parameters

--- a/.github/skills/prove/SKILL.md
+++ b/.github/skills/prove/SKILL.md
@ -7,7 +7,17 @@ Given a conjecture (an SMT-LIB2 assertion or a natural language claim), determin

 # Step 1: Prepare the negated formula

-Wrap the conjecture in `(assert (not ...))` and append `(check-sat)(get-model)`.
+Action:
+    Wrap the conjecture in `(assert (not ...))` and append
+    `(check-sat)(get-model)`.
+
+Expectation:
+    A complete SMT-LIB2 formula that negates the original conjecture with
+    all variables declared.
+
+Result:
+    If the negation is well-formed, proceed to Step 2.
+    If the conjecture is natural language, run **encode** first.

 Example: to prove that `(> x 3)` implies `(> x 1)`:
 ```smtlib
@ -19,6 +29,18 @@ Example: to prove that `(> x 3)` implies `(> x 1)`:

 # Step 2: Run the prover

+Action:
+    Invoke prove.py with the conjecture and variable declarations.
+
+Expectation:
+    The script prints `valid`, `invalid` (with counterexample), `unknown`,
+    or `timeout`. A run entry is logged to z3agent.db.
+
+Result:
+    On `valid`: proceed to **explain** if the user needs a summary.
+    On `invalid`: report the counterexample directly.
+    On `unknown`/`timeout`: try **simplify** first, or increase the timeout.
+
 ```bash
 python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int"
 ```
@ -35,9 +57,16 @@ python3 scripts/prove.py --conjecture "(=> (> x 3) (> x 1))" --vars "x:Int" --de

 # Step 3: Interpret the output

- `valid`: the negation was unsat, so the conjecture holds for all inputs.
- `invalid` followed by a counterexample: the negation was sat; the model shows a concrete assignment where the conjecture fails.
- `unknown` or `timeout`: Z3 could not decide. The conjecture may require auxiliary lemmas or induction.
+Action:
+    Read the prover output to determine validity of the conjecture.
+
+Expectation:
+    One of `valid`, `invalid` (with counterexample), `unknown`, or `timeout`.
+
+Result:
+    On `valid`: the conjecture holds universally.
+    On `invalid`: the model shows a concrete counterexample.
+    On `unknown`/`timeout`: the conjecture may require auxiliary lemmas or induction.

 # Parameters

--- a/.github/skills/simplify/SKILL.md
+++ b/.github/skills/simplify/SKILL.md
@ -7,7 +7,16 @@ Given a formula, apply a sequence of Z3 tactics to produce an equivalent but sim

 # Step 1: Choose tactics

-Z3 provides dozens of tactics. Common ones:
+Action:
+    Select a tactic chain from the available Z3 tactics based on the
+    formula's theory.
+
+Expectation:
+    A comma-separated list of tactic names suitable for the formula domain.
+
+Result:
+    If unsure, use the default chain: `simplify,propagate-values,ctx-simplify`.
+    For bitvector formulas, add `bit-blast`. Proceed to Step 2.

 | Tactic | What it does |
 |--------|-------------|
@ -22,6 +31,17 @@ Z3 provides dozens of tactics. Common ones:

 # Step 2: Run simplification

+Action:
+    Invoke simplify.py with the formula and optional tactic chain.
+
+Expectation:
+    The script applies each tactic in sequence and prints the simplified
+    formula. A run entry is logged to z3agent.db.
+
+Result:
+    If the output is simpler, pass it to **solve** or **prove**.
+    If unchanged, try a different tactic chain.
+
 ```bash
 python3 scripts/simplify.py --formula "(assert (and (> x 0) (> x 0)))" --vars "x:Int"
 python3 scripts/simplify.py --file formula.smt2 --tactics "simplify,propagate-values,ctx-simplify"
@ -32,7 +52,15 @@ Without `--tactics`, the script applies the default chain: `simplify`, `propagat

 # Step 3: Interpret the output

-The script prints the simplified formula in SMT-LIB2 syntax. Subgoals are printed as separate `(assert ...)` blocks.
+Action:
+    Read the simplified formula output in SMT-LIB2 syntax.
+
+Expectation:
+    One or more `(assert ...)` blocks representing equivalent subgoals.
+
+Result:
+    A smaller formula indicates successful reduction. Pass the result to
+    **solve**, **prove**, or **optimize** as needed.

 # Parameters

--- a/.github/skills/solve/SKILL.md
+++ b/.github/skills/solve/SKILL.md
@ -7,12 +7,31 @@ Given an SMT-LIB2 formula (or a set of constraints described in natural language

 # Step 1: Prepare the formula

-If the input is already valid SMT-LIB2, use it directly. If it is a natural language description, use the **encode** skill first to produce SMT-LIB2.
+Action:
+    Convert the input to valid SMT-LIB2. If the input is natural language,
+    use the **encode** skill first.

-The formula must include `(check-sat)` at the end. Append `(get-model)` for satisfiable queries or `(get-unsat-core)` when named assertions are used.
+Expectation:
+    A syntactically valid SMT-LIB2 formula ending with `(check-sat)` and
+    either `(get-model)` or `(get-unsat-core)` as appropriate.
+
+Result:
+    If valid SMT-LIB2 is ready, proceed to Step 2.
+    If encoding is needed, run **encode** first and return here.

 # Step 2: Run Z3

+Action:
+    Invoke solve.py with the formula string or file path.
+
+Expectation:
+    The script pipes the formula to `z3 -in`, logs the run to
+    `.z3-agent/z3agent.db`, and prints the result.
+
+Result:
+    Output is one of `sat`, `unsat`, `unknown`, or `timeout`.
+    Proceed to Step 3 to interpret.
+
 ```bash
 python3 scripts/solve.py --formula "(declare-const x Int)(assert (> x 0))(check-sat)(get-model)"
 ```
@ -27,14 +46,20 @@ With debug tracing:
 python3 scripts/solve.py --file problem.smt2 --debug
 ```

-The script pipes the formula to `z3 -in` via subprocess (no shell expansion), logs the run to `.z3-agent/z3agent.db`, and prints the result.
-
 # Step 3: Interpret the output

- `sat` followed by a model: the formula is satisfiable; the model assigns concrete values to each declared constant.
- `unsat`: no assignment exists. If `(get-unsat-core)` was used, the conflicting named assertions are listed.
- `unknown`: Z3 could not decide within the timeout. Consider increasing the timeout or simplifying the formula.
- `timeout`: the process was killed after the deadline. Try the **simplify** skill to reduce complexity.
+Action:
+    Parse the Z3 output to determine satisfiability and extract any model
+    or unsat core.
+
+Expectation:
+    `sat` with a model, `unsat` optionally with a core, `unknown`, or
+    `timeout`.
+
+Result:
+    On `sat`: report the model to the user.
+    On `unsat`: report the core if available.
+    On `unknown`/`timeout`: try **simplify** or increase the timeout.

 # Parameters

--- a/.github/skills/static-analysis/SKILL.md
+++ b/.github/skills/static-analysis/SKILL.md
@ -7,28 +7,63 @@ Run the Clang Static Analyzer over a CMake build of Z3, parse the resulting plis

 # Step 1: Run the analysis

+Action:
+    Invoke the script pointing at the CMake build directory. The script
+    runs `scan-build cmake ..` followed by `scan-build make` and writes
+    checker output to the output directory.
+
+Expectation:
+    scan-build completes within the timeout, producing plist diagnostic
+    files in the output directory (defaults to a `scan-results` subdirectory
+    of the build directory).
+
+Result:
+    On success: diagnostics are parsed and findings are printed. Proceed to
+    Step 2.
+    On failure: verify that clang and scan-build are installed and that the
+    build directory contains a valid CMake configuration.
+
 ```bash
 python3 scripts/static_analysis.py --build-dir build
 python3 scripts/static_analysis.py --build-dir build --output-dir /tmp/sa-results --debug
 python3 scripts/static_analysis.py --build-dir build --timeout 1800
 ```

-The script invokes `scan-build cmake ..` followed by `scan-build make` inside the specified build directory. Clang checker output is written to `--output-dir` (defaults to a `scan-results` subdirectory of the build directory).
-
 # Step 2: Interpret the output

-Each finding is printed with its source location, category, and description:
+Action:
+    Review the printed findings and the summary table grouped by category.
+
+Expectation:
+    Each finding shows its source location, category, and description.
+    The summary table ranks categories by frequency for quick triage.
+
+Result:
+    On zero findings: the codebase passes all enabled static checks.
+    On findings: prioritize by category frequency and severity. Address
+    null dereferences and use-after-free classes first.
+
+Example output:

 ```
 [Dead store] src/ast/ast.cpp:142: Value stored to 'result' is never read
 [Null dereference] src/smt/theory_lra.cpp:87: Access to field 'next' results in a dereference of a null pointer
 ```

-A summary table groups findings by category so that high-frequency classes are visible at a glance.
-
 # Step 3: Review historical findings

-All findings are logged to `z3agent.db`. Query them to track trends:
+Action:
+    Query z3agent.db to compare current results against prior analysis
+    runs.
+
+Expectation:
+    Queries return category counts and run history, enabling regression
+    detection across commits.
+
+Result:
+    On stable or decreasing counts: no regressions introduced.
+    On increased counts: cross-reference new findings with recent commits
+    to identify the responsible change.

 ```bash
 python3 ../../shared/z3db.py query "SELECT category, COUNT(*) as cnt FROM findings WHERE run_id IN (SELECT run_id FROM runs WHERE skill='static-analysis') GROUP BY category ORDER BY cnt DESC"