3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-05-20 08:59:34 +00:00
z3/specs/theory-nseq-zipt-port.md
2026-03-04 17:37:04 +01:00

657 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# theory_nseq — ZIPT-Based String Theory Solver for Z3
| Document Metadata | Details |
| ---------------------- | ---------------------------------- |
| Author(s) | Nikolaj Bjorner |
| Status | In Review (RFC) |
| Team / Owner | Z3 / SMT Strings |
| Created / Last Updated | 2026-03-03 |
## 1. Executive Summary
This spec defines the completion of the ZIPT algorithm port into Z3's c3 branch as `theory_nseq` — a new string/sequence theory solver based on Nielsen-style word equation solving with lazy regex membership and Parikh image reasoning. The foundational data structures (sgraph, snode, seq_plugin, nielsen_graph) are already implemented and tested (38 unit tests). What remains is: (1) the theory solver skeleton and SMT integration, (2) the Nielsen transformation algorithms (modifier chain, graph expansion, iterative deepening, conflict detection), (3) Parikh image / length reasoning, (4) regex membership processing, (5) propagation engine, and (6) model generation. The new solver will coexist with `theory_seq` and be selectable via `smt.string_solver=nseq`.
## 2. Context and Motivation
### 2.1 Current State
Z3's string solver `theory_seq` uses an axiom-driven approach with a 20-phase `final_check_eh` cascade. While production-grade, it struggles with certain classes of word equations and regex membership constraints that the ZIPT algorithm handles more effectively.
The ZIPT algorithm (reference C# implementation at [github.com/CEisenhofer/ZIPT](https://github.com/CEisenhofer/ZIPT/tree/parikh/ZIPT)) uses a fundamentally different approach: Nielsen transformations on a search graph with iterative deepening, lazy regex evaluation, and Parikh image constraints.
**Already implemented on c3 branch** (per [research doc](../../research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md)):
-`sgraph` + `snode` — String representation, metadata, hashing, substitution, derivatives, minterms (`src/ast/euf/euf_sgraph.h/.cpp`, `euf_snode.h`)
-`seq_plugin` — Egraph plugin for associativity, identity, absorption, star merge (`src/ast/euf/euf_seq_plugin.h/.cpp`)
- ✅ Nielsen graph data structures — `str_eq`, `str_mem`, `nielsen_subst`, `nielsen_edge`, `nielsen_node`, `nielsen_graph`, `dep_tracker` (`src/smt/seq/seq_nielsen.h/.cpp`)
- ✅ 38 unit tests (18 sgraph + 7 seq_plugin + 13 nielsen)
-`theory_nseq` skeleton and SMT integration (`src/smt/theory_nseq.h/.cpp`)
- ✅ Parameter integration — `smt.string_solver=nseq` and `smt.nseq.max_depth` (`src/params/smt_params*.{cpp,h,pyg}`, `src/smt/smt_setup.cpp`)
- ✅ Build system — `theory_nseq.cpp`, `nseq_state.cpp`, `nseq_regex.cpp`, `nseq_model.cpp` in `src/smt/CMakeLists.txt`
- ✅ All 13 Nielsen modifiers (`apply_det_modifier`, `apply_const_nielsen`, `apply_var_nielsen`, `apply_eq_split`, `apply_regex_char_split`, `apply_power_epsilon`, `apply_num_cmp`, `apply_const_num_unwinding`, `apply_star_intr`, `apply_gpower_intr`, `apply_regex_var_split`, `apply_power_split`, `apply_var_num_unwinding`)
- ✅ Graph expansion / iterative deepening DFS — `nielsen_graph::solve()` + `search_dfs()` (depth 3, +1/failure, bounded by `smt.nseq.max_depth`)
- ✅ Constraint simplification engine — `nielsen_node::simplify_and_init()`
- ✅ Conflict detection and explanation — `dep_tracker`-based conflict clauses
- ✅ Propagation engine — `theory_nseq::propagate()`, eq/diseq/regex membership wiring
- ✅ Model generation — `nseq_model` (`src/smt/nseq_model.h/.cpp`), wired via `init_model()` / `mk_value()` / `finalize_model()`
- ✅ Regex membership processing — `nseq_regex` (`src/smt/nseq_regex.h/.cpp`)
- ✅ Constraint state management — `nseq_state` (`src/smt/nseq_state.h/.cpp`) with push/pop backtracking
- ✅ Integration unit tests — `tst_nseq_basic` (`src/test/nseq_basic.cpp`), `tst_nseq_regex` (`src/test/nseq_regex.cpp`)
- ✅ Length constraint generation — `nielsen_graph::generate_length_constraints()` (length equalities, power non-negativity, regex bounds)
**Not yet implemented / incomplete:**
- ❌ Integer constraint reasoning via `lp::int_solver` — length constraints are generated but not yet discharged through `lp::int_solver`; currently propagated as theory lemmas via `arith_value`
- ❌ Correct results on Z3's full string regression suite (`C:\z3test\regressions\smt2\string*.smt2`)
### 2.2 The Problem
- **Algorithmic gap**: The core solving algorithms are implemented. The main remaining gap is full `lp::int_solver` integration for integer subproblems.
- **Regression coverage**: The solver has not yet been validated against Z3's full string regression suite.
### 2.3 Reference Documents
- `spec/reference.md` — Points to ZIPT repo (parikh branch), LazyMemberships.pdf, ClemensTableau.pdf
- `spec/plan.md` — Task description
- `spec/plan1.md` — 7-phase implementation plan with file list
- `research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md` — Comprehensive codebase research
## 3. Goals and Non-Goals
### 3.1 Functional Goals
- [x] `theory_nseq` theory solver class implementing `smt::theory` interface
- [x] Selectable via `smt.string_solver=nseq` parameter
- [x] Nielsen transformation-based word equation solving
- [x] Lazy regex membership via Brzozowski derivatives (reusing `sgraph::brzozowski_deriv`)
- [ ] Integer constraint reasoning via `lp::int_solver` (length, power terms, Parikh image)
- [x] Iterative deepening search: initial depth 3, +1 per failure, upper bound via `smt.nseq.max_depth` (0 = infinity)
- [x] Conflict detection with dependency-tracked explanations
- [x] Model generation for satisfiable instances
- [ ] Correct results on Z3's existing string regression suite (`C:\z3test\regressions\smt2\string*.smt2`)
### 3.2 Non-Goals (Out of Scope)
- [ ] Replacing `theory_seq` as default — `theory_nseq` is an alternative, not a replacement
- [ ] Polymorphic sequences — focus on strings (`Seq[Char]`) initially; generic `Seq[T]` can follow later
- [ ] Higher-order sequence ops (`seq.map`, `seq.foldl`) — defer to `theory_seq` or axiomatize minimally
- [ ] Bitvector-to-string conversions (`ubv2s`, `sbv2s`) — defer
- [ ] Integer-to-string / string-to-integer conversions (`str.to_int`, `int.to_str`) — defer
- [ ] Extended string predicates: containment (`str.contains`), prefix/suffix (`str.prefixof`, `str.suffixof`) — defer
- [ ] String ordering (`str.<`, `str.<=`) — defer
- [ ] New Z3 API surface — `theory_nseq` is internal only, selected by parameter
- [ ] Performance parity with `theory_seq` on all benchmarks in Phase 1 — correctness first
## 4. Proposed Solution (High-Level Design)
### 4.1 System Architecture
```
┌─────────────────────────────────────────────────────────┐
│ smt::context │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ theory_arith │ │ theory_nseq │ │ theory_bv │ │
│ └──────────────┘ └──────┬───────┘ └──────────────┘ │
│ │ │
│ ┌────────────────────────┼──────────────────────────┐ │
│ │ theory_nseq internals │ │
│ │ ┌─────────────┐ ┌───┴───────┐ ┌────────────┐ │ │
│ │ │ nseq_regex │ │ propagator│ │ nseq_model │ │ │
│ │ └──────┬──────┘ └───┬───────┘ └────────────┘ │ │
│ │ │ │ │ │
│ │ ┌──────┴─────────────┴────────────────────────┐ │ │
│ │ │ nielsen_graph (search engine) │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ │ modifier │ │ simplify │ │ │ │
│ │ │ │ chain │ │ engine │ │ │ │
│ │ │ └──────────┘ └──────────┘ │ │ │
│ │ └───────────────────┬─────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌───────────────────┴─────────────────────────┐ │ │
│ │ │ sgraph + snode │ │ │
│ │ │ (string representation layer) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Shared infrastructure: seq_util, seq_rewriter, │
│ seq_skolem, arith_value, seq_factory │
└──────────────────────────────────────────────────────────┘
```
### 4.2 Architectural Pattern
**Nielsen transformation graph with iterative deepening DFS** — matching ZIPT's architecture:
1. **Internalization**: String terms/atoms are registered, creating theory variables and snode representations.
2. **Propagation**: Equality/disequality merges and boolean assignments update the constraint store.
3. **Final check**: Build a nielsen_graph from active constraints, run iterative deepening search.
4. **Search**: At each node, apply the modifier chain (13 modifiers in priority order) to generate child nodes. Check for conflicts and satisfaction.
5. **Conflict/Satisfaction**: If all leaves are satisfied → FC_DONE. If a conflict is found → generate explanation clause and return FC_CONTINUE. If stuck → increase depth bound and retry.
### 4.3 Key Components
| Component | Responsibility | Location | Reuse |
|-----------|---------------|----------|-------|
| `theory_nseq` | SMT theory interface, internalization, propagation | `src/smt/theory_nseq.h/.cpp` | New |
| `nseq_state` | Constraint store, variable tracking, backtrackable state | `src/smt/nseq_state.h/.cpp` | New |
| Nielsen graph (solver) | Graph expansion, iterative deepening, conflict detection | `src/smt/seq/seq_nielsen.h/.cpp` | Extend existing |
| Nielsen modifiers | 13 transformation rules (per ZIPT modifier chain) | `src/smt/seq/seq_nielsen.h/.cpp` | New (in existing files) |
| `nseq_regex` | Regex membership processing, derivative dispatch | `src/smt/nseq_regex.h/.cpp` | New |
| `nseq_model` | Model construction from solved constraints | `src/smt/nseq_model.h/.cpp` | New |
| Parikh/length | Length reasoning, integer subproblems | `src/smt/seq/seq_nielsen.h/.cpp` | New (in existing) |
| `lp::int_solver` | Integer subproblem solving (replaces ZIPT's PDD subsolver) | `src/math/lp/int_solver.h` | Existing ✅ |
| sgraph | String representation, substitution, derivatives | `src/ast/euf/euf_sgraph.*` | Existing ✅ |
| seq_plugin | Egraph associativity/simplification | `src/ast/euf/euf_seq_plugin.*` | Existing ✅ |
| seq_util | Sort/op declarations and utilities | `src/ast/seq_decl_plugin.*` | Existing ✅ |
| seq_rewriter | Rewrite rules, regex derivatives, nullability | `src/ast/rewriter/seq_rewriter.*` | Existing ✅ |
| seq_skolem | Skolem function introduction | `src/ast/rewriter/seq_skolem.*` | Existing ✅ |
| arith_value | Arithmetic solver queries | `src/smt/smt_arith_value.*` | Existing ✅ |
| seq_factory | Model value factory | `src/model/seq_factory.h` | Existing ✅ |
## 5. Detailed Design
### 5.1 Phase 1: Skeleton & Integration
#### 5.1.1 theory_nseq class (`src/smt/theory_nseq.h/.cpp`)
Implements `smt::theory` with these pure virtual methods:
```cpp
class theory_nseq : public theory {
// Required by smt::theory:
bool internalize_atom(app* atom, bool gate_ctx) override;
bool internalize_term(app* term) override;
void new_eq_eh(theory_var v1, theory_var v2) override;
void new_diseq_eh(theory_var v1, theory_var v2) override;
theory* mk_fresh(context* new_ctx) override;
void display(std::ostream& out) const override;
// Optional overrides:
void assign_eh(bool_var v, bool is_true) override;
bool can_propagate() override;
void propagate() override;
final_check_status final_check_eh() override;
void push_scope_eh() override;
void pop_scope_eh(unsigned num_scopes) override;
void init_model(model_generator& mg) override;
void finalize_model(model_generator& mg) override;
void collect_statistics(statistics& st) const override;
// Members:
seq_util m_seq;
arith_util m_autil;
arith_value m_arith_value;
seq_rewriter m_rewriter;
seq::skolem m_sk;
euf::egraph m_egraph;
euf::sgraph m_sgraph;
nielsen_graph m_nielsen;
// ... state management, constraint store
};
```
**family_id**: Uses `m_seq.get_family_id()` (same as `theory_seq`).
**Internalization pattern** (following `theory_seq`):
- `internalize_term`: For string-sorted terms, create theory variable, register with egraph/sgraph. For `str.++`, `str.len`, `str.in_re`, etc., recursively internalize children.
- `internalize_atom`: For `str.in_re` — create boolean variable. (`str.contains`, `str.prefixof`, `str.suffixof`, `str.<`, `str.<=` are out of scope for `theory_nseq`.)
#### 5.1.2 Parameter integration
**`src/params/smt_params.cpp`** — Add `"nseq"` to `validate_string_solver()`:
```cpp
void smt_params::validate_string_solver(symbol const& s) const {
if (s == "z3str3" || s == "seq" || s == "empty" || s == "auto" || s == "none" || s == "nseq")
return;
throw default_exception("Invalid string solver value. Legal values are z3str3, seq, empty, auto, none, nseq");
}
```
**`src/smt/smt_setup.cpp`** — Add `"nseq"` dispatch:
```cpp
void setup::setup_seq_str(static_features const& st) {
if (m_params.m_string_solver == "nseq") {
setup_nseq();
}
else if (m_params.m_string_solver == "seq") { ... }
// ...
}
void setup::setup_nseq() {
m_context.register_plugin(alloc(smt::theory_nseq, m_context));
}
```
**`src/smt/smt_setup.h`** — Declare `setup_nseq()`.
**`src/params/smt_params_helper.pyg`** — Document `nseq` in the `string_solver` parameter description, and add:
```python
def_param("nseq.max_depth", UINT, 0,
"Maximum Nielsen search depth for theory_nseq (0 = unlimited)")
```
#### 5.1.3 Build system
**`src/smt/CMakeLists.txt`** — Add `theory_nseq.cpp` and new .cpp files.
### 5.2 Phase 2: Constraint Store & State Management
#### 5.2.1 nseq_state (`src/smt/nseq_state.h/.cpp`)
Manages the backtrackable constraint store bridging SMT context to Nielsen graph:
```cpp
class nseq_state {
// Variable tracking
th_union_find m_find; // Union-find for theory variables
obj_map<expr, theory_var> m_var_map; // expr → theory_var
// Active constraints (backtrackable)
scoped_vector<str_eq> m_eqs; // Active word equations
scoped_vector<str_mem> m_mems; // Active regex memberships
svector<std::pair<theory_var, theory_var>> m_diseqs; // Disequalities
svector<bool_var> m_pending_bools; // Pending boolean assignments
// Scope management
void push();
void pop(unsigned n);
// Constraint registration
void add_eq(euf::snode* lhs, euf::snode* rhs, unsigned dep_id);
void add_mem(euf::snode* str, euf::snode* regex, unsigned dep_id);
void add_diseq(theory_var v1, theory_var v2);
};
```
Key design decisions:
- Use `scoped_vector` for backtrackable collections (same pattern as `theory_seq`)
- Map theory variables to snode representations via sgraph
- Dependency IDs correspond to `dep_tracker` bit positions for conflict explanation
### 5.3 Phase 3: Nielsen Transformation Algorithms
This is the core algorithmic phase. All additions go into `src/smt/seq/seq_nielsen.h/.cpp`.
#### 5.3.1 Simplification Engine
Add to `nielsen_node`:
```cpp
// In nielsen_node:
simplify_result simplify_and_init(nielsen_graph& g);
simplify_result simplify_str_eq(nielsen_graph& g, unsigned idx);
simplify_result simplify_str_mem(nielsen_graph& g, unsigned idx);
```
Simplification rules (matching ZIPT's `SimplifyAndInit`):
1. **Trivial equation removal**: If `lhs == rhs` (same snode), remove
2. **Empty propagation**: If `lhs = ε`, then `rhs` must be empty → conflict or propagate
3. **Constant prefix matching**: If both sides start with same character, strip it
4. **Constant mismatch**: If sides start with different characters → conflict (symbol_clash)
5. **Unit propagation**: If `x = c` where c is ground, substitute x→c everywhere
6. **Nullable regex**: If `str = ε` and regex is nullable → satisfied
#### 5.3.2 Modifier Chain
Add 13 modifiers matching ZIPT's priority order. Each modifier returns a list of `nielsen_edge` expansions:
```cpp
// In nielsen_graph:
struct modifier_result {
ptr_vector<nielsen_edge> edges;
bool deterministic; // true if only one branch is valid
};
// Modifiers (in priority order):
modifier_result apply_det_modifier(nielsen_node& n); // 1. Deterministic simplifications
modifier_result apply_power_epsilon(nielsen_node& n); // 2. Power-epsilon: u="" || n=0
modifier_result apply_num_cmp(nielsen_node& n); // 3. Numeric comparisons
modifier_result apply_const_num_unwinding(nielsen_node& n); // 4. Constant power unwinding
modifier_result apply_eq_split(nielsen_node& n); // 5. Equality prefix/suffix split
modifier_result apply_star_intr(nielsen_node& n); // 6. Stabilizer introduction
modifier_result apply_gpower_intr(nielsen_node& n); // 7. Power introduction
modifier_result apply_const_nielsen(nielsen_node& n); // 8. Nielsen vs constant
modifier_result apply_regex_char_split(nielsen_node& n); // 9. Regex character split
modifier_result apply_regex_var_split(nielsen_node& n); // 10. Regex variable split
modifier_result apply_power_split(nielsen_node& n); // 11. Power unwinding
modifier_result apply_var_nielsen(nielsen_node& n); // 12. Variable Nielsen split
modifier_result apply_var_num_unwinding(nielsen_node& n); // 13. Variable power unwinding
```
The modifier chain is applied in order; the first modifier that produces edges is used:
```cpp
modifier_result generate_extensions(nielsen_node& n) {
modifier_result r;
r = apply_det_modifier(n); if (!r.edges.empty()) return r;
r = apply_power_epsilon(n); if (!r.edges.empty()) return r;
// ... through all 13 modifiers
r = apply_var_num_unwinding(n);
return r;
}
```
#### 5.3.3 Graph Expansion & Iterative Deepening
Add to `nielsen_graph`:
```cpp
// Search result
enum class search_result { sat, unsat, unknown };
// Main search entry point
search_result solve();
// Iterative deepening DFS
search_result search_dfs(nielsen_node* node, unsigned depth);
// Expand a single node
search_result expand_node(nielsen_node* node, unsigned depth);
```
**Algorithm** (matching ZIPT's `NielsenGraph.Run()`):
1. Set initial depth bound to **3**
2. Run DFS from root with current depth bound
3. If all branches conflict → UNSAT
4. If a leaf is satisfied (all constraints trivial) → SAT
5. If depth bound hit → increase bound by **1**, restart (iterative deepening)
6. If `smt.nseq.max_depth > 0` and bound exceeds it → give up (`FC_GIVEUP`); if `smt.nseq.max_depth == 0` (default) → no upper limit
#### 5.3.4 Conflict Detection & Explanation
Add to `nielsen_graph`:
```cpp
// Extract conflict explanation as a set of input constraint indices
void explain_conflict(nielsen_node const& conflict_node, svector<unsigned>& deps);
```
The `dep_tracker` bitvectors on each constraint track which original input constraints contributed. On conflict, OR together the dep_trackers of the conflicting constraints to get the minimal explanation.
### 5.4 Phase 4: Integer Constraint Reasoning
Each Nielsen node accumulates integer constraints from all sources:
- **Length constraints**: from word equations `lhs = rhs``|lhs| = |rhs|`
- **Power terms**: `x^n` with symbolic exponent → `n ≥ 0`, `|x^n| = n * |x|`
- **Parikh image**: from regex membership `x ∈ L(r)` → length bounds/modular constraints (e.g. `x ∈ (ab)*``|x| mod 2 = 0`)
All of these are collected together into a single integer constraint system per node and discharged uniformly via Z3's `lp::int_solver` (`src/math/lp/int_solver.h`). This replaces ZIPT's PDD-based integer subsolver. There is no separate treatment per source — they are all integer constraints over the same set of length variables.
```cpp
// In nielsen_graph or a helper:
void generate_integer_constraints(nielsen_node const& node,
vector<expr_ref>& int_constraints);
```
**Length computation from snode**:
- `|ε|` = 0
- `|c|` = 1 (character)
- `|a · b|` = `|a| + |b|`
- `|x^n|` = `n * |x|` (power)
- `|x|` = length variable for x
**Integration**: Use `arith_value` to query current assignments and check feasibility. Propagate derived equalities/inequalities via `theory_nseq::propagate()`.
### 5.5 Phase 5: Regex Membership
#### 5.5.1 nseq_regex (`src/smt/nseq_regex.h/.cpp`)
Handles regex membership constraints using ZIPT's lazy approach:
```cpp
class nseq_regex {
euf::sgraph& m_sg;
seq_rewriter& m_rewriter;
// Process regex membership after character consumption
simplify_result process_str_mem(str_mem& mem, euf::snode* consumed_char);
// Compute derivative and create new membership constraint
str_mem derive(str_mem const& mem, euf::snode* char_node);
// Check if regex is empty (unsatisfiable)
bool is_empty_regex(euf::snode* regex);
// Check for regex cycles (stabilizer introduction)
bool detect_cycle(str_mem const& mem);
// Compute minterms for character splitting
void get_minterms(euf::snode* regex, vector<euf::snode*>& minterms);
};
```
**Lazy evaluation pattern** (per LazyMemberships.pdf):
- Only consume characters from the string when they become concrete (via substitution)
- Use Brzozowski derivatives: `D(c, r)` gives the derivative of regex `r` by character `c`
- Track history (consumed prefix) for cycle detection
- When a cycle is detected, introduce a stabilizer: `remaining ∈ base*`
**Reuse**: `sgraph::brzozowski_deriv()` and `sgraph::compute_minterms()` are already implemented.
### 5.6 Phase 6: Propagation Engine & Integration
#### 5.6.1 theory_nseq::final_check_eh()
The main solving loop:
```cpp
final_check_status theory_nseq::final_check_eh() {
// 1. Collect constraints from SMT context into nielsen_graph
m_nielsen.reset();
populate_nielsen_graph();
// 2. Run Nielsen search
auto result = m_nielsen.solve();
switch (result) {
case search_result::sat:
return FC_DONE;
case search_result::unsat:
// Extract conflict explanation
svector<unsigned> deps;
m_nielsen.explain_conflict(deps);
// Build conflict clause from deps
add_conflict_clause(deps);
return FC_CONTINUE;
case search_result::unknown:
// Depth bound hit or timeout
return FC_GIVEUP;
}
}
```
#### 5.6.2 populate_nielsen_graph()
Convert active SMT constraints to nielsen_graph format:
```cpp
void theory_nseq::populate_nielsen_graph() {
for (auto& eq : m_state.eqs()) {
euf::snode* lhs = m_sgraph.mk(eq.first);
euf::snode* rhs = m_sgraph.mk(eq.second);
m_nielsen.add_str_eq(lhs, rhs);
}
for (auto& mem : m_state.mems()) {
euf::snode* str = m_sgraph.mk(mem.str);
euf::snode* regex = m_sgraph.mk(mem.regex);
m_nielsen.add_str_mem(str, regex);
}
}
```
#### 5.6.3 theory_nseq::propagate()
Propagate consequences from Nielsen solving to the SMT context:
- New equalities (from substitutions) → `ctx.assign_eq()`
- New disequalities → `ctx.assign_diseq()`
- New literals (regex memberships) → `ctx.assign()`
- Length constraints → `ctx.assign()` for arithmetic literals
### 5.7 Phase 7: Model Generation
#### 5.7.1 nseq_model (`src/smt/nseq_model.h/.cpp`)
Build concrete string values from solved Nielsen graph:
```cpp
class nseq_model {
// Extract model assignments from a satisfied nielsen_node
void extract_assignments(nielsen_node const& leaf,
obj_map<expr, expr*>& assignments);
// Generate fresh string values for unconstrained variables
expr_ref mk_fresh_value(sort* s, seq_factory& factory);
// Ensure consistency with regex memberships
bool validate_regex(expr* str_val, expr* regex);
};
```
**Model construction algorithm**:
1. From the satisfied leaf node, collect all substitutions along the path from root
2. Compose substitutions to get final variable assignments
3. For unconstrained variables, use `seq_factory::get_fresh_value()`
4. Register with `model_generator` via `init_model()` / `finalize_model()`
## 6. Alternatives Considered
| Option | Pros | Cons | Decision |
|--------|------|------|----------|
| Extend `theory_seq` with Nielsen | Single solver, shared infrastructure | Massive refactor, high risk to production solver | **Rejected**: Too risky to production |
| Port ZIPT as standalone user propagator | Minimal Z3 changes | Poor integration, no sharing of Z3 infrastructure, slower | **Rejected**: Misses optimization opportunities |
| New `theory_nseq` (selected) | Clean separation, reuses infrastructure, safe coexistence | New code to maintain, some duplication | **Selected**: Best balance of risk and benefit |
| PDD for Parikh (full ZIPT port) | Exact polynomial reasoning | Complex, Z3 has `lp::int_solver` already | **Rejected**: Use `lp::int_solver` instead of porting ZIPT's PDD subsolver |
## 7. Cross-Cutting Concerns
### 7.1 Backtracking / Scope Management
`theory_nseq` must correctly implement `push_scope_eh()` / `pop_scope_eh()`:
- `nseq_state` uses `scoped_vector` (same as `theory_seq`)
- sgraph/egraph have their own `push()`/`pop()` — must be synchronized
- nielsen_graph is rebuilt from scratch on each `final_check_eh()` (no incremental state), so no backtracking needed for the graph itself
### 7.2 Conflict Clause Quality
Conflict explanations must be minimal for effective CDCL learning:
- `dep_tracker` bitvectors track which input constraints are responsible
- On conflict, extract the minimal set of input constraint IDs
- Map these back to Z3 literals for the conflict clause
### 7.3 Interaction with Arithmetic Solver
Length constraints create coupling between `theory_nseq` and the arithmetic solver:
- Use `arith_value` to query current arithmetic assignments (same pattern as `theory_seq`)
- Propagate derived length equalities as theory lemmas
- May need to handle `FC_CONTINUE` loops when arithmetic and string solvers ping-pong
### 7.4 Statistics
`collect_statistics()` should track:
- Number of Nielsen graph nodes explored
- Number of conflicts found
- Number of depth bound increases
- Number of regex derivatives computed
- Number of length constraints generated
## 8. Migration, Rollout, and Testing
### 8.1 Deployment Strategy
- [ ] Phase 1: `theory_nseq` returns `FC_GIVEUP` for all inputs — proves integration works
- [ ] Phase 2: Handle trivial word equations (constant strings, simple variables)
- [ ] Phase 3: Full modifier chain — handle the string regression suite
- [ ] Phase 4: Regex support — handle `re*.smt2` regression tests
- [ ] Phase 5: Performance tuning — compare against `theory_seq` on SMT-COMP benchmarks
### 8.2 Test Plan
#### Unit Tests (C++ tests in `src/test/`)
| Test | File | Scope |
|------|------|-------|
| `TST(nseq_basic)` | `src/test/nseq_basic.cpp` | theory_nseq instantiation, parameter selection, trivial sat/unsat |
| `TST(seq_nielsen)` | `src/test/seq_nielsen.cpp` (existing) | Extend with simplification, modifier, and search tests |
| `TST(euf_sgraph)` | `src/test/euf_sgraph.cpp` (existing) | Already passing; extend if sgraph changes are needed |
#### Regression Tests (SMT2 via z3 command line)
Use `C:\z3test\regressions\smt2\` (the same suite as nightly CI):
1. **String-specific files** with `smt.string_solver=nseq`:
- `string-concat.smt2`, `string-eval.smt2`, `string-itos.smt2`, `string-literals.smt2`, `string-ops.smt2`, `string-simplify.smt2`
- `string3.smt2`, `string6.smt2`, `string7.smt2`, `string9.smt2`, `string11.smt2``string14.smt2`
- `re.smt2`, `re_rewriter.smt2`
2. **Comparison testing**: Run same files with both `smt.string_solver=seq` and `smt.string_solver=nseq`, verify sat/unsat agreement (no soundness bugs).
3. **Full regression suite**: Ensure `smt.string_solver=nseq` doesn't break non-string tests.
#### Targeted SMT2 Tests (New)
Create focused test files in `tests/nseq/`:
- `eq_basic.smt2` — Simple word equations
- `eq_nielsen.smt2` — Equations requiring Nielsen transformations
- `length.smt2` — Length constraints
- `regex.smt2` — Regex membership
- `mixed.smt2` — Combined string + arithmetic
- `unsat.smt2` — Unsatisfiable benchmarks
## 9. Design Decisions (Resolved)
- [x] **Q1: egraph ownership**`theory_nseq` creates its own **private egraph/sgraph**, matching the ZIPT pattern where the string manager is independent. This avoids interference with other theories and keeps the string representation self-contained.
- [x] **Q2: Parikh implementation approach** — Use **Z3's built-in `lp::int_solver`** (`src/math/lp/int_solver.h`) for integer subproblems arising from Parikh/length constraints. Generate linear length constraints from word equations and regex membership, propagate them as theory lemmas via `arith_value`. ZIPT's PDD subsolver is **not ported**`lp::int_solver` directly replaces it.
- [x] **Q3: Nielsen graph persistence** — The Nielsen graph will be **maintained incrementally with backtracking support**. The graph persists across `final_check_eh()` calls with push/pop scope synchronization. This avoids redundant re-exploration of previously visited nodes when the SMT solver backtracks.
- [x] **Q4: Modifier chain completeness** — Implement an **essential subset of 5 modifiers first**: DetModifier, EqSplitModifier, ConstNielsenModifier, VarNielsenModifier, RegexCharSplitModifier. These cover basic word equations and simple regex membership. The remaining 8 modifiers (PowerEpsilon, NumCmp, ConstNumUnwinding, StarIntr, GPowerIntr, RegexVarSplit, PowerSplit, VarNumUnwinding) will be added incrementally in subsequent phases.
- [x] **Q5: Generic sequences**`theory_nseq` handles **all `Seq[T]` types from the start**, not just strings. Nielsen transformations work on any element type. Regex membership is string-specific but the word equation engine is generic.
- [x] **Q6: Higher-order ops****Axiomatize minimally** for `seq.map`, `seq.mapi`, `seq.foldl`, `seq.foldli`. Generate basic axioms (similar to theory_seq's approach) without deep reasoning. This ensures theory_nseq doesn't immediately give up on problems containing these operations.
- [x] **Q7: Disequality handling** — Use **separate tracking like ZIPT**. Port ZIPT's character disequality and string disequality tracking into the nielsen_graph data structures. This gives the modifier chain direct access to disequality information (particularly needed for character minterm computation in RegexCharSplitModifier).
## 10. File Summary
### New Files
| File | Phase | Description |
|------|-------|-------------|
| `src/smt/theory_nseq.h` | 1 | Theory class header |
| `src/smt/theory_nseq.cpp` | 1 | Theory class implementation |
| `src/smt/nseq_state.h` | 2 | Constraint store header |
| `src/smt/nseq_state.cpp` | 2 | Constraint store implementation |
| `src/smt/nseq_regex.h` | 5 | Regex handling header |
| `src/smt/nseq_regex.cpp` | 5 | Regex handling implementation |
| `src/smt/nseq_model.h` | 7 | Model generation header |
| `src/smt/nseq_model.cpp` | 7 | Model generation implementation |
| `src/test/nseq_basic.cpp` | 1 | Unit tests |
### Files to Extend
| File | Phase | Change |
|------|-------|--------|
| `src/smt/seq/seq_nielsen.h` | 3 | Add simplification, modifier chain, search methods |
| `src/smt/seq/seq_nielsen.cpp` | 3 | Implement above methods |
### Files to Modify (Integration)
| File | Phase | Change |
|------|-------|--------|
| `src/smt/smt_setup.h` | 1 | Declare `setup_nseq()` |
| `src/smt/smt_setup.cpp` | 1 | Add `"nseq"` dispatch, implement `setup_nseq()` |
| `src/params/smt_params.cpp` | 1 | Add `"nseq"` to `validate_string_solver()` |
| `src/params/smt_params_helper.pyg` | 1 | Document `nseq` option |
| `src/smt/CMakeLists.txt` | 1 | Add new .cpp files to SOURCES |
| `src/test/main.cpp` | 1 | Register `TST(nseq_basic)` |
| `src/test/CMakeLists.txt` | 1 | Add `nseq_basic.cpp` |
## 11. Implementation Order
1. **Phase 1** — Skeleton (`theory_nseq` class, parameter integration, build system) — compilable stub returning `FC_GIVEUP`
2. **Phase 2** — State management (`nseq_state`) — constraint store with backtracking
3. **Phase 3** — Nielsen algorithms (simplification, modifiers, search, conflict) — core solver
4. **Phase 4** — Integer constraint reasoning (`lp::int_solver`) — length, power, Parikh
5. **Phase 5** — Regex membership (`nseq_regex`) — lazy derivatives
6. **Phase 6** — Propagation wiring — connect all components in `final_check_eh`
7. **Phase 7** — Model generation (`nseq_model`) — satisfying assignments