# theory_nseq — ZIPT-Based String Theory Solver for Z3 | Document Metadata | Details | | ---------------------- | ---------------------------------- | | Author(s) | Nikolaj Bjorner | | Status | In Review (RFC) | | Team / Owner | Z3 / SMT Strings | | Created / Last Updated | 2026-03-03 | ## 1. Executive Summary This spec defines the completion of the ZIPT algorithm port into Z3's c3 branch as `theory_nseq` — a new string/sequence theory solver based on Nielsen-style word equation solving with lazy regex membership and Parikh image reasoning. The foundational data structures (sgraph, snode, seq_plugin, nielsen_graph) are already implemented and tested (38 unit tests). What remains is: (1) the theory solver skeleton and SMT integration, (2) the Nielsen transformation algorithms (modifier chain, graph expansion, iterative deepening, conflict detection), (3) Parikh image / length reasoning, (4) regex membership processing, (5) propagation engine, and (6) model generation. The new solver will coexist with `theory_seq` and be selectable via `smt.string_solver=nseq`. ## 2. Context and Motivation ### 2.1 Current State Z3's string solver `theory_seq` uses an axiom-driven approach with a 20-phase `final_check_eh` cascade. While production-grade, it struggles with certain classes of word equations and regex membership constraints that the ZIPT algorithm handles more effectively. The ZIPT algorithm (reference C# implementation at [github.com/CEisenhofer/ZIPT](https://github.com/CEisenhofer/ZIPT/tree/parikh/ZIPT)) uses a fundamentally different approach: Nielsen transformations on a search graph with iterative deepening, lazy regex evaluation, and Parikh image constraints. **Already implemented on c3 branch** (per [research doc](../../research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md)): - ✅ `sgraph` + `snode` — String representation, metadata, hashing, substitution, derivatives, minterms (`src/ast/euf/euf_sgraph.h/.cpp`, `euf_snode.h`) - ✅ `seq_plugin` — Egraph plugin for associativity, identity, absorption, star merge (`src/ast/euf/euf_seq_plugin.h/.cpp`) - ✅ Nielsen graph data structures — `str_eq`, `str_mem`, `nielsen_subst`, `nielsen_edge`, `nielsen_node`, `nielsen_graph`, `dep_tracker` (`src/smt/seq/seq_nielsen.h/.cpp`) - ✅ 38 unit tests (18 sgraph + 7 seq_plugin + 13 nielsen) - ✅ `theory_nseq` skeleton and SMT integration (`src/smt/theory_nseq.h/.cpp`) - ✅ Parameter integration — `smt.string_solver=nseq` and `smt.nseq.max_depth` (`src/params/smt_params*.{cpp,h,pyg}`, `src/smt/smt_setup.cpp`) - ✅ Build system — `theory_nseq.cpp`, `nseq_state.cpp`, `nseq_regex.cpp`, `nseq_model.cpp` in `src/smt/CMakeLists.txt` - ✅ All 13 Nielsen modifiers (`apply_det_modifier`, `apply_const_nielsen`, `apply_var_nielsen`, `apply_eq_split`, `apply_regex_char_split`, `apply_power_epsilon`, `apply_num_cmp`, `apply_const_num_unwinding`, `apply_star_intr`, `apply_gpower_intr`, `apply_regex_var_split`, `apply_power_split`, `apply_var_num_unwinding`) - ✅ Graph expansion / iterative deepening DFS — `nielsen_graph::solve()` + `search_dfs()` (depth 3, +1/failure, bounded by `smt.nseq.max_depth`) - ✅ Constraint simplification engine — `nielsen_node::simplify_and_init()` - ✅ Conflict detection and explanation — `dep_tracker`-based conflict clauses - ✅ Propagation engine — `theory_nseq::propagate()`, eq/diseq/regex membership wiring - ✅ Model generation — `nseq_model` (`src/smt/nseq_model.h/.cpp`), wired via `init_model()` / `mk_value()` / `finalize_model()` - ✅ Regex membership processing — `nseq_regex` (`src/smt/nseq_regex.h/.cpp`) - ✅ Constraint state management — `nseq_state` (`src/smt/nseq_state.h/.cpp`) with push/pop backtracking - ✅ Integration unit tests — `tst_nseq_basic` (`src/test/nseq_basic.cpp`), `tst_nseq_regex` (`src/test/nseq_regex.cpp`) - ✅ Length constraint generation — `nielsen_graph::generate_length_constraints()` (length equalities, power non-negativity, regex bounds) **Not yet implemented / incomplete:** - ❌ Integer constraint reasoning via `lp::int_solver` — length constraints are generated but not yet discharged through `lp::int_solver`; currently propagated as theory lemmas via `arith_value` - ❌ Correct results on Z3's full string regression suite (`C:\z3test\regressions\smt2\string*.smt2`) ### 2.2 The Problem - **Algorithmic gap**: The core solving algorithms are implemented. The main remaining gap is full `lp::int_solver` integration for integer subproblems. - **Regression coverage**: The solver has not yet been validated against Z3's full string regression suite. ### 2.3 Reference Documents - `spec/reference.md` — Points to ZIPT repo (parikh branch), LazyMemberships.pdf, ClemensTableau.pdf - `spec/plan.md` — Task description - `spec/plan1.md` — 7-phase implementation plan with file list - `research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md` — Comprehensive codebase research ## 3. Goals and Non-Goals ### 3.1 Functional Goals - [x] `theory_nseq` theory solver class implementing `smt::theory` interface - [x] Selectable via `smt.string_solver=nseq` parameter - [x] Nielsen transformation-based word equation solving - [x] Lazy regex membership via Brzozowski derivatives (reusing `sgraph::brzozowski_deriv`) - [ ] Integer constraint reasoning via `lp::int_solver` (length, power terms, Parikh image) - [x] Iterative deepening search: initial depth 3, +1 per failure, upper bound via `smt.nseq.max_depth` (0 = infinity) - [x] Conflict detection with dependency-tracked explanations - [x] Model generation for satisfiable instances - [ ] Correct results on Z3's existing string regression suite (`C:\z3test\regressions\smt2\string*.smt2`) ### 3.2 Non-Goals (Out of Scope) - [ ] Replacing `theory_seq` as default — `theory_nseq` is an alternative, not a replacement - [ ] Polymorphic sequences — focus on strings (`Seq[Char]`) initially; generic `Seq[T]` can follow later - [ ] Higher-order sequence ops (`seq.map`, `seq.foldl`) — defer to `theory_seq` or axiomatize minimally - [ ] Bitvector-to-string conversions (`ubv2s`, `sbv2s`) — defer - [ ] Integer-to-string / string-to-integer conversions (`str.to_int`, `int.to_str`) — defer - [ ] Extended string predicates: containment (`str.contains`), prefix/suffix (`str.prefixof`, `str.suffixof`) — defer - [ ] String ordering (`str.<`, `str.<=`) — defer - [ ] New Z3 API surface — `theory_nseq` is internal only, selected by parameter - [ ] Performance parity with `theory_seq` on all benchmarks in Phase 1 — correctness first ## 4. Proposed Solution (High-Level Design) ### 4.1 System Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ smt::context │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ theory_arith │ │ theory_nseq │ │ theory_bv │ │ │ └──────────────┘ └──────┬───────┘ └──────────────┘ │ │ │ │ │ ┌────────────────────────┼──────────────────────────┐ │ │ │ theory_nseq internals │ │ │ │ ┌─────────────┐ ┌───┴───────┐ ┌────────────┐ │ │ │ │ │ nseq_regex │ │ propagator│ │ nseq_model │ │ │ │ │ └──────┬──────┘ └───┬───────┘ └────────────┘ │ │ │ │ │ │ │ │ │ │ ┌──────┴─────────────┴────────────────────────┐ │ │ │ │ │ nielsen_graph (search engine) │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ │ │ │ │ │ │ │ modifier │ │ simplify │ │ │ │ │ │ │ │ chain │ │ engine │ │ │ │ │ │ │ └──────────┘ └──────────┘ │ │ │ │ │ └───────────────────┬─────────────────────────┘ │ │ │ │ │ │ │ │ │ ┌───────────────────┴─────────────────────────┐ │ │ │ │ │ sgraph + snode │ │ │ │ │ │ (string representation layer) │ │ │ │ │ └─────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ Shared infrastructure: seq_util, seq_rewriter, │ │ seq_skolem, arith_value, seq_factory │ └──────────────────────────────────────────────────────────┘ ``` ### 4.2 Architectural Pattern **Nielsen transformation graph with iterative deepening DFS** — matching ZIPT's architecture: 1. **Internalization**: String terms/atoms are registered, creating theory variables and snode representations. 2. **Propagation**: Equality/disequality merges and boolean assignments update the constraint store. 3. **Final check**: Build a nielsen_graph from active constraints, run iterative deepening search. 4. **Search**: At each node, apply the modifier chain (13 modifiers in priority order) to generate child nodes. Check for conflicts and satisfaction. 5. **Conflict/Satisfaction**: If all leaves are satisfied → FC_DONE. If a conflict is found → generate explanation clause and return FC_CONTINUE. If stuck → increase depth bound and retry. ### 4.3 Key Components | Component | Responsibility | Location | Reuse | |-----------|---------------|----------|-------| | `theory_nseq` | SMT theory interface, internalization, propagation | `src/smt/theory_nseq.h/.cpp` | New | | `nseq_state` | Constraint store, variable tracking, backtrackable state | `src/smt/nseq_state.h/.cpp` | New | | Nielsen graph (solver) | Graph expansion, iterative deepening, conflict detection | `src/smt/seq/seq_nielsen.h/.cpp` | Extend existing | | Nielsen modifiers | 13 transformation rules (per ZIPT modifier chain) | `src/smt/seq/seq_nielsen.h/.cpp` | New (in existing files) | | `nseq_regex` | Regex membership processing, derivative dispatch | `src/smt/nseq_regex.h/.cpp` | New | | `nseq_model` | Model construction from solved constraints | `src/smt/nseq_model.h/.cpp` | New | | Parikh/length | Length reasoning, integer subproblems | `src/smt/seq/seq_nielsen.h/.cpp` | New (in existing) | | `lp::int_solver` | Integer subproblem solving (replaces ZIPT's PDD subsolver) | `src/math/lp/int_solver.h` | Existing ✅ | | sgraph | String representation, substitution, derivatives | `src/ast/euf/euf_sgraph.*` | Existing ✅ | | seq_plugin | Egraph associativity/simplification | `src/ast/euf/euf_seq_plugin.*` | Existing ✅ | | seq_util | Sort/op declarations and utilities | `src/ast/seq_decl_plugin.*` | Existing ✅ | | seq_rewriter | Rewrite rules, regex derivatives, nullability | `src/ast/rewriter/seq_rewriter.*` | Existing ✅ | | seq_skolem | Skolem function introduction | `src/ast/rewriter/seq_skolem.*` | Existing ✅ | | arith_value | Arithmetic solver queries | `src/smt/smt_arith_value.*` | Existing ✅ | | seq_factory | Model value factory | `src/model/seq_factory.h` | Existing ✅ | ## 5. Detailed Design ### 5.1 Phase 1: Skeleton & Integration #### 5.1.1 theory_nseq class (`src/smt/theory_nseq.h/.cpp`) Implements `smt::theory` with these pure virtual methods: ```cpp class theory_nseq : public theory { // Required by smt::theory: bool internalize_atom(app* atom, bool gate_ctx) override; bool internalize_term(app* term) override; void new_eq_eh(theory_var v1, theory_var v2) override; void new_diseq_eh(theory_var v1, theory_var v2) override; theory* mk_fresh(context* new_ctx) override; void display(std::ostream& out) const override; // Optional overrides: void assign_eh(bool_var v, bool is_true) override; bool can_propagate() override; void propagate() override; final_check_status final_check_eh() override; void push_scope_eh() override; void pop_scope_eh(unsigned num_scopes) override; void init_model(model_generator& mg) override; void finalize_model(model_generator& mg) override; void collect_statistics(statistics& st) const override; // Members: seq_util m_seq; arith_util m_autil; arith_value m_arith_value; seq_rewriter m_rewriter; seq::skolem m_sk; euf::egraph m_egraph; euf::sgraph m_sgraph; nielsen_graph m_nielsen; // ... state management, constraint store }; ``` **family_id**: Uses `m_seq.get_family_id()` (same as `theory_seq`). **Internalization pattern** (following `theory_seq`): - `internalize_term`: For string-sorted terms, create theory variable, register with egraph/sgraph. For `str.++`, `str.len`, `str.in_re`, etc., recursively internalize children. - `internalize_atom`: For `str.in_re` — create boolean variable. (`str.contains`, `str.prefixof`, `str.suffixof`, `str.<`, `str.<=` are out of scope for `theory_nseq`.) #### 5.1.2 Parameter integration **`src/params/smt_params.cpp`** — Add `"nseq"` to `validate_string_solver()`: ```cpp void smt_params::validate_string_solver(symbol const& s) const { if (s == "z3str3" || s == "seq" || s == "empty" || s == "auto" || s == "none" || s == "nseq") return; throw default_exception("Invalid string solver value. Legal values are z3str3, seq, empty, auto, none, nseq"); } ``` **`src/smt/smt_setup.cpp`** — Add `"nseq"` dispatch: ```cpp void setup::setup_seq_str(static_features const& st) { if (m_params.m_string_solver == "nseq") { setup_nseq(); } else if (m_params.m_string_solver == "seq") { ... } // ... } void setup::setup_nseq() { m_context.register_plugin(alloc(smt::theory_nseq, m_context)); } ``` **`src/smt/smt_setup.h`** — Declare `setup_nseq()`. **`src/params/smt_params_helper.pyg`** — Document `nseq` in the `string_solver` parameter description, and add: ```python def_param("nseq.max_depth", UINT, 0, "Maximum Nielsen search depth for theory_nseq (0 = unlimited)") ``` #### 5.1.3 Build system **`src/smt/CMakeLists.txt`** — Add `theory_nseq.cpp` and new .cpp files. ### 5.2 Phase 2: Constraint Store & State Management #### 5.2.1 nseq_state (`src/smt/nseq_state.h/.cpp`) Manages the backtrackable constraint store bridging SMT context to Nielsen graph: ```cpp class nseq_state { // Variable tracking th_union_find m_find; // Union-find for theory variables obj_map m_var_map; // expr → theory_var // Active constraints (backtrackable) scoped_vector m_eqs; // Active word equations scoped_vector m_mems; // Active regex memberships svector> m_diseqs; // Disequalities svector m_pending_bools; // Pending boolean assignments // Scope management void push(); void pop(unsigned n); // Constraint registration void add_eq(euf::snode* lhs, euf::snode* rhs, unsigned dep_id); void add_mem(euf::snode* str, euf::snode* regex, unsigned dep_id); void add_diseq(theory_var v1, theory_var v2); }; ``` Key design decisions: - Use `scoped_vector` for backtrackable collections (same pattern as `theory_seq`) - Map theory variables to snode representations via sgraph - Dependency IDs correspond to `dep_tracker` bit positions for conflict explanation ### 5.3 Phase 3: Nielsen Transformation Algorithms This is the core algorithmic phase. All additions go into `src/smt/seq/seq_nielsen.h/.cpp`. #### 5.3.1 Simplification Engine Add to `nielsen_node`: ```cpp // In nielsen_node: simplify_result simplify_and_init(nielsen_graph& g); simplify_result simplify_str_eq(nielsen_graph& g, unsigned idx); simplify_result simplify_str_mem(nielsen_graph& g, unsigned idx); ``` Simplification rules (matching ZIPT's `SimplifyAndInit`): 1. **Trivial equation removal**: If `lhs == rhs` (same snode), remove 2. **Empty propagation**: If `lhs = ε`, then `rhs` must be empty → conflict or propagate 3. **Constant prefix matching**: If both sides start with same character, strip it 4. **Constant mismatch**: If sides start with different characters → conflict (symbol_clash) 5. **Unit propagation**: If `x = c` where c is ground, substitute x→c everywhere 6. **Nullable regex**: If `str = ε` and regex is nullable → satisfied #### 5.3.2 Modifier Chain Add 13 modifiers matching ZIPT's priority order. Each modifier returns a list of `nielsen_edge` expansions: ```cpp // In nielsen_graph: struct modifier_result { ptr_vector edges; bool deterministic; // true if only one branch is valid }; // Modifiers (in priority order): modifier_result apply_det_modifier(nielsen_node& n); // 1. Deterministic simplifications modifier_result apply_power_epsilon(nielsen_node& n); // 2. Power-epsilon: u="" || n=0 modifier_result apply_num_cmp(nielsen_node& n); // 3. Numeric comparisons modifier_result apply_const_num_unwinding(nielsen_node& n); // 4. Constant power unwinding modifier_result apply_eq_split(nielsen_node& n); // 5. Equality prefix/suffix split modifier_result apply_star_intr(nielsen_node& n); // 6. Stabilizer introduction modifier_result apply_gpower_intr(nielsen_node& n); // 7. Power introduction modifier_result apply_const_nielsen(nielsen_node& n); // 8. Nielsen vs constant modifier_result apply_regex_char_split(nielsen_node& n); // 9. Regex character split modifier_result apply_regex_var_split(nielsen_node& n); // 10. Regex variable split modifier_result apply_power_split(nielsen_node& n); // 11. Power unwinding modifier_result apply_var_nielsen(nielsen_node& n); // 12. Variable Nielsen split modifier_result apply_var_num_unwinding(nielsen_node& n); // 13. Variable power unwinding ``` The modifier chain is applied in order; the first modifier that produces edges is used: ```cpp modifier_result generate_extensions(nielsen_node& n) { modifier_result r; r = apply_det_modifier(n); if (!r.edges.empty()) return r; r = apply_power_epsilon(n); if (!r.edges.empty()) return r; // ... through all 13 modifiers r = apply_var_num_unwinding(n); return r; } ``` #### 5.3.3 Graph Expansion & Iterative Deepening Add to `nielsen_graph`: ```cpp // Search result enum class search_result { sat, unsat, unknown }; // Main search entry point search_result solve(); // Iterative deepening DFS search_result search_dfs(nielsen_node* node, unsigned depth); // Expand a single node search_result expand_node(nielsen_node* node, unsigned depth); ``` **Algorithm** (matching ZIPT's `NielsenGraph.Run()`): 1. Set initial depth bound to **3** 2. Run DFS from root with current depth bound 3. If all branches conflict → UNSAT 4. If a leaf is satisfied (all constraints trivial) → SAT 5. If depth bound hit → increase bound by **1**, restart (iterative deepening) 6. If `smt.nseq.max_depth > 0` and bound exceeds it → give up (`FC_GIVEUP`); if `smt.nseq.max_depth == 0` (default) → no upper limit #### 5.3.4 Conflict Detection & Explanation Add to `nielsen_graph`: ```cpp // Extract conflict explanation as a set of input constraint indices void explain_conflict(nielsen_node const& conflict_node, svector& deps); ``` The `dep_tracker` bitvectors on each constraint track which original input constraints contributed. On conflict, OR together the dep_trackers of the conflicting constraints to get the minimal explanation. ### 5.4 Phase 4: Integer Constraint Reasoning Each Nielsen node accumulates integer constraints from all sources: - **Length constraints**: from word equations `lhs = rhs` → `|lhs| = |rhs|` - **Power terms**: `x^n` with symbolic exponent → `n ≥ 0`, `|x^n| = n * |x|` - **Parikh image**: from regex membership `x ∈ L(r)` → length bounds/modular constraints (e.g. `x ∈ (ab)*` → `|x| mod 2 = 0`) All of these are collected together into a single integer constraint system per node and discharged uniformly via Z3's `lp::int_solver` (`src/math/lp/int_solver.h`). This replaces ZIPT's PDD-based integer subsolver. There is no separate treatment per source — they are all integer constraints over the same set of length variables. ```cpp // In nielsen_graph or a helper: void generate_integer_constraints(nielsen_node const& node, vector& int_constraints); ``` **Length computation from snode**: - `|ε|` = 0 - `|c|` = 1 (character) - `|a · b|` = `|a| + |b|` - `|x^n|` = `n * |x|` (power) - `|x|` = length variable for x **Integration**: Use `arith_value` to query current assignments and check feasibility. Propagate derived equalities/inequalities via `theory_nseq::propagate()`. ### 5.5 Phase 5: Regex Membership #### 5.5.1 nseq_regex (`src/smt/nseq_regex.h/.cpp`) Handles regex membership constraints using ZIPT's lazy approach: ```cpp class nseq_regex { euf::sgraph& m_sg; seq_rewriter& m_rewriter; // Process regex membership after character consumption simplify_result process_str_mem(str_mem& mem, euf::snode* consumed_char); // Compute derivative and create new membership constraint str_mem derive(str_mem const& mem, euf::snode* char_node); // Check if regex is empty (unsatisfiable) bool is_empty_regex(euf::snode* regex); // Check for regex cycles (stabilizer introduction) bool detect_cycle(str_mem const& mem); // Compute minterms for character splitting void get_minterms(euf::snode* regex, vector& minterms); }; ``` **Lazy evaluation pattern** (per LazyMemberships.pdf): - Only consume characters from the string when they become concrete (via substitution) - Use Brzozowski derivatives: `D(c, r)` gives the derivative of regex `r` by character `c` - Track history (consumed prefix) for cycle detection - When a cycle is detected, introduce a stabilizer: `remaining ∈ base*` **Reuse**: `sgraph::brzozowski_deriv()` and `sgraph::compute_minterms()` are already implemented. ### 5.6 Phase 6: Propagation Engine & Integration #### 5.6.1 theory_nseq::final_check_eh() The main solving loop: ```cpp final_check_status theory_nseq::final_check_eh() { // 1. Collect constraints from SMT context into nielsen_graph m_nielsen.reset(); populate_nielsen_graph(); // 2. Run Nielsen search auto result = m_nielsen.solve(); switch (result) { case search_result::sat: return FC_DONE; case search_result::unsat: // Extract conflict explanation svector deps; m_nielsen.explain_conflict(deps); // Build conflict clause from deps add_conflict_clause(deps); return FC_CONTINUE; case search_result::unknown: // Depth bound hit or timeout return FC_GIVEUP; } } ``` #### 5.6.2 populate_nielsen_graph() Convert active SMT constraints to nielsen_graph format: ```cpp void theory_nseq::populate_nielsen_graph() { for (auto& eq : m_state.eqs()) { euf::snode* lhs = m_sgraph.mk(eq.first); euf::snode* rhs = m_sgraph.mk(eq.second); m_nielsen.add_str_eq(lhs, rhs); } for (auto& mem : m_state.mems()) { euf::snode* str = m_sgraph.mk(mem.str); euf::snode* regex = m_sgraph.mk(mem.regex); m_nielsen.add_str_mem(str, regex); } } ``` #### 5.6.3 theory_nseq::propagate() Propagate consequences from Nielsen solving to the SMT context: - New equalities (from substitutions) → `ctx.assign_eq()` - New disequalities → `ctx.assign_diseq()` - New literals (regex memberships) → `ctx.assign()` - Length constraints → `ctx.assign()` for arithmetic literals ### 5.7 Phase 7: Model Generation #### 5.7.1 nseq_model (`src/smt/nseq_model.h/.cpp`) Build concrete string values from solved Nielsen graph: ```cpp class nseq_model { // Extract model assignments from a satisfied nielsen_node void extract_assignments(nielsen_node const& leaf, obj_map& assignments); // Generate fresh string values for unconstrained variables expr_ref mk_fresh_value(sort* s, seq_factory& factory); // Ensure consistency with regex memberships bool validate_regex(expr* str_val, expr* regex); }; ``` **Model construction algorithm**: 1. From the satisfied leaf node, collect all substitutions along the path from root 2. Compose substitutions to get final variable assignments 3. For unconstrained variables, use `seq_factory::get_fresh_value()` 4. Register with `model_generator` via `init_model()` / `finalize_model()` ## 6. Alternatives Considered | Option | Pros | Cons | Decision | |--------|------|------|----------| | Extend `theory_seq` with Nielsen | Single solver, shared infrastructure | Massive refactor, high risk to production solver | **Rejected**: Too risky to production | | Port ZIPT as standalone user propagator | Minimal Z3 changes | Poor integration, no sharing of Z3 infrastructure, slower | **Rejected**: Misses optimization opportunities | | New `theory_nseq` (selected) | Clean separation, reuses infrastructure, safe coexistence | New code to maintain, some duplication | **Selected**: Best balance of risk and benefit | | PDD for Parikh (full ZIPT port) | Exact polynomial reasoning | Complex, Z3 has `lp::int_solver` already | **Rejected**: Use `lp::int_solver` instead of porting ZIPT's PDD subsolver | ## 7. Cross-Cutting Concerns ### 7.1 Backtracking / Scope Management `theory_nseq` must correctly implement `push_scope_eh()` / `pop_scope_eh()`: - `nseq_state` uses `scoped_vector` (same as `theory_seq`) - sgraph/egraph have their own `push()`/`pop()` — must be synchronized - nielsen_graph is rebuilt from scratch on each `final_check_eh()` (no incremental state), so no backtracking needed for the graph itself ### 7.2 Conflict Clause Quality Conflict explanations must be minimal for effective CDCL learning: - `dep_tracker` bitvectors track which input constraints are responsible - On conflict, extract the minimal set of input constraint IDs - Map these back to Z3 literals for the conflict clause ### 7.3 Interaction with Arithmetic Solver Length constraints create coupling between `theory_nseq` and the arithmetic solver: - Use `arith_value` to query current arithmetic assignments (same pattern as `theory_seq`) - Propagate derived length equalities as theory lemmas - May need to handle `FC_CONTINUE` loops when arithmetic and string solvers ping-pong ### 7.4 Statistics `collect_statistics()` should track: - Number of Nielsen graph nodes explored - Number of conflicts found - Number of depth bound increases - Number of regex derivatives computed - Number of length constraints generated ## 8. Migration, Rollout, and Testing ### 8.1 Deployment Strategy - [ ] Phase 1: `theory_nseq` returns `FC_GIVEUP` for all inputs — proves integration works - [ ] Phase 2: Handle trivial word equations (constant strings, simple variables) - [ ] Phase 3: Full modifier chain — handle the string regression suite - [ ] Phase 4: Regex support — handle `re*.smt2` regression tests - [ ] Phase 5: Performance tuning — compare against `theory_seq` on SMT-COMP benchmarks ### 8.2 Test Plan #### Unit Tests (C++ tests in `src/test/`) | Test | File | Scope | |------|------|-------| | `TST(nseq_basic)` | `src/test/nseq_basic.cpp` | theory_nseq instantiation, parameter selection, trivial sat/unsat | | `TST(seq_nielsen)` | `src/test/seq_nielsen.cpp` (existing) | Extend with simplification, modifier, and search tests | | `TST(euf_sgraph)` | `src/test/euf_sgraph.cpp` (existing) | Already passing; extend if sgraph changes are needed | #### Regression Tests (SMT2 via z3 command line) Use `C:\z3test\regressions\smt2\` (the same suite as nightly CI): 1. **String-specific files** with `smt.string_solver=nseq`: - `string-concat.smt2`, `string-eval.smt2`, `string-itos.smt2`, `string-literals.smt2`, `string-ops.smt2`, `string-simplify.smt2` - `string3.smt2`, `string6.smt2`, `string7.smt2`, `string9.smt2`, `string11.smt2`–`string14.smt2` - `re.smt2`, `re_rewriter.smt2` 2. **Comparison testing**: Run same files with both `smt.string_solver=seq` and `smt.string_solver=nseq`, verify sat/unsat agreement (no soundness bugs). 3. **Full regression suite**: Ensure `smt.string_solver=nseq` doesn't break non-string tests. #### Targeted SMT2 Tests (New) Create focused test files in `tests/nseq/`: - `eq_basic.smt2` — Simple word equations - `eq_nielsen.smt2` — Equations requiring Nielsen transformations - `length.smt2` — Length constraints - `regex.smt2` — Regex membership - `mixed.smt2` — Combined string + arithmetic - `unsat.smt2` — Unsatisfiable benchmarks ## 9. Design Decisions (Resolved) - [x] **Q1: egraph ownership** — `theory_nseq` creates its own **private egraph/sgraph**, matching the ZIPT pattern where the string manager is independent. This avoids interference with other theories and keeps the string representation self-contained. - [x] **Q2: Parikh implementation approach** — Use **Z3's built-in `lp::int_solver`** (`src/math/lp/int_solver.h`) for integer subproblems arising from Parikh/length constraints. Generate linear length constraints from word equations and regex membership, propagate them as theory lemmas via `arith_value`. ZIPT's PDD subsolver is **not ported** — `lp::int_solver` directly replaces it. - [x] **Q3: Nielsen graph persistence** — The Nielsen graph will be **maintained incrementally with backtracking support**. The graph persists across `final_check_eh()` calls with push/pop scope synchronization. This avoids redundant re-exploration of previously visited nodes when the SMT solver backtracks. - [x] **Q4: Modifier chain completeness** — Implement an **essential subset of 5 modifiers first**: DetModifier, EqSplitModifier, ConstNielsenModifier, VarNielsenModifier, RegexCharSplitModifier. These cover basic word equations and simple regex membership. The remaining 8 modifiers (PowerEpsilon, NumCmp, ConstNumUnwinding, StarIntr, GPowerIntr, RegexVarSplit, PowerSplit, VarNumUnwinding) will be added incrementally in subsequent phases. - [x] **Q5: Generic sequences** — `theory_nseq` handles **all `Seq[T]` types from the start**, not just strings. Nielsen transformations work on any element type. Regex membership is string-specific but the word equation engine is generic. - [x] **Q6: Higher-order ops** — **Axiomatize minimally** for `seq.map`, `seq.mapi`, `seq.foldl`, `seq.foldli`. Generate basic axioms (similar to theory_seq's approach) without deep reasoning. This ensures theory_nseq doesn't immediately give up on problems containing these operations. - [x] **Q7: Disequality handling** — Use **separate tracking like ZIPT**. Port ZIPT's character disequality and string disequality tracking into the nielsen_graph data structures. This gives the modifier chain direct access to disequality information (particularly needed for character minterm computation in RegexCharSplitModifier). ## 10. File Summary ### New Files | File | Phase | Description | |------|-------|-------------| | `src/smt/theory_nseq.h` | 1 | Theory class header | | `src/smt/theory_nseq.cpp` | 1 | Theory class implementation | | `src/smt/nseq_state.h` | 2 | Constraint store header | | `src/smt/nseq_state.cpp` | 2 | Constraint store implementation | | `src/smt/nseq_regex.h` | 5 | Regex handling header | | `src/smt/nseq_regex.cpp` | 5 | Regex handling implementation | | `src/smt/nseq_model.h` | 7 | Model generation header | | `src/smt/nseq_model.cpp` | 7 | Model generation implementation | | `src/test/nseq_basic.cpp` | 1 | Unit tests | ### Files to Extend | File | Phase | Change | |------|-------|--------| | `src/smt/seq/seq_nielsen.h` | 3 | Add simplification, modifier chain, search methods | | `src/smt/seq/seq_nielsen.cpp` | 3 | Implement above methods | ### Files to Modify (Integration) | File | Phase | Change | |------|-------|--------| | `src/smt/smt_setup.h` | 1 | Declare `setup_nseq()` | | `src/smt/smt_setup.cpp` | 1 | Add `"nseq"` dispatch, implement `setup_nseq()` | | `src/params/smt_params.cpp` | 1 | Add `"nseq"` to `validate_string_solver()` | | `src/params/smt_params_helper.pyg` | 1 | Document `nseq` option | | `src/smt/CMakeLists.txt` | 1 | Add new .cpp files to SOURCES | | `src/test/main.cpp` | 1 | Register `TST(nseq_basic)` | | `src/test/CMakeLists.txt` | 1 | Add `nseq_basic.cpp` | ## 11. Implementation Order 1. **Phase 1** — Skeleton (`theory_nseq` class, parameter integration, build system) — compilable stub returning `FC_GIVEUP` 2. **Phase 2** — State management (`nseq_state`) — constraint store with backtracking 3. **Phase 3** — Nielsen algorithms (simplification, modifiers, search, conflict) — core solver 4. **Phase 4** — Integer constraint reasoning (`lp::int_solver`) — length, power, Parikh 5. **Phase 5** — Regex membership (`nseq_regex`) — lazy derivatives 6. **Phase 6** — Propagation wiring — connect all components in `final_check_eh` 7. **Phase 7** — Model generation (`nseq_model`) — satisfying assignments