3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-04-20 10:53:29 +00:00
z3/specs/theory-nseq-zipt-port.md
2026-03-04 17:37:04 +01:00

34 KiB
Raw Blame History

theory_nseq — ZIPT-Based String Theory Solver for Z3

Document Metadata Details
Author(s) Nikolaj Bjorner
Status In Review (RFC)
Team / Owner Z3 / SMT Strings
Created / Last Updated 2026-03-03

1. Executive Summary

This spec defines the completion of the ZIPT algorithm port into Z3's c3 branch as theory_nseq — a new string/sequence theory solver based on Nielsen-style word equation solving with lazy regex membership and Parikh image reasoning. The foundational data structures (sgraph, snode, seq_plugin, nielsen_graph) are already implemented and tested (38 unit tests). What remains is: (1) the theory solver skeleton and SMT integration, (2) the Nielsen transformation algorithms (modifier chain, graph expansion, iterative deepening, conflict detection), (3) Parikh image / length reasoning, (4) regex membership processing, (5) propagation engine, and (6) model generation. The new solver will coexist with theory_seq and be selectable via smt.string_solver=nseq.

2. Context and Motivation

2.1 Current State

Z3's string solver theory_seq uses an axiom-driven approach with a 20-phase final_check_eh cascade. While production-grade, it struggles with certain classes of word equations and regex membership constraints that the ZIPT algorithm handles more effectively.

The ZIPT algorithm (reference C# implementation at github.com/CEisenhofer/ZIPT) uses a fundamentally different approach: Nielsen transformations on a search graph with iterative deepening, lazy regex evaluation, and Parikh image constraints.

Already implemented on c3 branch (per research doc):

  • sgraph + snode — String representation, metadata, hashing, substitution, derivatives, minterms (src/ast/euf/euf_sgraph.h/.cpp, euf_snode.h)
  • seq_plugin — Egraph plugin for associativity, identity, absorption, star merge (src/ast/euf/euf_seq_plugin.h/.cpp)
  • Nielsen graph data structures — str_eq, str_mem, nielsen_subst, nielsen_edge, nielsen_node, nielsen_graph, dep_tracker (src/smt/seq/seq_nielsen.h/.cpp)
  • 38 unit tests (18 sgraph + 7 seq_plugin + 13 nielsen)
  • theory_nseq skeleton and SMT integration (src/smt/theory_nseq.h/.cpp)
  • Parameter integration — smt.string_solver=nseq and smt.nseq.max_depth (src/params/smt_params*.{cpp,h,pyg}, src/smt/smt_setup.cpp)
  • Build system — theory_nseq.cpp, nseq_state.cpp, nseq_regex.cpp, nseq_model.cpp in src/smt/CMakeLists.txt
  • All 13 Nielsen modifiers (apply_det_modifier, apply_const_nielsen, apply_var_nielsen, apply_eq_split, apply_regex_char_split, apply_power_epsilon, apply_num_cmp, apply_const_num_unwinding, apply_star_intr, apply_gpower_intr, apply_regex_var_split, apply_power_split, apply_var_num_unwinding)
  • Graph expansion / iterative deepening DFS — nielsen_graph::solve() + search_dfs() (depth 3, +1/failure, bounded by smt.nseq.max_depth)
  • Constraint simplification engine — nielsen_node::simplify_and_init()
  • Conflict detection and explanation — dep_tracker-based conflict clauses
  • Propagation engine — theory_nseq::propagate(), eq/diseq/regex membership wiring
  • Model generation — nseq_model (src/smt/nseq_model.h/.cpp), wired via init_model() / mk_value() / finalize_model()
  • Regex membership processing — nseq_regex (src/smt/nseq_regex.h/.cpp)
  • Constraint state management — nseq_state (src/smt/nseq_state.h/.cpp) with push/pop backtracking
  • Integration unit tests — tst_nseq_basic (src/test/nseq_basic.cpp), tst_nseq_regex (src/test/nseq_regex.cpp)
  • Length constraint generation — nielsen_graph::generate_length_constraints() (length equalities, power non-negativity, regex bounds)

Not yet implemented / incomplete:

  • Integer constraint reasoning via lp::int_solver — length constraints are generated but not yet discharged through lp::int_solver; currently propagated as theory lemmas via arith_value
  • Correct results on Z3's full string regression suite (C:\z3test\regressions\smt2\string*.smt2)

2.2 The Problem

  • Algorithmic gap: The core solving algorithms are implemented. The main remaining gap is full lp::int_solver integration for integer subproblems.
  • Regression coverage: The solver has not yet been validated against Z3's full string regression suite.

2.3 Reference Documents

  • spec/reference.md — Points to ZIPT repo (parikh branch), LazyMemberships.pdf, ClemensTableau.pdf
  • spec/plan.md — Task description
  • spec/plan1.md — 7-phase implementation plan with file list
  • research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md — Comprehensive codebase research

3. Goals and Non-Goals

3.1 Functional Goals

  • theory_nseq theory solver class implementing smt::theory interface
  • Selectable via smt.string_solver=nseq parameter
  • Nielsen transformation-based word equation solving
  • Lazy regex membership via Brzozowski derivatives (reusing sgraph::brzozowski_deriv)
  • Integer constraint reasoning via lp::int_solver (length, power terms, Parikh image)
  • Iterative deepening search: initial depth 3, +1 per failure, upper bound via smt.nseq.max_depth (0 = infinity)
  • Conflict detection with dependency-tracked explanations
  • Model generation for satisfiable instances
  • Correct results on Z3's existing string regression suite (C:\z3test\regressions\smt2\string*.smt2)

3.2 Non-Goals (Out of Scope)

  • Replacing theory_seq as default — theory_nseq is an alternative, not a replacement
  • Polymorphic sequences — focus on strings (Seq[Char]) initially; generic Seq[T] can follow later
  • Higher-order sequence ops (seq.map, seq.foldl) — defer to theory_seq or axiomatize minimally
  • Bitvector-to-string conversions (ubv2s, sbv2s) — defer
  • Integer-to-string / string-to-integer conversions (str.to_int, int.to_str) — defer
  • Extended string predicates: containment (str.contains), prefix/suffix (str.prefixof, str.suffixof) — defer
  • String ordering (str.<, str.<=) — defer
  • New Z3 API surface — theory_nseq is internal only, selected by parameter
  • Performance parity with theory_seq on all benchmarks in Phase 1 — correctness first

4. Proposed Solution (High-Level Design)

4.1 System Architecture

┌─────────────────────────────────────────────────────────┐
│                    smt::context                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ theory_arith │  │ theory_nseq  │  │ theory_bv    │  │
│  └──────────────┘  └──────┬───────┘  └──────────────┘  │
│                           │                              │
│  ┌────────────────────────┼──────────────────────────┐  │
│  │            theory_nseq internals                   │  │
│  │  ┌─────────────┐  ┌───┴───────┐  ┌────────────┐  │  │
│  │  │ nseq_regex  │  │ propagator│  │ nseq_model │  │  │
│  │  └──────┬──────┘  └───┬───────┘  └────────────┘  │  │
│  │         │             │                            │  │
│  │  ┌──────┴─────────────┴────────────────────────┐  │  │
│  │  │          nielsen_graph (search engine)       │  │  │
│  │  │  ┌──────────┐ ┌──────────┐                       │  │  │
│  │  │  │ modifier │ │ simplify │                       │  │  │
│  │  │  │  chain   │ │  engine  │                       │  │  │
│  │  │  └──────────┘ └──────────┘                       │  │  │
│  │  └───────────────────┬─────────────────────────┘  │  │
│  │                      │                             │  │
│  │  ┌───────────────────┴─────────────────────────┐  │  │
│  │  │              sgraph + snode                  │  │  │
│  │  │       (string representation layer)          │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  Shared infrastructure: seq_util, seq_rewriter,          │
│  seq_skolem, arith_value, seq_factory                    │
└──────────────────────────────────────────────────────────┘

4.2 Architectural Pattern

Nielsen transformation graph with iterative deepening DFS — matching ZIPT's architecture:

  1. Internalization: String terms/atoms are registered, creating theory variables and snode representations.
  2. Propagation: Equality/disequality merges and boolean assignments update the constraint store.
  3. Final check: Build a nielsen_graph from active constraints, run iterative deepening search.
  4. Search: At each node, apply the modifier chain (13 modifiers in priority order) to generate child nodes. Check for conflicts and satisfaction.
  5. Conflict/Satisfaction: If all leaves are satisfied → FC_DONE. If a conflict is found → generate explanation clause and return FC_CONTINUE. If stuck → increase depth bound and retry.

4.3 Key Components

Component Responsibility Location Reuse
theory_nseq SMT theory interface, internalization, propagation src/smt/theory_nseq.h/.cpp New
nseq_state Constraint store, variable tracking, backtrackable state src/smt/nseq_state.h/.cpp New
Nielsen graph (solver) Graph expansion, iterative deepening, conflict detection src/smt/seq/seq_nielsen.h/.cpp Extend existing
Nielsen modifiers 13 transformation rules (per ZIPT modifier chain) src/smt/seq/seq_nielsen.h/.cpp New (in existing files)
nseq_regex Regex membership processing, derivative dispatch src/smt/nseq_regex.h/.cpp New
nseq_model Model construction from solved constraints src/smt/nseq_model.h/.cpp New
Parikh/length Length reasoning, integer subproblems src/smt/seq/seq_nielsen.h/.cpp New (in existing)
lp::int_solver Integer subproblem solving (replaces ZIPT's PDD subsolver) src/math/lp/int_solver.h Existing
sgraph String representation, substitution, derivatives src/ast/euf/euf_sgraph.* Existing
seq_plugin Egraph associativity/simplification src/ast/euf/euf_seq_plugin.* Existing
seq_util Sort/op declarations and utilities src/ast/seq_decl_plugin.* Existing
seq_rewriter Rewrite rules, regex derivatives, nullability src/ast/rewriter/seq_rewriter.* Existing
seq_skolem Skolem function introduction src/ast/rewriter/seq_skolem.* Existing
arith_value Arithmetic solver queries src/smt/smt_arith_value.* Existing
seq_factory Model value factory src/model/seq_factory.h Existing

5. Detailed Design

5.1 Phase 1: Skeleton & Integration

5.1.1 theory_nseq class (src/smt/theory_nseq.h/.cpp)

Implements smt::theory with these pure virtual methods:

class theory_nseq : public theory {
    // Required by smt::theory:
    bool internalize_atom(app* atom, bool gate_ctx) override;
    bool internalize_term(app* term) override;
    void new_eq_eh(theory_var v1, theory_var v2) override;
    void new_diseq_eh(theory_var v1, theory_var v2) override;
    theory* mk_fresh(context* new_ctx) override;
    void display(std::ostream& out) const override;

    // Optional overrides:
    void assign_eh(bool_var v, bool is_true) override;
    bool can_propagate() override;
    void propagate() override;
    final_check_status final_check_eh() override;
    void push_scope_eh() override;
    void pop_scope_eh(unsigned num_scopes) override;
    void init_model(model_generator& mg) override;
    void finalize_model(model_generator& mg) override;
    void collect_statistics(statistics& st) const override;

    // Members:
    seq_util          m_seq;
    arith_util        m_autil;
    arith_value       m_arith_value;
    seq_rewriter      m_rewriter;
    seq::skolem       m_sk;
    euf::egraph       m_egraph;
    euf::sgraph       m_sgraph;
    nielsen_graph     m_nielsen;
    // ... state management, constraint store
};

family_id: Uses m_seq.get_family_id() (same as theory_seq).

Internalization pattern (following theory_seq):

  • internalize_term: For string-sorted terms, create theory variable, register with egraph/sgraph. For str.++, str.len, str.in_re, etc., recursively internalize children.
  • internalize_atom: For str.in_re — create boolean variable. (str.contains, str.prefixof, str.suffixof, str.<, str.<= are out of scope for theory_nseq.)

5.1.2 Parameter integration

src/params/smt_params.cpp — Add "nseq" to validate_string_solver():

void smt_params::validate_string_solver(symbol const& s) const {
    if (s == "z3str3" || s == "seq" || s == "empty" || s == "auto" || s == "none" || s == "nseq")
        return;
    throw default_exception("Invalid string solver value. Legal values are z3str3, seq, empty, auto, none, nseq");
}

src/smt/smt_setup.cpp — Add "nseq" dispatch:

void setup::setup_seq_str(static_features const& st) {
    if (m_params.m_string_solver == "nseq") {
        setup_nseq();
    }
    else if (m_params.m_string_solver == "seq") { ... }
    // ...
}

void setup::setup_nseq() {
    m_context.register_plugin(alloc(smt::theory_nseq, m_context));
}

src/smt/smt_setup.h — Declare setup_nseq().

src/params/smt_params_helper.pyg — Document nseq in the string_solver parameter description, and add:

def_param("nseq.max_depth", UINT, 0,
    "Maximum Nielsen search depth for theory_nseq (0 = unlimited)")

5.1.3 Build system

src/smt/CMakeLists.txt — Add theory_nseq.cpp and new .cpp files.

5.2 Phase 2: Constraint Store & State Management

5.2.1 nseq_state (src/smt/nseq_state.h/.cpp)

Manages the backtrackable constraint store bridging SMT context to Nielsen graph:

class nseq_state {
    // Variable tracking
    th_union_find           m_find;        // Union-find for theory variables
    obj_map<expr, theory_var> m_var_map;   // expr → theory_var

    // Active constraints (backtrackable)
    scoped_vector<str_eq>   m_eqs;         // Active word equations
    scoped_vector<str_mem>  m_mems;        // Active regex memberships
    svector<std::pair<theory_var, theory_var>> m_diseqs; // Disequalities
    svector<bool_var>       m_pending_bools; // Pending boolean assignments

    // Scope management
    void push();
    void pop(unsigned n);

    // Constraint registration
    void add_eq(euf::snode* lhs, euf::snode* rhs, unsigned dep_id);
    void add_mem(euf::snode* str, euf::snode* regex, unsigned dep_id);
    void add_diseq(theory_var v1, theory_var v2);
};

Key design decisions:

  • Use scoped_vector for backtrackable collections (same pattern as theory_seq)
  • Map theory variables to snode representations via sgraph
  • Dependency IDs correspond to dep_tracker bit positions for conflict explanation

5.3 Phase 3: Nielsen Transformation Algorithms

This is the core algorithmic phase. All additions go into src/smt/seq/seq_nielsen.h/.cpp.

5.3.1 Simplification Engine

Add to nielsen_node:

// In nielsen_node:
simplify_result simplify_and_init(nielsen_graph& g);
simplify_result simplify_str_eq(nielsen_graph& g, unsigned idx);
simplify_result simplify_str_mem(nielsen_graph& g, unsigned idx);

Simplification rules (matching ZIPT's SimplifyAndInit):

  1. Trivial equation removal: If lhs == rhs (same snode), remove
  2. Empty propagation: If lhs = ε, then rhs must be empty → conflict or propagate
  3. Constant prefix matching: If both sides start with same character, strip it
  4. Constant mismatch: If sides start with different characters → conflict (symbol_clash)
  5. Unit propagation: If x = c where c is ground, substitute x→c everywhere
  6. Nullable regex: If str = ε and regex is nullable → satisfied

5.3.2 Modifier Chain

Add 13 modifiers matching ZIPT's priority order. Each modifier returns a list of nielsen_edge expansions:

// In nielsen_graph:
struct modifier_result {
    ptr_vector<nielsen_edge> edges;
    bool deterministic;  // true if only one branch is valid
};

// Modifiers (in priority order):
modifier_result apply_det_modifier(nielsen_node& n);           // 1. Deterministic simplifications
modifier_result apply_power_epsilon(nielsen_node& n);          // 2. Power-epsilon: u="" || n=0
modifier_result apply_num_cmp(nielsen_node& n);                // 3. Numeric comparisons
modifier_result apply_const_num_unwinding(nielsen_node& n);    // 4. Constant power unwinding
modifier_result apply_eq_split(nielsen_node& n);               // 5. Equality prefix/suffix split
modifier_result apply_star_intr(nielsen_node& n);              // 6. Stabilizer introduction
modifier_result apply_gpower_intr(nielsen_node& n);            // 7. Power introduction
modifier_result apply_const_nielsen(nielsen_node& n);          // 8. Nielsen vs constant
modifier_result apply_regex_char_split(nielsen_node& n);       // 9. Regex character split
modifier_result apply_regex_var_split(nielsen_node& n);        // 10. Regex variable split
modifier_result apply_power_split(nielsen_node& n);            // 11. Power unwinding
modifier_result apply_var_nielsen(nielsen_node& n);            // 12. Variable Nielsen split
modifier_result apply_var_num_unwinding(nielsen_node& n);      // 13. Variable power unwinding

The modifier chain is applied in order; the first modifier that produces edges is used:

modifier_result generate_extensions(nielsen_node& n) {
    modifier_result r;
    r = apply_det_modifier(n);        if (!r.edges.empty()) return r;
    r = apply_power_epsilon(n);       if (!r.edges.empty()) return r;
    // ... through all 13 modifiers
    r = apply_var_num_unwinding(n);
    return r;
}

5.3.3 Graph Expansion & Iterative Deepening

Add to nielsen_graph:

// Search result
enum class search_result { sat, unsat, unknown };

// Main search entry point
search_result solve();

// Iterative deepening DFS
search_result search_dfs(nielsen_node* node, unsigned depth);

// Expand a single node
search_result expand_node(nielsen_node* node, unsigned depth);

Algorithm (matching ZIPT's NielsenGraph.Run()):

  1. Set initial depth bound to 3
  2. Run DFS from root with current depth bound
  3. If all branches conflict → UNSAT
  4. If a leaf is satisfied (all constraints trivial) → SAT
  5. If depth bound hit → increase bound by 1, restart (iterative deepening)
  6. If smt.nseq.max_depth > 0 and bound exceeds it → give up (FC_GIVEUP); if smt.nseq.max_depth == 0 (default) → no upper limit

5.3.4 Conflict Detection & Explanation

Add to nielsen_graph:

// Extract conflict explanation as a set of input constraint indices
void explain_conflict(nielsen_node const& conflict_node, svector<unsigned>& deps);

The dep_tracker bitvectors on each constraint track which original input constraints contributed. On conflict, OR together the dep_trackers of the conflicting constraints to get the minimal explanation.

5.4 Phase 4: Integer Constraint Reasoning

Each Nielsen node accumulates integer constraints from all sources:

  • Length constraints: from word equations lhs = rhs|lhs| = |rhs|
  • Power terms: x^n with symbolic exponent → n ≥ 0, |x^n| = n * |x|
  • Parikh image: from regex membership x ∈ L(r) → length bounds/modular constraints (e.g. x ∈ (ab)*|x| mod 2 = 0)

All of these are collected together into a single integer constraint system per node and discharged uniformly via Z3's lp::int_solver (src/math/lp/int_solver.h). This replaces ZIPT's PDD-based integer subsolver. There is no separate treatment per source — they are all integer constraints over the same set of length variables.

// In nielsen_graph or a helper:
void generate_integer_constraints(nielsen_node const& node,
                                  vector<expr_ref>& int_constraints);

Length computation from snode:

  • |ε| = 0
  • |c| = 1 (character)
  • |a · b| = |a| + |b|
  • |x^n| = n * |x| (power)
  • |x| = length variable for x

Integration: Use arith_value to query current assignments and check feasibility. Propagate derived equalities/inequalities via theory_nseq::propagate().

5.5 Phase 5: Regex Membership

5.5.1 nseq_regex (src/smt/nseq_regex.h/.cpp)

Handles regex membership constraints using ZIPT's lazy approach:

class nseq_regex {
    euf::sgraph&    m_sg;
    seq_rewriter&   m_rewriter;

    // Process regex membership after character consumption
    simplify_result process_str_mem(str_mem& mem, euf::snode* consumed_char);

    // Compute derivative and create new membership constraint
    str_mem derive(str_mem const& mem, euf::snode* char_node);

    // Check if regex is empty (unsatisfiable)
    bool is_empty_regex(euf::snode* regex);

    // Check for regex cycles (stabilizer introduction)
    bool detect_cycle(str_mem const& mem);

    // Compute minterms for character splitting
    void get_minterms(euf::snode* regex, vector<euf::snode*>& minterms);
};

Lazy evaluation pattern (per LazyMemberships.pdf):

  • Only consume characters from the string when they become concrete (via substitution)
  • Use Brzozowski derivatives: D(c, r) gives the derivative of regex r by character c
  • Track history (consumed prefix) for cycle detection
  • When a cycle is detected, introduce a stabilizer: remaining ∈ base*

Reuse: sgraph::brzozowski_deriv() and sgraph::compute_minterms() are already implemented.

5.6 Phase 6: Propagation Engine & Integration

5.6.1 theory_nseq::final_check_eh()

The main solving loop:

final_check_status theory_nseq::final_check_eh() {
    // 1. Collect constraints from SMT context into nielsen_graph
    m_nielsen.reset();
    populate_nielsen_graph();

    // 2. Run Nielsen search
    auto result = m_nielsen.solve();

    switch (result) {
    case search_result::sat:
        return FC_DONE;
    case search_result::unsat:
        // Extract conflict explanation
        svector<unsigned> deps;
        m_nielsen.explain_conflict(deps);
        // Build conflict clause from deps
        add_conflict_clause(deps);
        return FC_CONTINUE;
    case search_result::unknown:
        // Depth bound hit or timeout
        return FC_GIVEUP;
    }
}

5.6.2 populate_nielsen_graph()

Convert active SMT constraints to nielsen_graph format:

void theory_nseq::populate_nielsen_graph() {
    for (auto& eq : m_state.eqs()) {
        euf::snode* lhs = m_sgraph.mk(eq.first);
        euf::snode* rhs = m_sgraph.mk(eq.second);
        m_nielsen.add_str_eq(lhs, rhs);
    }
    for (auto& mem : m_state.mems()) {
        euf::snode* str = m_sgraph.mk(mem.str);
        euf::snode* regex = m_sgraph.mk(mem.regex);
        m_nielsen.add_str_mem(str, regex);
    }
}

5.6.3 theory_nseq::propagate()

Propagate consequences from Nielsen solving to the SMT context:

  • New equalities (from substitutions) → ctx.assign_eq()
  • New disequalities → ctx.assign_diseq()
  • New literals (regex memberships) → ctx.assign()
  • Length constraints → ctx.assign() for arithmetic literals

5.7 Phase 7: Model Generation

5.7.1 nseq_model (src/smt/nseq_model.h/.cpp)

Build concrete string values from solved Nielsen graph:

class nseq_model {
    // Extract model assignments from a satisfied nielsen_node
    void extract_assignments(nielsen_node const& leaf,
                            obj_map<expr, expr*>& assignments);

    // Generate fresh string values for unconstrained variables
    expr_ref mk_fresh_value(sort* s, seq_factory& factory);

    // Ensure consistency with regex memberships
    bool validate_regex(expr* str_val, expr* regex);
};

Model construction algorithm:

  1. From the satisfied leaf node, collect all substitutions along the path from root
  2. Compose substitutions to get final variable assignments
  3. For unconstrained variables, use seq_factory::get_fresh_value()
  4. Register with model_generator via init_model() / finalize_model()

6. Alternatives Considered

Option Pros Cons Decision
Extend theory_seq with Nielsen Single solver, shared infrastructure Massive refactor, high risk to production solver Rejected: Too risky to production
Port ZIPT as standalone user propagator Minimal Z3 changes Poor integration, no sharing of Z3 infrastructure, slower Rejected: Misses optimization opportunities
New theory_nseq (selected) Clean separation, reuses infrastructure, safe coexistence New code to maintain, some duplication Selected: Best balance of risk and benefit
PDD for Parikh (full ZIPT port) Exact polynomial reasoning Complex, Z3 has lp::int_solver already Rejected: Use lp::int_solver instead of porting ZIPT's PDD subsolver

7. Cross-Cutting Concerns

7.1 Backtracking / Scope Management

theory_nseq must correctly implement push_scope_eh() / pop_scope_eh():

  • nseq_state uses scoped_vector (same as theory_seq)
  • sgraph/egraph have their own push()/pop() — must be synchronized
  • nielsen_graph is rebuilt from scratch on each final_check_eh() (no incremental state), so no backtracking needed for the graph itself

7.2 Conflict Clause Quality

Conflict explanations must be minimal for effective CDCL learning:

  • dep_tracker bitvectors track which input constraints are responsible
  • On conflict, extract the minimal set of input constraint IDs
  • Map these back to Z3 literals for the conflict clause

7.3 Interaction with Arithmetic Solver

Length constraints create coupling between theory_nseq and the arithmetic solver:

  • Use arith_value to query current arithmetic assignments (same pattern as theory_seq)
  • Propagate derived length equalities as theory lemmas
  • May need to handle FC_CONTINUE loops when arithmetic and string solvers ping-pong

7.4 Statistics

collect_statistics() should track:

  • Number of Nielsen graph nodes explored
  • Number of conflicts found
  • Number of depth bound increases
  • Number of regex derivatives computed
  • Number of length constraints generated

8. Migration, Rollout, and Testing

8.1 Deployment Strategy

  • Phase 1: theory_nseq returns FC_GIVEUP for all inputs — proves integration works
  • Phase 2: Handle trivial word equations (constant strings, simple variables)
  • Phase 3: Full modifier chain — handle the string regression suite
  • Phase 4: Regex support — handle re*.smt2 regression tests
  • Phase 5: Performance tuning — compare against theory_seq on SMT-COMP benchmarks

8.2 Test Plan

Unit Tests (C++ tests in src/test/)

Test File Scope
TST(nseq_basic) src/test/nseq_basic.cpp theory_nseq instantiation, parameter selection, trivial sat/unsat
TST(seq_nielsen) src/test/seq_nielsen.cpp (existing) Extend with simplification, modifier, and search tests
TST(euf_sgraph) src/test/euf_sgraph.cpp (existing) Already passing; extend if sgraph changes are needed

Regression Tests (SMT2 via z3 command line)

Use C:\z3test\regressions\smt2\ (the same suite as nightly CI):

  1. String-specific files with smt.string_solver=nseq:

    • string-concat.smt2, string-eval.smt2, string-itos.smt2, string-literals.smt2, string-ops.smt2, string-simplify.smt2
    • string3.smt2, string6.smt2, string7.smt2, string9.smt2, string11.smt2string14.smt2
    • re.smt2, re_rewriter.smt2
  2. Comparison testing: Run same files with both smt.string_solver=seq and smt.string_solver=nseq, verify sat/unsat agreement (no soundness bugs).

  3. Full regression suite: Ensure smt.string_solver=nseq doesn't break non-string tests.

Targeted SMT2 Tests (New)

Create focused test files in tests/nseq/:

  • eq_basic.smt2 — Simple word equations
  • eq_nielsen.smt2 — Equations requiring Nielsen transformations
  • length.smt2 — Length constraints
  • regex.smt2 — Regex membership
  • mixed.smt2 — Combined string + arithmetic
  • unsat.smt2 — Unsatisfiable benchmarks

9. Design Decisions (Resolved)

  • Q1: egraph ownershiptheory_nseq creates its own private egraph/sgraph, matching the ZIPT pattern where the string manager is independent. This avoids interference with other theories and keeps the string representation self-contained.

  • Q2: Parikh implementation approach — Use Z3's built-in lp::int_solver (src/math/lp/int_solver.h) for integer subproblems arising from Parikh/length constraints. Generate linear length constraints from word equations and regex membership, propagate them as theory lemmas via arith_value. ZIPT's PDD subsolver is not portedlp::int_solver directly replaces it.

  • Q3: Nielsen graph persistence — The Nielsen graph will be maintained incrementally with backtracking support. The graph persists across final_check_eh() calls with push/pop scope synchronization. This avoids redundant re-exploration of previously visited nodes when the SMT solver backtracks.

  • Q4: Modifier chain completeness — Implement an essential subset of 5 modifiers first: DetModifier, EqSplitModifier, ConstNielsenModifier, VarNielsenModifier, RegexCharSplitModifier. These cover basic word equations and simple regex membership. The remaining 8 modifiers (PowerEpsilon, NumCmp, ConstNumUnwinding, StarIntr, GPowerIntr, RegexVarSplit, PowerSplit, VarNumUnwinding) will be added incrementally in subsequent phases.

  • Q5: Generic sequencestheory_nseq handles all Seq[T] types from the start, not just strings. Nielsen transformations work on any element type. Regex membership is string-specific but the word equation engine is generic.

  • Q6: Higher-order opsAxiomatize minimally for seq.map, seq.mapi, seq.foldl, seq.foldli. Generate basic axioms (similar to theory_seq's approach) without deep reasoning. This ensures theory_nseq doesn't immediately give up on problems containing these operations.

  • Q7: Disequality handling — Use separate tracking like ZIPT. Port ZIPT's character disequality and string disequality tracking into the nielsen_graph data structures. This gives the modifier chain direct access to disequality information (particularly needed for character minterm computation in RegexCharSplitModifier).

10. File Summary

New Files

File Phase Description
src/smt/theory_nseq.h 1 Theory class header
src/smt/theory_nseq.cpp 1 Theory class implementation
src/smt/nseq_state.h 2 Constraint store header
src/smt/nseq_state.cpp 2 Constraint store implementation
src/smt/nseq_regex.h 5 Regex handling header
src/smt/nseq_regex.cpp 5 Regex handling implementation
src/smt/nseq_model.h 7 Model generation header
src/smt/nseq_model.cpp 7 Model generation implementation
src/test/nseq_basic.cpp 1 Unit tests

Files to Extend

File Phase Change
src/smt/seq/seq_nielsen.h 3 Add simplification, modifier chain, search methods
src/smt/seq/seq_nielsen.cpp 3 Implement above methods

Files to Modify (Integration)

File Phase Change
src/smt/smt_setup.h 1 Declare setup_nseq()
src/smt/smt_setup.cpp 1 Add "nseq" dispatch, implement setup_nseq()
src/params/smt_params.cpp 1 Add "nseq" to validate_string_solver()
src/params/smt_params_helper.pyg 1 Document nseq option
src/smt/CMakeLists.txt 1 Add new .cpp files to SOURCES
src/test/main.cpp 1 Register TST(nseq_basic)
src/test/CMakeLists.txt 1 Add nseq_basic.cpp

11. Implementation Order

  1. Phase 1 — Skeleton (theory_nseq class, parameter integration, build system) — compilable stub returning FC_GIVEUP
  2. Phase 2 — State management (nseq_state) — constraint store with backtracking
  3. Phase 3 — Nielsen algorithms (simplification, modifiers, search, conflict) — core solver
  4. Phase 4 — Integer constraint reasoning (lp::int_solver) — length, power, Parikh
  5. Phase 5 — Regex membership (nseq_regex) — lazy derivatives
  6. Phase 6 — Propagation wiring — connect all components in final_check_eh
  7. Phase 7 — Model generation (nseq_model) — satisfying assignments