diff --git a/research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md b/research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md new file mode 100644 index 000000000..28b17a0d4 --- /dev/null +++ b/research/docs/2026-03-03-zipt-string-theory-codebase-analysis.md @@ -0,0 +1,667 @@ +--- +date: 2026-03-03 20:46:00 UTC +researcher: Copilot CLI +git_commit: fb674ac5b2e722dfe1086c7d952054a0bc11c963 +branch: c3 +repository: c3 +topic: "ZIPT-related string theory components: sgraph, seq_plugin, Nielsen graphs, theory_seq, seq_rewriter, axioms" +tags: [research, codebase, zipt, sgraph, seq_plugin, nielsen, theory_seq, seq_rewriter, seq_axioms, seq_decl_plugin, strings, regex] +status: complete +last_updated: 2026-03-03 +last_updated_by: Copilot CLI +--- + +# Research: ZIPT-Related String Theory Components in Z3 + +## Research Question + +Examine the Z3 codebase and specifications in the `spec/` directory, focusing on code related to the ZIPT repository (https://github.com/CEisenhofer/ZIPT/tree/parikh/ZIPT) and the files for sgraph, seq_plugin, Nielsen graphs, theory_seq, and associated files for sequences (seq_rewriter, axioms for seq). + +## Summary + +This document provides a comprehensive analysis of Z3's string/sequence theory infrastructure and its relationship to the ZIPT reference implementation. Z3 has **two layers** of string solving: + +1. **Existing production solver** (`theory_seq`): A mature, axiom-driven solver in `src/smt/` with 20+ final_check phases, regex derivatives, and Skolemization. +2. **New ZIPT-inspired infrastructure** (in progress): `sgraph` (string graph), `seq_plugin` (egraph plugin), and `nielsen_graph` (Nielsen transformations) — implementing the ZIPT approach of Nielsen-style word equation solving with lazy regex membership and Parikh image reasoning. + +The spec directory (`spec/plan1.md`) describes the goal: implement `theory_nseq` as a new theory solver using the ZIPT algorithm, selectable via `smt.string_solver=nseq`. No `theory_nseq` files exist yet — only the foundational data structures (sgraph, snode, seq_plugin, nielsen_graph) are implemented and tested. + +--- + +## Detailed Findings + +### 1. ZIPT Reference Implementation (C#) + +**Repository**: https://github.com/CEisenhofer/ZIPT (parikh branch) + +The ZIPT solver implements Nielsen-style word equation solving with lazy regex membership and Parikh image analysis. It is a Z3 user-propagator written in C#. + +#### Architecture + +| Component | File(s) | Purpose | +|-----------|---------|---------| +| `StringPropagator` | `StringPropagator.cs` | Base user-propagator; handles Z3 callbacks (Fixed, Created, Eq, DisEq) | +| `NielsenGraph` | `Constraints/NielsenGraph.cs` | Search graph with iterative deepening DFS | +| `NielsenNode` | `Constraints/NielsenNode.cs` | Node = set of constraints; supports simplification and extension | +| `NielsenEdge` | `Constraints/NielsenEdge.cs` | Edge = substitutions + side constraints | +| `StrManager` | `Strings/StrManager.cs` | String canonicalization, concatenation, substitution | +| `PDD` | `IntUtils/PDD.cs` | Polynomial Decision Diagrams for integer reasoning | +| `Interval` | `IntUtils/Interval.cs` | Interval arithmetic for bounds | + +#### Constraint Types + +- **StrEq**: Word equations `LHS = RHS` (regex-free) +- **StrMem**: Regex membership `str ∈ L(regex)` with history tracking +- **IntEq**: Polynomial equality `P = 0` (over length variables) +- **IntLe**: Polynomial inequality `P ≤ 0` +- **IntNonEq**: Polynomial disequality `P ≠ 0` +- **Character ranges**: Symbolic char ∈ [a-z] +- **Character disequalities**: c₁ ≠ c₂ + +#### Modifier Chain (Priority Order) + +1. **DetModifier** — Deterministic simplifications (highest priority) +2. **PowerEpsilonModifier** — Handle empty power: `u = "" || n = 0` +3. **NumCmpModifier** — Numeric comparisons: `n ≤ m || n > m` +4. **ConstNumUnwindingModifier** — Constant power unwinding: `n = 0 || n > 0` +5. **EqSplitModifier** — Split equality into prefix/suffix parts +6. **StarIntrModifier** — Introduce stabilizer `x ∈ base*` +7. **GPowerIntrModifier** — Introduce power `x := u^n prefix(u)` +8. **ConstNielsenModifier** — Nielsen split against constant: `x := ax || x := ""` +9. **RegexCharSplitModifier** — Character-level regex split via minterms +10. **RegexVarSplitModifier** — Variable-level regex split: `x / "" || x / cx` +11. **PowerSplitModifier** — Power unwinding +12. **VarNielsenModifier** — Variable Nielsen split (lowest priority) +13. **VarNumUnwindingModifier** — Variable power unwinding + +#### Key Design Patterns + +- **Lazy regex evaluation**: Only consume characters when concrete +- **Canonicalization**: All strings/polynomials are canonicalized +- **Parikh constraints**: Length reasoning via PDD polynomials + interval arithmetic +- **Iterative deepening**: Depth-bounded search with increasing limits +- **Subsumption**: Avoid redundant branches via node subsumption checking +- **Modification counts**: Version string variables for Z3 (x → x_0, x_1, ...) + +--- + +### 2. sgraph — String Graph (`src/ast/euf/`) + +**Files**: `euf_sgraph.h`, `euf_sgraph.cpp`, `euf_snode.h` + +The sgraph maps Z3 AST expressions to `snode` (string node) structures organized as binary concatenation trees. This is the Z3-native equivalent of ZIPT's `StrManager`. + +#### snode Class (`euf_snode.h`) + +**snode_kind enum** (16 kinds): + +| Kind | AST Op | Description | +|------|--------|-------------| +| `s_empty` | `OP_SEQ_EMPTY` | Empty string | +| `s_char` | `OP_SEQ_UNIT` (literal) | Concrete character | +| `s_var` | uninterpreted const | String variable | +| `s_unit` | `OP_SEQ_UNIT` (non-literal) | Generic unit | +| `s_concat` | `OP_SEQ_CONCAT` | Binary concatenation | +| `s_power` | `OP_SEQ_POWER` | String exponentiation s^n | +| `s_star` | `OP_RE_STAR` | Kleene star r* | +| `s_loop` | `OP_RE_LOOP` | Bounded loop r{lo,hi} | +| `s_union` | `OP_RE_UNION` | Union r₁\|r₂ | +| `s_intersect` | `OP_RE_INTERSECT` | Intersection r₁&r₂ | +| `s_complement` | `OP_RE_COMPLEMENT` | Complement ~r | +| `s_fail` | `OP_RE_EMPTY_SET` | Empty language ∅ | +| `s_full_char` | `OP_RE_FULL_CHAR_SET` | Full character set . | +| `s_full_seq` | `OP_RE_FULL_SEQ_SET` | Full sequence set .* | +| `s_to_re` | `OP_SEQ_TO_RE` | String to regex | +| `s_in_re` | `OP_SEQ_IN_RE` | Regex membership | +| `s_other` | (default) | Other expression | + +**Key Members**: +- `expr* m_expr` — Z3 AST expression +- `snode_kind m_kind` — Classification +- `unsigned m_id` — Unique ID +- `bool m_ground` — No uninterpreted variables in subtree +- `bool m_regex_free` — No regex constructs in subtree +- `bool m_nullable` — Accepts empty string +- `unsigned m_level` — Tree depth +- `unsigned m_length` — Token count (leaf count) +- `unsigned m_hash_matrix[2][2]` — 2×2 polynomial hash matrix for O(1) associative hashing +- `snode_subst_cache* m_subst_cache` — ZIPT-style substitution cache +- `snode* m_args[0]` — Flexible array of children + +**Hash Matrix** (`HASH_BASE = 31`): +- Empty: identity matrix `[[1,0],[0,1]]` +- Leaf: `[[31, expr_id+1], [0, 1]]` +- Concat: matrix multiplication `M(left) * M(right)` +- Hash value: `m_hash_matrix[0][1]` (assoc_hash) + +**Navigation**: `first()`, `last()`, `at(i)`, `collect_tokens()`, `is_token()` (everything except empty and concat) + +#### sgraph Class (`euf_sgraph.h/cpp`) + +**Constructor**: `sgraph(ast_manager& m, egraph& eg, bool add_plugin = true)` +- Stores reference to egraph (not owned) +- If `add_plugin=true`, creates and registers `seq_plugin` +- Registers `on_make` callback with egraph: when any enode is created with seq/re sort, automatically creates corresponding snode + +**Core Operations**: + +| Method | Description | +|--------|-------------| +| `mk(expr* e)` | Create snode from expression (idempotent) | +| `find(expr* e)` | Lookup existing snode | +| `mk_var(symbol)` | Create variable snode | +| `mk_char(unsigned)` | Create character snode | +| `mk_empty()` | Create empty snode | +| `mk_concat(a, b)` | Create concat (with identity elimination) | +| `drop_first/last(n)` | Remove first/last token | +| `drop_left/right(n, k)` | Remove k tokens from left/right | +| `subst(n, var, repl)` | Substitute variable with replacement (cached) | +| `brzozowski_deriv(re, elem)` | Regex derivative via seq_rewriter | +| `compute_minterms(re, out)` | Extract character classes (capped at 20 predicates) | +| `push()/pop(n)` | Scope management synchronized with egraph | + +**Substitution Caching**: Each snode has lazy-allocated `snode_subst_cache` storing `(var_id, repl_id) → result` mappings, avoiding redundant substitution computations (ZIPT pattern). + +--- + +### 3. seq_plugin — Egraph Plugin (`src/ast/euf/`) + +**Files**: `euf_seq_plugin.h`, `euf_seq_plugin.cpp` + +The seq_plugin implements equality saturation for sequence/regex concatenation in the egraph. It handles associativity, identity/absorption rules, and Kleene star simplifications. + +#### Class Hierarchy + +``` +euf::plugin (abstract base) + └── euf::seq_plugin +``` + +**Constructor**: `seq_plugin(egraph& g, sgraph* sg = nullptr)` +- If `sg` is null, creates and owns a new sgraph +- Initializes `seq_util`, `seq_rewriter`, hash table with custom functors + +#### Concat Handling + +| Predicate | String | Regex | +|-----------|--------|-------| +| Concat | `is_str_concat()` / `OP_SEQ_CONCAT` | `is_re_concat()` / `OP_RE_CONCAT` | +| Identity | `is_str_empty()` / ε | `is_re_epsilon()` / to_re("") | +| Absorbing | (none) | `is_re_empty()` / ∅ | + +**Simplification Rules** (from `propagate_simplify`): + +1. **Kleene star merging**: `concat(v*, v*) = v*` when bodies are congruent (root equality) +2. **Extended star merging**: `concat(v*, concat(v*, c)) = concat(v*, c)` +3. **Nullable absorption**: `concat(.*, v) = .*` when v is nullable +4. **Symmetric nullable absorption**: `concat(v, .*) = .*` when v is nullable + +**Associativity** (from `propagate_assoc`): +- Maintains `hashtable` that respects associativity +- Hash: Uses sgraph's cached snode hash matrices for O(1) hashing, or falls back to leaf-based hashing +- Equality: Compares flattened leaf sequences pointwise +- When duplicate found, merges via `push_merge` + +**Nullability**: Delegates to `m_rewriter.is_nullable(e)` (from `seq_rewriter`) + +**Key Relationship**: seq_plugin does NOT create snodes. The sgraph's `on_make` callback handles snode creation. seq_plugin only observes egraph events and triggers merges. + +--- + +### 4. Nielsen Graph (`src/smt/seq/`) + +**Files**: `seq_nielsen.h`, `seq_nielsen.cpp`, `CMakeLists.txt` + +The Nielsen graph implements word equation solving via Nielsen transformations. This is the Z3-native port of ZIPT's `NielsenGraph`, `NielsenNode`, `NielsenEdge`, and constraint types. + +#### Classes + +##### dep_tracker (`seq_nielsen.h:81-94`) +Bitvector-based dependency tracker for conflict explanation. +- `merge(other)`: Bitwise OR +- `is_superset(other)`: All bits of other present +- `is_empty()`: No bits set + +##### str_eq (`seq_nielsen.h:98-119`) +Word equation `lhs = rhs` (both regex-free snode trees). +- `sort()`: Canonicalize by snode ID +- `is_trivial()`: Same snode or both empty +- `contains_var(var)`: Check if variable appears in lhs or rhs + +##### str_mem (`seq_nielsen.h:123-143`) +Regex membership `str ∈ L(regex)`. +- `m_history`: Consumed prefix for cycle detection +- `m_id`: Unique identifier +- `is_primitive()`: True when str is a single variable + +##### nielsen_subst (`seq_nielsen.h:147-162`) +Variable substitution `var → replacement`. +- `is_eliminating()`: True when variable doesn't appear in replacement (progress) + +##### nielsen_edge (`seq_nielsen.h:166-195`) +Directed edge in the Nielsen graph. +- `m_subst`: Vector of substitutions applied +- `m_side_str_eq/m_side_str_mem`: Side constraints added +- `m_is_progress`: Whether this is an eliminating step + +##### nielsen_node (`seq_nielsen.h:199-270`) +Node = set of constraints at a point in the search. +- `m_str_eq`: String equality constraints +- `m_str_mem`: Regex membership constraints +- `m_outgoing`: Outgoing edges +- `m_backedge`: For cycle detection +- `m_is_general_conflict`, `m_reason`: Conflict tracking +- `clone_from(parent)`: Copy constraints +- `apply_subst(sg, s)`: Apply substitution to all constraints + +##### nielsen_graph (`seq_nielsen.h:274-325`) +The overall graph structure. +- `m_sg`: Reference to sgraph infrastructure +- `m_root`: Root node with initial constraints +- `m_run_idx`, `m_depth_bound`: For iterative deepening +- `mk_node()`, `mk_child(parent)`, `mk_edge(src, tgt, progress)` +- `add_str_eq(lhs, rhs)`, `add_str_mem(str, regex)`: Populate from external solver +- `reset()`, `inc_run_idx()`: Lifecycle management + +#### Enums + +**simplify_result**: `proceed`, `conflict`, `satisfied`, `restart`, `restart_and_satisfied` + +**backtrack_reason**: `unevaluated`, `extended`, `symbol_clash`, `parikh_image`, `subsumption`, `arithmetic`, `regex`, `regex_widening`, `character_range`, `smt`, `children_failed` + +#### Build Configuration + +`src/smt/seq/CMakeLists.txt`: Component `smt_seq`, depends on `euf` and `rewriter`. + +#### Current Status + +The Nielsen graph provides **data structures and infrastructure** — node/edge/constraint/substitution management, dependency tracking, backtracking support, display. The **actual transformation algorithms** (modifier chain, graph expansion, conflict detection, Parikh image analysis) are not yet implemented. Tests cover all data structure operations but not algorithmic behavior. + +--- + +### 5. theory_seq — Existing String Theory Solver (`src/smt/`) + +**Files**: `theory_seq.h`, `theory_seq.cpp`, `theory_seq_empty.h` + +The production string solver implementing `theory` interface. Uses axiom-driven solving with 20+ cascading phases. + +#### Class Structure + +``` +smt::theory (base) +seq::eq_solver_context (interface) + └── smt::theory_seq +``` + +**Component Submodules** (declared in `theory_seq.h:356-366`): +- `seq::skolem m_sk` — Skolemization +- `seq_axioms m_ax` — Axiom generation (bridges rewriter and SMT layers) +- `seq::eq_solver m_eq` — Core equality solver (solver-agnostic) +- `seq_regex m_regex` — Regular expression handling +- `seq_offset_eq m_offset_eq` — Length offset tracking +- `th_union_find m_find` — Union-find for theory variables +- `dependency_manager m_dm` — Justification tracking +- `solution_map m_rep` — Variable → representative mapping + +#### Final Check Cascade (20 phases) + +``` +1. simplify_and_solve_eqs() — Solve unitary equalities +2. check_lts() — Lexicographic ordering +3. solve_nqs(0) — Disequalities +4. m_regex.propagate() — Regex propagation +5. check_contains() — Contains constraints +6. check_fixed_length(true,false)— Zero-length sequences +7. len_based_split() — Length-based splitting +8. check_fixed_length(false,false)— Fixed-length expansion +9. check_int_string() — int/string conversions +10. check_ubv_string() — bitvector/string conversions +11. reduce_length_eq() — Reduce by length equality +12. branch_unit_variable() — Branch on unit variables +13. branch_binary_variable() — Branch on binary patterns +14. branch_variable() — General variable branching +15. check_length_coherence() — Length coherence +16. check_extensionality() — Extensionality +17. branch_nqs() — Branch on disequalities +18. branch_itos() — Branch on int-to-string +19. check_fixed_length(false,true)— Fixed-length (long strings) +20. solve_recfuns() — Recursive functions +``` + +Returns `FC_CONTINUE` if progress, `FC_GIVEUP` if stuck, `FC_DONE` if solved. + +#### String Solver Selection (`smt_setup.cpp`) + +Parameter: `smt.string_solver` +- `"seq"` → `setup_seq()` → `alloc(smt::theory_seq, m_context)` +- `"empty"` → `setup_seq()` (with empty variant) +- `"none"` → No solver +- `"auto"` → `setup_seq()` +- `"nseq"` → **Not yet implemented** (planned per spec) + +#### Key Data Structures + +- `solution_map m_rep` — Maps variables to representatives (with normalization cache) +- `scoped_vector m_eqs` — Active equations +- `scoped_vector m_nqs` — Disequalities +- `scoped_vector m_ncs` — Negated contains +- `exclusion_table m_exclude` — Asserted disequalities +- `expr_ref_vector m_axioms` — Axiom queue (with deduplication) + +--- + +### 6. seq_rewriter — Sequence Rewriter (`src/ast/rewriter/`) + +**Files**: `seq_rewriter.h`, `seq_rewriter.cpp` + +Provides local syntactic simplification rules for all string/sequence operations. Entry point: `mk_app_core(func_decl* f, ...)`. + +#### Rewrite Categories + +| Category | Key Methods | Description | +|----------|-------------|-------------| +| Construction | `mk_seq_unit`, `mk_seq_concat` | Coalesce chars, right-associate, eliminate empty | +| Extraction | `mk_seq_extract`, `mk_seq_at`, `mk_seq_nth` | Substring, index access | +| Length | `mk_seq_length` | Compute length, simplify for concat/replace/extract | +| Search | `mk_seq_contains`, `mk_seq_index`, `mk_seq_last_index` | Containment, index-of | +| Replacement | `mk_seq_replace`, `mk_seq_replace_all`, `mk_seq_replace_re*` | String replacement | +| Comparison | `mk_str_le`, `mk_str_lt`, `mk_seq_prefix`, `mk_seq_suffix` | Lexicographic, prefix/suffix | +| Conversion | `mk_str_itos`, `mk_str_stoi`, `mk_str_from_code`, `mk_str_to_code` | Type conversions | +| Higher-order | `mk_seq_map`, `mk_seq_mapi`, `mk_seq_foldl`, `mk_seq_foldli` | Map, fold | +| Regex | `mk_re_concat`, `mk_re_union`, `mk_re_inter`, `mk_re_star`, etc. | Regex construction | +| Derivative | `mk_re_derivative` → `mk_antimirov_deriv` | Antimirov-style regex derivative | +| Nullability | `is_nullable` → `is_nullable_rec` | Nullable check for regex | + +#### Regex Derivative System + +Uses **Antimirov derivatives** with a BDD-like normal form: + +1. **Top level**: Union of Antimirov derivatives using `_OP_RE_ANTIMIROV_UNION` +2. **Middle level**: Nested if-then-else sorted by condition ID +3. **Leaf level**: Regular regex unions + +Key derivative rules: +- `D(e, r₁·r₂) = D(e,r₁)·r₂ ∪ (nullable(r₁) ? D(e,r₂) : ∅)` +- `D(e, r₁∪r₂) = D(e,r₁) ∪ D(e,r₂)` +- `D(e, r*) = D(e,r)·r*` +- `D(e, ¬r) = ¬D(e,r)` +- `D(e, [lo-hi]) = ε if lo ≤ e ≤ hi else ∅` + +Validated by `check_deriv_normal_form` in debug mode. + +--- + +### 7. Axiom System (`src/ast/rewriter/` + `src/smt/`) + +Two layers: **rewriter-level** (solver-independent) and **SMT-level** (bridges to Z3 context). + +#### Rewriter-Level Axioms (`seq::axioms`, `src/ast/rewriter/seq_axioms.h/cpp`) + +Solver-agnostic axiom generation via callbacks. Key axioms: + +| Operation | Axiom Summary | +|-----------|---------------| +| `extract(s,i,l)` | `0≤i≤\|s\| ∧ 0≤l → s=x·e·y ∧ \|x\|=i ∧ \|e\|=min(l,\|s\|-i)` | +| `indexof(t,s,off)` | `contains(t,s) → t=x·s·y ∧ indexof=\|x\|` + tightest prefix | +| `replace(u,s,t)` | `contains(u,s) → u=x·s·y ∧ r=x·t·y` | +| `at(s,i)` | `0≤i<\|s\| → s=x·e·y ∧ \|x\|=i ∧ \|e\|=1` | +| `itos(n)` | `n≥0 → stoi(itos(n))=n ∧ no leading zeros` | +| `stoi(s)` | `stoi(s)≥-1 ∧ stoi("")=-1` | + +Introduced via Skolem functions (see below). + +#### SMT-Level Axioms (`smt::seq_axioms`, `src/smt/seq_axioms.h/cpp`) + +Wraps `seq::axioms` with literal creation, clause addition, and phase forcing for Z3's SMT context. + +--- + +### 8. Skolemization (`src/ast/rewriter/seq_skolem.h/cpp`) + +`seq::skolem` introduces auxiliary Skolem functions for structural reasoning: + +| Skolem | Symbol | Purpose | +|--------|--------|---------| +| `seq.pre` | `m_pre` | Prefix of s up to index i | +| `seq.post` | `m_post` | Suffix of s from index i | +| `seq.tail` | `m_tail` | Rest after first element | +| `seq.first` | `m_seq_first` | First element | +| `seq.last` | `m_seq_last` | Last element | +| `seq.idx.l/r` | `m_indexof_left/right` | Decomposition for indexof: t = left·s·right | +| `seq.lidx.l/r` | `m_lindexof_left/right` | Decomposition for last_indexof | +| `seq.cnt.l/r` | — | Decomposition for contains | +| `aut.accept` | `m_accept` | Regex automaton acceptance | +| `aut.step` | `m_aut_step` | Automaton state transition | +| `seq.eq` | `m_eq` | Special equality predicate | +| `seq.length_limit` | `m_length_limit` | Length bound | + +--- + +### 9. seq_decl_plugin — Declaration Plugin (`src/ast/`) + +**Files**: `seq_decl_plugin.h`, `seq_decl_plugin.cpp` + +Provides the theory declarations for sequences, strings, and regular expressions. + +#### Operation Kinds (`seq_op_kind` enum) + +**Sequence**: `OP_SEQ_UNIT`, `OP_SEQ_EMPTY`, `OP_SEQ_CONCAT`, `OP_SEQ_PREFIX`, `OP_SEQ_SUFFIX`, `OP_SEQ_CONTAINS`, `OP_SEQ_EXTRACT`, `OP_SEQ_REPLACE`, `OP_SEQ_AT`, `OP_SEQ_NTH`, `OP_SEQ_NTH_I`, `OP_SEQ_NTH_U`, `OP_SEQ_LENGTH`, `OP_SEQ_INDEX`, `OP_SEQ_LAST_INDEX`, `OP_SEQ_TO_RE`, `OP_SEQ_IN_RE`, `OP_SEQ_REPLACE_RE_ALL`, `OP_SEQ_REPLACE_RE`, `OP_SEQ_REPLACE_ALL`, `OP_SEQ_MAP`, `OP_SEQ_MAPI`, `OP_SEQ_FOLDL`, `OP_SEQ_FOLDLI`, **`OP_SEQ_POWER`** + +**Regex**: `OP_RE_PLUS`, `OP_RE_STAR`, `OP_RE_OPTION`, `OP_RE_RANGE`, `OP_RE_CONCAT`, `OP_RE_UNION`, `OP_RE_DIFF`, `OP_RE_INTERSECT`, `OP_RE_LOOP`, `OP_RE_POWER`, `OP_RE_COMPLEMENT`, `OP_RE_EMPTY_SET`, `OP_RE_FULL_SEQ_SET`, `OP_RE_FULL_CHAR_SET`, `OP_RE_OF_PRED`, `OP_RE_REVERSE`, `OP_RE_DERIVATIVE` + +**String-specific**: `OP_STRING_CONST`, `OP_STRING_ITOS`, `OP_STRING_STOI`, `OP_STRING_UBVTOS`, `OP_STRING_SBVTOS`, `OP_STRING_LT`, `OP_STRING_LE`, `OP_STRING_IS_DIGIT`, `OP_STRING_TO_CODE`, `OP_STRING_FROM_CODE` + +#### seq_util Utility Class + +Nested classes for operation construction/recognition: +- `seq_util::str` — String/sequence operations (`mk_concat`, `mk_length`, `mk_empty`, `is_concat`, `is_empty`, `mk_power`, `is_power`, etc.) +- `seq_util::rex` — Regex operations (`mk_star`, `mk_union`, `mk_inter`, `mk_complement`, `mk_concat`, `is_star`, `is_concat`, `is_nullable`, etc.) + +#### Sort System + +- `SEQ_SORT` — Generic Seq[T] +- `RE_SORT` — RegEx[Seq[T]] +- `_STRING_SORT` — Seq[Char] (string) +- `_REGLAN_SORT` — RegEx[String] + +--- + +### 10. Sequence Equality Solver (`src/ast/rewriter/seq_eq_solver.h/cpp`) + +Solver-agnostic equality inference for sequence equations: + +- `reduce_unit`: Handle unit equations +- `reduce_itos1/2/3`: Handle `itos(s) = itos(t)`, `itos(s) = ""`, `itos(n) = "d1d2...dk"` +- `reduce_binary_eq`: Handle `x·xs = ys·y` patterns +- `reduce_nth_solved`: Solve nth equations +- `branch_unit_variable`: Split variable as empty or non-empty + +--- + +### 11. Test Infrastructure + +#### Test Files + +| File | Tests | Lines | Component | +|------|-------|-------|-----------| +| `src/test/euf_sgraph.cpp` | 18 | 769 | sgraph: classification, metadata, backtracking, navigation, substitution, derivatives, minterms | +| `src/test/euf_seq_plugin.cpp` | 7 | 242 | seq_plugin: associativity, identity elimination, star merging, nullable absorption, egraph sync | +| `src/test/seq_nielsen.cpp` | 13 | 480 | Nielsen: dep_tracker, str_eq, str_mem, subst, node, edge, graph population, expansion, backedge | + +#### Registration + +- `src/test/main.cpp`: `TST(euf_sgraph)` (line 284), `TST(euf_seq_plugin)` (line 285), `TST(seq_nielsen)` (line 289) +- `src/test/CMakeLists.txt`: All three .cpp files included; depends on `smt_seq` +- Build: `ninja test-z3`, Run: `./test-z3 euf_sgraph`, `./test-z3 euf_seq_plugin`, `./test-z3 seq_nielsen` + +#### Test Setup Pattern + +```cpp +ast_manager m; +reg_decl_plugins(m); +euf::egraph eg(m); +euf::sgraph sg(m, eg); // Creates seq_plugin automatically +seq_util seq(m); +sort_ref str_sort(seq.str.mk_string_sort(), m); +// For nielsen: +seq::nielsen_graph ng(sg); +``` + +--- + +## Code References + +### ZIPT Infrastructure (New) + +| File | Description | +|------|-------------| +| `src/ast/euf/euf_snode.h` | snode class, snode_kind enum, hash matrix, subst cache | +| `src/ast/euf/euf_sgraph.h` | sgraph class header | +| `src/ast/euf/euf_sgraph.cpp` | sgraph implementation (classify, metadata, hash, subst, derivatives, minterms) | +| `src/ast/euf/euf_seq_plugin.h` | seq_plugin class header | +| `src/ast/euf/euf_seq_plugin.cpp` | seq_plugin implementation (assoc, simplify, nullable) | +| `src/smt/seq/seq_nielsen.h` | Nielsen graph classes (str_eq, str_mem, subst, edge, node, graph) | +| `src/smt/seq/seq_nielsen.cpp` | Nielsen graph implementation | +| `src/smt/seq/CMakeLists.txt` | smt_seq component build config | + +### Existing String Theory (Production) + +| File | Description | +|------|-------------| +| `src/smt/theory_seq.h` | theory_seq class with 20-phase final_check | +| `src/smt/theory_seq.cpp` | Full theory solver implementation | +| `src/smt/theory_seq_empty.h` | Empty/stub variant | +| `src/smt/seq_axioms.h/cpp` | SMT-level axiom bridge | +| `src/smt/seq_regex.h/cpp` | Regex handling for theory_seq | +| `src/smt/seq_eq_solver.cpp` | SMT-level equality solving | +| `src/smt/seq_ne_solver.cpp` | Disequality solving | +| `src/smt/seq_offset_eq.h/cpp` | Length offset equality | +| `src/smt/smt_setup.h/cpp` | String solver selection/instantiation | + +### Rewriter Layer (Shared) + +| File | Description | +|------|-------------| +| `src/ast/rewriter/seq_rewriter.h/cpp` | Rewrite rules + regex derivatives | +| `src/ast/rewriter/seq_axioms.h/cpp` | Solver-independent axiom generation | +| `src/ast/rewriter/seq_skolem.h/cpp` | Skolem function introduction | +| `src/ast/rewriter/seq_eq_solver.h/cpp` | Solver-agnostic equality inference | + +### Declaration Layer + +| File | Description | +|------|-------------| +| `src/ast/seq_decl_plugin.h/cpp` | Sort/operation declarations, seq_util | +| `src/model/seq_factory.h` | Model value factory | +| `src/params/theory_seq_params.h/cpp` | Theory parameters | +| `src/params/seq_rewriter_params.pyg` | Rewriter parameters | + +### Tests + +| File | Description | +|------|-------------| +| `src/test/euf_sgraph.cpp` | 18 sgraph unit tests | +| `src/test/euf_seq_plugin.cpp` | 7 seq_plugin tests | +| `src/test/seq_nielsen.cpp` | 13 Nielsen graph tests | +| `src/test/main.cpp` | Test registration (lines 284, 285, 289) | +| `src/test/CMakeLists.txt` | Test build config (lines 54, 55, 134) | + +--- + +## Architecture Documentation + +### Component Dependency Graph + +``` +seq_decl_plugin (AST sort/op declarations) + ↓ +seq_util (utility wrapper) + ↓ +seq_rewriter (simplification rules, derivatives) + ↓ ↘ +seq_axioms (rewriter) seq_skolem + ↓ ↓ +seq_axioms (SMT) ←───────┘ + ↓ +┌────────────┬──────────────────────┐ +│ theory_seq │ theory_nseq (TODO) │ +│ (existing) │ (per spec/plan1.md) │ +├────────────┤──────────────────────┤ +│ seq_regex │ nielsen_graph ←──── sgraph ←── snode +│ seq_eq/ne │ (new ZIPT infra) seq_plugin +│ solution_map│ egraph +└────────────┴──────────────────────┘ +``` + +### ZIPT ↔ Z3 Component Mapping + +| ZIPT (C#) | Z3 (C++) | Location | +|------------|----------|----------| +| `StrManager` | `sgraph` + `snode` | `src/ast/euf/euf_sgraph.*`, `euf_snode.h` | +| `Str` (token tree) | `snode` (binary concat tree) | `src/ast/euf/euf_snode.h` | +| `StrEq` | `seq::str_eq` | `src/smt/seq/seq_nielsen.h:98` | +| `StrMem` | `seq::str_mem` | `src/smt/seq/seq_nielsen.h:123` | +| `NielsenGraph` | `seq::nielsen_graph` | `src/smt/seq/seq_nielsen.h:274` | +| `NielsenNode` | `seq::nielsen_node` | `src/smt/seq/seq_nielsen.h:199` | +| `NielsenEdge` | `seq::nielsen_edge` | `src/smt/seq/seq_nielsen.h:166` | +| `Subst` | `seq::nielsen_subst` | `src/smt/seq/seq_nielsen.h:147` | +| `DependencyTracker` | `seq::dep_tracker` | `src/smt/seq/seq_nielsen.h:81` | +| `SimplifyResult` | `seq::simplify_result` | `src/smt/seq/seq_nielsen.h:54` | +| `BacktrackReasons` | `seq::backtrack_reason` | `src/smt/seq/seq_nielsen.h:64` | +| `PDD` / `Interval` | (not yet ported) | — | +| Modifier chain | (not yet ported) | — | +| `StringPropagator` | `theory_nseq` (TODO) | — | + +### What Exists vs. What's Planned + +**Implemented**: +- ✅ sgraph + snode (string representation, metadata, hashing, substitution, derivatives, minterms) +- ✅ seq_plugin (associativity, identity, absorption, star merge, nullable absorption) +- ✅ Nielsen graph data structures (str_eq, str_mem, subst, edge, node, graph, dep_tracker) +- ✅ All unit tests passing (18 + 7 + 13 = 38 tests) + +**Not Yet Implemented** (per spec/plan1.md): +- ❌ `theory_nseq` skeleton and SMT integration +- ❌ Parameter integration (`smt.string_solver=nseq`) +- ❌ Nielsen transformation algorithms (modifier chain) +- ❌ Graph expansion / iterative deepening search +- ❌ Conflict detection and explanation +- ❌ Parikh image / PDD / Interval reasoning +- ❌ Regex splitting modifiers +- ❌ String propagator (propagation engine) +- ❌ Model generation +- ❌ Integration tests + +--- + +## Historical Context (from spec/) + +### spec/reference.md +Points to ZIPT repository (parikh branch) and two PDF papers: +- `LazyMemberships.pdf` — Lazy regex membership approach +- `ClemensTableau.pdf` — Additional tableau algorithms + +### spec/plan.md +Task description: Implement `theory_nseq` in `src/smt/`, add `smt.string_solver` parameter support, put self-contained utilities in `ast/rewriter/`. + +### spec/plan1.md +Detailed implementation plan with 7 phases: +1. Skeleton & Integration +2. Core Data Structures (state, constraints, dependencies) +3. Word Equation Solver (Nielsen transformations) +4. Length Constraint Integration (Parikh image) +5. Regex Membership +6. Propagation Engine +7. Model Generation + +--- + +## Open Questions + +1. **theory_nseq skeleton**: When will Phase 1 (skeleton + parameter integration) be implemented? +2. **Modifier chain priority**: Should Z3's modifier priorities match ZIPT's exactly, or be adjusted? +3. **PDD porting**: Will Z3 reuse existing polynomial infrastructure or port ZIPT's PDD? +4. **Regex derivatives**: Will `theory_nseq` reuse `seq_rewriter::mk_derivative` or implement ZIPT's approach? +5. **Integer solver integration**: Will `theory_nseq` use `arith_value` (like `theory_seq`) or a sub-solver (like ZIPT)? +6. **Backtracking scope**: How will Nielsen graph state interact with SMT solver backtracking? +7. **Performance target**: What benchmarks should `theory_nseq` match or exceed compared to `theory_seq`?