12 KiB
Implementation Plan: theory_nseq — New String Theory Solver for Z3
Problem Statement
Implement a new theory solver theory_nseq for strings/sequences in Z3's SMT solver framework (src/smt/). The solver is based on the algorithms described in LazyMemberships.pdf and ClemensTableau.pdf, with a reference C# implementation at ZIPT. The new solver should coexist with the existing theory_seq and be selectable via the smt.string_solver parameter.
Reference Architecture
The ZIPT reference implementation (C#) contains these key components:
- StringPropagator — core constraint propagation engine
- Nielsen Graph/Node/Edge — word equation solving via Nielsen transformations
- Constraint system — StrEq, StrMem (regex membership), StrContains, prefix/suffix, disequalities
- Modifier chain — decomposition, equation splitting, Nielsen transforms, regex splitting
- IntUtils (PDD, Interval) — Parikh image / integer constraint reasoning for lengths
- StrManager — string state tracking, canonicalization
Approach
Build theory_nseq as a new theory plugin in src/smt/, following the same patterns as theory_seq. Reuse existing Z3 infrastructure (seq_decl_plugin, seq_rewriter, seq_skolem, arith_value) and add new algorithm-specific modules. Self-contained rewriter utilities go in src/ast/rewriter/.
Module Breakdown
Phase 1: Skeleton & Integration
1.1 theory_nseq skeleton (src/smt/theory_nseq.h, src/smt/theory_nseq.cpp)
- New class
theory_nseq : public theory - Implement all required virtual methods from
smt::theory:internalize_atom,internalize_termnew_eq_eh,new_diseq_eh,assign_ehcan_propagate,propagatefinal_check_ehpush_scope_eh,pop_scope_ehmk_fresh,get_name(return"nseq")mk_value,init_model,finalize_modeldisplay,collect_statistics
- Initially stub all methods (FC_GIVEUP on final_check)
- Use
seq_utilandarith_utilfrom existing infrastructure - Hold references to
seq_rewriter,seq::skolem,arith_value
1.2 Parameter integration
src/params/smt_params_helper.pyg: Add'nseq'to string_solver options documentationsrc/params/smt_params.cpp: Add"nseq"tovalidate_string_solver()src/smt/smt_setup.cpp: Add"nseq"case insetup_QF_S()andsetup_seq_str(), calling newsetup_nseq()methodsrc/smt/smt_setup.h: Declaresetup_nseq()
1.3 Build system integration
src/smt/CMakeLists.txt: Addtheory_nseq.cppand all new.cppfiles to SOURCES
Phase 2: Core Data Structures
2.1 String representation & state management (src/smt/nseq_state.h/.cpp)
- String variable tracking (analogous to ZIPT's StrManager)
- Concatenation decomposition into token sequences
- Canonicalization of string terms
- Variable-to-representative mapping (union-find or solution map)
- Backtrackable state via
scoped_vector/ trail
2.2 Constraint representation (src/smt/nseq_constraint.h)
- Constraint types: StrEq, StrNe, StrMem (regex membership), StrContains, StrPrefix, StrSuffix
- IntEq, IntLe, IntNe for length constraints
- Dependency tracking for conflict explanations
- Constraint simplification results
2.3 Dependency tracker (src/smt/nseq_dep.h/.cpp)
- Track justifications for derived facts
- Integrate with Z3's
scoped_dependency_manager<assumption>pattern (same as theory_seq) - Support linearization for conflict clause generation
Phase 3: Word Equation Solver (Nielsen Transformations)
3.1 Nielsen graph (src/ast/rewriter/nseq_nielsen.h/.cpp)
- Nielsen transformation graph: nodes represent equation states, edges represent transformations
- Node: stores a set of simplified word equations
- Edge: transformation applied (variable elimination, constant matching, split)
- Graph exploration with cycle detection
- Self-contained rewriter utility (no dependency on smt context)
3.2 Nielsen node operations
- Constant matching: if both sides start with same constant, strip it
- Variable elimination: if LHS starts with variable x, case-split: x=ε or x=a·x'
- Equation decomposition: split compound equations
- Determinism detection: identify forced assignments
- Directed Nielsen: apply transformations in a directed manner
3.3 Modifier chain (integrated into nielsen module)
- DecomposeModifier — decompose strings into concatenation components
- EqSplitModifier — case analysis on equalities
- VarNielsenModifier / ConstNielsenModifier / DirectedNielsenModifier
- StarIntrModifier — Kleene star introduction/elimination
- PowerSplitModifier / PowerEpsilonModifier
- RegexCharSplitModifier / RegexVarSplitModifier
Phase 4: Length Constraint Integration (Parikh Image)
4.1 Length reasoning (src/ast/rewriter/nseq_length.h/.cpp)
- Interval arithmetic for length bounds
- Length variable management
- Integration with Z3's arithmetic solver via
arith_value - Parikh image constraints: derive integer constraints from string equations
- Self-contained rewriter utility
4.2 Arithmetic interface in theory_nseq
- Query arithmetic solver for current length assignments
- Propagate length equalities/inequalities derived from string equations
- Use
arith_value(same pattern as theory_seq)
Phase 5: Regex Membership
5.1 Regex handling (src/smt/nseq_regex.h/.cpp)
- Lazy membership checking (per LazyMemberships.pdf)
- Regex derivative computation (reuse
seq_rewriter.mk_derivative) - Regex splitting for equation solving
- Integration with Nielsen transformations
- Nullable checking
Phase 6: Propagation Engine
6.1 String propagator (src/smt/nseq_propagator.h/.cpp)
- Main propagation loop integrating all components
- Equation simplification pipeline
- Constraint interaction (string eq + length + regex)
- Case splitting decisions
- Conflict detection and clause generation
6.2 Wire into theory_nseq
internalize_term/atom: register string terms, create theory variables, enqueue axiomsassign_eh: process boolean assignments (e.g., regex memberships, contains)new_eq_eh/new_diseq_eh: process equality/disequality mergespropagate: run propagation enginefinal_check_eh: comprehensive consistency check, case splits
Phase 7: Model Generation
7.1 Model construction
- Build string values from solved equations
- Handle unconstrained variables
- Generate models consistent with length constraints and regex memberships
- Register
seq_factoryfor value generation
Testing Plan
Unit Tests
T1: Basic smoke test (src/test/nseq_basic.cpp)
- Register test in
src/test/main.cppviaTST(nseq_basic) - Test that theory_nseq can be selected via
smt.string_solver=nseq - Verify basic sat/unsat on trivial string equalities
T2: Nielsen transformation tests
- Unit tests for the Nielsen graph/node rewriter module
- Test constant matching, variable elimination, equation decomposition
T3: Length reasoning tests
- Test interval propagation, Parikh image constraint generation
Integration Tests (via z3 command line)
T4: z3test regression benchmarks (local at C:\z3test)
The z3test repository is available locally at C:\z3test. It contains the same regression suite used by nightly CI (.github/workflows/nightly.yml:198).
Key string/regex benchmarks in C:\z3test\regressions\smt2:
string-concat.smt2,string-eval.smt2,string-itos.smt2,string-literals.smt2,string-ops.smt2,string-simplify.smt2string3.smt2,string6.smt2,string7.smt2,string9.smt2,string11.smt2throughstring14.smt2re.smt2,re_rewriter.smt2testfoldl.smt2,testmap.smt2,testmapi.smt2(sequence higher-order)byte_string.smt2(disabled)
Testing approach:
- Run full regression suite:
python C:\z3test\scripts\test_benchmarks.py <z3exe> C:\z3test\regressions\smt2- The test harness runs each
.smt2file withmodel_validate=trueand compares output to.expected.out - Script supports parallel execution (uses all CPU cores by default)
- The test harness runs each
- Run string-specific subset with
smt.string_solver=nseq:- For each
string*.smt2andre*.smt2inC:\z3test\regressions\smt2, run:z3 smt.string_solver=nseq <file> - Compare output against
.expected.outfiles
- For each
- Also test
C:\z3test\regressions\issues(8 issue-specific regression tests)
Other regression directories to verify (non-string, but ensure no regressions):
C:\z3test\regressions\smt2-extraC:\z3test\regressions\smt2-debug
T5: Targeted string SMT2 tests
Create focused test files exercising specific features:
tests/nseq/eq_basic.smt2— simple word equations (x·a = a·x)tests/nseq/eq_nielsen.smt2— equations requiring Nielsen transformationstests/nseq/concat.smt2— concatenation reasoningtests/nseq/contains.smt2— str.contains, str.prefixof, str.suffixoftests/nseq/length.smt2— length constraints and Parikh imagetests/nseq/regex.smt2— regex membershiptests/nseq/mixed.smt2— combinations of string ops with arithmetictests/nseq/unsat.smt2— unsatisfiable benchmarks
T6: Comparison testing with parallel runner
- Use
~/dev/test_scripts/src/run_in_parallel.pyfor parallel testing - Run the same SMT files with both
smt.string_solver=seqandsmt.string_solver=nseq - Verify no soundness bugs (sat/unsat agree)
Performance Testing
T7: Performance benchmarks (release builds only)
- Use SMT-COMP string benchmarks (QF_S, QF_SLIA tracks)
- Compare solve times between theory_seq and theory_nseq
- Use
z3 -stfor statistics - Only on release builds (never debug builds)
File Summary
New files to create:
| File | Description |
|---|---|
src/smt/theory_nseq.h |
Theory class header |
src/smt/theory_nseq.cpp |
Theory class implementation |
src/smt/nseq_state.h |
String state management header |
src/smt/nseq_state.cpp |
String state management impl |
src/smt/nseq_constraint.h |
Constraint type definitions |
src/smt/nseq_dep.h |
Dependency tracking header |
src/smt/nseq_dep.cpp |
Dependency tracking impl |
src/smt/nseq_propagator.h |
Propagation engine header |
src/smt/nseq_propagator.cpp |
Propagation engine impl |
src/smt/nseq_regex.h |
Regex handling header |
src/smt/nseq_regex.cpp |
Regex handling impl |
src/ast/rewriter/nseq_nielsen.h |
Nielsen transformation header |
src/ast/rewriter/nseq_nielsen.cpp |
Nielsen transformation impl |
src/ast/rewriter/nseq_length.h |
Length/Parikh reasoning header |
src/ast/rewriter/nseq_length.cpp |
Length/Parikh reasoning impl |
src/test/nseq_basic.cpp |
Unit tests |
Existing files to modify:
| File | Change |
|---|---|
src/smt/smt_setup.h |
Declare setup_nseq() |
src/smt/smt_setup.cpp |
Add "nseq" to string_solver dispatch, implement setup_nseq() |
src/smt/CMakeLists.txt |
Add new .cpp files |
src/ast/rewriter/CMakeLists.txt |
Add new rewriter .cpp files |
src/params/smt_params_helper.pyg |
Document nseq option |
src/params/smt_params.cpp |
Add "nseq" to validation |
src/test/main.cpp |
Register TST(nseq_basic) |
src/test/CMakeLists.txt |
Add nseq_basic.cpp (if needed) |
Implementation Order
- Phase 1 (Skeleton & Integration) — get a compilable stub that Z3 can load
- Phase 2 (Core Data Structures) — state management and constraint representation
- Phase 3 (Nielsen Transformations) — core word equation solving
- Phase 4 (Length/Parikh) — arithmetic integration
- Phase 5 (Regex) — regex membership
- Phase 6 (Propagation Engine) — wire everything together
- Phase 7 (Model Generation) — produce satisfying assignments
- Testing — runs in parallel with each phase; benchmarks after Phase 6
Notes
- The ZIPT C# reference at
github.com/CEisenhofer/ZIPT(parikh branch) is the primary algorithm reference - LazyMemberships.pdf describes the lazy regex membership approach
- ClemensTableau.pdf describes additional algorithms
- Reuse Z3 infrastructure wherever possible:
seq_util,seq_rewriter,seq_skolem,arith_value,seq_factory - Self-contained utilities (Nielsen graph, length reasoning) go in
src/ast/rewriter/per the spec - All timing-sensitive builds require 30+ minute timeouts