z3/research/docs/nseq-issues/10-performance-regression-regex-word-equation.md

# [nseq] Performance regression: nseq times out on regex + word equation benchmarks

**Labels**: performance, c3, nseq

## Summary

The nseq solver times out (exceeds 5 seconds) on several benchmarks that the seq solver
solves in under 30ms. These represent cases where nseq's Nielsen-graph search either
explores too many branches or does not apply available pruning heuristics effectively.

## Affected benchmarks

| File | seq verdict | seq time | nseq verdict | nseq time |
|------|-------------|----------|--------------|-----------|
| `concat-regex.smt2` | unsat | 0.018s | **unknown** (timeout) | >5s |
| `concat-regex3.smt2` | unsat | 0.024s | **unknown** (timeout) | >5s |
| `loop.smt2` | sat | 0.026s | **unknown** (timeout) | >5s |
| `simple-concat-4.smt2` | sat | 0.028s | **unknown** (timeout) | >5s |
| `all-quantifiers.smt2` | unknown | 0.020s | **unknown** (timeout) | >5s |

Data from: https://github.com/Z3Prover/z3/discussions/9071

## Reproducing examples

```smt2
; concat-regex.smt2 — seq: unsat in 18ms, nseq: TIMEOUT
(set-logic QF_S)
(declare-fun a () String)
(declare-fun b () String)
(declare-fun c () String)
(declare-fun d () String)
(assert (= a (str.++ b c)))
(assert (= b (str.++ d c)))
(assert (str.in.re a (re.+ (re.union (str.to.re "x") (str.to.re "y")))))
(assert (str.in.re c (re.+ (str.to.re "x"))))
(check-sat)
```

Why unsat: `a = b ++ c = d ++ c ++ c`. With `a ∈ (x|y)+` and `c ∈ x+`,
all characters in `a` are 'x' or 'y', but `c` forces all its chars to be 'x'.
The word equation creates constraints that (with the regex) force a contradiction.
The seq solver can prove this quickly, but nseq's iterative deepening DFS exhausts
its depth budget.

```smt2
; loop.smt2 — seq: sat in 26ms, nseq: TIMEOUT
(declare-fun TestC0 () String)
(declare-fun |1 Fill 1| () String)
(declare-fun |1 Fill 0| () String)
(declare-fun |1 Fill 2| () String)
(assert (str.in.re TestC0
           (re.++ (re.* (re.range "\x00" "\xff"))
                  ((_ re.loop 3 3) (re.range "0" "9"))
                  (re.* (re.range "0" "9"))
                  (re.* (re.range "\x00" "\xff")))))
; ... (additional constraints)
(check-sat)
```

```smt2
; simple-concat-4.smt2 — seq: sat in 28ms, nseq: TIMEOUT
; Uses recursive function definition (define-fun-rec) with transducer-style constraints
; and regex membership
(check-sat)
```

## Analysis

### concat-regex and concat-regex3

These benchmarks use word equations `a = b ++ c`, `b = d ++ c` combined with regex
memberships. The seq solver uses a length-based approach or Parikh constraints that
quickly prune the search. The nseq solver's Nielsen graph approach uses iterative
deepening DFS (starting at depth 10, doubling each iteration, up to 6 iterations per
`seq_nielsen.cpp:solve()`). For these formulas:

- The Nielsen graph may generate many extensions at each level (characters from the
  regex alphabet, word splits)
- The Parikh pre-check may not prune these cases effectively because the Parikh
  constraints for `(x|y)+` allow many length values
- The solver exhausts all 6 iterations without finding the contradiction

**Fix proposal**: Improve the pre-check for membership intersection. When all strings
in a word equation are constrained to finite-alphabet regexes (e.g., `(x|y)+ ∩ x+`),
use a more aggressive alphabet-intersection analysis to prune branches early.

### loop.smt2 and simple-concat-4.smt2

These benchmarks involve:
- `re.loop` (counted repetition) for `loop.smt2`
- `define-fun-rec` (recursive function definitions for transducer-style constraints) for `simple-concat-4.smt2`

For `loop.smt2`: the `(_ re.loop 3 3)` constraint requires exactly 3 digits to appear
at a specific position. The nseq solver may not have efficient handling for counted
repetitions in the Parikh constraint generator.

For `simple-concat-4.smt2`: the `define-fun-rec` introduces recursive predicates.
These are handled by `unfold_ho_terms()` in `theory_nseq.cpp`, which iteratively
unfolds the recursive definition. The timeout may be due to slow convergence of the
unfolding or the combination with regex constraints.

### Performance improvement opportunities

1. **Better Parikh stride computation for counted repetitions** (`seq_parikh.cpp`):
   `(_ re.loop N N)` has stride 0 and min_len = N * char_len. Ensure this is handled.

2. **Alphabet intersection pre-pruning** in the Nielsen graph: before generating all
   possible character extensions, check if the intersection of the alphabets of all
   regex constraints on a variable is consistent with the required character.

3. **Increase Nielsen DFS depth budget adaptively** or use a BFS strategy for small
   alphabets.

4. **Cache Parikh constraints** across Nielsen nodes that share the same membership constraints.

## Files to investigate

- `src/smt/seq/seq_nielsen.cpp` — `solve()`, `generate_extensions()`, depth budget
- `src/smt/seq/seq_parikh.cpp` — `compute_length_stride()` for `re.loop`, constraint generation
- `src/smt/seq/seq_regex.cpp` — alphabet-intersection analysis
- `src/smt/theory_nseq.cpp` — `unfold_ho_terms()` convergence for recursive definitions