The symbolic derivative of a left-associated concat (a*b)*r2 recurses
through the entire left spine, exceeding m_max_depth and emitting stuck
symbolic re.derivative terms that accumulate across NFA states and cause
an exponential blow-up on contains-pattern intersections. Right-
associate the concat (via derive::mk_concat, a single linear pass that
does not touch the derivative depth counter) before deriving, so the head
stays atomic and recursion is shallow.
Also restore the L(a)subset L(b) subsumption in mk_union_core (collapses
semantically-equal union states in antimirov intersection derivatives),
and fix a latent sort bug in mk_regex_concat where the '..* = .+'
rewrite passed the element sort instead of the regex sort to
mk_full_char, triggering a 're.+' sort-mismatch exception once exercised
by the derivative path.
Result on QF_S/20250410-matching wildcard-matching-regex set: 02 15s->0.6s,
04 15s->0.7s, 05 timeout->4.2s, 59/62 timeout->0.1-0.3s (vs master).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mk_re_concat only merged Sigma* Sigma* when both stars were direct
siblings. Once a literal nests the concatenation, e.g. (Sigma* x Sigma*) ++ Sigma*,
the trailing duplicate Sigma* survived because the left operand is a concat
that ends with Sigma* rather than a bare full_seq. Redundant adjacent stars
then multiply derivative states during bisimulation.
Add two grouping-insensitive rules using the existing
starts_with_full_seq/ends_with_full_seq helpers:
R ++ Sigma* -> R when R ends with Sigma*
Sigma* ++ R -> R when R starts with Sigma*
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Guard regex_to_range_predicate so it only collapses regexes whose
element sort is a character sort. Previously a regex over a non-char
sequence sort (e.g. (re (Seq Int))) could be silently mis-collapsed
into bogus [0, max_char] ranges. Add a negative unit test covering
re.empty/re.full_char over (Seq Int).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the bisimulation equivalence path, derivative leaves are now
extracted via the new seq_rewriter::brz_derivative_cofactors
(derive::derivative_cofactors), which computes the symbolic
derivative and enumerates its reachable leaves in fully ITE-hoisted
normal form: every if-then-else over the input character (including
ones previously buried under a concat or union) is hoisted to the top
via decompose_ite, infeasible minterms are pruned, and unions are kept
intact as single states. Each leaf is therefore a ground regex free of
(:var 0), so its nullability is always decidable.
This replaces collect_leaves (which only split top-level ITEs and left
buried (:var 0) ITEs inside leaves), the root cause of bisim returning
l_undef and falling through to the slow theory solver.
Validation on the regex-equivalence corpus (1523 files, -T:5, 8 workers):
- vs master: total solved 1394 vs 1378 (+16), soft_timeouts 129 vs 145,
0 soundness disagreements (was 18 -> 5 -> 0).
- vs derive: +242 solved (1394 vs 1152), 25.4% faster on commonly-solved
files, fixes 18 soundness disagreements, only 6 regressions.
- corpus wall time halved (172s vs 332s/349s).
- All 91 unit tests pass, including seq_regex_bisim.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce src/ast/rewriter/regex_range_collapse.{h,cpp}, a translator
between the boolean-combination-of-character-class fragment of regexes
and the range_predicate value type added in Stage 2.
Recognized fragment (translates to range_predicate):
re.empty, re.full_char, re.range, re.union, re.intersection, re.diff
of operands recursively in the fragment. Range bounds are accepted in
three encodings: string constant ("a"), seq.unit of a const char
(seq.unit (Char 97)), and length-1 zstring literal.
NOT translated:
re.complement -- this is sequence-level complement (Sigma* \ L), not
character-class complement. Translating it would incorrectly turn
re.comp(re.range "a" "z") into the character class [^a-z], which would
drop the empty string and all length>=2 strings.
Hook the translator into seq_rewriter at mk_re_union0, mk_re_union,
mk_re_inter0, mk_re_inter, and mk_re_diff so that boolean combinations
of character classes always reduce to a single canonical range-set
form. mk_re_complement is intentionally not hooked.
Materialization uses the canonical (seq.unit (Char N)) bound form
(matching the rest of seq_rewriter) and right-associates the union
with operands sorted by expr_id so the result matches the invariant
expected by merge_regex_sets.
Unit tests in src/test/regex_range_collapse.cpp cover the recognized
fragment, the non-translatable cases, and round-trip identity for
multi-range predicates.
Corpus validation on bench/inputs/regex-equivalence (1523 .smt2):
- 0 soundness regressions vs derive baseline.
- Resolves 4 previously-soft-timeout files (now solved correctly).
- Resolves 1 pre-existing wrong answer (mut_0404: master/derive say
unsat, ground-truth annotation and Stage 3 say sat).
- Wall-time: -2.2% vs Stage-3 starting point, -1.5% vs derive.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a specialized range-algebra over the unsigned character
domain [0, max_char], with canonical sorted-disjoint-non-adjacent
representation and linear-time union, intersection, complement,
difference, and symmetric difference operations.
This is Stage 1 of the derive-with-ranges plan: the value type only,
with unit tests covering factories, ordering, hashing, hand-picked
instances, and exhaustive de-Morgan / lattice laws over a small
domain (verified by enumerating all 64 subsets).
Integration with seq::derive's path conditions, the OneStep cache,
and the R&psi smart-constructor rewrite are deferred to later stages.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The single-ITE branch of hoist_ite was gated by 'm_path_stack.size() < 8'.
When the depth limit was reached, hoist_ite returned nullptr and callers
fell back to non-path-aware structural rewrites (mk_union0 / mk_xor0).
These rewrites simplify e.g. mk_union(empty, X) -> X and return X unchanged,
preserving inner ITE branches that were built at an earlier (less constrained)
path. Subsequent derivative computation never path-prunes those branches,
which can leak unreachable epsilon-leaves into the final t-regex and cause
the bisimulation algorithm to report inequivalence for equivalent regexes.
Concrete trigger: derivatives of unions/xors with >= 9 components produce
9-deep ITE chains; at depth 8 the inner ITE returns unprocessed, leaving an
unreachable epsilon-leaf that bisim mis-interprets as a distinguishing word.
Removing the guard restores soundness. The corpus run against
regex-equivalence (1523 files) fixes 22 triangulated soundness bugs
(mut_0013, mut_0241, mut_0257, mut_0301 among others) with zero regressions.
89/89 unit tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
`seq::derive::operator()(ele, r)` looks up `m_top_cache` keyed only by the
regex `r`, but on a miss it used to set `m_ele = ele` (a concrete char)
before calling `derive_rec(r)`. The resulting ITE-tree contained
constant-folded `(= ele c)` conditions, so the "symbolic" derivative
stored in the cache was actually specialized to that one ele. Subsequent
calls with the same `r` but a different ele hit the stale cached answer
and the substitution at the bottom was a no-op (no `v0` left to replace).
Simplest victim:
(str.in_re "aP" (re.++ (re.* "a") "P"))
returned `unsat`. The first call D_'a'(a*P) computed `a*P` and cached it
under key `a*P`; the next call D_'P'(a*P) hit that cache entry and
returned `a*P` instead of epsilon, so the membership check ended on a
non-nullable state.
Fix: set `m_ele = v` (the canonical fresh var) so the derivative is
genuinely symbolic. Concrete-ele callers go through the existing
substitution at the bottom of `operator()`.
Adds a regression test in src/test/seq_regex_bisim.cpp checking that
D_'a'(a*P) is not nullable while D_'P'(a*P) is.
Note: this is independent of the mut_0013 bisim-level unsoundness;
that case still fails and is being tracked separately.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The intersect_intervals routine in seq_derive.cpp maintains a path-tracking
disjoint union of character intervals. When intersecting the active suffix
with a new constraint [lo, hi], it iterated the suffix and, on encountering
the first interval disjoint from [lo, hi], reset the output cursor to the
end-marker and broke out of the loop. This both threw away the intervals it
had already kept and skipped every remaining interval, so e.g.
[(0, 96), (98, max)] intersected with [98, 98] became empty instead of
[(98, 98)].
Inside derive that silently killed valid branches of symbolic derivatives.
For example D(a|b) collapsed to ite(c='a', eps, empty) -- the 'b' branch
vanished -- which made the bisimulation procedure conclude bogus equalities
such as a* == (a|b)*. On the regex-equivalence corpus this single bug
accounted for ~510 false-unsat results vs master.
Fix: drop only the disjoint interval and continue scanning the rest of the
suffix. Add a small assertion-based regression test that builds D(a|b),
checks both branches survive, and runs bisim on a* vs (a|b)*.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce exclusion intervals alongside the existing path-based condition
tracking in simplify_ite_rec. The intervals track which character values
are still possible at each point in the ITE tree, enabling simplification
of nested range conditions that the per-entry path approach cannot handle.
Key additions:
- intervals_t type and push_intervals() to maintain live character ranges
- eval_range_cond() checks AND-of-char_le conditions against intervals
- intersect_intervals/exclude_interval utilities from seq_rewriter pattern
- Negated AND handling: ¬(lo<=x ∧ x<=hi) excludes [lo,hi] from intervals
The interval check runs before the existing eval_path_cond logic, catching
cases like: if(0<=x<=10, t, if(1<=x<=8, t2, e2)) → if(0<=x<=10, t, e2)
where the inner range [1,8] is fully contained in the excluded outer range.
Fixes remaining regression timeouts on 5728 P2 and 5731 P4.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Simplify trivial range bounds in derive_range: when lo=0, omit
the lo<=x condition; when hi=max_char, omit the x<=hi condition.
Full charset ranges return epsilon directly.
- Add char_le(0,x)=true and char_le(x,max)=true to eval_cond for
always-valid bounds.
- Add range implication logic to simplify_ite_rec: when path has
negated/positive char_le constraints, detect implied or contradicted
char_le conditions (e.g., ¬(x<=127) implies 128<=x).
- Add is_subset(a, .+) check: non-nullable regexes are subsets of .+
- In update_state_graph, skip recursive exploration of nullable targets
to avoid state explosion.
These fixes resolve timeouts on 5724 (all problems), 5721 P1, and 5693.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add top-level cache (m_top_cache) to ensure stable AST node identity
across repeated derivative calls, preventing state graph divergence
- Add get_head_tail helper for derive_to_re with str.is_unit/str.is_concat
- Add ITE hoisting in mk_union/mk_inter to keep ITEs at top level
- Add De Morgan rule in mk_complement: ~(A∪B) → ~A ∩ ~B
- Add ~ε → .+ simplification in mk_complement
- Add prefix factoring: a·x ∪ a·y = a·(x∪y) and a·x ∩ a·y = a·(x∩y)
- Add r* ∩ .+ = r+ special case in mk_inter
- Enhance is_subset with union/intersection distributivity and complement
- Remove De Morgan from mk_inter to prevent infinite recursion loop
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>