3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2026-07-05 06:46:11 +00:00

[snapshot-regression-fix] Keep symbolic re.range non-empty; fix soundness regression on range membership (#10017)

## Summary

Fixes a **soundness regression** in the sequence/regex rewriter: a
symbolic character range such as `(re.range x x)` was unsoundly
collapsed to `re.empty`, causing a satisfiable membership constraint to
be reported `unsat`.

This was surfaced by the `snapshot-regression` corpus in
`Z3Prover/bench`.

- **Originating discussion:**
https://github.com/Z3Prover/bench/discussions/2761
- **Benchmark:** `iss-5873/bug-2.smt2` (in `Z3Prover/bench`, under
`inputs/issues/iss-5873/`)
- **z3 under test at capture:** `z3-4.17.0-x64-glibc-2.39` (Nightly)

## Divergence

The recorded oracle expects `sat`; current z3 returns `unsat`:

```diff
--- bug-2.expected.out (expected)
+++ produced (current z3)
@@ -1,3 +1,4 @@
-sat
-((tmp_str0 "\u{0}"))
+unsat
+(error "line 12 column 10: check annotation that says sat")
+(error "line 14 column 22: model is not available")
 (:reason-unknown "")
```

The benchmark asserts (simplified):

```smt2
(assert (= (str.in_re (str.replace tmp_str0 tmp_str0 tmp_str0)
                      (re.range tmp_str0 tmp_str0))
           (str.contains tmp_str0 tmp_str0)))
```

`str.contains x x` is always true and `str.replace x x x = x`, so this
requires `str.in_re x (re.range x x)` to hold, which is satisfiable
exactly when `x` is a single character (`len(x) = 1`).

## Root cause

`seq_rewriter::mk_re_range` treated any bound that is not a concrete
single-character literal as making the whole range **empty**:

```cpp
if (str().is_string(lo, slo) && slo.length() == 1) clo = slo[0];
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo)) ;
else is_empty = true;   // unsound for a symbolic bound
```

For a symbolic bound this is unsound: `(re.range x x)` denotes `{x}`
whenever `x` is a single character, not `∅`. Collapsing it to `re.empty`
makes `str.in_re x (re.range x x)` false, contradicting the (true)
`str.contains x x`, so the solver derives an unsound `unsat`.

`git blame` attributes this unsound collapse to z3 commit `15f33f458d`
("Derive with ranges (#9965)"), which post-dates the oracle capture.

## Fix

Two surgical changes in `src/ast/rewriter/seq_rewriter.cpp`:

1. **`mk_re_range`** no longer assumes emptiness for symbolic bounds. It
concludes `re.empty` only when it can *prove* emptiness — a bound whose
length can never be 1, or two concrete bounds with `lo > hi`. When a
bound is symbolic it returns `BR_FAILED` and keeps the range. Concrete
single-character ranges keep their existing handling (`lo == hi →
str.to_re`, inverted → `re.empty`).

2. **`mk_str_in_regexp`** reduces membership in a range that has a
symbolic bound to the equivalent length/order constraints, which are
sound and complete under SMT-LIB `re.range` semantics:

`str.in_re e (re.range lo hi)` ⟶ `len(lo)=1 ∧ len(hi)=1 ∧ len(e)=1 ∧ lo
≤ e ∧ e ≤ hi`

(using `str.<=`). The derivative engine only unfolds ranges whose bounds
are concrete characters, so without this reduction a symbolic-bound
range would otherwise be left unsolved.

## Validation

Rebuilt z3 from this branch on the workflow runner (`./configure && make
-C build -j$(nproc)`) and re-ran the failing benchmark with the same
option the snapshot capture uses (`-T:20`):

```
$ z3 -T:20 inputs/issues/iss-5873/bug-2.smt2
sat
((tmp_str0 "A"))
(:reason-unknown "")
```

The verdict is now **`sat`** (was `unsat`) — the soundness regression is
resolved. A correctness battery over concrete and symbolic ranges all
returns the expected results, e.g.:

- `(str.in_re "b" (re.range "a" "c"))` → `sat`, `(str.in_re "d"
(re.range "a" "c"))` → `unsat`
- `(str.in_re x (re.range x x))` → `sat`; with `(= (str.len x) 2)` →
`unsat`
- `(str.in_re "b" (re.range x y))` → `sat`; with `(str.< y x)` → `unsat`
- `(str.in_re "" (re.range x y))` → `unsat`; `(str.in_re "ab" (re.range
"a" "c"))` → `unsat`

The pre-existing concrete-range derivative fast path is unchanged.

### Note on the model value (benign, unrelated to this fix)

The model value differs from the recorded oracle: current z3 prints
`((tmp_str0 "A"))` whereas the oracle recorded `((tmp_str0 "\u{0}"))`.
Both are valid single-character models (the formula has many). This
difference is **pre-existing and unrelated to this fix**: even a bare
`(assert (= (str.len x) 1))` yields `"A"` on current z3. It stems from
the seq/char theory's default character assignment for
otherwise-unconstrained characters (`theory_char.cpp` assigns fresh
characters starting from `'A'`), not from range handling. I deliberately
did **not** force the character to `\u{0}` — adding `x = "\u{0}"` would
be unsound over-constraining, and changing the global default character
is out of scope for this soundness fix and would perturb unrelated
models. The output is therefore semantically equivalent to the oracle
(same `sat` verdict and reason-unknown) but not byte-identical.

---
*Draft for human review. Diagnosed and fixed by the
`snapshot-regression-fixer` maintenance workflow.*




> Generated by [Fix a Z3 snapshot-regression
divergence](https://github.com/Z3Prover/bench/actions/runs/28502614658)
· 890.7 AIC · ⌖ 46.8 AIC · ⊞ 9K ·
[◷](https://github.com/search?q=repo%3AZ3Prover%2Fz3+%22gh-aw-workflow-id%3A+snapshot-regression-fixer%22&type=pullrequests)

<!-- gh-aw-agentic-workflow: Fix a Z3 snapshot-regression divergence,
engine: copilot, version: 1.0.63, model: claude-opus-4.8, id:
28502614658, workflow_id: snapshot-regression-fixer, run:
https://github.com/Z3Prover/bench/actions/runs/28502614658 -->

<!-- gh-aw-workflow-id: snapshot-regression-fixer -->
<!-- gh-aw-workflow-call-id: Z3Prover/bench/snapshot-regression-fixer
-->

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This commit is contained in:
Lev Nachmanson 2026-07-02 14:00:51 -07:00 committed by GitHub
parent 6ac3075022
commit 2a8f66f22b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 155 additions and 22 deletions

View file

@ -3263,6 +3263,39 @@ br_status seq_rewriter::mk_str_in_regexp(expr* a, expr* b, expr_ref& result) {
return BR_DONE; return BR_DONE;
} }
// (str.in_re e (re.range lo hi)) where a bound is not a concrete character.
// By SMT-LIB semantics (re.range lo hi) is the set of single characters c
// with lo <= c <= hi when lo and hi are themselves single characters, and
// the empty language otherwise; so membership is equivalent to lo, hi and
// e all being single characters with lo <= e <= hi. The derivative engine
// only unfolds ranges whose bounds are concrete characters, so without this
// reduction a range with a symbolic bound is left unsolved (and mk_re_range
// deliberately keeps such a range symbolic rather than unsoundly collapsing
// it to re.empty). Ranges with two concrete single-character bounds keep
// their existing derivative-based handling.
{
expr* rlo = nullptr, *rhi = nullptr;
if (re().is_range(b, rlo, rhi)) {
auto concrete_char = [&](expr* e) {
zstring s;
expr* ch = nullptr;
unsigned uc = 0;
return (str().is_string(e, s) && s.length() == 1) ||
(str().is_unit(e, ch) && m_util.is_const_char(ch, uc));
};
if (!concrete_char(rlo) || !concrete_char(rhi)) {
expr_ref_vector conj(m());
conj.push_back(m().mk_eq(str().mk_length(rlo), one()));
conj.push_back(m().mk_eq(str().mk_length(rhi), one()));
conj.push_back(m().mk_eq(str().mk_length(a), one()));
conj.push_back(str().mk_lex_le(rlo, a));
conj.push_back(str().mk_lex_le(a, rhi));
result = m().mk_and(conj);
return BR_REWRITE_FULL;
}
}
}
zstring s; zstring s;
if (str().is_string(a, s) && re().is_ground(b)) { if (str().is_string(a, s) && re().is_ground(b)) {
// Just check membership and replace by true/false // Just check membership and replace by true/false
@ -4126,35 +4159,47 @@ br_status seq_rewriter::mk_re_range(expr* lo, expr* hi, expr_ref& result) {
is_empty = true; is_empty = true;
if (max_length(hi) == std::make_pair(true, rational(0))) if (max_length(hi) == std::make_pair(true, rational(0)))
is_empty = true; is_empty = true;
if (!is_empty) {
if (str().is_string(lo, slo) && slo.length() == 1)
clo = slo[0];
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
;
else
is_empty = true;
}
if (!is_empty) {
if (str().is_string(hi, shi) && shi.length() == 1)
chi = shi[0];
else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
;
else
is_empty = true;
}
// clo/chi are only meaningful once both bounds were extracted; an early
// is_empty (from the length checks) leaves them at their default 0, so the
// is_empty return must come before the singleton/ordering checks below.
if (!is_empty && clo > chi)
is_empty = true;
// A provable length constraint (a bound can never be a single character)
// is the only sound way to conclude emptiness for a possibly-symbolic
// bound, so decide emptiness here before attempting to read concrete
// characters.
if (is_empty) { if (is_empty) {
sort* srt = re().mk_re(lo->get_sort()); sort* srt = re().mk_re(lo->get_sort());
result = re().mk_empty(srt); result = re().mk_empty(srt);
return BR_DONE; return BR_DONE;
} }
// Try to read concrete single-character bounds. A bound that is not a
// syntactic single-character literal is *symbolic* (its value depends on
// the model), NOT empty: collapsing such a range to re.empty is unsound
// (e.g. (re.range x x) is {x} whenever x is a single character), so we
// leave the range unevaluated (BR_FAILED) and let the theory solver
// reason about it.
bool has_clo = false, has_chi = false;
if (str().is_string(lo, slo) && slo.length() == 1) {
clo = slo[0];
has_clo = true;
}
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
has_clo = true;
if (str().is_string(hi, shi) && shi.length() == 1) {
chi = shi[0];
has_chi = true;
}
else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
has_chi = true;
if (!has_clo || !has_chi)
return BR_FAILED;
// Both bounds are concrete characters: an inverted range is empty.
if (clo > chi) {
sort* srt = re().mk_re(lo->get_sort());
result = re().mk_empty(srt);
return BR_DONE;
}
// Singleton: re.range "a" "a" → str.to_re "a" // Singleton: re.range "a" "a" → str.to_re "a"
if (clo == chi) { if (clo == chi) {
result = re().mk_to_re(str().mk_string(zstring(clo))); result = re().mk_to_re(str().mk_string(zstring(clo)));

View file

@ -10,12 +10,19 @@ Tests:
4. Range Range merged range for overlapping/adjacent 4. Range Range merged range for overlapping/adjacent
5. Complement of range one or two ranges 5. Complement of range one or two ranges
6. Downstream operators absorb empty ranges correctly 6. Downstream operators absorb empty ranges correctly
15. Symbolic-bound range membership rewrite (structural)
16. Symbolic-bound range membership: concrete element, symbolic bounds (structural)
17. Solver: (str.in_re x (re.range x x)) sat when len(x)=1
18. Solver: (str.in_re x (re.range x x)) unsat when len(x)=2
19. Solver: inverted symbolic bounds make membership unsatisfiable
--*/ --*/
#include "ast/arith_decl_plugin.h"
#include "ast/ast_pp.h" #include "ast/ast_pp.h"
#include "ast/reg_decl_plugins.h" #include "ast/reg_decl_plugins.h"
#include "ast/rewriter/th_rewriter.h" #include "ast/rewriter/th_rewriter.h"
#include "ast/seq_decl_plugin.h" #include "ast/seq_decl_plugin.h"
#include "smt/smt_context.h"
#include <iostream> #include <iostream>
// Build a single-char string literal expression. // Build a single-char string literal expression.
@ -167,5 +174,86 @@ void tst_seq_rewriter() {
ENSURE(su.re.is_empty(e)); ENSURE(su.re.is_empty(e));
} }
// -----------------------------------------------------------------------
// 15. Symbolic-bound range membership rewrite (structural).
// (str.in_re x (re.range x x)) with symbolic x should be unfolded
// by the rewriter into a conjunction of length and ordering
// constraints, not left stuck as an uninterpreted membership term.
// -----------------------------------------------------------------------
{
app_ref x(m.mk_fresh_const("x", str_sort), m);
expr_ref rng(su.re.mk_range(x, x), m);
expr_ref e(su.re.mk_in_re(x, rng), m);
rw(e);
std::cout << "symbolic range (x in [x,x]): " << mk_pp(e, m) << "\n";
ENSURE(m.is_and(e));
}
// -----------------------------------------------------------------------
// 16. Symbolic-bound range membership: concrete element, symbolic bounds.
// (str.in_re "b" (re.range lo hi)) should also be unfolded to a
// conjunction when lo/hi are free variables.
// -----------------------------------------------------------------------
{
app_ref lo(m.mk_fresh_const("lo", str_sort), m);
app_ref hi(m.mk_fresh_const("hi", str_sort), m);
expr_ref b_str(su.str.mk_string(zstring('b')), m);
expr_ref rng(su.re.mk_range(lo, hi), m);
expr_ref e(su.re.mk_in_re(b_str, rng), m);
rw(e);
std::cout << "symbolic range (\"b\" in [lo,hi]): " << mk_pp(e, m) << "\n";
ENSURE(m.is_and(e));
}
// -----------------------------------------------------------------------
// Solver-level tests: the unfolded conjunction must be decidable.
// -----------------------------------------------------------------------
{
arith_util a_util(m);
// 17. sat: (str.in_re x (re.range x x)) ∧ len(x)=1
{
smt_params sp;
smt::context ctx(m, sp);
app_ref x(m.mk_fresh_const("x", str_sort), m);
ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(1)));
lbool res = ctx.check();
std::cout << "symbolic range solver sat (len=1): " << res << "\n";
ENSURE(res == l_true);
}
// 18. unsat: (str.in_re x (re.range x x)) ∧ len(x)=2
// The unfolded membership requires len(x)=1, which contradicts len(x)=2.
{
smt_params sp;
smt::context ctx(m, sp);
app_ref x(m.mk_fresh_const("x", str_sort), m);
ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(2)));
lbool res = ctx.check();
std::cout << "symbolic range solver unsat (len=2): " << res << "\n";
ENSURE(res == l_false);
}
// 19. unsat: inverted symbolic bounds make membership false.
// (str.in_re "b" (re.range lo hi)) ∧ lo="z" ∧ hi="a"
// The unfolded conjunction requires lo <=_lex "b" <=_lex hi, but
// "z" > "b" > "a" so the ordering constraints are unsatisfiable.
{
smt_params sp;
smt::context ctx(m, sp);
app_ref lo(m.mk_fresh_const("lo", str_sort), m);
app_ref hi(m.mk_fresh_const("hi", str_sort), m);
expr_ref b_str(su.str.mk_string(zstring('b')), m);
ctx.assert_expr(su.re.mk_in_re(b_str, su.re.mk_range(lo, hi)));
ctx.assert_expr(m.mk_eq(lo, su.str.mk_string(zstring('z'))));
ctx.assert_expr(m.mk_eq(hi, su.str.mk_string(zstring('a'))));
lbool res = ctx.check();
std::cout << "symbolic range solver inverted bounds unsat: " << res << "\n";
ENSURE(res == l_false);
}
}
std::cout << "tst_seq_rewriter: all tests passed\n"; std::cout << "tst_seq_rewriter: all tests passed\n";
} }