mirror of
https://github.com/Z3Prover/z3
synced 2026-07-05 06:46:11 +00:00
[snapshot-regression-fix] Keep symbolic re.range non-empty; fix soundness regression on range membership (#10017)
## Summary
Fixes a **soundness regression** in the sequence/regex rewriter: a
symbolic character range such as `(re.range x x)` was unsoundly
collapsed to `re.empty`, causing a satisfiable membership constraint to
be reported `unsat`.
This was surfaced by the `snapshot-regression` corpus in
`Z3Prover/bench`.
- **Originating discussion:**
https://github.com/Z3Prover/bench/discussions/2761
- **Benchmark:** `iss-5873/bug-2.smt2` (in `Z3Prover/bench`, under
`inputs/issues/iss-5873/`)
- **z3 under test at capture:** `z3-4.17.0-x64-glibc-2.39` (Nightly)
## Divergence
The recorded oracle expects `sat`; current z3 returns `unsat`:
```diff
--- bug-2.expected.out (expected)
+++ produced (current z3)
@@ -1,3 +1,4 @@
-sat
-((tmp_str0 "\u{0}"))
+unsat
+(error "line 12 column 10: check annotation that says sat")
+(error "line 14 column 22: model is not available")
(:reason-unknown "")
```
The benchmark asserts (simplified):
```smt2
(assert (= (str.in_re (str.replace tmp_str0 tmp_str0 tmp_str0)
(re.range tmp_str0 tmp_str0))
(str.contains tmp_str0 tmp_str0)))
```
`str.contains x x` is always true and `str.replace x x x = x`, so this
requires `str.in_re x (re.range x x)` to hold, which is satisfiable
exactly when `x` is a single character (`len(x) = 1`).
## Root cause
`seq_rewriter::mk_re_range` treated any bound that is not a concrete
single-character literal as making the whole range **empty**:
```cpp
if (str().is_string(lo, slo) && slo.length() == 1) clo = slo[0];
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo)) ;
else is_empty = true; // unsound for a symbolic bound
```
For a symbolic bound this is unsound: `(re.range x x)` denotes `{x}`
whenever `x` is a single character, not `∅`. Collapsing it to `re.empty`
makes `str.in_re x (re.range x x)` false, contradicting the (true)
`str.contains x x`, so the solver derives an unsound `unsat`.
`git blame` attributes this unsound collapse to z3 commit `15f33f458d`
("Derive with ranges (#9965)"), which post-dates the oracle capture.
## Fix
Two surgical changes in `src/ast/rewriter/seq_rewriter.cpp`:
1. **`mk_re_range`** no longer assumes emptiness for symbolic bounds. It
concludes `re.empty` only when it can *prove* emptiness — a bound whose
length can never be 1, or two concrete bounds with `lo > hi`. When a
bound is symbolic it returns `BR_FAILED` and keeps the range. Concrete
single-character ranges keep their existing handling (`lo == hi →
str.to_re`, inverted → `re.empty`).
2. **`mk_str_in_regexp`** reduces membership in a range that has a
symbolic bound to the equivalent length/order constraints, which are
sound and complete under SMT-LIB `re.range` semantics:
`str.in_re e (re.range lo hi)` ⟶ `len(lo)=1 ∧ len(hi)=1 ∧ len(e)=1 ∧ lo
≤ e ∧ e ≤ hi`
(using `str.<=`). The derivative engine only unfolds ranges whose bounds
are concrete characters, so without this reduction a symbolic-bound
range would otherwise be left unsolved.
## Validation
Rebuilt z3 from this branch on the workflow runner (`./configure && make
-C build -j$(nproc)`) and re-ran the failing benchmark with the same
option the snapshot capture uses (`-T:20`):
```
$ z3 -T:20 inputs/issues/iss-5873/bug-2.smt2
sat
((tmp_str0 "A"))
(:reason-unknown "")
```
The verdict is now **`sat`** (was `unsat`) — the soundness regression is
resolved. A correctness battery over concrete and symbolic ranges all
returns the expected results, e.g.:
- `(str.in_re "b" (re.range "a" "c"))` → `sat`, `(str.in_re "d"
(re.range "a" "c"))` → `unsat`
- `(str.in_re x (re.range x x))` → `sat`; with `(= (str.len x) 2)` →
`unsat`
- `(str.in_re "b" (re.range x y))` → `sat`; with `(str.< y x)` → `unsat`
- `(str.in_re "" (re.range x y))` → `unsat`; `(str.in_re "ab" (re.range
"a" "c"))` → `unsat`
The pre-existing concrete-range derivative fast path is unchanged.
### Note on the model value (benign, unrelated to this fix)
The model value differs from the recorded oracle: current z3 prints
`((tmp_str0 "A"))` whereas the oracle recorded `((tmp_str0 "\u{0}"))`.
Both are valid single-character models (the formula has many). This
difference is **pre-existing and unrelated to this fix**: even a bare
`(assert (= (str.len x) 1))` yields `"A"` on current z3. It stems from
the seq/char theory's default character assignment for
otherwise-unconstrained characters (`theory_char.cpp` assigns fresh
characters starting from `'A'`), not from range handling. I deliberately
did **not** force the character to `\u{0}` — adding `x = "\u{0}"` would
be unsound over-constraining, and changing the global default character
is out of scope for this soundness fix and would perturb unrelated
models. The output is therefore semantically equivalent to the oracle
(same `sat` verdict and reason-unknown) but not byte-identical.
---
*Draft for human review. Diagnosed and fixed by the
`snapshot-regression-fixer` maintenance workflow.*
> Generated by [Fix a Z3 snapshot-regression
divergence](https://github.com/Z3Prover/bench/actions/runs/28502614658)
· 890.7 AIC · ⌖ 46.8 AIC · ⊞ 9K ·
[◷](https://github.com/search?q=repo%3AZ3Prover%2Fz3+%22gh-aw-workflow-id%3A+snapshot-regression-fixer%22&type=pullrequests)
<!-- gh-aw-agentic-workflow: Fix a Z3 snapshot-regression divergence,
engine: copilot, version: 1.0.63, model: claude-opus-4.8, id:
28502614658, workflow_id: snapshot-regression-fixer, run:
https://github.com/Z3Prover/bench/actions/runs/28502614658 -->
<!-- gh-aw-workflow-id: snapshot-regression-fixer -->
<!-- gh-aw-workflow-call-id: Z3Prover/bench/snapshot-regression-fixer
-->
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This commit is contained in:
parent
6ac3075022
commit
2a8f66f22b
2 changed files with 155 additions and 22 deletions
|
|
@ -3263,6 +3263,39 @@ br_status seq_rewriter::mk_str_in_regexp(expr* a, expr* b, expr_ref& result) {
|
|||
return BR_DONE;
|
||||
}
|
||||
|
||||
// (str.in_re e (re.range lo hi)) where a bound is not a concrete character.
|
||||
// By SMT-LIB semantics (re.range lo hi) is the set of single characters c
|
||||
// with lo <= c <= hi when lo and hi are themselves single characters, and
|
||||
// the empty language otherwise; so membership is equivalent to lo, hi and
|
||||
// e all being single characters with lo <= e <= hi. The derivative engine
|
||||
// only unfolds ranges whose bounds are concrete characters, so without this
|
||||
// reduction a range with a symbolic bound is left unsolved (and mk_re_range
|
||||
// deliberately keeps such a range symbolic rather than unsoundly collapsing
|
||||
// it to re.empty). Ranges with two concrete single-character bounds keep
|
||||
// their existing derivative-based handling.
|
||||
{
|
||||
expr* rlo = nullptr, *rhi = nullptr;
|
||||
if (re().is_range(b, rlo, rhi)) {
|
||||
auto concrete_char = [&](expr* e) {
|
||||
zstring s;
|
||||
expr* ch = nullptr;
|
||||
unsigned uc = 0;
|
||||
return (str().is_string(e, s) && s.length() == 1) ||
|
||||
(str().is_unit(e, ch) && m_util.is_const_char(ch, uc));
|
||||
};
|
||||
if (!concrete_char(rlo) || !concrete_char(rhi)) {
|
||||
expr_ref_vector conj(m());
|
||||
conj.push_back(m().mk_eq(str().mk_length(rlo), one()));
|
||||
conj.push_back(m().mk_eq(str().mk_length(rhi), one()));
|
||||
conj.push_back(m().mk_eq(str().mk_length(a), one()));
|
||||
conj.push_back(str().mk_lex_le(rlo, a));
|
||||
conj.push_back(str().mk_lex_le(a, rhi));
|
||||
result = m().mk_and(conj);
|
||||
return BR_REWRITE_FULL;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
zstring s;
|
||||
if (str().is_string(a, s) && re().is_ground(b)) {
|
||||
// Just check membership and replace by true/false
|
||||
|
|
@ -4126,35 +4159,47 @@ br_status seq_rewriter::mk_re_range(expr* lo, expr* hi, expr_ref& result) {
|
|||
is_empty = true;
|
||||
if (max_length(hi) == std::make_pair(true, rational(0)))
|
||||
is_empty = true;
|
||||
if (!is_empty) {
|
||||
if (str().is_string(lo, slo) && slo.length() == 1)
|
||||
clo = slo[0];
|
||||
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
|
||||
;
|
||||
else
|
||||
is_empty = true;
|
||||
}
|
||||
if (!is_empty) {
|
||||
if (str().is_string(hi, shi) && shi.length() == 1)
|
||||
chi = shi[0];
|
||||
else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
|
||||
;
|
||||
else
|
||||
is_empty = true;
|
||||
}
|
||||
|
||||
// clo/chi are only meaningful once both bounds were extracted; an early
|
||||
// is_empty (from the length checks) leaves them at their default 0, so the
|
||||
// is_empty return must come before the singleton/ordering checks below.
|
||||
if (!is_empty && clo > chi)
|
||||
is_empty = true;
|
||||
|
||||
// A provable length constraint (a bound can never be a single character)
|
||||
// is the only sound way to conclude emptiness for a possibly-symbolic
|
||||
// bound, so decide emptiness here before attempting to read concrete
|
||||
// characters.
|
||||
if (is_empty) {
|
||||
sort* srt = re().mk_re(lo->get_sort());
|
||||
result = re().mk_empty(srt);
|
||||
return BR_DONE;
|
||||
}
|
||||
|
||||
// Try to read concrete single-character bounds. A bound that is not a
|
||||
// syntactic single-character literal is *symbolic* (its value depends on
|
||||
// the model), NOT empty: collapsing such a range to re.empty is unsound
|
||||
// (e.g. (re.range x x) is {x} whenever x is a single character), so we
|
||||
// leave the range unevaluated (BR_FAILED) and let the theory solver
|
||||
// reason about it.
|
||||
bool has_clo = false, has_chi = false;
|
||||
if (str().is_string(lo, slo) && slo.length() == 1) {
|
||||
clo = slo[0];
|
||||
has_clo = true;
|
||||
}
|
||||
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
|
||||
has_clo = true;
|
||||
if (str().is_string(hi, shi) && shi.length() == 1) {
|
||||
chi = shi[0];
|
||||
has_chi = true;
|
||||
}
|
||||
else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
|
||||
has_chi = true;
|
||||
|
||||
if (!has_clo || !has_chi)
|
||||
return BR_FAILED;
|
||||
|
||||
// Both bounds are concrete characters: an inverted range is empty.
|
||||
if (clo > chi) {
|
||||
sort* srt = re().mk_re(lo->get_sort());
|
||||
result = re().mk_empty(srt);
|
||||
return BR_DONE;
|
||||
}
|
||||
|
||||
// Singleton: re.range "a" "a" → str.to_re "a"
|
||||
if (clo == chi) {
|
||||
result = re().mk_to_re(str().mk_string(zstring(clo)));
|
||||
|
|
|
|||
|
|
@ -10,12 +10,19 @@ Tests:
|
|||
4. Range ∪ Range → merged range for overlapping/adjacent
|
||||
5. Complement of range → one or two ranges
|
||||
6. Downstream operators absorb empty ranges correctly
|
||||
15. Symbolic-bound range membership rewrite (structural)
|
||||
16. Symbolic-bound range membership: concrete element, symbolic bounds (structural)
|
||||
17. Solver: (str.in_re x (re.range x x)) sat when len(x)=1
|
||||
18. Solver: (str.in_re x (re.range x x)) unsat when len(x)=2
|
||||
19. Solver: inverted symbolic bounds make membership unsatisfiable
|
||||
--*/
|
||||
|
||||
#include "ast/arith_decl_plugin.h"
|
||||
#include "ast/ast_pp.h"
|
||||
#include "ast/reg_decl_plugins.h"
|
||||
#include "ast/rewriter/th_rewriter.h"
|
||||
#include "ast/seq_decl_plugin.h"
|
||||
#include "smt/smt_context.h"
|
||||
#include <iostream>
|
||||
|
||||
// Build a single-char string literal expression.
|
||||
|
|
@ -167,5 +174,86 @@ void tst_seq_rewriter() {
|
|||
ENSURE(su.re.is_empty(e));
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------
|
||||
// 15. Symbolic-bound range membership rewrite (structural).
|
||||
// (str.in_re x (re.range x x)) with symbolic x should be unfolded
|
||||
// by the rewriter into a conjunction of length and ordering
|
||||
// constraints, not left stuck as an uninterpreted membership term.
|
||||
// -----------------------------------------------------------------------
|
||||
{
|
||||
app_ref x(m.mk_fresh_const("x", str_sort), m);
|
||||
expr_ref rng(su.re.mk_range(x, x), m);
|
||||
expr_ref e(su.re.mk_in_re(x, rng), m);
|
||||
rw(e);
|
||||
std::cout << "symbolic range (x in [x,x]): " << mk_pp(e, m) << "\n";
|
||||
ENSURE(m.is_and(e));
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------
|
||||
// 16. Symbolic-bound range membership: concrete element, symbolic bounds.
|
||||
// (str.in_re "b" (re.range lo hi)) should also be unfolded to a
|
||||
// conjunction when lo/hi are free variables.
|
||||
// -----------------------------------------------------------------------
|
||||
{
|
||||
app_ref lo(m.mk_fresh_const("lo", str_sort), m);
|
||||
app_ref hi(m.mk_fresh_const("hi", str_sort), m);
|
||||
expr_ref b_str(su.str.mk_string(zstring('b')), m);
|
||||
expr_ref rng(su.re.mk_range(lo, hi), m);
|
||||
expr_ref e(su.re.mk_in_re(b_str, rng), m);
|
||||
rw(e);
|
||||
std::cout << "symbolic range (\"b\" in [lo,hi]): " << mk_pp(e, m) << "\n";
|
||||
ENSURE(m.is_and(e));
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------
|
||||
// Solver-level tests: the unfolded conjunction must be decidable.
|
||||
// -----------------------------------------------------------------------
|
||||
{
|
||||
arith_util a_util(m);
|
||||
|
||||
// 17. sat: (str.in_re x (re.range x x)) ∧ len(x)=1
|
||||
{
|
||||
smt_params sp;
|
||||
smt::context ctx(m, sp);
|
||||
app_ref x(m.mk_fresh_const("x", str_sort), m);
|
||||
ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
|
||||
ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(1)));
|
||||
lbool res = ctx.check();
|
||||
std::cout << "symbolic range solver sat (len=1): " << res << "\n";
|
||||
ENSURE(res == l_true);
|
||||
}
|
||||
|
||||
// 18. unsat: (str.in_re x (re.range x x)) ∧ len(x)=2
|
||||
// The unfolded membership requires len(x)=1, which contradicts len(x)=2.
|
||||
{
|
||||
smt_params sp;
|
||||
smt::context ctx(m, sp);
|
||||
app_ref x(m.mk_fresh_const("x", str_sort), m);
|
||||
ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
|
||||
ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(2)));
|
||||
lbool res = ctx.check();
|
||||
std::cout << "symbolic range solver unsat (len=2): " << res << "\n";
|
||||
ENSURE(res == l_false);
|
||||
}
|
||||
|
||||
// 19. unsat: inverted symbolic bounds make membership false.
|
||||
// (str.in_re "b" (re.range lo hi)) ∧ lo="z" ∧ hi="a"
|
||||
// The unfolded conjunction requires lo <=_lex "b" <=_lex hi, but
|
||||
// "z" > "b" > "a" so the ordering constraints are unsatisfiable.
|
||||
{
|
||||
smt_params sp;
|
||||
smt::context ctx(m, sp);
|
||||
app_ref lo(m.mk_fresh_const("lo", str_sort), m);
|
||||
app_ref hi(m.mk_fresh_const("hi", str_sort), m);
|
||||
expr_ref b_str(su.str.mk_string(zstring('b')), m);
|
||||
ctx.assert_expr(su.re.mk_in_re(b_str, su.re.mk_range(lo, hi)));
|
||||
ctx.assert_expr(m.mk_eq(lo, su.str.mk_string(zstring('z'))));
|
||||
ctx.assert_expr(m.mk_eq(hi, su.str.mk_string(zstring('a'))));
|
||||
lbool res = ctx.check();
|
||||
std::cout << "symbolic range solver inverted bounds unsat: " << res << "\n";
|
||||
ENSURE(res == l_false);
|
||||
}
|
||||
}
|
||||
|
||||
std::cout << "tst_seq_rewriter: all tests passed\n";
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue