From 2a8f66f22bf973d7d20f603880513507befe427d Mon Sep 17 00:00:00 2001
From: Lev Nachmanson <5377127+levnach@users.noreply.github.com>
Date: Thu, 2 Jul 2026 14:00:51 -0700
Subject: [PATCH] [snapshot-regression-fix] Keep symbolic re.range non-empty;
 fix soundness regression on range membership (#10017)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fixes a **soundness regression** in the sequence/regex rewriter: a
symbolic character range such as `(re.range x x)` was unsoundly
collapsed to `re.empty`, causing a satisfiable membership constraint to
be reported `unsat`.

This was surfaced by the `snapshot-regression` corpus in
`Z3Prover/bench`.

- **Originating discussion:**
https://github.com/Z3Prover/bench/discussions/2761
- **Benchmark:** `iss-5873/bug-2.smt2` (in `Z3Prover/bench`, under
`inputs/issues/iss-5873/`)
- **z3 under test at capture:** `z3-4.17.0-x64-glibc-2.39` (Nightly)

## Divergence

The recorded oracle expects `sat`; current z3 returns `unsat`:

```diff
--- bug-2.expected.out (expected)
+++ produced (current z3)
@@ -1,3 +1,4 @@
-sat
-((tmp_str0 "\u{0}"))
+unsat
+(error "line 12 column 10: check annotation that says sat")
+(error "line 14 column 22: model is not available")
 (:reason-unknown "")
```

The benchmark asserts (simplified):

```smt2
(assert (= (str.in_re (str.replace tmp_str0 tmp_str0 tmp_str0)
                      (re.range tmp_str0 tmp_str0))
           (str.contains tmp_str0 tmp_str0)))
```

`str.contains x x` is always true and `str.replace x x x = x`, so this
requires `str.in_re x (re.range x x)` to hold, which is satisfiable
exactly when `x` is a single character (`len(x) = 1`).

## Root cause

`seq_rewriter::mk_re_range` treated any bound that is not a concrete
single-character literal as making the whole range **empty**:

```cpp
if (str().is_string(lo, slo) && slo.length() == 1) clo = slo[0];
else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo)) ;
else is_empty = true;   // unsound for a symbolic bound
```

For a symbolic bound this is unsound: `(re.range x x)` denotes `{x}`
whenever `x` is a single character, not `∅`. Collapsing it to `re.empty`
makes `str.in_re x (re.range x x)` false, contradicting the (true)
`str.contains x x`, so the solver derives an unsound `unsat`.

`git blame` attributes this unsound collapse to z3 commit `15f33f458d`
("Derive with ranges (#9965)"), which post-dates the oracle capture.

## Fix

Two surgical changes in `src/ast/rewriter/seq_rewriter.cpp`:

1. **`mk_re_range`** no longer assumes emptiness for symbolic bounds. It
concludes `re.empty` only when it can *prove* emptiness — a bound whose
length can never be 1, or two concrete bounds with `lo > hi`. When a
bound is symbolic it returns `BR_FAILED` and keeps the range. Concrete
single-character ranges keep their existing handling (`lo == hi →
str.to_re`, inverted → `re.empty`).

2. **`mk_str_in_regexp`** reduces membership in a range that has a
symbolic bound to the equivalent length/order constraints, which are
sound and complete under SMT-LIB `re.range` semantics:

`str.in_re e (re.range lo hi)` ⟶ `len(lo)=1 ∧ len(hi)=1 ∧ len(e)=1 ∧ lo
≤ e ∧ e ≤ hi`

(using `str.<=`). The derivative engine only unfolds ranges whose bounds
are concrete characters, so without this reduction a symbolic-bound
range would otherwise be left unsolved.

## Validation

Rebuilt z3 from this branch on the workflow runner (`./configure && make
-C build -j$(nproc)`) and re-ran the failing benchmark with the same
option the snapshot capture uses (`-T:20`):

```
$ z3 -T:20 inputs/issues/iss-5873/bug-2.smt2
sat
((tmp_str0 "A"))
(:reason-unknown "")
```

The verdict is now **`sat`** (was `unsat`) — the soundness regression is
resolved. A correctness battery over concrete and symbolic ranges all
returns the expected results, e.g.:

- `(str.in_re "b" (re.range "a" "c"))` → `sat`, `(str.in_re "d"
(re.range "a" "c"))` → `unsat`
- `(str.in_re x (re.range x x))` → `sat`; with `(= (str.len x) 2)` →
`unsat`
- `(str.in_re "b" (re.range x y))` → `sat`; with `(str.< y x)` → `unsat`
- `(str.in_re "" (re.range x y))` → `unsat`; `(str.in_re "ab" (re.range
"a" "c"))` → `unsat`

The pre-existing concrete-range derivative fast path is unchanged.

### Note on the model value (benign, unrelated to this fix)

The model value differs from the recorded oracle: current z3 prints
`((tmp_str0 "A"))` whereas the oracle recorded `((tmp_str0 "\u{0}"))`.
Both are valid single-character models (the formula has many). This
difference is **pre-existing and unrelated to this fix**: even a bare
`(assert (= (str.len x) 1))` yields `"A"` on current z3. It stems from
the seq/char theory's default character assignment for
otherwise-unconstrained characters (`theory_char.cpp` assigns fresh
characters starting from `'A'`), not from range handling. I deliberately
did **not** force the character to `\u{0}` — adding `x = "\u{0}"` would
be unsound over-constraining, and changing the global default character
is out of scope for this soundness fix and would perturb unrelated
models. The output is therefore semantically equivalent to the oracle
(same `sat` verdict and reason-unknown) but not byte-identical.

---
*Draft for human review. Diagnosed and fixed by the
`snapshot-regression-fixer` maintenance workflow.*


> Generated by [Fix a Z3 snapshot-regression
divergence](https://github.com/Z3Prover/bench/actions/runs/28502614658)
· 890.7 AIC · ⌖ 46.8 AIC · ⊞ 9K ·
[◷](https://github.com/search?q=repo%3AZ3Prover%2Fz3+%22gh-aw-workflow-id%3A+snapshot-regression-fixer%22&type=pullrequests)

<!-- gh-aw-agentic-workflow: Fix a Z3 snapshot-regression divergence,
engine: copilot, version: 1.0.63, model: claude-opus-4.8, id:
28502614658, workflow_id: snapshot-regression-fixer, run:
https://github.com/Z3Prover/bench/actions/runs/28502614658 -->

<!-- gh-aw-workflow-id: snapshot-regression-fixer -->
<!-- gh-aw-workflow-call-id: Z3Prover/bench/snapshot-regression-fixer
-->

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
---
 src/ast/rewriter/seq_rewriter.cpp | 89 +++++++++++++++++++++++--------
 src/test/seq_rewriter.cpp         | 88 ++++++++++++++++++++++++++++++
 2 files changed, 155 insertions(+), 22 deletions(-)

diff --git a/src/ast/rewriter/seq_rewriter.cpp b/src/ast/rewriter/seq_rewriter.cpp
index 9e4a6ca72c..d1df4133b1 100644
--- a/src/ast/rewriter/seq_rewriter.cpp
+++ b/src/ast/rewriter/seq_rewriter.cpp
@@ -3263,6 +3263,39 @@ br_status seq_rewriter::mk_str_in_regexp(expr* a, expr* b, expr_ref& result) {
         return BR_DONE;
     }
 
+    // (str.in_re e (re.range lo hi)) where a bound is not a concrete character.
+    // By SMT-LIB semantics (re.range lo hi) is the set of single characters c
+    // with lo <= c <= hi when lo and hi are themselves single characters, and
+    // the empty language otherwise; so membership is equivalent to lo, hi and
+    // e all being single characters with lo <= e <= hi.  The derivative engine
+    // only unfolds ranges whose bounds are concrete characters, so without this
+    // reduction a range with a symbolic bound is left unsolved (and mk_re_range
+    // deliberately keeps such a range symbolic rather than unsoundly collapsing
+    // it to re.empty).  Ranges with two concrete single-character bounds keep
+    // their existing derivative-based handling.
+    {
+        expr* rlo = nullptr, *rhi = nullptr;
+        if (re().is_range(b, rlo, rhi)) {
+            auto concrete_char = [&](expr* e) {
+                zstring s;
+                expr* ch = nullptr;
+                unsigned uc = 0;
+                return (str().is_string(e, s) && s.length() == 1) ||
+                       (str().is_unit(e, ch) && m_util.is_const_char(ch, uc));
+            };
+            if (!concrete_char(rlo) || !concrete_char(rhi)) {
+                expr_ref_vector conj(m());
+                conj.push_back(m().mk_eq(str().mk_length(rlo), one()));
+                conj.push_back(m().mk_eq(str().mk_length(rhi), one()));
+                conj.push_back(m().mk_eq(str().mk_length(a), one()));
+                conj.push_back(str().mk_lex_le(rlo, a));
+                conj.push_back(str().mk_lex_le(a, rhi));
+                result = m().mk_and(conj);
+                return BR_REWRITE_FULL;
+            }
+        }
+    }
+
     zstring s;
     if (str().is_string(a, s) && re().is_ground(b)) {
         // Just check membership and replace by true/false
@@ -4126,35 +4159,47 @@ br_status seq_rewriter::mk_re_range(expr* lo, expr* hi, expr_ref& result) {
         is_empty = true;
     if (max_length(hi) == std::make_pair(true, rational(0)))
         is_empty = true;
-    if (!is_empty) {
-        if (str().is_string(lo, slo) && slo.length() == 1) 
-            clo = slo[0];    
-        else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
-            ;
-        else
-            is_empty = true;
-    }
-    if (!is_empty) {
-        if (str().is_string(hi, shi) && shi.length() == 1)
-            chi = shi[0];
-        else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
-            ;
-        else
-            is_empty = true;
-    }
-
-    // clo/chi are only meaningful once both bounds were extracted; an early
-    // is_empty (from the length checks) leaves them at their default 0, so the
-    // is_empty return must come before the singleton/ordering checks below.
-    if (!is_empty && clo > chi)
-        is_empty = true;
 
+    // A provable length constraint (a bound can never be a single character)
+    // is the only sound way to conclude emptiness for a possibly-symbolic
+    // bound, so decide emptiness here before attempting to read concrete
+    // characters.
     if (is_empty) {
         sort* srt = re().mk_re(lo->get_sort());
         result = re().mk_empty(srt);
         return BR_DONE;
     }
 
+    // Try to read concrete single-character bounds.  A bound that is not a
+    // syntactic single-character literal is *symbolic* (its value depends on
+    // the model), NOT empty: collapsing such a range to re.empty is unsound
+    // (e.g. (re.range x x) is {x} whenever x is a single character), so we
+    // leave the range unevaluated (BR_FAILED) and let the theory solver
+    // reason about it.
+    bool has_clo = false, has_chi = false;
+    if (str().is_string(lo, slo) && slo.length() == 1) {
+        clo = slo[0];
+        has_clo = true;
+    }
+    else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
+        has_clo = true;
+    if (str().is_string(hi, shi) && shi.length() == 1) {
+        chi = shi[0];
+        has_chi = true;
+    }
+    else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
+        has_chi = true;
+
+    if (!has_clo || !has_chi)
+        return BR_FAILED;
+
+    // Both bounds are concrete characters: an inverted range is empty.
+    if (clo > chi) {
+        sort* srt = re().mk_re(lo->get_sort());
+        result = re().mk_empty(srt);
+        return BR_DONE;
+    }
+
     // Singleton: re.range "a" "a" → str.to_re "a"
     if (clo == chi) {
         result = re().mk_to_re(str().mk_string(zstring(clo)));
diff --git a/src/test/seq_rewriter.cpp b/src/test/seq_rewriter.cpp
index 658675fa72..920a4ae4ac 100644
--- a/src/test/seq_rewriter.cpp
+++ b/src/test/seq_rewriter.cpp
@@ -10,12 +10,19 @@ Tests:
   4. Range ∪ Range → merged range for overlapping/adjacent
   5. Complement of range → one or two ranges
   6. Downstream operators absorb empty ranges correctly
+ 15. Symbolic-bound range membership rewrite (structural)
+ 16. Symbolic-bound range membership: concrete element, symbolic bounds (structural)
+ 17. Solver: (str.in_re x (re.range x x)) sat when len(x)=1
+ 18. Solver: (str.in_re x (re.range x x)) unsat when len(x)=2
+ 19. Solver: inverted symbolic bounds make membership unsatisfiable
 --*/
 
+#include "ast/arith_decl_plugin.h"
 #include "ast/ast_pp.h"
 #include "ast/reg_decl_plugins.h"
 #include "ast/rewriter/th_rewriter.h"
 #include "ast/seq_decl_plugin.h"
+#include "smt/smt_context.h"
 #include <iostream>
 
 // Build a single-char string literal expression.
@@ -167,5 +174,86 @@ void tst_seq_rewriter() {
         ENSURE(su.re.is_empty(e));
     }
 
+    // -----------------------------------------------------------------------
+    // 15. Symbolic-bound range membership rewrite (structural).
+    //     (str.in_re x (re.range x x)) with symbolic x should be unfolded
+    //     by the rewriter into a conjunction of length and ordering
+    //     constraints, not left stuck as an uninterpreted membership term.
+    // -----------------------------------------------------------------------
+    {
+        app_ref x(m.mk_fresh_const("x", str_sort), m);
+        expr_ref rng(su.re.mk_range(x, x), m);
+        expr_ref e(su.re.mk_in_re(x, rng), m);
+        rw(e);
+        std::cout << "symbolic range (x in [x,x]): " << mk_pp(e, m) << "\n";
+        ENSURE(m.is_and(e));
+    }
+
+    // -----------------------------------------------------------------------
+    // 16. Symbolic-bound range membership: concrete element, symbolic bounds.
+    //     (str.in_re "b" (re.range lo hi)) should also be unfolded to a
+    //     conjunction when lo/hi are free variables.
+    // -----------------------------------------------------------------------
+    {
+        app_ref lo(m.mk_fresh_const("lo", str_sort), m);
+        app_ref hi(m.mk_fresh_const("hi", str_sort), m);
+        expr_ref b_str(su.str.mk_string(zstring('b')), m);
+        expr_ref rng(su.re.mk_range(lo, hi), m);
+        expr_ref e(su.re.mk_in_re(b_str, rng), m);
+        rw(e);
+        std::cout << "symbolic range (\"b\" in [lo,hi]): " << mk_pp(e, m) << "\n";
+        ENSURE(m.is_and(e));
+    }
+
+    // -----------------------------------------------------------------------
+    // Solver-level tests: the unfolded conjunction must be decidable.
+    // -----------------------------------------------------------------------
+    {
+        arith_util a_util(m);
+
+        // 17. sat: (str.in_re x (re.range x x)) ∧ len(x)=1
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref x(m.mk_fresh_const("x", str_sort), m);
+            ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
+            ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(1)));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver sat (len=1): " << res << "\n";
+            ENSURE(res == l_true);
+        }
+
+        // 18. unsat: (str.in_re x (re.range x x)) ∧ len(x)=2
+        //     The unfolded membership requires len(x)=1, which contradicts len(x)=2.
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref x(m.mk_fresh_const("x", str_sort), m);
+            ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
+            ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(2)));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver unsat (len=2): " << res << "\n";
+            ENSURE(res == l_false);
+        }
+
+        // 19. unsat: inverted symbolic bounds make membership false.
+        //     (str.in_re "b" (re.range lo hi)) ∧ lo="z" ∧ hi="a"
+        //     The unfolded conjunction requires lo <=_lex "b" <=_lex hi, but
+        //     "z" > "b" > "a" so the ordering constraints are unsatisfiable.
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref lo(m.mk_fresh_const("lo", str_sort), m);
+            app_ref hi(m.mk_fresh_const("hi", str_sort), m);
+            expr_ref b_str(su.str.mk_string(zstring('b')), m);
+            ctx.assert_expr(su.re.mk_in_re(b_str, su.re.mk_range(lo, hi)));
+            ctx.assert_expr(m.mk_eq(lo, su.str.mk_string(zstring('z'))));
+            ctx.assert_expr(m.mk_eq(hi, su.str.mk_string(zstring('a'))));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver inverted bounds unsat: " << res << "\n";
+            ENSURE(res == l_false);
+        }
+    }
+
     std::cout << "tst_seq_rewriter: all tests passed\n";
 }