[snapshot-regression-fix] Keep symbolic re.range non-empty; fix soundness regression on range membership (#10017)

## Summary Fixes a **soundness regression** in the sequence/regex rewriter: a symbolic character range such as `(re.range x x)` was unsoundly collapsed to `re.empty`, causing a satisfiable membership constraint to be reported `unsat`. This was surfaced by the `snapshot-regression` corpus in `Z3Prover/bench`. - **Originating discussion:** https://github.com/Z3Prover/bench/discussions/2761 - **Benchmark:** `iss-5873/bug-2.smt2` (in `Z3Prover/bench`, under `inputs/issues/iss-5873/`) - **z3 under test at capture:** `z3-4.17.0-x64-glibc-2.39` (Nightly) ## Divergence The recorded oracle expects `sat`; current z3 returns `unsat`: ```diff --- bug-2.expected.out (expected) +++ produced (current z3) @@ -1,3 +1,4 @@ -sat -((tmp_str0 "\u{0}")) +unsat +(error "line 12 column 10: check annotation that says sat") +(error "line 14 column 22: model is not available") (:reason-unknown "") ``` The benchmark asserts (simplified): ```smt2 (assert (= (str.in_re (str.replace tmp_str0 tmp_str0 tmp_str0) (re.range tmp_str0 tmp_str0)) (str.contains tmp_str0 tmp_str0))) ``` `str.contains x x` is always true and `str.replace x x x = x`, so this requires `str.in_re x (re.range x x)` to hold, which is satisfiable exactly when `x` is a single character (`len(x) = 1`). ## Root cause `seq_rewriter::mk_re_range` treated any bound that is not a concrete single-character literal as making the whole range **empty**: ```cpp if (str().is_string(lo, slo) && slo.length() == 1) clo = slo[0]; else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo)) ; else is_empty = true; // unsound for a symbolic bound ``` For a symbolic bound this is unsound: `(re.range x x)` denotes `{x}` whenever `x` is a single character, not `∅`. Collapsing it to `re.empty` makes `str.in_re x (re.range x x)` false, contradicting the (true) `str.contains x x`, so the solver derives an unsound `unsat`. `git blame` attributes this unsound collapse to z3 commit `15f33f458d` ("Derive with ranges (#9965)"), which post-dates the oracle capture. ## Fix Two surgical changes in `src/ast/rewriter/seq_rewriter.cpp`: 1. **`mk_re_range`** no longer assumes emptiness for symbolic bounds. It concludes `re.empty` only when it can *prove* emptiness — a bound whose length can never be 1, or two concrete bounds with `lo > hi`. When a bound is symbolic it returns `BR_FAILED` and keeps the range. Concrete single-character ranges keep their existing handling (`lo == hi → str.to_re`, inverted → `re.empty`). 2. **`mk_str_in_regexp`** reduces membership in a range that has a symbolic bound to the equivalent length/order constraints, which are sound and complete under SMT-LIB `re.range` semantics: `str.in_re e (re.range lo hi)` ⟶ `len(lo)=1 ∧ len(hi)=1 ∧ len(e)=1 ∧ lo ≤ e ∧ e ≤ hi` (using `str.<=`). The derivative engine only unfolds ranges whose bounds are concrete characters, so without this reduction a symbolic-bound range would otherwise be left unsolved. ## Validation Rebuilt z3 from this branch on the workflow runner (`./configure && make -C build -j$(nproc)`) and re-ran the failing benchmark with the same option the snapshot capture uses (`-T:20`): ``` $ z3 -T:20 inputs/issues/iss-5873/bug-2.smt2 sat ((tmp_str0 "A")) (:reason-unknown "") ``` The verdict is now **`sat`** (was `unsat`) — the soundness regression is resolved. A correctness battery over concrete and symbolic ranges all returns the expected results, e.g.: - `(str.in_re "b" (re.range "a" "c"))` → `sat`, `(str.in_re "d" (re.range "a" "c"))` → `unsat` - `(str.in_re x (re.range x x))` → `sat`; with `(= (str.len x) 2)` → `unsat` - `(str.in_re "b" (re.range x y))` → `sat`; with `(str.< y x)` → `unsat` - `(str.in_re "" (re.range x y))` → `unsat`; `(str.in_re "ab" (re.range "a" "c"))` → `unsat` The pre-existing concrete-range derivative fast path is unchanged. ### Note on the model value (benign, unrelated to this fix) The model value differs from the recorded oracle: current z3 prints `((tmp_str0 "A"))` whereas the oracle recorded `((tmp_str0 "\u{0}"))`. Both are valid single-character models (the formula has many). This difference is **pre-existing and unrelated to this fix**: even a bare `(assert (= (str.len x) 1))` yields `"A"` on current z3. It stems from the seq/char theory's default character assignment for otherwise-unconstrained characters (`theory_char.cpp` assigns fresh characters starting from `'A'`), not from range handling. I deliberately did **not** force the character to `\u{0}` — adding `x = "\u{0}"` would be unsound over-constraining, and changing the global default character is out of scope for this soundness fix and would perturb unrelated models. The output is therefore semantically equivalent to the oracle (same `sat` verdict and reason-unknown) but not byte-identical. --- *Draft for human review. Diagnosed and fixed by the `snapshot-regression-fixer` maintenance workflow.* > Generated by [Fix a Z3 snapshot-regression divergence](https://github.com/Z3Prover/bench/actions/runs/28502614658) · 890.7 AIC · ⌖ 46.8 AIC · ⊞ 9K · [◷](https://github.com/search?q=repo%3AZ3Prover%2Fz3+%22gh-aw-workflow-id%3A+snapshot-regression-fixer%22&type=pullrequests)    --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-07-05 06:46:11 +00:00 · 2026-07-02 14:00:51 -07:00 · 2026-07-02 14:00:51 -07:00 · 2a8f66f22b
commit 2a8f66f22b
parent 6ac3075022
2 changed files with 155 additions and 22 deletions
--- a/src/ast/rewriter/seq_rewriter.cpp
+++ b/src/ast/rewriter/seq_rewriter.cpp
@ -3263,6 +3263,39 @@ br_status seq_rewriter::mk_str_in_regexp(expr* a, expr* b, expr_ref& result) {
        return BR_DONE;
    }

+    // (str.in_re e (re.range lo hi)) where a bound is not a concrete character.
+    // By SMT-LIB semantics (re.range lo hi) is the set of single characters c
+    // with lo <= c <= hi when lo and hi are themselves single characters, and
+    // the empty language otherwise; so membership is equivalent to lo, hi and
+    // e all being single characters with lo <= e <= hi.  The derivative engine
+    // only unfolds ranges whose bounds are concrete characters, so without this
+    // reduction a range with a symbolic bound is left unsolved (and mk_re_range
+    // deliberately keeps such a range symbolic rather than unsoundly collapsing
+    // it to re.empty).  Ranges with two concrete single-character bounds keep
+    // their existing derivative-based handling.
+    {
+        expr* rlo = nullptr, *rhi = nullptr;
+        if (re().is_range(b, rlo, rhi)) {
+            auto concrete_char = [&](expr* e) {
+                zstring s;
+                expr* ch = nullptr;
+                unsigned uc = 0;
+                return (str().is_string(e, s) && s.length() == 1) ||
+                       (str().is_unit(e, ch) && m_util.is_const_char(ch, uc));
+            };
+            if (!concrete_char(rlo) || !concrete_char(rhi)) {
+                expr_ref_vector conj(m());
+                conj.push_back(m().mk_eq(str().mk_length(rlo), one()));
+                conj.push_back(m().mk_eq(str().mk_length(rhi), one()));
+                conj.push_back(m().mk_eq(str().mk_length(a), one()));
+                conj.push_back(str().mk_lex_le(rlo, a));
+                conj.push_back(str().mk_lex_le(a, rhi));
+                result = m().mk_and(conj);
+                return BR_REWRITE_FULL;
+            }
+        }
+    }
+
    zstring s;
    if (str().is_string(a, s) && re().is_ground(b)) {
        // Just check membership and replace by true/false
@ -4126,35 +4159,47 @@ br_status seq_rewriter::mk_re_range(expr* lo, expr* hi, expr_ref& result) {
        is_empty = true;
    if (max_length(hi) == std::make_pair(true, rational(0)))
        is_empty = true;
-    if (!is_empty) {
-        if (str().is_string(lo, slo) && slo.length() == 1) 
-            clo = slo[0];    
-        else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
-            ;
-        else
-            is_empty = true;
-    }
-    if (!is_empty) {
-        if (str().is_string(hi, shi) && shi.length() == 1)
-            chi = shi[0];
-        else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
-            ;
-        else
-            is_empty = true;
-    }
-
-    // clo/chi are only meaningful once both bounds were extracted; an early
-    // is_empty (from the length checks) leaves them at their default 0, so the
-    // is_empty return must come before the singleton/ordering checks below.
-    if (!is_empty && clo > chi)
-        is_empty = true;

+    // A provable length constraint (a bound can never be a single character)
+    // is the only sound way to conclude emptiness for a possibly-symbolic
+    // bound, so decide emptiness here before attempting to read concrete
+    // characters.
    if (is_empty) {
        sort* srt = re().mk_re(lo->get_sort());
        result = re().mk_empty(srt);
        return BR_DONE;
    }

+    // Try to read concrete single-character bounds.  A bound that is not a
+    // syntactic single-character literal is *symbolic* (its value depends on
+    // the model), NOT empty: collapsing such a range to re.empty is unsound
+    // (e.g. (re.range x x) is {x} whenever x is a single character), so we
+    // leave the range unevaluated (BR_FAILED) and let the theory solver
+    // reason about it.
+    bool has_clo = false, has_chi = false;
+    if (str().is_string(lo, slo) && slo.length() == 1) {
+        clo = slo[0];
+        has_clo = true;
+    }
+    else if (str().is_unit(lo, lo1) && m_util.is_const_char(lo1, clo))
+        has_clo = true;
+    if (str().is_string(hi, shi) && shi.length() == 1) {
+        chi = shi[0];
+        has_chi = true;
+    }
+    else if (str().is_unit(hi, hi1) && m_util.is_const_char(hi1, chi))
+        has_chi = true;
+
+    if (!has_clo || !has_chi)
+        return BR_FAILED;
+
+    // Both bounds are concrete characters: an inverted range is empty.
+    if (clo > chi) {
+        sort* srt = re().mk_re(lo->get_sort());
+        result = re().mk_empty(srt);
+        return BR_DONE;
+    }
+
    // Singleton: re.range "a" "a" → str.to_re "a"
    if (clo == chi) {
        result = re().mk_to_re(str().mk_string(zstring(clo)));
--- a/src/test/seq_rewriter.cpp
+++ b/src/test/seq_rewriter.cpp
@ -10,12 +10,19 @@ Tests:
  4. Range ∪ Range → merged range for overlapping/adjacent
  5. Complement of range → one or two ranges
  6. Downstream operators absorb empty ranges correctly
+ 15. Symbolic-bound range membership rewrite (structural)
+ 16. Symbolic-bound range membership: concrete element, symbolic bounds (structural)
+ 17. Solver: (str.in_re x (re.range x x)) sat when len(x)=1
+ 18. Solver: (str.in_re x (re.range x x)) unsat when len(x)=2
+ 19. Solver: inverted symbolic bounds make membership unsatisfiable
 --*/

+#include "ast/arith_decl_plugin.h"
 #include "ast/ast_pp.h"
 #include "ast/reg_decl_plugins.h"
 #include "ast/rewriter/th_rewriter.h"
 #include "ast/seq_decl_plugin.h"
+#include "smt/smt_context.h"
 #include <iostream>

 // Build a single-char string literal expression.
@ -167,5 +174,86 @@ void tst_seq_rewriter() {
        ENSURE(su.re.is_empty(e));
    }

+    // -----------------------------------------------------------------------
+    // 15. Symbolic-bound range membership rewrite (structural).
+    //     (str.in_re x (re.range x x)) with symbolic x should be unfolded
+    //     by the rewriter into a conjunction of length and ordering
+    //     constraints, not left stuck as an uninterpreted membership term.
+    // -----------------------------------------------------------------------
+    {
+        app_ref x(m.mk_fresh_const("x", str_sort), m);
+        expr_ref rng(su.re.mk_range(x, x), m);
+        expr_ref e(su.re.mk_in_re(x, rng), m);
+        rw(e);
+        std::cout << "symbolic range (x in [x,x]): " << mk_pp(e, m) << "\n";
+        ENSURE(m.is_and(e));
+    }
+
+    // -----------------------------------------------------------------------
+    // 16. Symbolic-bound range membership: concrete element, symbolic bounds.
+    //     (str.in_re "b" (re.range lo hi)) should also be unfolded to a
+    //     conjunction when lo/hi are free variables.
+    // -----------------------------------------------------------------------
+    {
+        app_ref lo(m.mk_fresh_const("lo", str_sort), m);
+        app_ref hi(m.mk_fresh_const("hi", str_sort), m);
+        expr_ref b_str(su.str.mk_string(zstring('b')), m);
+        expr_ref rng(su.re.mk_range(lo, hi), m);
+        expr_ref e(su.re.mk_in_re(b_str, rng), m);
+        rw(e);
+        std::cout << "symbolic range (\"b\" in [lo,hi]): " << mk_pp(e, m) << "\n";
+        ENSURE(m.is_and(e));
+    }
+
+    // -----------------------------------------------------------------------
+    // Solver-level tests: the unfolded conjunction must be decidable.
+    // -----------------------------------------------------------------------
+    {
+        arith_util a_util(m);
+
+        // 17. sat: (str.in_re x (re.range x x)) ∧ len(x)=1
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref x(m.mk_fresh_const("x", str_sort), m);
+            ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
+            ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(1)));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver sat (len=1): " << res << "\n";
+            ENSURE(res == l_true);
+        }
+
+        // 18. unsat: (str.in_re x (re.range x x)) ∧ len(x)=2
+        //     The unfolded membership requires len(x)=1, which contradicts len(x)=2.
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref x(m.mk_fresh_const("x", str_sort), m);
+            ctx.assert_expr(su.re.mk_in_re(x, su.re.mk_range(x, x)));
+            ctx.assert_expr(m.mk_eq(su.str.mk_length(x), a_util.mk_int(2)));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver unsat (len=2): " << res << "\n";
+            ENSURE(res == l_false);
+        }
+
+        // 19. unsat: inverted symbolic bounds make membership false.
+        //     (str.in_re "b" (re.range lo hi)) ∧ lo="z" ∧ hi="a"
+        //     The unfolded conjunction requires lo <=_lex "b" <=_lex hi, but
+        //     "z" > "b" > "a" so the ordering constraints are unsatisfiable.
+        {
+            smt_params sp;
+            smt::context ctx(m, sp);
+            app_ref lo(m.mk_fresh_const("lo", str_sort), m);
+            app_ref hi(m.mk_fresh_const("hi", str_sort), m);
+            expr_ref b_str(su.str.mk_string(zstring('b')), m);
+            ctx.assert_expr(su.re.mk_in_re(b_str, su.re.mk_range(lo, hi)));
+            ctx.assert_expr(m.mk_eq(lo, su.str.mk_string(zstring('z'))));
+            ctx.assert_expr(m.mk_eq(hi, su.str.mk_string(zstring('a'))));
+            lbool res = ctx.check();
+            std::cout << "symbolic range solver inverted bounds unsat: " << res << "\n";
+            ENSURE(res == l_false);
+        }
+    }
+
    std::cout << "tst_seq_rewriter: all tests passed\n";
 }