3
0
Fork 0
mirror of https://github.com/Z3Prover/z3 synced 2025-04-23 17:15:31 +00:00

Regex solver updates (#4636)

* std::cout debugging statements

* comment out std::cout debugging as this is now a shared fork

* convert std::cout to TRACE statements for seq_rewriter and seq_regex

* add cases to min_length and max_length for regexes

* bug fix

* update min_length and max_length functions for REs

* initial pass on simplifying derivative normal forms by eliminating redundant predicates locally

* add seq_regex_brief trace statements

* working on debugging ref count issue

* fix ref count bug and convert trace statements to seq_regex_brief

* add compact tracing for cache hits/misses

* seq_regex fix cache hit/miss tracing and wrapper around is_nullable

* minor

* label and disable more experimental changes for testing

* minor documentation / tracing

* a few more @EXP annotations

* dead state elimination skeleton code

* progress on dead state elimination

* more progress on dead state elimination

* refactor dead state class to separate self-contained state_graph class

* finish factoring state_graph to only work with unsigned values, and implement separate functionality for expr* logic

* implement get_all_derivatives, add debug tracing

* trace statements for debugging is_nullable loop bug

* fix is_nullable loop bug

* comment out local nullable change and mark experimental

* pretty printing for state_graph

* rewrite state graph to remove the fragile assumption that all edges from a state are added at a time

* start of general cycle detection check + fix some comments

* implement full cycle detection procedure

* normalize derivative conditions to form 'ele <= a'

* order derivative conditions by character code

* fix confusing names m_to and m_from

* assign increasing state IDs from 1 instead of using get_id on AST node

* remove elim_condition call in get_dall_derivatives

* use u_map instead of uint_map to avoid memory leak

* remove unnecessary call to is_ground

* debugging

* small improvements to seq_regex_brief tracing

* fix bug on evil2 example

* save work

* new propagate code

* work in progress on using same seq sort for deriv calls

* avoid re-computing derivatives: use same head var for every derivative call

* use min_length on regexes to prune search

* simple implementation of can_be_in_cycle using rank function idea

* add a disabled experimental change

* minor cleanup comments, etc.

* seq_rewriter cleanup for PR

* typo noticed by Nikolaj

* move state graph to util/state_graph

* re-add accidentally removed line

* clean up seq_regex code removing obsolete functions and comments

* a few more cleanup items

* oops, missed merge change to fix compilation

* disabled change to lift unions to the top level and treat them seperately in seq_regex solver

* added get_overapprox_regex to over-approximate regex membership constraints

* replace calls to is_epsilon with a centrally available method in seq_decl_plugin

* simplifications and modifications in get_overapprox_regex and related

* added approximation support for sequence expressions that use ite

* removed is_app check that was redundant

* tweak differences with upstream

* rewrite derivative leaves

* enable Antimorov-style derivatives via lifting unions in the solver

* TODO placeholders for outputting state graph

* change order in seq_regex propagate_in_re

* implement a more restricted form of Antimorov derivatives via a special op code to indicate lifting unions

* minor

* new Antimorov optimizations based on BDD compatibility checking

* seq regex tracing for # of derivatives

* fix get_cofactors (currently this fix is buggy)

* partially revert get_cofactors buggy change

* re-implement get_cofactors to more efficiently explore nodes in the derivative expression

* dgml generation for state graph

* fix release build

* improved dgml output

* bug fixes in dgml generation

* dot output support for state_graph and moved dgml and dot output under CASSERT

* updated tracing of what regex corresponds to what state id with /tr:state_graph

* clean up & document Antimorov derivative support

* remove op cache tracing

* remove re_rank experimental idea

* small fix

* fix Antimorov derivative (important change for the good performance)

* remove unused and unnecessary code

* implemented simpler efficient get_cofactors alternative mk_deriv_accept

* simplifications in propagate_accept, and trace unusual cases

* document the various seq_regex tracing & debugging command-line options

* fix debug build (broken tracing)

* guard eager Antimorov lifting for possible disabling

* fix bug in propagate_accept Rule 1

* disable eager version of Antimorov lifting for performance reasons

* remove some remaining obsolete comments

Co-authored-by: calebstanford-msr <t-casta@microsoft.com>
Co-authored-by: Margus Veanes <margus@microsoft.com>
This commit is contained in:
Caleb Stanford 2020-08-13 15:47:36 -04:00 committed by GitHub
parent 9df6c10ad8
commit 2c02264a94
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
7 changed files with 556 additions and 140 deletions

View file

@ -109,24 +109,6 @@ namespace smt {
return;
}
// Convert a non-ground sequence into an additional regex and
// strengthen the original regex constraint into an intersection
// for example:
// (x ++ "a" ++ y) in b*
// is coverted to
// (x ++ "a" ++ y) in intersect((.* ++ "a" ++ .*), b*)
if (!m.is_value(s)) {
expr_ref s_approx = get_overapprox_regex(s);
if (!re().is_full_seq(s_approx)) {
r = re().mk_inter(r, s_approx);
TRACE("seq_regex", tout
<< "get_overapprox_regex(" << mk_pp(s, m)
<< ") = " << mk_pp(s_approx, m) << std::endl;);
STRACE("seq_regex_brief", tout
<< "overapprox=" << state_str(r) << " ";);
}
}
if (coallesce_in_re(lit)) {
TRACE("seq_regex", tout
<< "simplified conjunctions to an intersection" << std::endl;);
@ -141,6 +123,26 @@ namespace smt {
return;
}
// Convert a non-ground sequence into an additional regex and
// strengthen the original regex constraint into an intersection
// for example:
// (x ++ "a" ++ y) in b*
// is coverted to
// (x ++ "a" ++ y) in intersect((.* ++ "a" ++ .*), b*)
expr_ref _r_temp_owner(m);
if (!m.is_value(s)) {
expr_ref s_approx = get_overapprox_regex(s);
if (!re().is_full_seq(s_approx)) {
r = re().mk_inter(r, s_approx);
_r_temp_owner = r;
TRACE("seq_regex", tout
<< "get_overapprox_regex(" << mk_pp(s, m)
<< ") = " << mk_pp(s_approx, m) << std::endl;);
STRACE("seq_regex_brief", tout
<< "overapprox=" << state_str(r) << " ";);
}
}
expr_ref zero(a().mk_int(0), m);
expr_ref acc = sk().mk_accept(s, zero, r);
literal acc_lit = th.mk_literal(acc);
@ -213,6 +215,7 @@ namespace smt {
*
* Rule 1. (accept s i r) => len(s) >= i + min_len(r)
* Rule 2. (accept s i r) & len(s) <= i => nullable(r)
* (only necessary if min_len fails and returns 0 for non-nullable r)
* Rule 3. (accept s i r) and len(s) > i =>
* (accept s (i + 1) (derivative s[i] r)
*
@ -258,24 +261,36 @@ namespace smt {
STRACE("seq_regex_brief", tout << "(unfold) ";);
// Rule 1: use min_length to prune search
expr_ref s_to_re(re().mk_to_re(s), m);
expr_ref s_plus_r(re().mk_concat(s_to_re, r), m);
unsigned min_len = re().min_length(s_plus_r);
literal len_s_ge_min = th.m_ax.mk_ge(th.mk_len(s), min_len);
unsigned min_len = re().min_length(r);
unsigned min_len_plus_i = u().max_plus(min_len, idx);
literal len_s_ge_min = th.m_ax.mk_ge(th.mk_len(s), min_len_plus_i);
th.propagate_lit(nullptr, 1, &lit, len_s_ge_min);
// Axiom equivalent to the above: th.add_axiom(~lit, len_s_ge_min);
// Rule 2: nullable check
literal len_s_le_i = th.m_ax.mk_le(th.mk_len(s), idx);
expr_ref is_nullable = is_nullable_wrapper(r);
if (m.is_false(is_nullable)) {
th.propagate_lit(nullptr, 1, &lit, ~len_s_le_i);
}
else if (!m.is_true(is_nullable)) {
// is_nullable did not simplify
literal is_nullable_lit = th.mk_literal(is_nullable_wrapper(r));
ctx.mark_as_relevant(is_nullable_lit);
th.add_axiom(~lit, ~len_s_le_i, is_nullable_lit);
if (min_len == 0) {
expr_ref is_nullable = is_nullable_wrapper(r);
if (m.is_false(is_nullable)) {
STRACE("seq_regex", tout
<< "Warning: min_length returned 0 for non-nullable regex"
<< std::endl;);
STRACE("seq_regex_brief", tout
<< " (Warning: min_length returned 0 for"
<< " non-nullable regex)";);
th.propagate_lit(nullptr, 1, &lit, ~len_s_le_i);
}
else if (!m.is_true(is_nullable)) {
// is_nullable did not simplify
STRACE("seq_regex", tout
<< "Warning: is_nullable did not simplify to true or false"
<< std::endl;);
STRACE("seq_regex_brief", tout
<< " (Warning: is_nullable did not simplify)";);
literal is_nullable_lit = th.mk_literal(is_nullable);
ctx.mark_as_relevant(is_nullable_lit);
th.add_axiom(~lit, ~len_s_le_i, is_nullable_lit);
}
}
// Rule 3: derivative unfolding
@ -283,24 +298,11 @@ namespace smt {
expr_ref hd = th.mk_nth(s, i);
expr_ref deriv(m);
deriv = derivative_wrapper(hd, r);
expr_ref accept_deriv(m);
accept_deriv = mk_deriv_accept(s, idx + 1, deriv);
accept_next.push_back(~lit);
accept_next.push_back(len_s_le_i);
expr_ref_pair_vector cofactors(m);
get_cofactors(deriv, cofactors);
for (auto const& p : cofactors) {
if (m.is_false(p.first) || re().is_empty(p.second)) continue;
expr_ref cond(p.first, m);
expr_ref deriv_leaf(p.second, m);
expr_ref acc = sk().mk_accept(s, a().mk_int(idx + 1), deriv_leaf);
expr_ref choice(m.mk_and(cond, acc), m);
literal choice_lit = th.mk_literal(choice);
accept_next.push_back(choice_lit);
// TBD: try prioritizing unvisited states here over visited
// ones (in the state graph), to improve performance
STRACE("seq_regex_verbose", tout << "added choice: "
<< mk_pp(choice, m) << std::endl;);
}
accept_next.push_back(th.mk_literal(accept_deriv));
th.add_axiom(accept_next);
}
@ -442,7 +444,7 @@ namespace smt {
expr_ref r = symmetric_diff(r1, r2);
expr_ref emp(re().mk_empty(m.get_sort(r)), m);
expr_ref n(m.mk_fresh_const("re.char", seq_sort), m);
expr_ref is_empty = sk().mk_is_empty(r, emp, n);
expr_ref is_empty = sk().mk_is_empty(r, r, n);
th.add_axiom(~th.mk_eq(r1, r2, false), th.mk_literal(is_empty));
}
@ -455,7 +457,7 @@ namespace smt {
expr_ref r = symmetric_diff(r1, r2);
expr_ref emp(re().mk_empty(m.get_sort(r)), m);
expr_ref n(m.mk_fresh_const("re.char", seq_sort), m);
expr_ref is_non_empty = sk().mk_is_non_empty(r, emp, n);
expr_ref is_non_empty = sk().mk_is_non_empty(r, r, n);
th.add_axiom(th.mk_eq(r1, r2, false), th.mk_literal(is_non_empty));
}
@ -517,22 +519,98 @@ namespace smt {
th.add_axiom(lits);
}
void seq_regex::get_cofactors(expr* r, expr_ref_vector& conds, expr_ref_pair_vector& result) {
expr* cond = nullptr, *th = nullptr, *el = nullptr;
if (m.is_ite(r, cond, th, el)) {
conds.push_back(cond);
get_cofactors(th, conds, result);
conds.pop_back();
conds.push_back(mk_not(m, cond));
get_cofactors(el, conds, result);
conds.pop_back();
}
else {
expr_ref conj = mk_and(conds);
result.push_back(conj, r);
/*
Given a string s, index i, and a derivative regex d, return an
expression that is equivalent to
accept s i r
but which pushes accept s i r into the leaves (next derivatives to
explore).
Input r is of type regex; output is of type bool.
Example:
mk_deriv_accept(s, i, (ite a r1 r2) u (ite b r3 r4))
= (or (ite a (accept s i r1) (accept s i r2))
(ite b (accept s i r3) (accept s i r4)))
*/
expr_ref seq_regex::mk_deriv_accept(expr* s, unsigned i, expr* r) {
vector<expr*> to_visit;
to_visit.push_back(r);
obj_map<expr, expr*> re_to_bool;
expr_ref_vector _temp_bool_owner(m); // temp owner for bools we create
// DFS
while (to_visit.size() > 0) {
expr* e = to_visit.back();
expr* econd = nullptr, *e1 = nullptr, *e2 = nullptr;
if (!re_to_bool.contains(e)) {
// First visit: add children
STRACE("seq_regex_verbose", tout << "1";);
if (m.is_ite(e, econd, e1, e2) ||
re().is_union(e, e1, e2)) {
to_visit.push_back(e1);
to_visit.push_back(e2);
}
// Mark first visit by adding nullptr to the map
re_to_bool.insert(e, nullptr);
}
else if (re_to_bool.find(e) == nullptr) {
// Second visit: set value
STRACE("seq_regex_verbose", tout << "2";);
to_visit.pop_back();
if (m.is_ite(e, econd, e1, e2)) {
expr* b1 = re_to_bool.find(e1);
expr* b2 = re_to_bool.find(e2);
expr* b = m.mk_ite(econd, b1, b2);
_temp_bool_owner.push_back(b);
re_to_bool.find(e) = b;
}
else if (re().is_union(e, e1, e2)) {
expr* b1 = re_to_bool.find(e1);
expr* b2 = re_to_bool.find(e2);
expr* b = m.mk_or(b1, b2);
_temp_bool_owner.push_back(b);
re_to_bool.find(e) = b;
}
else {
expr* iplus1 = a().mk_int(i);
_temp_bool_owner.push_back(iplus1);
expr_ref acc_leaf = sk().mk_accept(s, iplus1, e);
_temp_bool_owner.push_back(acc_leaf);
re_to_bool.find(e) = acc_leaf;
STRACE("seq_regex_verbose", tout
<< "mk_deriv_accept: added accept leaf: "
<< mk_pp(acc_leaf, m) << std::endl;);
}
}
else {
STRACE("seq_regex_verbose", tout << "3";);
// Remaining visits: skip
to_visit.pop_back();
}
}
// Finalize
expr_ref result(m);
result = re_to_bool.find(r); // Assigns ownership of all exprs in
// re_to_bool for after this completes
rewrite(result);
return result;
}
/*
Return a list of all leaves in the derivative of a regex r,
ignoring the conditions along each path.
Warning: Although the derivative
normal form tries to eliminate unsat condition paths, one cannot
assume that the path to each leaf is satisfiable in general
(e.g. when regexes are created using re.pred).
So not all results may correspond to satisfiable predicates.
It is OK to rely on the results being satisfiable for completeness,
but not soundness.
*/
void seq_regex::get_all_derivatives(expr* r, expr_ref_vector& results) {
// Get derivative
sort* seq_sort = nullptr;
@ -541,14 +619,74 @@ namespace smt {
expr_ref hd = mk_first(r, n);
expr_ref d(m);
d = derivative_wrapper(hd, r);
// Use get_cofactors method and try to filter out unsatisfiable conds
expr_ref_pair_vector cofactors(m);
get_cofactors(d, cofactors);
STRACE("seq_regex_verbose", tout << "getting all derivatives of: " << mk_pp(r, m) << std::endl;);
for (auto const& p : cofactors) {
if (m.is_false(p.first) || re().is_empty(p.second)) continue;
STRACE("seq_regex_verbose", tout << "adding derivative: " << mk_pp(p.second, m) << std::endl;);
results.push_back(p.second);
// DFS
vector<expr*> to_visit;
to_visit.push_back(d);
obj_map<expr, bool> visited; // set<expr> (bool is used as a unit type)
while (to_visit.size() > 0) {
expr* e = to_visit.back();
to_visit.pop_back();
if (visited.contains(e)) continue;
visited.insert(e, true);
expr* econd = nullptr, *e1 = nullptr, *e2 = nullptr;
if (m.is_ite(e, econd, e1, e2) ||
re().is_union(e, e1, e2)) {
to_visit.push_back(e1);
to_visit.push_back(e2);
}
else if (!re().is_empty(e)) {
results.push_back(e);
STRACE("seq_regex_verbose", tout
<< "get_all_derivatives: added deriv: "
<< mk_pp(e, m) << std::endl;);
}
}
STRACE("seq_regex", tout << "Number of derivatives: "
<< results.size() << std::endl;);
STRACE("seq_regex_brief", tout << "#derivs=" << results.size() << " ";);
}
/*
Return a list of all (cond, leaf) pairs in a given derivative
expression r.
Note: this recursive implementation is inefficient, since if nodes
are repeated often in the expression DAG, they may be visited
many times. For this reason, prefer mk_deriv_accept and
get_all_derivatives when possible.
This method is still used by:
propagate_is_empty
propagate_is_non_empty
*/
void seq_regex::get_cofactors(expr* r, expr_ref_pair_vector& result) {
expr_ref_vector conds(m);
get_cofactors_rec(r, conds, result);
STRACE("seq_regex", tout << "Number of derivatives: "
<< result.size() << std::endl;);
STRACE("seq_regex_brief", tout << "#derivs=" << result.size() << " ";);
}
void seq_regex::get_cofactors_rec(expr* r, expr_ref_vector& conds,
expr_ref_pair_vector& result) {
expr* cond = nullptr, *r1 = nullptr, *r2 = nullptr;
if (m.is_ite(r, cond, r1, r2)) {
conds.push_back(cond);
get_cofactors_rec(r1, conds, result);
conds.pop_back();
conds.push_back(mk_not(m, cond));
get_cofactors_rec(r2, conds, result);
conds.pop_back();
}
else if (re().is_union(r, r1, r2)) {
get_cofactors_rec(r1, conds, result);
get_cofactors_rec(r2, conds, result);
}
else {
expr_ref conj = mk_and(conds);
if (!m.is_false(conj) && !re().is_empty(r))
result.push_back(conj, r);
}
}
@ -618,6 +756,9 @@ namespace smt {
m_expr_to_state.insert(e, new_id);
STRACE("seq_regex_brief", tout << "new(" << expr_id_str(e)
<< ")=" << state_str(e) << " ";);
STRACE("seq_regex", tout
<< "New state ID: " << new_id
<< " = " << mk_pp(e, m) << std::endl;);
}
return m_expr_to_state.find(e);
}
@ -627,6 +768,7 @@ namespace smt {
return m_state_to_expr.get(id);
}
bool seq_regex::can_be_in_cycle(expr *r1, expr *r2) {
// TBD: This can be used to optimize the state graph:
// return false here if it is known that r1 -> r2 can never be
@ -649,10 +791,11 @@ namespace smt {
STRACE("seq_regex_brief", tout << "(MAX SIZE REACHED) ";);
return false;
}
STRACE("seq_regex", tout << "Updating state graph for regex "
<< mk_pp(r, m) << ") ";);
// Add state
m_state_graph.add_state(r_id);
STRACE("state_graph", tout << "regex(" << r_id << ") = " << mk_pp(r, m) << std::endl;);
STRACE("seq_regex", tout << "Updating state graph for regex "
<< mk_pp(r, m) << ") " << std::endl;);
STRACE("seq_regex_brief", tout << std::endl << "USG("
<< state_str(r) << ") ";);
expr_ref r_nullable = is_nullable_wrapper(r);
@ -663,18 +806,20 @@ namespace smt {
// Add edges to all derivatives
expr_ref_vector derivatives(m);
STRACE("seq_regex_verbose", tout
<< std::endl << " getting all derivs: " << r_id << " ";);
<< "getting all derivs: " << r_id << " " << std::endl;);
get_all_derivatives(r, derivatives);
for (auto const& dr: derivatives) {
unsigned dr_id = get_state_id(dr);
STRACE("seq_regex_verbose", tout
<< std::endl << " traversing deriv: " << dr_id << " ";);
<< " traversing deriv: " << dr_id << " " << std::endl;);
m_state_graph.add_state(dr_id);
STRACE("state_graph", tout << "regex(" << dr_id << ") = " << mk_pp(dr, m) << std::endl;);
bool maybecycle = can_be_in_cycle(r, dr);
m_state_graph.add_edge(r_id, dr_id, maybecycle);
}
m_state_graph.mark_done(r_id);
}
STRACE("seq_regex", m_state_graph.display(tout););
STRACE("seq_regex_brief", tout << std::endl;);
STRACE("seq_regex_brief", m_state_graph.display(tout););
return true;

View file

@ -23,6 +23,71 @@ Author:
#include "smt/smt_context.h"
#include "smt/seq_skolem.h"
/*
*** Tracing and debugging in this module and related modules ***
Tracing and debugging for the regex solver are split across several
command-line flags.
TRACING
-tr:seq_regex and -tr:seq_regex_brief
These are the main tags to trace what the regex solver is doing.
They mostly trace the same things, except that seq_regex_brief
avoids printing out expressions and tries to abbreviate the output
as much as possible. seq_regex_brief shows the following output:
Top-level propagations:
PIR: Propagating an in_re constraint
PE/PNE: Propagating an empty/non-empty constraint
PEQ/PNEQ: Propagating a not-equal constraint
PA: Propagating an accept constraint
In tracing, arguments are generally put in parentheses.
To achieve abbreviated output, expressions are traced in one of two
ways:
id243 (expr ID): the regex or expression with id 243
3 (state ID): the regex with state ID 3
When a regex is newly assigned to a state ID, we print this:
new(id606)=4
Of these, PA is the most important, and traces as follows:
PA(x@i,r): propagate accept for string x at index i, regex r.
(empty), (dead), (blocked), (unfold): info about whether this
PA was cut off early, or unfolded into the derivatives
(next states)
d(r1)=r2: r2 is the derivative of r1
n(r1)=b: b = whether r1 is nullable or not
USG(r): updating state graph for regex r (add all derivatives)
-tr:state_graph
This is the tracing done by util/state_graph, the data structure
that seq_regex uses to track live and dead regexes, which can
altneratively be used to get a high-level picture of what states
are being explored and updated as the solver progresses.
-tr:seq_regex_verbose
Used for some more frequent tracing (in the style of seq_regex,
not in the style of seq_regex_brief)
-tr:seq and -tr:seq_verbose
These are the underlying sequence theory tracing, often used by
the rewriter.
DEBUGGING AND VIEWING STATE GRAPH GRAPHICAL OUTPUT
-dbg:seq_regex
Debugging that checks invariants. Currently, checks that derivative
normal form is correctly preserved in the rewriter.
-dbg:state_graph
Debugging for the state graph, which
1. Checks state graph invariants, and
2. Generates the files .z3-state-graph.dgml and .z3-state-graph.dot
which can be used to visually view the state graph being explored,
during or after executing Z3.
The output can be viewed:
- Using Visual Studio for .dgml
- Using a tool such as xdot (`xdot .z3-state-graph.dot`) for .dot
*/
namespace smt {
class theory_seq;
@ -93,12 +158,13 @@ namespace smt {
expr_ref is_nullable_wrapper(expr* r);
expr_ref derivative_wrapper(expr* hd, expr* r);
void get_cofactors(expr* r, expr_ref_vector& conds, expr_ref_pair_vector& result);
void get_cofactors(expr* r, expr_ref_pair_vector& result) {
expr_ref_vector conds(m);
get_cofactors(r, conds, result);
}
// Various support for unfolding derivative expressions that are
// returned by derivative_wrapper
expr_ref mk_deriv_accept(expr* s, unsigned i, expr* r);
void get_all_derivatives(expr* r, expr_ref_vector& results);
void get_cofactors(expr* r, expr_ref_pair_vector& result);
void get_cofactors_rec(expr* r, expr_ref_vector& conds,
expr_ref_pair_vector& result);
public: