mirror of
https://github.com/Z3Prover/z3
synced 2026-05-16 23:25:36 +00:00
Terminate on Demand and some algorithmic bugfixes in the search tree (#9336)
* first attempt with codex. Codex notes:
What changed:
- Each tree node now tracks:
- active worker count
- lease epoch
- cancel epoch
- get_cube() now hands each worker an explicit lease: (node, epoch, cancel_epoch).
- try_split() and backtrack() now operate on that lease, and the batch manager releases the worker’s lease under the tree lock before mutating the
node.
- If another worker closes the leased node or subtree, the batch manager cancels only the workers whose current leased nodes are now closed.
- Workers detect canceled leases after check(), reset their local cancel flag, abandon the stale lease, and continue instead of turning that into a
global exception.
- The “reopen immediately into the open queue” policy is preserved. I did not add a barrier waiting for all workers on a node to finish.
- Active-worker accounting is now separate from the open/active/closed scheduling status, so reopening a node no longer erases the fact that other
workers are still on it.
I also updated search_tree bookkeeping so:
- closure bumps node cancel/lease epochs
- active-node counting uses actual active-worker presence, not just status == active
* fix(parallel-smt): gate split/backtrack by lease epoch
What it changes:
- util/search_tree.h
- bumps node epoch on split
- threads epoch through should_split(...) and try_split(...)
- always records effort, but only split/reopen if the lease epoch still matches
- smt/smt_parallel.cpp
- requires is_lease_valid(..., lease.epoch) before backtrack(...)
- passes lease.epoch into m_search_tree.try_split(...)
* clean up code and add some comments
* fix bug about backtracking condition being too strict: The epoch guard should not block backtrack(...) the same way it blocks try_split(...). A stale worker that proves UNSAT for n should still be able to
close n, and that closure should then cancel the other workers on n and its subtree.
I changed smt/smt_parallel.cpp accordingly:
- try_split(...) still uses epoch to reject stale structural splits
- backtrack(...) no longer requires is_lease_valid(..., epoch); it only requires that the lease is not already canceled
So the intended asymmetry is now restored:
- stale split: reject
- stale unsat/backtrack: allow closure, then cancel affected workers
* ablate to no backtracking on stale leases
* revert codex change about exception handling
* remove old code
* ablate backtracking gate
* attempt to fix linux crashes
* try to fix bug about active worker counts/lease accounting. current policy should hold: - stale leases: release/decrement
- canceled leases: do not release/decrement (just ignore since we have an invariant that canceled leases mean closed nodes that are never revisited
* delay premature root activation
* fix major semantic bug about threads continually choosing the root if their lease is reset
* fix cancellation to unknown status
* fix very bad bug about all threads needing to start at the root
* ablate active ranking: now nodes are only reopened if they are truly inactive (active worker count is 0)
* fix some bugs about leases
* ablate adding static effort only
* fix some bugs about leases
* don't explode effort for portfolio nodes
* fix: still accumulate per-node effort, but don't over-accumulate on portfolio solves
* restore dynamically scaled effort
* lease cancellation doens't touch rlimit now, it just sets max conflicts to 0. also fix a VERY BAD BUG about effort never being updated until all leases are done on a node, which meant we never left the root
* cross-thread modification of max conflicts is unsafe, so create an atomic lease canceled variable that's ch
ecked in ctx where max conflicts is also checked
* move atomic lease check in the context to the more global get_cancel_flag function
* Fix new SIGSEV. issue: The root cause: get_cancel_flag() is called from within propagation loops (mid-BCP, mid-equality-propagation, mid-atom-propagation). When it returns true there, the solver exits early and leaves the context in an intermediate state —
propagation queues partially processed, theory state potentially inconsistent with boolean state.
For the global cancel (m.limit().cancel()), this is harmless: the worker exits entirely and the context is destroyed. Intermediate state doesn't matter.
For a lease cancel, the context is reused — the worker gets a new cube and calls ctx->check() again on the same context object. Re-entering check() on a context interrupted mid-propagation causes it to access that corrupted intermediate
state → SIGSEGV.
The m_max_conflicts check is the only checkpoint that's safe for re-entry: it only fires post-conflict-resolution, pre-decision, when propagation queues are empty and theory state is consistent.
Fix: Remove m_lease_canceled from get_cancel_flag(). Keep it only at safe, between-phase checkpoints where the context is in a known-consistent state. The result is two safe checkpoints for m_lease_canceled: after each conflict (post-resolution, queues empty) and before each theory final check (not yet entered the theory). Neither interrupts the solver mid-mutation. The SIGSEGV should be
gone, and NIA performance should improve because long theory final checks (where NIA burns most time) are now preemptable before they start.
* fix new inconsistent theory bug: The problem is returning FC_GIVEUP from inside final_check() after some theories have already run final_check_eh() and pushed propagations into the queue. Those pending propagations reference context state that gets invalidated on the next check() call → SIGSEGV. The fix: check m_lease_canceled before entering final_check() in bounded_search(), never from inside it. That way the context is always in a clean pre-final-check state when we bail out. This is safe: decide() returned false (all variables assigned, no pending propagations), theories haven't been touched yet, context is in a fully consistent state. For NIA, this is still a meaningful win — we avoid entering expensive arithmetic final checks entirely when the lease is already canceled.
* remove second lease cancel check in smt_context, not sure it's safe. only check where we do the max conflicts check
* check epoch match in release_lease_unlocked
* restore exception handling logic to master branch
* restore reslimit cancels since the bug appears to be latent
---------
Co-authored-by: Ilana Shapiro <ilanashapiro@Ilanas-MacBook-Pro.local>
Co-authored-by: Ilana Shapiro <ilanashapiro@Mac.lan1>
This commit is contained in:
parent
0669cdd829
commit
44966e1733
4 changed files with 242 additions and 120 deletions
|
|
@ -48,6 +48,10 @@ namespace search_tree {
|
|||
vector<literal> m_core;
|
||||
unsigned m_num_activations = 0;
|
||||
unsigned m_effort_spent = 0;
|
||||
unsigned m_round_max_effort = 0;
|
||||
unsigned m_active_workers = 0;
|
||||
unsigned m_epoch = 0;
|
||||
unsigned m_cancel_epoch = 0;
|
||||
|
||||
public:
|
||||
node(literal const &l, node *parent) : m_literal(l), m_parent(parent), m_status(status::open) {}
|
||||
|
|
@ -65,9 +69,6 @@ namespace search_tree {
|
|||
literal const &get_literal() const {
|
||||
return m_literal;
|
||||
}
|
||||
bool literal_is_null() const {
|
||||
return Config::is_null(m_literal);
|
||||
}
|
||||
void split(literal const &a, literal const &b) {
|
||||
SASSERT(!Config::literal_is_null(a));
|
||||
SASSERT(!Config::literal_is_null(b));
|
||||
|
|
@ -77,6 +78,7 @@ namespace search_tree {
|
|||
SASSERT(!m_right);
|
||||
m_left = alloc(node<Config>, a, this);
|
||||
m_right = alloc(node<Config>, b, this);
|
||||
inc_epoch();
|
||||
}
|
||||
|
||||
node* left() const { return m_left; }
|
||||
|
|
@ -94,22 +96,6 @@ namespace search_tree {
|
|||
return d;
|
||||
}
|
||||
|
||||
node *find_active_node() {
|
||||
if (m_status == status::active)
|
||||
return this;
|
||||
if (m_status == status::closed)
|
||||
return nullptr;
|
||||
node *nodes[2] = {m_left, m_right};
|
||||
for (unsigned i = 0; i < 2; ++i) {
|
||||
auto res = nodes[i] ? nodes[i]->find_active_node() : nullptr;
|
||||
if (res)
|
||||
return res;
|
||||
}
|
||||
if (m_left->get_status() == status::closed && m_right->get_status() == status::closed)
|
||||
m_status = status::closed;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
void display(std::ostream &out, unsigned indent) const {
|
||||
for (unsigned i = 0; i < indent; ++i)
|
||||
out << " ";
|
||||
|
|
@ -137,12 +123,40 @@ namespace search_tree {
|
|||
void mark_new_activation() {
|
||||
set_status(status::active);
|
||||
++m_num_activations;
|
||||
++m_active_workers;
|
||||
}
|
||||
void dec_active_workers() {
|
||||
if (m_active_workers > 0)
|
||||
--m_active_workers;
|
||||
if (m_active_workers == 0 && m_status == status::active) {
|
||||
m_round_max_effort = 0;
|
||||
m_status = status::open;
|
||||
}
|
||||
}
|
||||
bool has_active_workers() const {
|
||||
return m_active_workers > 0;
|
||||
}
|
||||
unsigned effort_spent() const {
|
||||
return m_effort_spent;
|
||||
}
|
||||
void add_effort(unsigned effort) {
|
||||
m_effort_spent += effort;
|
||||
void update_round_max_effort(unsigned effort) {
|
||||
if (effort <= m_round_max_effort)
|
||||
return;
|
||||
m_effort_spent -= m_round_max_effort;
|
||||
m_round_max_effort = effort;
|
||||
m_effort_spent += m_round_max_effort;
|
||||
}
|
||||
unsigned epoch() const {
|
||||
return m_epoch;
|
||||
}
|
||||
void inc_epoch() {
|
||||
++m_epoch;
|
||||
}
|
||||
unsigned get_cancel_epoch() const {
|
||||
return m_cancel_epoch;
|
||||
}
|
||||
void inc_cancel_epoch() {
|
||||
++m_cancel_epoch;
|
||||
}
|
||||
};
|
||||
|
||||
|
|
@ -161,13 +175,13 @@ namespace search_tree {
|
|||
|
||||
struct candidate {
|
||||
node<Config>* n = nullptr;
|
||||
unsigned effort_band = UINT_MAX;
|
||||
unsigned scaled_effort = UINT_MAX;
|
||||
unsigned depth = 0;
|
||||
};
|
||||
|
||||
// A measure of how much effort has been spent on the node, used for activation prioritization and expansion decisions
|
||||
// The effort unit is the workers' initial conflict budget, and effort spent grows by a factor defined in smt_parallel.h on each split attempt
|
||||
unsigned effort_band(node<Config> const* n) const {
|
||||
unsigned scaled_effort(node<Config> const* n) const {
|
||||
return n->effort_spent() / std::max<unsigned>(1, m_effort_unit);
|
||||
}
|
||||
|
||||
|
|
@ -177,8 +191,8 @@ namespace search_tree {
|
|||
return false;
|
||||
if (!b.n)
|
||||
return true;
|
||||
if (a.effort_band != b.effort_band)
|
||||
return a.effort_band < b.effort_band;
|
||||
if (a.scaled_effort != b.scaled_effort)
|
||||
return a.scaled_effort < b.scaled_effort;
|
||||
if (a.depth != b.depth)
|
||||
return a.depth > b.depth;
|
||||
return false;
|
||||
|
|
@ -191,7 +205,7 @@ namespace search_tree {
|
|||
if (cur->get_status() == target_status) {
|
||||
candidate cand;
|
||||
cand.n = cur;
|
||||
cand.effort_band = effort_band(cur);
|
||||
cand.scaled_effort = scaled_effort(cur);
|
||||
cand.depth = cur->depth();
|
||||
|
||||
if (better(cand, best))
|
||||
|
|
@ -202,20 +216,6 @@ namespace search_tree {
|
|||
select_next_node(cur->right(), target_status, best);
|
||||
}
|
||||
|
||||
node<Config>* activate_best_node() {
|
||||
candidate best;
|
||||
select_next_node(m_root.get(), status::open, best);
|
||||
if (!best.n) {
|
||||
IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";);
|
||||
select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection
|
||||
}
|
||||
|
||||
if (!best.n)
|
||||
return nullptr;
|
||||
best.n->mark_new_activation();
|
||||
return best.n;
|
||||
}
|
||||
|
||||
bool has_unvisited_open_node(node<Config>* cur) const {
|
||||
if (!cur || cur->get_status() == status::closed)
|
||||
return false;
|
||||
|
|
@ -251,8 +251,8 @@ namespace search_tree {
|
|||
find_shallowest_timed_out_leaf_depth(cur->right(), best_depth);
|
||||
}
|
||||
|
||||
bool should_split(node<Config>* n) {
|
||||
if (!n || n->get_status() != status::active || !n->is_leaf())
|
||||
bool should_split(node<Config>* n, unsigned epoch) {
|
||||
if (!is_lease_valid(n, epoch) || !n->is_leaf())
|
||||
return false;
|
||||
|
||||
unsigned num_active_nodes = count_active_nodes(m_root.get());
|
||||
|
|
@ -344,6 +344,8 @@ namespace search_tree {
|
|||
void close(node<Config> *n, vector<literal> const &C) {
|
||||
if (!n || n->get_status() == status::closed)
|
||||
return;
|
||||
n->inc_epoch();
|
||||
n->inc_cancel_epoch();
|
||||
n->set_status(status::closed);
|
||||
n->set_core(C);
|
||||
close(n->left(), C);
|
||||
|
|
@ -441,40 +443,32 @@ namespace search_tree {
|
|||
|
||||
void reset() {
|
||||
m_root = alloc(node<Config>, m_null_literal, nullptr);
|
||||
m_root->mark_new_activation();
|
||||
}
|
||||
|
||||
// On timeout, either expand the current leaf or reopen the node for a
|
||||
// later revisit, depending on the tree-expansion heuristic.
|
||||
bool try_split(node<Config> *n, literal const &a, literal const &b, unsigned effort) {
|
||||
if (!n || n->get_status() != status::active)
|
||||
bool try_split(node<Config> *n, unsigned epoch, unsigned cancel_epoch, literal const &a, literal const &b, unsigned effort) {
|
||||
if (is_lease_canceled(n, cancel_epoch))
|
||||
return false;
|
||||
|
||||
n->add_effort(effort);
|
||||
// Record at most one effort contribution per concurrent round on this node.
|
||||
// Stale workers still contribute, but only via the round-local maximum.
|
||||
n->update_round_max_effort(effort);
|
||||
bool did_split = false;
|
||||
|
||||
if (should_split(n)) {
|
||||
if (should_split(n, epoch)) {
|
||||
n->split(a, b);
|
||||
did_split = true;
|
||||
}
|
||||
|
||||
// Reopen immediately on timeout, even if other workers are still active.
|
||||
// This keeps scheduling asynchronous: timeouts act as signals to reconsider
|
||||
// the search, not barriers requiring all workers to finish.
|
||||
//
|
||||
// Early reopening also creates a soft penalty (via accumulated effort),
|
||||
// reducing over-concentration while still allowing revisits.
|
||||
//
|
||||
// Waiting for all workers would introduce per-node synchronization, delay
|
||||
// diversification, and let a slow worker stall progress.
|
||||
n->set_status(status::open);
|
||||
|
||||
return did_split;
|
||||
}
|
||||
|
||||
// conflict is given by a set of literals.
|
||||
// they are subsets of the literals on the path from root to n AND the external assumption literals
|
||||
void backtrack(node<Config> *n, vector<literal> const &conflict) {
|
||||
if (!n)
|
||||
return;
|
||||
if (conflict.empty()) {
|
||||
close_with_core(m_root.get(), conflict);
|
||||
return;
|
||||
|
|
@ -507,17 +501,41 @@ namespace search_tree {
|
|||
UNREACHABLE();
|
||||
}
|
||||
|
||||
// return an active node in the tree, or nullptr if there is none
|
||||
// first check if there is a node to activate under n,
|
||||
// if not, go up the tree and try to activate a sibling subtree
|
||||
node<Config> *activate_node(node<Config> *n) {
|
||||
if (!n) {
|
||||
if (m_root->get_status() == status::active) {
|
||||
m_root->mark_new_activation();
|
||||
return m_root.get();
|
||||
}
|
||||
// Try to select an open node using the select_next_node policy
|
||||
// If there are no open nodes, try to select an active node for portfolio solving
|
||||
node<Config>* activate_best_node() {
|
||||
candidate best;
|
||||
select_next_node(m_root.get(), status::open, best);
|
||||
if (!best.n) {
|
||||
IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";);
|
||||
select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection
|
||||
}
|
||||
return activate_best_node();
|
||||
|
||||
if (!best.n)
|
||||
return nullptr;
|
||||
best.n->mark_new_activation();
|
||||
return best.n;
|
||||
}
|
||||
|
||||
node<Config>* activate_root() {
|
||||
if (m_root->get_status() == status::closed)
|
||||
return nullptr;
|
||||
m_root->mark_new_activation();
|
||||
return m_root.get();
|
||||
}
|
||||
|
||||
void dec_active_workers(node<Config>* n) {
|
||||
if (!n)
|
||||
return;
|
||||
n->dec_active_workers();
|
||||
}
|
||||
|
||||
bool is_lease_valid(node<Config>* n, unsigned epoch) const {
|
||||
return n && n->get_status() == status::active && n->epoch() == epoch;
|
||||
}
|
||||
|
||||
bool is_lease_canceled(node<Config>* n, unsigned cancel_epoch) const {
|
||||
return !n || n->get_status() == status::closed || n->get_cancel_epoch() != cancel_epoch;
|
||||
}
|
||||
|
||||
vector<literal> const &get_core_from_root() const {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue