mirror of
https://github.com/Z3Prover/z3
synced 2026-05-15 06:35:32 +00:00
Terminate on Demand and some algorithmic bugfixes in the search tree (#9336)
* first attempt with codex. Codex notes:
What changed:
- Each tree node now tracks:
- active worker count
- lease epoch
- cancel epoch
- get_cube() now hands each worker an explicit lease: (node, epoch, cancel_epoch).
- try_split() and backtrack() now operate on that lease, and the batch manager releases the worker’s lease under the tree lock before mutating the
node.
- If another worker closes the leased node or subtree, the batch manager cancels only the workers whose current leased nodes are now closed.
- Workers detect canceled leases after check(), reset their local cancel flag, abandon the stale lease, and continue instead of turning that into a
global exception.
- The “reopen immediately into the open queue” policy is preserved. I did not add a barrier waiting for all workers on a node to finish.
- Active-worker accounting is now separate from the open/active/closed scheduling status, so reopening a node no longer erases the fact that other
workers are still on it.
I also updated search_tree bookkeeping so:
- closure bumps node cancel/lease epochs
- active-node counting uses actual active-worker presence, not just status == active
* fix(parallel-smt): gate split/backtrack by lease epoch
What it changes:
- util/search_tree.h
- bumps node epoch on split
- threads epoch through should_split(...) and try_split(...)
- always records effort, but only split/reopen if the lease epoch still matches
- smt/smt_parallel.cpp
- requires is_lease_valid(..., lease.epoch) before backtrack(...)
- passes lease.epoch into m_search_tree.try_split(...)
* clean up code and add some comments
* fix bug about backtracking condition being too strict: The epoch guard should not block backtrack(...) the same way it blocks try_split(...). A stale worker that proves UNSAT for n should still be able to
close n, and that closure should then cancel the other workers on n and its subtree.
I changed smt/smt_parallel.cpp accordingly:
- try_split(...) still uses epoch to reject stale structural splits
- backtrack(...) no longer requires is_lease_valid(..., epoch); it only requires that the lease is not already canceled
So the intended asymmetry is now restored:
- stale split: reject
- stale unsat/backtrack: allow closure, then cancel affected workers
* ablate to no backtracking on stale leases
* revert codex change about exception handling
* remove old code
* ablate backtracking gate
* attempt to fix linux crashes
* try to fix bug about active worker counts/lease accounting. current policy should hold: - stale leases: release/decrement
- canceled leases: do not release/decrement (just ignore since we have an invariant that canceled leases mean closed nodes that are never revisited
* delay premature root activation
* fix major semantic bug about threads continually choosing the root if their lease is reset
* fix cancellation to unknown status
* fix very bad bug about all threads needing to start at the root
* ablate active ranking: now nodes are only reopened if they are truly inactive (active worker count is 0)
* fix some bugs about leases
* ablate adding static effort only
* fix some bugs about leases
* don't explode effort for portfolio nodes
* fix: still accumulate per-node effort, but don't over-accumulate on portfolio solves
* restore dynamically scaled effort
* lease cancellation doens't touch rlimit now, it just sets max conflicts to 0. also fix a VERY BAD BUG about effort never being updated until all leases are done on a node, which meant we never left the root
* cross-thread modification of max conflicts is unsafe, so create an atomic lease canceled variable that's ch
ecked in ctx where max conflicts is also checked
* move atomic lease check in the context to the more global get_cancel_flag function
* Fix new SIGSEV. issue: The root cause: get_cancel_flag() is called from within propagation loops (mid-BCP, mid-equality-propagation, mid-atom-propagation). When it returns true there, the solver exits early and leaves the context in an intermediate state —
propagation queues partially processed, theory state potentially inconsistent with boolean state.
For the global cancel (m.limit().cancel()), this is harmless: the worker exits entirely and the context is destroyed. Intermediate state doesn't matter.
For a lease cancel, the context is reused — the worker gets a new cube and calls ctx->check() again on the same context object. Re-entering check() on a context interrupted mid-propagation causes it to access that corrupted intermediate
state → SIGSEGV.
The m_max_conflicts check is the only checkpoint that's safe for re-entry: it only fires post-conflict-resolution, pre-decision, when propagation queues are empty and theory state is consistent.
Fix: Remove m_lease_canceled from get_cancel_flag(). Keep it only at safe, between-phase checkpoints where the context is in a known-consistent state. The result is two safe checkpoints for m_lease_canceled: after each conflict (post-resolution, queues empty) and before each theory final check (not yet entered the theory). Neither interrupts the solver mid-mutation. The SIGSEGV should be
gone, and NIA performance should improve because long theory final checks (where NIA burns most time) are now preemptable before they start.
* fix new inconsistent theory bug: The problem is returning FC_GIVEUP from inside final_check() after some theories have already run final_check_eh() and pushed propagations into the queue. Those pending propagations reference context state that gets invalidated on the next check() call → SIGSEGV. The fix: check m_lease_canceled before entering final_check() in bounded_search(), never from inside it. That way the context is always in a clean pre-final-check state when we bail out. This is safe: decide() returned false (all variables assigned, no pending propagations), theories haven't been touched yet, context is in a fully consistent state. For NIA, this is still a meaningful win — we avoid entering expensive arithmetic final checks entirely when the lease is already canceled.
* remove second lease cancel check in smt_context, not sure it's safe. only check where we do the max conflicts check
* check epoch match in release_lease_unlocked
* restore exception handling logic to master branch
* restore reslimit cancels since the bug appears to be latent
---------
Co-authored-by: Ilana Shapiro <ilanashapiro@Ilanas-MacBook-Pro.local>
Co-authored-by: Ilana Shapiro <ilanashapiro@Mac.lan1>
This commit is contained in:
parent
0669cdd829
commit
44966e1733
4 changed files with 242 additions and 120 deletions
|
|
@ -64,6 +64,12 @@ namespace smt {
|
|||
|
||||
namespace smt {
|
||||
|
||||
static bool is_cancellation_exception(char const* msg) {
|
||||
return msg &&
|
||||
(strstr(msg, "canceled") != nullptr ||
|
||||
strstr(msg, "cancelled") != nullptr);
|
||||
}
|
||||
|
||||
void parallel::sls_worker::run() {
|
||||
ptr_vector<expr> assertions;
|
||||
p.ctx.get_assertions(assertions);
|
||||
|
|
@ -111,25 +117,32 @@ namespace smt {
|
|||
}
|
||||
|
||||
void parallel::worker::run() {
|
||||
search_tree::node<cube_config> *node = nullptr;
|
||||
bool is_first_run = true;
|
||||
node_lease lease;
|
||||
expr_ref_vector cube(m);
|
||||
while (true) {
|
||||
|
||||
if (!b.get_cube(m_g2l, id, cube, node)) {
|
||||
if (!b.get_cube(m_g2l, id, cube, is_first_run, lease)) {
|
||||
LOG_WORKER(1, " no more cubes\n");
|
||||
return;
|
||||
}
|
||||
is_first_run = false;
|
||||
collect_shared_clauses();
|
||||
|
||||
check_cube_start:
|
||||
LOG_WORKER(1, " CUBE SIZE IN MAIN LOOP: " << cube.size() << "\n");
|
||||
lbool r = check_cube(cube);
|
||||
|
||||
if (!m.inc()) {
|
||||
b.set_exception("context cancelled");
|
||||
return;
|
||||
if (b.lease_canceled(lease)) {
|
||||
LOG_WORKER(1, " abandoning canceled lease\n");
|
||||
lease = {};
|
||||
m.limit().dec_cancel();
|
||||
continue;
|
||||
}
|
||||
|
||||
if (!m.inc())
|
||||
return;
|
||||
|
||||
switch (r) {
|
||||
case l_undef: {
|
||||
update_max_thread_conflicts();
|
||||
|
|
@ -140,7 +153,8 @@ namespace smt {
|
|||
auto atom = get_split_atom();
|
||||
if (!atom)
|
||||
goto check_cube_start;
|
||||
b.try_split(m_l2g, id, node, atom, m_config.m_threads_max_conflicts);
|
||||
b.try_split(m_l2g, id, lease, atom, m_config.m_threads_max_conflicts);
|
||||
lease = {};
|
||||
simplify();
|
||||
break;
|
||||
}
|
||||
|
|
@ -164,7 +178,8 @@ namespace smt {
|
|||
}
|
||||
|
||||
LOG_WORKER(1, " found unsat cube\n");
|
||||
b.backtrack(m_l2g, unsat_core, node);
|
||||
b.backtrack(m_l2g, id, unsat_core, lease);
|
||||
lease = {};
|
||||
|
||||
if (m_config.m_share_conflicts)
|
||||
b.collect_clause(m_l2g, id, mk_not(mk_and(unsat_core)));
|
||||
|
|
@ -314,18 +329,57 @@ namespace smt {
|
|||
m.limit().cancel();
|
||||
}
|
||||
|
||||
void parallel::batch_manager::backtrack(ast_translation &l2g, expr_ref_vector const &core,
|
||||
search_tree::node<cube_config> *node) {
|
||||
void parallel::worker::cancel_lease() {
|
||||
LOG_WORKER(1, " canceling lease\n");
|
||||
m.limit().inc_cancel();
|
||||
}
|
||||
|
||||
void parallel::batch_manager::release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch) {
|
||||
if (worker_id >= m_worker_leases.size())
|
||||
return;
|
||||
auto &lease = m_worker_leases[worker_id];
|
||||
if (!lease.node || lease.node != n || lease.epoch != epoch)
|
||||
return;
|
||||
m_search_tree.dec_active_workers(lease.node);
|
||||
lease = {};
|
||||
}
|
||||
|
||||
void parallel::batch_manager::cancel_closed_leases_unlocked(unsigned source_worker_id) {
|
||||
unsigned n = std::min(m_worker_leases.size(), p.m_workers.size());
|
||||
for (unsigned worker_id = 0; worker_id < n; ++worker_id) {
|
||||
if (worker_id == source_worker_id)
|
||||
continue;
|
||||
auto const& lease = m_worker_leases[worker_id];
|
||||
|
||||
// only cancel workers that currently hold a lease, and whose lease is canceled
|
||||
if (lease.node && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
|
||||
p.m_workers[worker_id]->cancel_lease();
|
||||
}
|
||||
}
|
||||
|
||||
void parallel::batch_manager::backtrack(ast_translation &l2g, unsigned worker_id, expr_ref_vector const &core,
|
||||
node_lease const &lease) {
|
||||
std::scoped_lock lock(mux);
|
||||
IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n");
|
||||
|
||||
if (m_state != state::is_running)
|
||||
return;
|
||||
|
||||
// we close/backtrack regardless of whether this lease is stale or not, as long as the lease isn't canceled
|
||||
// i.e. worker 1 splits this node, but then worker 2 determines UNSAT --> worker 2 is stale but we still close this node and backtrack
|
||||
if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
|
||||
return;
|
||||
|
||||
IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n");
|
||||
|
||||
release_lease_unlocked(worker_id, lease.node, lease.epoch);
|
||||
|
||||
vector<cube_config::literal> g_core;
|
||||
for (auto c : core) {
|
||||
expr_ref g_c(l2g(c), m);
|
||||
for (auto c : core)
|
||||
g_core.push_back(expr_ref(l2g(c), m));
|
||||
}
|
||||
m_search_tree.backtrack(node, g_core);
|
||||
m_search_tree.backtrack(lease.node, g_core);
|
||||
|
||||
// terminate on-demand the workers that are currently exploring the now-closed nodes
|
||||
cancel_closed_leases_unlocked(worker_id);
|
||||
|
||||
IF_VERBOSE(1, m_search_tree.display(verbose_stream() << bounded_pp_exprs(core) << "\n"););
|
||||
if (m_search_tree.is_closed()) {
|
||||
|
|
@ -338,25 +392,40 @@ namespace smt {
|
|||
}
|
||||
}
|
||||
|
||||
void parallel::batch_manager::try_split(ast_translation &l2g, unsigned source_worker_id,
|
||||
search_tree::node<cube_config> *node, expr *atom, unsigned effort) {
|
||||
void parallel::batch_manager::try_split(ast_translation &l2g, unsigned worker_id,
|
||||
node_lease const &lease, expr *atom, unsigned effort) {
|
||||
std::scoped_lock lock(mux);
|
||||
expr_ref lit(m), nlit(m);
|
||||
lit = l2g(atom);
|
||||
nlit = mk_not(m, lit);
|
||||
|
||||
if (m_state != state::is_running)
|
||||
return;
|
||||
|
||||
bool did_split = m_search_tree.try_split(node, lit, nlit, effort);
|
||||
if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
|
||||
return;
|
||||
|
||||
expr_ref lit(m), nlit(m);
|
||||
lit = l2g(atom);
|
||||
nlit = mk_not(m, lit);
|
||||
bool did_split = m_search_tree.try_split(lease.node, lease.epoch, lease.cancel_epoch, lit, nlit, effort);
|
||||
|
||||
release_lease_unlocked(worker_id, lease.node, lease.epoch);
|
||||
|
||||
if (did_split) {
|
||||
++m_stats.m_num_cubes;
|
||||
m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, node->depth() + 1);
|
||||
m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, lease.node->depth() + 1);
|
||||
IF_VERBOSE(1, verbose_stream() << "Batch manager splitting on literal: " << mk_bounded_pp(lit, m, 3) << "\n");
|
||||
}
|
||||
}
|
||||
|
||||
void parallel::batch_manager::release_lease(unsigned worker_id, node_lease const &lease) {
|
||||
std::scoped_lock lock(mux);
|
||||
release_lease_unlocked(worker_id, lease.node, lease.epoch);
|
||||
}
|
||||
|
||||
bool parallel::batch_manager::lease_canceled(node_lease const &lease) {
|
||||
std::scoped_lock lock(mux);
|
||||
return m_state == state::is_running && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch);
|
||||
}
|
||||
|
||||
void parallel::batch_manager::collect_clause(ast_translation &l2g, unsigned source_worker_id, expr *clause) {
|
||||
std::scoped_lock lock(mux);
|
||||
expr *g_clause = l2g(clause);
|
||||
|
|
@ -400,9 +469,11 @@ namespace smt {
|
|||
try {
|
||||
r = ctx->check(asms.size(), asms.data());
|
||||
} catch (z3_error &err) {
|
||||
b.set_exception(err.error_code());
|
||||
if (!is_cancellation_exception(err.what()))
|
||||
b.set_exception(err.error_code());
|
||||
} catch (z3_exception &ex) {
|
||||
b.set_exception(ex.what());
|
||||
if (!is_cancellation_exception(ex.what()))
|
||||
b.set_exception(ex.what());
|
||||
} catch (...) {
|
||||
b.set_exception("unknown exception");
|
||||
}
|
||||
|
|
@ -501,9 +572,10 @@ namespace smt {
|
|||
}
|
||||
}
|
||||
|
||||
bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, node *&n) {
|
||||
bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, bool is_first_run, node_lease &lease) {
|
||||
std::scoped_lock lock(mux);
|
||||
cube.reset();
|
||||
std::unique_lock<std::mutex> lock(mux);
|
||||
|
||||
if (m_search_tree.is_closed()) {
|
||||
IF_VERBOSE(1, verbose_stream() << "all done\n";);
|
||||
return false;
|
||||
|
|
@ -512,11 +584,21 @@ namespace smt {
|
|||
IF_VERBOSE(1, verbose_stream() << "aborting get_cube\n";);
|
||||
return false;
|
||||
}
|
||||
node *t = m_search_tree.activate_node(n);
|
||||
|
||||
node *t = is_first_run ? m_search_tree.activate_root() : m_search_tree.activate_best_node();
|
||||
|
||||
if (!t)
|
||||
return false;
|
||||
|
||||
IF_VERBOSE(1, m_search_tree.display(verbose_stream()); verbose_stream() << "\n";);
|
||||
n = t;
|
||||
|
||||
lease.node = t;
|
||||
lease.epoch = t->epoch();
|
||||
lease.cancel_epoch = t->get_cancel_epoch();
|
||||
if (id >= m_worker_leases.size())
|
||||
m_worker_leases.resize(id + 1);
|
||||
m_worker_leases[id] = lease;
|
||||
|
||||
while (t) {
|
||||
if (cube_config::literal_is_null(t->get_literal()))
|
||||
break;
|
||||
|
|
@ -525,6 +607,7 @@ namespace smt {
|
|||
cube.push_back(std::move(lit));
|
||||
t = t->parent();
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
|
|
@ -532,6 +615,8 @@ namespace smt {
|
|||
m_state = state::is_running;
|
||||
m_search_tree.reset();
|
||||
m_search_tree.set_effort_unit(initial_max_thread_conflicts);
|
||||
m_worker_leases.reset();
|
||||
m_worker_leases.resize(p.m_workers.size());
|
||||
}
|
||||
|
||||
void parallel::batch_manager::collect_statistics(::statistics &st) const {
|
||||
|
|
@ -540,7 +625,12 @@ namespace smt {
|
|||
}
|
||||
|
||||
lbool parallel::operator()(expr_ref_vector const &asms) {
|
||||
IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << num_threads << " threads\n";);
|
||||
smt_parallel_params pp(ctx.m_params);
|
||||
unsigned num_workers = std::min((unsigned)std::thread::hardware_concurrency(), ctx.get_fparams().m_threads);
|
||||
unsigned num_sls_threads = (pp.sls() ? 1 : 0);
|
||||
unsigned total_threads = num_workers + num_sls_threads;
|
||||
|
||||
IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << total_threads << " threads\n";);
|
||||
ast_manager &m = ctx.m;
|
||||
|
||||
if (m.has_trace_stream())
|
||||
|
|
@ -556,33 +646,32 @@ namespace smt {
|
|||
};
|
||||
scoped_clear clear(*this);
|
||||
|
||||
m_batch_manager.initialize();
|
||||
m_workers.reset();
|
||||
|
||||
smt_parallel_params pp(ctx.m_params);
|
||||
m_should_run_sls = pp.sls();
|
||||
|
||||
scoped_limits sl(m.limit());
|
||||
flet<unsigned> _nt(ctx.m_fparams.m_threads, 1);
|
||||
SASSERT(num_threads > 1);
|
||||
for (unsigned i = 0; i < num_threads; ++i)
|
||||
SASSERT(num_workers > 1);
|
||||
for (unsigned i = 0; i < num_workers; ++i)
|
||||
m_workers.push_back(alloc(worker, i, *this, asms));
|
||||
for (auto w : m_workers)
|
||||
sl.push_child(&(w->limit()));
|
||||
if (m_should_run_sls) {
|
||||
|
||||
if (num_sls_threads == 1) {
|
||||
m_sls_worker = alloc(sls_worker, *this);
|
||||
sl.push_child(&(m_sls_worker->limit()));
|
||||
}
|
||||
|
||||
IF_VERBOSE(1, verbose_stream() << "Launched " << m_workers.size() << " CDCL threads, "
|
||||
<< (m_sls_worker ? 1 : 0) << " SLS threads\n";);
|
||||
|
||||
m_batch_manager.initialize();
|
||||
|
||||
// Launch threads
|
||||
vector<std::thread> threads;
|
||||
threads.resize(m_should_run_sls ? num_threads + 1 : num_threads); // +1 for sls worker
|
||||
for (unsigned i = 0; i < num_threads; ++i)
|
||||
threads[i] = std::thread([&, i]() { m_workers[i]->run(); });
|
||||
|
||||
// the final thread runs the sls worker
|
||||
if (m_should_run_sls)
|
||||
threads[num_threads] = std::thread([&]() { m_sls_worker->run(); });
|
||||
threads.resize(total_threads);
|
||||
unsigned thread_idx = 0;
|
||||
for (auto* w : m_workers)
|
||||
threads[thread_idx++] = std::thread([&, w]() { w->run(); });
|
||||
|
||||
// Wait for all threads to finish
|
||||
for (auto &th : threads)
|
||||
|
|
@ -591,7 +680,7 @@ namespace smt {
|
|||
for (auto w : m_workers)
|
||||
w->collect_statistics(ctx.m_aux_stats);
|
||||
m_batch_manager.collect_statistics(ctx.m_aux_stats);
|
||||
if (m_should_run_sls)
|
||||
if (m_sls_worker)
|
||||
m_sls_worker->collect_statistics(ctx.m_aux_stats);
|
||||
|
||||
return m_batch_manager.get_result();
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue