From 44966e17339c0c381b247805528e5098a5b0ffd3 Mon Sep 17 00:00:00 2001 From: Ilana Shapiro Date: Sun, 19 Apr 2026 07:21:41 -0700 Subject: [PATCH] Terminate on Demand and some algorithmic bugfixes in the search tree (#9336) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * first attempt with codex. Codex notes: What changed: - Each tree node now tracks: - active worker count - lease epoch - cancel epoch - get_cube() now hands each worker an explicit lease: (node, epoch, cancel_epoch). - try_split() and backtrack() now operate on that lease, and the batch manager releases the worker’s lease under the tree lock before mutating the node. - If another worker closes the leased node or subtree, the batch manager cancels only the workers whose current leased nodes are now closed. - Workers detect canceled leases after check(), reset their local cancel flag, abandon the stale lease, and continue instead of turning that into a global exception. - The “reopen immediately into the open queue” policy is preserved. I did not add a barrier waiting for all workers on a node to finish. - Active-worker accounting is now separate from the open/active/closed scheduling status, so reopening a node no longer erases the fact that other workers are still on it. I also updated search_tree bookkeeping so: - closure bumps node cancel/lease epochs - active-node counting uses actual active-worker presence, not just status == active * fix(parallel-smt): gate split/backtrack by lease epoch What it changes: - util/search_tree.h - bumps node epoch on split - threads epoch through should_split(...) and try_split(...) - always records effort, but only split/reopen if the lease epoch still matches - smt/smt_parallel.cpp - requires is_lease_valid(..., lease.epoch) before backtrack(...) - passes lease.epoch into m_search_tree.try_split(...) * clean up code and add some comments * fix bug about backtracking condition being too strict: The epoch guard should not block backtrack(...) the same way it blocks try_split(...). A stale worker that proves UNSAT for n should still be able to close n, and that closure should then cancel the other workers on n and its subtree. I changed smt/smt_parallel.cpp accordingly: - try_split(...) still uses epoch to reject stale structural splits - backtrack(...) no longer requires is_lease_valid(..., epoch); it only requires that the lease is not already canceled So the intended asymmetry is now restored: - stale split: reject - stale unsat/backtrack: allow closure, then cancel affected workers * ablate to no backtracking on stale leases * revert codex change about exception handling * remove old code * ablate backtracking gate * attempt to fix linux crashes * try to fix bug about active worker counts/lease accounting. current policy should hold: - stale leases: release/decrement - canceled leases: do not release/decrement (just ignore since we have an invariant that canceled leases mean closed nodes that are never revisited * delay premature root activation * fix major semantic bug about threads continually choosing the root if their lease is reset * fix cancellation to unknown status * fix very bad bug about all threads needing to start at the root * ablate active ranking: now nodes are only reopened if they are truly inactive (active worker count is 0) * fix some bugs about leases * ablate adding static effort only * fix some bugs about leases * don't explode effort for portfolio nodes * fix: still accumulate per-node effort, but don't over-accumulate on portfolio solves * restore dynamically scaled effort * lease cancellation doens't touch rlimit now, it just sets max conflicts to 0. also fix a VERY BAD BUG about effort never being updated until all leases are done on a node, which meant we never left the root * cross-thread modification of max conflicts is unsafe, so create an atomic lease canceled variable that's ch ecked in ctx where max conflicts is also checked * move atomic lease check in the context to the more global get_cancel_flag function * Fix new SIGSEV. issue: The root cause: get_cancel_flag() is called from within propagation loops (mid-BCP, mid-equality-propagation, mid-atom-propagation). When it returns true there, the solver exits early and leaves the context in an intermediate state — propagation queues partially processed, theory state potentially inconsistent with boolean state. For the global cancel (m.limit().cancel()), this is harmless: the worker exits entirely and the context is destroyed. Intermediate state doesn't matter. For a lease cancel, the context is reused — the worker gets a new cube and calls ctx->check() again on the same context object. Re-entering check() on a context interrupted mid-propagation causes it to access that corrupted intermediate state → SIGSEGV. The m_max_conflicts check is the only checkpoint that's safe for re-entry: it only fires post-conflict-resolution, pre-decision, when propagation queues are empty and theory state is consistent. Fix: Remove m_lease_canceled from get_cancel_flag(). Keep it only at safe, between-phase checkpoints where the context is in a known-consistent state. The result is two safe checkpoints for m_lease_canceled: after each conflict (post-resolution, queues empty) and before each theory final check (not yet entered the theory). Neither interrupts the solver mid-mutation. The SIGSEGV should be gone, and NIA performance should improve because long theory final checks (where NIA burns most time) are now preemptable before they start. * fix new inconsistent theory bug: The problem is returning FC_GIVEUP from inside final_check() after some theories have already run final_check_eh() and pushed propagations into the queue. Those pending propagations reference context state that gets invalidated on the next check() call → SIGSEGV. The fix: check m_lease_canceled before entering final_check() in bounded_search(), never from inside it. That way the context is always in a clean pre-final-check state when we bail out. This is safe: decide() returned false (all variables assigned, no pending propagations), theories haven't been touched yet, context is in a fully consistent state. For NIA, this is still a meaningful win — we avoid entering expensive arithmetic final checks entirely when the lease is already canceled. * remove second lease cancel check in smt_context, not sure it's safe. only check where we do the max conflicts check * check epoch match in release_lease_unlocked * restore exception handling logic to master branch * restore reslimit cancels since the bug appears to be latent --------- Co-authored-by: Ilana Shapiro Co-authored-by: Ilana Shapiro --- src/smt/smt_context.h | 1 + src/smt/smt_parallel.cpp | 175 +++++++++++++++++++++++++++++---------- src/smt/smt_parallel.h | 32 +++++-- src/util/search_tree.h | 154 +++++++++++++++++++--------------- 4 files changed, 242 insertions(+), 120 deletions(-) diff --git a/src/smt/smt_context.h b/src/smt/smt_context.h index 80ab6803d..12fbf0284 100644 --- a/src/smt/smt_context.h +++ b/src/smt/smt_context.h @@ -18,6 +18,7 @@ Revision History: --*/ #pragma once +#include #include "ast/quantifier_stat.h" #include "ast/simplifiers/dependent_expr_state.h" #include "smt/smt_clause.h" diff --git a/src/smt/smt_parallel.cpp b/src/smt/smt_parallel.cpp index d74b71df2..97c9a7105 100644 --- a/src/smt/smt_parallel.cpp +++ b/src/smt/smt_parallel.cpp @@ -64,6 +64,12 @@ namespace smt { namespace smt { + static bool is_cancellation_exception(char const* msg) { + return msg && + (strstr(msg, "canceled") != nullptr || + strstr(msg, "cancelled") != nullptr); + } + void parallel::sls_worker::run() { ptr_vector assertions; p.ctx.get_assertions(assertions); @@ -111,25 +117,32 @@ namespace smt { } void parallel::worker::run() { - search_tree::node *node = nullptr; + bool is_first_run = true; + node_lease lease; expr_ref_vector cube(m); while (true) { - if (!b.get_cube(m_g2l, id, cube, node)) { + if (!b.get_cube(m_g2l, id, cube, is_first_run, lease)) { LOG_WORKER(1, " no more cubes\n"); return; } + is_first_run = false; collect_shared_clauses(); check_cube_start: LOG_WORKER(1, " CUBE SIZE IN MAIN LOOP: " << cube.size() << "\n"); lbool r = check_cube(cube); - if (!m.inc()) { - b.set_exception("context cancelled"); - return; + if (b.lease_canceled(lease)) { + LOG_WORKER(1, " abandoning canceled lease\n"); + lease = {}; + m.limit().dec_cancel(); + continue; } + if (!m.inc()) + return; + switch (r) { case l_undef: { update_max_thread_conflicts(); @@ -140,7 +153,8 @@ namespace smt { auto atom = get_split_atom(); if (!atom) goto check_cube_start; - b.try_split(m_l2g, id, node, atom, m_config.m_threads_max_conflicts); + b.try_split(m_l2g, id, lease, atom, m_config.m_threads_max_conflicts); + lease = {}; simplify(); break; } @@ -164,7 +178,8 @@ namespace smt { } LOG_WORKER(1, " found unsat cube\n"); - b.backtrack(m_l2g, unsat_core, node); + b.backtrack(m_l2g, id, unsat_core, lease); + lease = {}; if (m_config.m_share_conflicts) b.collect_clause(m_l2g, id, mk_not(mk_and(unsat_core))); @@ -314,18 +329,57 @@ namespace smt { m.limit().cancel(); } - void parallel::batch_manager::backtrack(ast_translation &l2g, expr_ref_vector const &core, - search_tree::node *node) { + void parallel::worker::cancel_lease() { + LOG_WORKER(1, " canceling lease\n"); + m.limit().inc_cancel(); + } + + void parallel::batch_manager::release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch) { + if (worker_id >= m_worker_leases.size()) + return; + auto &lease = m_worker_leases[worker_id]; + if (!lease.node || lease.node != n || lease.epoch != epoch) + return; + m_search_tree.dec_active_workers(lease.node); + lease = {}; + } + + void parallel::batch_manager::cancel_closed_leases_unlocked(unsigned source_worker_id) { + unsigned n = std::min(m_worker_leases.size(), p.m_workers.size()); + for (unsigned worker_id = 0; worker_id < n; ++worker_id) { + if (worker_id == source_worker_id) + continue; + auto const& lease = m_worker_leases[worker_id]; + + // only cancel workers that currently hold a lease, and whose lease is canceled + if (lease.node && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch)) + p.m_workers[worker_id]->cancel_lease(); + } + } + + void parallel::batch_manager::backtrack(ast_translation &l2g, unsigned worker_id, expr_ref_vector const &core, + node_lease const &lease) { std::scoped_lock lock(mux); - IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n"); + if (m_state != state::is_running) return; + + // we close/backtrack regardless of whether this lease is stale or not, as long as the lease isn't canceled + // i.e. worker 1 splits this node, but then worker 2 determines UNSAT --> worker 2 is stale but we still close this node and backtrack + if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch)) + return; + + IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n"); + + release_lease_unlocked(worker_id, lease.node, lease.epoch); + vector g_core; - for (auto c : core) { - expr_ref g_c(l2g(c), m); + for (auto c : core) g_core.push_back(expr_ref(l2g(c), m)); - } - m_search_tree.backtrack(node, g_core); + m_search_tree.backtrack(lease.node, g_core); + + // terminate on-demand the workers that are currently exploring the now-closed nodes + cancel_closed_leases_unlocked(worker_id); IF_VERBOSE(1, m_search_tree.display(verbose_stream() << bounded_pp_exprs(core) << "\n");); if (m_search_tree.is_closed()) { @@ -338,25 +392,40 @@ namespace smt { } } - void parallel::batch_manager::try_split(ast_translation &l2g, unsigned source_worker_id, - search_tree::node *node, expr *atom, unsigned effort) { + void parallel::batch_manager::try_split(ast_translation &l2g, unsigned worker_id, + node_lease const &lease, expr *atom, unsigned effort) { std::scoped_lock lock(mux); - expr_ref lit(m), nlit(m); - lit = l2g(atom); - nlit = mk_not(m, lit); if (m_state != state::is_running) return; - bool did_split = m_search_tree.try_split(node, lit, nlit, effort); + if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch)) + return; + + expr_ref lit(m), nlit(m); + lit = l2g(atom); + nlit = mk_not(m, lit); + bool did_split = m_search_tree.try_split(lease.node, lease.epoch, lease.cancel_epoch, lit, nlit, effort); + + release_lease_unlocked(worker_id, lease.node, lease.epoch); if (did_split) { ++m_stats.m_num_cubes; - m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, node->depth() + 1); + m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, lease.node->depth() + 1); IF_VERBOSE(1, verbose_stream() << "Batch manager splitting on literal: " << mk_bounded_pp(lit, m, 3) << "\n"); } } + void parallel::batch_manager::release_lease(unsigned worker_id, node_lease const &lease) { + std::scoped_lock lock(mux); + release_lease_unlocked(worker_id, lease.node, lease.epoch); + } + + bool parallel::batch_manager::lease_canceled(node_lease const &lease) { + std::scoped_lock lock(mux); + return m_state == state::is_running && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch); + } + void parallel::batch_manager::collect_clause(ast_translation &l2g, unsigned source_worker_id, expr *clause) { std::scoped_lock lock(mux); expr *g_clause = l2g(clause); @@ -400,9 +469,11 @@ namespace smt { try { r = ctx->check(asms.size(), asms.data()); } catch (z3_error &err) { - b.set_exception(err.error_code()); + if (!is_cancellation_exception(err.what())) + b.set_exception(err.error_code()); } catch (z3_exception &ex) { - b.set_exception(ex.what()); + if (!is_cancellation_exception(ex.what())) + b.set_exception(ex.what()); } catch (...) { b.set_exception("unknown exception"); } @@ -501,9 +572,10 @@ namespace smt { } } - bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, node *&n) { + bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, bool is_first_run, node_lease &lease) { + std::scoped_lock lock(mux); cube.reset(); - std::unique_lock lock(mux); + if (m_search_tree.is_closed()) { IF_VERBOSE(1, verbose_stream() << "all done\n";); return false; @@ -512,11 +584,21 @@ namespace smt { IF_VERBOSE(1, verbose_stream() << "aborting get_cube\n";); return false; } - node *t = m_search_tree.activate_node(n); + + node *t = is_first_run ? m_search_tree.activate_root() : m_search_tree.activate_best_node(); + if (!t) return false; + IF_VERBOSE(1, m_search_tree.display(verbose_stream()); verbose_stream() << "\n";); - n = t; + + lease.node = t; + lease.epoch = t->epoch(); + lease.cancel_epoch = t->get_cancel_epoch(); + if (id >= m_worker_leases.size()) + m_worker_leases.resize(id + 1); + m_worker_leases[id] = lease; + while (t) { if (cube_config::literal_is_null(t->get_literal())) break; @@ -525,6 +607,7 @@ namespace smt { cube.push_back(std::move(lit)); t = t->parent(); } + return true; } @@ -532,6 +615,8 @@ namespace smt { m_state = state::is_running; m_search_tree.reset(); m_search_tree.set_effort_unit(initial_max_thread_conflicts); + m_worker_leases.reset(); + m_worker_leases.resize(p.m_workers.size()); } void parallel::batch_manager::collect_statistics(::statistics &st) const { @@ -540,7 +625,12 @@ namespace smt { } lbool parallel::operator()(expr_ref_vector const &asms) { - IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << num_threads << " threads\n";); + smt_parallel_params pp(ctx.m_params); + unsigned num_workers = std::min((unsigned)std::thread::hardware_concurrency(), ctx.get_fparams().m_threads); + unsigned num_sls_threads = (pp.sls() ? 1 : 0); + unsigned total_threads = num_workers + num_sls_threads; + + IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << total_threads << " threads\n";); ast_manager &m = ctx.m; if (m.has_trace_stream()) @@ -556,33 +646,32 @@ namespace smt { }; scoped_clear clear(*this); - m_batch_manager.initialize(); m_workers.reset(); - - smt_parallel_params pp(ctx.m_params); - m_should_run_sls = pp.sls(); scoped_limits sl(m.limit()); flet _nt(ctx.m_fparams.m_threads, 1); - SASSERT(num_threads > 1); - for (unsigned i = 0; i < num_threads; ++i) + SASSERT(num_workers > 1); + for (unsigned i = 0; i < num_workers; ++i) m_workers.push_back(alloc(worker, i, *this, asms)); for (auto w : m_workers) sl.push_child(&(w->limit())); - if (m_should_run_sls) { + + if (num_sls_threads == 1) { m_sls_worker = alloc(sls_worker, *this); sl.push_child(&(m_sls_worker->limit())); } + IF_VERBOSE(1, verbose_stream() << "Launched " << m_workers.size() << " CDCL threads, " + << (m_sls_worker ? 1 : 0) << " SLS threads\n";); + + m_batch_manager.initialize(); + // Launch threads vector threads; - threads.resize(m_should_run_sls ? num_threads + 1 : num_threads); // +1 for sls worker - for (unsigned i = 0; i < num_threads; ++i) - threads[i] = std::thread([&, i]() { m_workers[i]->run(); }); - - // the final thread runs the sls worker - if (m_should_run_sls) - threads[num_threads] = std::thread([&]() { m_sls_worker->run(); }); + threads.resize(total_threads); + unsigned thread_idx = 0; + for (auto* w : m_workers) + threads[thread_idx++] = std::thread([&, w]() { w->run(); }); // Wait for all threads to finish for (auto &th : threads) @@ -591,7 +680,7 @@ namespace smt { for (auto w : m_workers) w->collect_statistics(ctx.m_aux_stats); m_batch_manager.collect_statistics(ctx.m_aux_stats); - if (m_should_run_sls) + if (m_sls_worker) m_sls_worker->collect_statistics(ctx.m_aux_stats); return m_batch_manager.get_result(); diff --git a/src/smt/smt_parallel.h b/src/smt/smt_parallel.h index bddb10b63..32517eb31 100644 --- a/src/smt/smt_parallel.h +++ b/src/smt/smt_parallel.h @@ -35,14 +35,26 @@ namespace smt { class parallel { context& ctx; - unsigned num_threads; - bool m_should_run_sls = false; struct shared_clause { unsigned source_worker_id; expr_ref clause; }; + struct node_lease { + search_tree::node* node = nullptr; + // Version counter for structural mutations of this node (e.g., split/close). + // Used to detect stale leases: if a worker's lease.epoch != node.epoch, + // the node has changed since it was acquired and must not be mutated. + unsigned epoch = 0; + + // Cancellation generation counter for this node/subtree. + // Incremented when the node is closed; used to signal that all + // workers holding leases on this node (or its descendants) + // must abandon work immediately. + unsigned cancel_epoch = 0; + }; + class batch_manager { enum state { @@ -66,6 +78,7 @@ namespace smt { stats m_stats; using node = search_tree::node; search_tree::tree m_search_tree; + vector m_worker_leases; unsigned m_exception_code = 0; std::string m_exception_msg; @@ -91,7 +104,8 @@ namespace smt { cancel_sls_worker(); } - void init_parameters_state(); + void release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch); + void cancel_closed_leases_unlocked(unsigned source_worker_id); public: batch_manager(ast_manager& m, parallel& p) : m(m), p(p), m_search_tree(expr_ref(m)) { } @@ -104,9 +118,11 @@ namespace smt { void set_exception(unsigned error_code); void collect_statistics(::statistics& st) const; - bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, node*& n); - void backtrack(ast_translation& l2g, expr_ref_vector const& core, node* n); - void try_split(ast_translation& l2g, unsigned id, node* n, expr* atom, unsigned effort); + bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, bool is_first_run, node_lease& lease); + void backtrack(ast_translation& l2g, unsigned worker_id, expr_ref_vector const& core, node_lease const& lease); + void try_split(ast_translation& l2g, unsigned worker_id, node_lease const& lease, expr* atom, unsigned effort); + void release_lease(unsigned worker_id, node_lease const& lease); + bool lease_canceled(node_lease const& lease); void collect_clause(ast_translation& l2g, unsigned source_worker_id, expr* clause); expr_ref_vector return_shared_clauses(ast_translation& g2l, unsigned& worker_limit, unsigned worker_id); @@ -164,6 +180,7 @@ namespace smt { void collect_shared_clauses(); void cancel(); + void cancel_lease(); void collect_statistics(::statistics& st) const; reslimit& limit() { @@ -198,9 +215,6 @@ namespace smt { public: parallel(context& ctx) : ctx(ctx), - num_threads(std::min( - (unsigned)std::thread::hardware_concurrency(), - ctx.get_fparams().m_threads)), m_batch_manager(ctx.m, *this) {} lbool operator()(expr_ref_vector const& asms); diff --git a/src/util/search_tree.h b/src/util/search_tree.h index bac6ce0d0..ebfc18e63 100644 --- a/src/util/search_tree.h +++ b/src/util/search_tree.h @@ -48,6 +48,10 @@ namespace search_tree { vector m_core; unsigned m_num_activations = 0; unsigned m_effort_spent = 0; + unsigned m_round_max_effort = 0; + unsigned m_active_workers = 0; + unsigned m_epoch = 0; + unsigned m_cancel_epoch = 0; public: node(literal const &l, node *parent) : m_literal(l), m_parent(parent), m_status(status::open) {} @@ -65,9 +69,6 @@ namespace search_tree { literal const &get_literal() const { return m_literal; } - bool literal_is_null() const { - return Config::is_null(m_literal); - } void split(literal const &a, literal const &b) { SASSERT(!Config::literal_is_null(a)); SASSERT(!Config::literal_is_null(b)); @@ -77,6 +78,7 @@ namespace search_tree { SASSERT(!m_right); m_left = alloc(node, a, this); m_right = alloc(node, b, this); + inc_epoch(); } node* left() const { return m_left; } @@ -94,22 +96,6 @@ namespace search_tree { return d; } - node *find_active_node() { - if (m_status == status::active) - return this; - if (m_status == status::closed) - return nullptr; - node *nodes[2] = {m_left, m_right}; - for (unsigned i = 0; i < 2; ++i) { - auto res = nodes[i] ? nodes[i]->find_active_node() : nullptr; - if (res) - return res; - } - if (m_left->get_status() == status::closed && m_right->get_status() == status::closed) - m_status = status::closed; - return nullptr; - } - void display(std::ostream &out, unsigned indent) const { for (unsigned i = 0; i < indent; ++i) out << " "; @@ -137,12 +123,40 @@ namespace search_tree { void mark_new_activation() { set_status(status::active); ++m_num_activations; + ++m_active_workers; + } + void dec_active_workers() { + if (m_active_workers > 0) + --m_active_workers; + if (m_active_workers == 0 && m_status == status::active) { + m_round_max_effort = 0; + m_status = status::open; + } + } + bool has_active_workers() const { + return m_active_workers > 0; } unsigned effort_spent() const { return m_effort_spent; } - void add_effort(unsigned effort) { - m_effort_spent += effort; + void update_round_max_effort(unsigned effort) { + if (effort <= m_round_max_effort) + return; + m_effort_spent -= m_round_max_effort; + m_round_max_effort = effort; + m_effort_spent += m_round_max_effort; + } + unsigned epoch() const { + return m_epoch; + } + void inc_epoch() { + ++m_epoch; + } + unsigned get_cancel_epoch() const { + return m_cancel_epoch; + } + void inc_cancel_epoch() { + ++m_cancel_epoch; } }; @@ -161,13 +175,13 @@ namespace search_tree { struct candidate { node* n = nullptr; - unsigned effort_band = UINT_MAX; + unsigned scaled_effort = UINT_MAX; unsigned depth = 0; }; // A measure of how much effort has been spent on the node, used for activation prioritization and expansion decisions // The effort unit is the workers' initial conflict budget, and effort spent grows by a factor defined in smt_parallel.h on each split attempt - unsigned effort_band(node const* n) const { + unsigned scaled_effort(node const* n) const { return n->effort_spent() / std::max(1, m_effort_unit); } @@ -177,8 +191,8 @@ namespace search_tree { return false; if (!b.n) return true; - if (a.effort_band != b.effort_band) - return a.effort_band < b.effort_band; + if (a.scaled_effort != b.scaled_effort) + return a.scaled_effort < b.scaled_effort; if (a.depth != b.depth) return a.depth > b.depth; return false; @@ -191,7 +205,7 @@ namespace search_tree { if (cur->get_status() == target_status) { candidate cand; cand.n = cur; - cand.effort_band = effort_band(cur); + cand.scaled_effort = scaled_effort(cur); cand.depth = cur->depth(); if (better(cand, best)) @@ -202,20 +216,6 @@ namespace search_tree { select_next_node(cur->right(), target_status, best); } - node* activate_best_node() { - candidate best; - select_next_node(m_root.get(), status::open, best); - if (!best.n) { - IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";); - select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection - } - - if (!best.n) - return nullptr; - best.n->mark_new_activation(); - return best.n; - } - bool has_unvisited_open_node(node* cur) const { if (!cur || cur->get_status() == status::closed) return false; @@ -251,8 +251,8 @@ namespace search_tree { find_shallowest_timed_out_leaf_depth(cur->right(), best_depth); } - bool should_split(node* n) { - if (!n || n->get_status() != status::active || !n->is_leaf()) + bool should_split(node* n, unsigned epoch) { + if (!is_lease_valid(n, epoch) || !n->is_leaf()) return false; unsigned num_active_nodes = count_active_nodes(m_root.get()); @@ -344,6 +344,8 @@ namespace search_tree { void close(node *n, vector const &C) { if (!n || n->get_status() == status::closed) return; + n->inc_epoch(); + n->inc_cancel_epoch(); n->set_status(status::closed); n->set_core(C); close(n->left(), C); @@ -441,40 +443,32 @@ namespace search_tree { void reset() { m_root = alloc(node, m_null_literal, nullptr); - m_root->mark_new_activation(); } // On timeout, either expand the current leaf or reopen the node for a // later revisit, depending on the tree-expansion heuristic. - bool try_split(node *n, literal const &a, literal const &b, unsigned effort) { - if (!n || n->get_status() != status::active) + bool try_split(node *n, unsigned epoch, unsigned cancel_epoch, literal const &a, literal const &b, unsigned effort) { + if (is_lease_canceled(n, cancel_epoch)) return false; - n->add_effort(effort); + // Record at most one effort contribution per concurrent round on this node. + // Stale workers still contribute, but only via the round-local maximum. + n->update_round_max_effort(effort); bool did_split = false; - if (should_split(n)) { + if (should_split(n, epoch)) { n->split(a, b); did_split = true; } - // Reopen immediately on timeout, even if other workers are still active. - // This keeps scheduling asynchronous: timeouts act as signals to reconsider - // the search, not barriers requiring all workers to finish. - // - // Early reopening also creates a soft penalty (via accumulated effort), - // reducing over-concentration while still allowing revisits. - // - // Waiting for all workers would introduce per-node synchronization, delay - // diversification, and let a slow worker stall progress. - n->set_status(status::open); - return did_split; } // conflict is given by a set of literals. // they are subsets of the literals on the path from root to n AND the external assumption literals void backtrack(node *n, vector const &conflict) { + if (!n) + return; if (conflict.empty()) { close_with_core(m_root.get(), conflict); return; @@ -507,17 +501,41 @@ namespace search_tree { UNREACHABLE(); } - // return an active node in the tree, or nullptr if there is none - // first check if there is a node to activate under n, - // if not, go up the tree and try to activate a sibling subtree - node *activate_node(node *n) { - if (!n) { - if (m_root->get_status() == status::active) { - m_root->mark_new_activation(); - return m_root.get(); - } + // Try to select an open node using the select_next_node policy + // If there are no open nodes, try to select an active node for portfolio solving + node* activate_best_node() { + candidate best; + select_next_node(m_root.get(), status::open, best); + if (!best.n) { + IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";); + select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection } - return activate_best_node(); + + if (!best.n) + return nullptr; + best.n->mark_new_activation(); + return best.n; + } + + node* activate_root() { + if (m_root->get_status() == status::closed) + return nullptr; + m_root->mark_new_activation(); + return m_root.get(); + } + + void dec_active_workers(node* n) { + if (!n) + return; + n->dec_active_workers(); + } + + bool is_lease_valid(node* n, unsigned epoch) const { + return n && n->get_status() == status::active && n->epoch() == epoch; + } + + bool is_lease_canceled(node* n, unsigned cancel_epoch) const { + return !n || n->get_status() == status::closed || n->get_cancel_epoch() != cancel_epoch; } vector const &get_core_from_root() const {