From 44966e17339c0c381b247805528e5098a5b0ffd3 Mon Sep 17 00:00:00 2001
From: Ilana Shapiro <ishapiro00@gmail.com>
Date: Sun, 19 Apr 2026 07:21:41 -0700
Subject: [PATCH] Terminate on Demand and some algorithmic bugfixes in the
 search tree (#9336)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* first attempt with codex. Codex notes:
What changed:

  - Each tree node now tracks:
      - active worker count
      - lease epoch
      - cancel epoch
  - get_cube() now hands each worker an explicit lease: (node, epoch, cancel_epoch).
  - try_split() and backtrack() now operate on that lease, and the batch manager releases the worker’s lease under the tree lock before mutating the
    node.
  - If another worker closes the leased node or subtree, the batch manager cancels only the workers whose current leased nodes are now closed.
  - Workers detect canceled leases after check(), reset their local cancel flag, abandon the stale lease, and continue instead of turning that into a
    global exception.
  - The “reopen immediately into the open queue” policy is preserved. I did not add a barrier waiting for all workers on a node to finish.
  - Active-worker accounting is now separate from the open/active/closed scheduling status, so reopening a node no longer erases the fact that other
    workers are still on it.

  I also updated search_tree bookkeeping so:

  - closure bumps node cancel/lease epochs
  - active-node counting uses actual active-worker presence, not just status == active

* fix(parallel-smt): gate split/backtrack by lease epoch

What it changes:

  - util/search_tree.h
      - bumps node epoch on split
      - threads epoch through should_split(...) and try_split(...)
      - always records effort, but only split/reopen if the lease epoch still matches
  - smt/smt_parallel.cpp
      - requires is_lease_valid(..., lease.epoch) before backtrack(...)
      - passes lease.epoch into m_search_tree.try_split(...)

* clean up code and add some comments

* fix bug about backtracking condition being too strict: The epoch guard should not block backtrack(...) the same way it blocks try_split(...). A stale worker that proves UNSAT for n should still be able to
  close n, and that closure should then cancel the other workers on n and its subtree.

  I changed smt/smt_parallel.cpp accordingly:

  - try_split(...) still uses epoch to reject stale structural splits
  - backtrack(...) no longer requires is_lease_valid(..., epoch); it only requires that the lease is not already canceled

  So the intended asymmetry is now restored:

  - stale split: reject
  - stale unsat/backtrack: allow closure, then cancel affected workers

* ablate to no backtracking on stale leases

* revert codex change about exception handling

* remove old code

* ablate backtracking gate

* attempt to fix linux crashes

* try to fix bug about active worker counts/lease accounting. current policy should hold: - stale leases: release/decrement
  - canceled leases: do not release/decrement (just ignore since we have an invariant that canceled leases mean closed nodes that are never revisited

* delay premature root activation

* fix major semantic bug about threads continually choosing the root if their lease is reset

* fix cancellation to unknown status

* fix very bad bug about all threads needing to start at the root

* ablate active ranking: now nodes are only reopened if they are truly inactive (active worker count is 0)

* fix some bugs about leases

* ablate adding static effort only

* fix some bugs about leases

* don't explode effort for portfolio nodes

* fix: still accumulate per-node effort, but don't over-accumulate on portfolio solves

* restore dynamically scaled effort

* lease cancellation doens't touch rlimit now, it just sets max conflicts to 0. also fix a VERY BAD BUG about effort never being updated until all leases are done on a node, which meant we never left the root

* cross-thread modification of max conflicts is unsafe, so create an atomic lease canceled variable that's ch
ecked in ctx where max conflicts is also checked

* move atomic lease check in the context to the more global get_cancel_flag function

* Fix new SIGSEV. issue: The root cause: get_cancel_flag() is called from within propagation loops (mid-BCP, mid-equality-propagation, mid-atom-propagation). When it returns true there, the solver exits early and leaves the context in an intermediate state —
  propagation queues partially processed, theory state potentially inconsistent with boolean state.

  For the global cancel (m.limit().cancel()), this is harmless: the worker exits entirely and the context is destroyed. Intermediate state doesn't matter.

  For a lease cancel, the context is reused — the worker gets a new cube and calls ctx->check() again on the same context object. Re-entering check() on a context interrupted mid-propagation causes it to access that corrupted intermediate
  state → SIGSEGV.

  The m_max_conflicts check is the only checkpoint that's safe for re-entry: it only fires post-conflict-resolution, pre-decision, when propagation queues are empty and theory state is consistent.

  Fix: Remove m_lease_canceled from get_cancel_flag(). Keep it only at safe, between-phase checkpoints where the context is in a known-consistent state. The result is two safe checkpoints for m_lease_canceled: after each conflict (post-resolution, queues empty) and before each theory final check (not yet entered the theory). Neither interrupts the solver mid-mutation. The SIGSEGV should be
   gone, and NIA performance should improve because long theory final checks (where NIA burns most time) are now preemptable before they start.

* fix new inconsistent theory bug: The problem is returning FC_GIVEUP from inside final_check() after some theories have already run final_check_eh() and pushed propagations into the queue. Those pending propagations reference context state that gets invalidated on the next check() call → SIGSEGV. The fix: check m_lease_canceled before entering final_check() in bounded_search(), never from inside it. That way the context is always in a clean pre-final-check state when we bail out. This is safe: decide() returned false (all variables assigned, no pending propagations), theories haven't been touched yet, context is in a fully consistent state. For NIA, this is still a meaningful win — we avoid entering expensive arithmetic final checks entirely when the lease is already canceled.

* remove second lease cancel check in smt_context, not sure it's safe. only check where we do the max conflicts check

* check epoch match in release_lease_unlocked

* restore exception handling logic to master branch

* restore reslimit cancels since the bug appears to be latent

---------

Co-authored-by: Ilana Shapiro <ilanashapiro@Ilanas-MacBook-Pro.local>
Co-authored-by: Ilana Shapiro <ilanashapiro@Mac.lan1>
---
 src/smt/smt_context.h    |   1 +
 src/smt/smt_parallel.cpp | 175 +++++++++++++++++++++++++++++----------
 src/smt/smt_parallel.h   |  32 +++++--
 src/util/search_tree.h   | 154 +++++++++++++++++++---------------
 4 files changed, 242 insertions(+), 120 deletions(-)

diff --git a/src/smt/smt_context.h b/src/smt/smt_context.h
index 80ab6803d..12fbf0284 100644
--- a/src/smt/smt_context.h
+++ b/src/smt/smt_context.h
@@ -18,6 +18,7 @@ Revision History:
 --*/
 #pragma once
 
+#include <atomic>
 #include "ast/quantifier_stat.h"
 #include "ast/simplifiers/dependent_expr_state.h"
 #include "smt/smt_clause.h"
diff --git a/src/smt/smt_parallel.cpp b/src/smt/smt_parallel.cpp
index d74b71df2..97c9a7105 100644
--- a/src/smt/smt_parallel.cpp
+++ b/src/smt/smt_parallel.cpp
@@ -64,6 +64,12 @@ namespace smt {
 
 namespace smt {
 
+    static bool is_cancellation_exception(char const* msg) {
+        return msg &&
+            (strstr(msg, "canceled") != nullptr ||
+             strstr(msg, "cancelled") != nullptr);
+    }
+
     void parallel::sls_worker::run() {
         ptr_vector<expr> assertions;
         p.ctx.get_assertions(assertions);
@@ -111,25 +117,32 @@ namespace smt {
     }
 
     void parallel::worker::run() {
-        search_tree::node<cube_config> *node = nullptr;
+        bool is_first_run = true;
+        node_lease lease;
         expr_ref_vector cube(m);
         while (true) {
 
-            if (!b.get_cube(m_g2l, id, cube, node)) {
+            if (!b.get_cube(m_g2l, id, cube, is_first_run, lease)) {
                 LOG_WORKER(1, " no more cubes\n");
                 return;
             }
+            is_first_run = false;
             collect_shared_clauses();
 
         check_cube_start:
             LOG_WORKER(1, " CUBE SIZE IN MAIN LOOP: " << cube.size() << "\n");
             lbool r = check_cube(cube);
 
-            if (!m.inc()) {
-                b.set_exception("context cancelled");
-                return;
+            if (b.lease_canceled(lease)) {
+                LOG_WORKER(1, " abandoning canceled lease\n");
+                lease = {};
+                m.limit().dec_cancel();
+                continue;
             }
 
+             if (!m.inc())
+                return;
+
             switch (r) {
             case l_undef: {
                 update_max_thread_conflicts();
@@ -140,7 +153,8 @@ namespace smt {
                 auto atom = get_split_atom();
                 if (!atom)
                     goto check_cube_start;
-                b.try_split(m_l2g, id, node, atom, m_config.m_threads_max_conflicts);
+                b.try_split(m_l2g, id, lease, atom, m_config.m_threads_max_conflicts);
+                lease = {};
                 simplify();
                 break;
             }
@@ -164,7 +178,8 @@ namespace smt {
                 }
 
                 LOG_WORKER(1, " found unsat cube\n");
-                b.backtrack(m_l2g, unsat_core, node);
+                b.backtrack(m_l2g, id, unsat_core, lease);
+                lease = {};
 
                 if (m_config.m_share_conflicts)
                     b.collect_clause(m_l2g, id, mk_not(mk_and(unsat_core)));
@@ -314,18 +329,57 @@ namespace smt {
         m.limit().cancel();
     }
 
-    void parallel::batch_manager::backtrack(ast_translation &l2g, expr_ref_vector const &core,
-                                            search_tree::node<cube_config> *node) {
+    void parallel::worker::cancel_lease() {
+        LOG_WORKER(1, " canceling lease\n");
+        m.limit().inc_cancel();
+    }
+
+    void parallel::batch_manager::release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch) {
+        if (worker_id >= m_worker_leases.size())
+            return;
+        auto &lease = m_worker_leases[worker_id];
+        if (!lease.node || lease.node != n || lease.epoch != epoch)
+            return;
+        m_search_tree.dec_active_workers(lease.node);
+        lease = {};
+    }
+
+    void parallel::batch_manager::cancel_closed_leases_unlocked(unsigned source_worker_id) {
+        unsigned n = std::min(m_worker_leases.size(), p.m_workers.size());
+        for (unsigned worker_id = 0; worker_id < n; ++worker_id) {
+            if (worker_id == source_worker_id)
+                continue;
+            auto const& lease = m_worker_leases[worker_id];
+            
+            // only cancel workers that currently hold a lease, and whose lease is canceled
+            if (lease.node && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
+                p.m_workers[worker_id]->cancel_lease();
+        }
+    }
+
+    void parallel::batch_manager::backtrack(ast_translation &l2g, unsigned worker_id, expr_ref_vector const &core,
+                                            node_lease const &lease) {
         std::scoped_lock lock(mux);
-        IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n");
+
         if (m_state != state::is_running)
             return;
+        
+        // we close/backtrack regardless of whether this lease is stale or not, as long as the lease isn't canceled
+        // i.e. worker 1 splits this node, but then worker 2 determines UNSAT --> worker 2 is stale but we still close this node and backtrack
+        if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
+            return;
+
+        IF_VERBOSE(1, verbose_stream() << "Batch manager backtracking.\n");
+        
+        release_lease_unlocked(worker_id, lease.node, lease.epoch);
+        
         vector<cube_config::literal> g_core;
-        for (auto c : core) {
-            expr_ref g_c(l2g(c), m);
+        for (auto c : core)
             g_core.push_back(expr_ref(l2g(c), m));
-        }
-        m_search_tree.backtrack(node, g_core);
+        m_search_tree.backtrack(lease.node, g_core);
+
+        // terminate on-demand the workers that are currently exploring the now-closed nodes
+        cancel_closed_leases_unlocked(worker_id);
 
         IF_VERBOSE(1, m_search_tree.display(verbose_stream() << bounded_pp_exprs(core) << "\n"););
         if (m_search_tree.is_closed()) {
@@ -338,25 +392,40 @@ namespace smt {
         }
     }
 
-    void parallel::batch_manager::try_split(ast_translation &l2g, unsigned source_worker_id,
-                                        search_tree::node<cube_config> *node, expr *atom, unsigned effort) {
+    void parallel::batch_manager::try_split(ast_translation &l2g, unsigned worker_id,
+                                        node_lease const &lease, expr *atom, unsigned effort) {
         std::scoped_lock lock(mux);
-        expr_ref lit(m), nlit(m);
-        lit = l2g(atom);
-        nlit = mk_not(m, lit);
         
         if (m_state != state::is_running)
             return;
 
-        bool did_split = m_search_tree.try_split(node, lit, nlit, effort);
+        if (m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch))
+            return;
+
+        expr_ref lit(m), nlit(m);
+        lit = l2g(atom);
+        nlit = mk_not(m, lit);
+        bool did_split = m_search_tree.try_split(lease.node, lease.epoch, lease.cancel_epoch, lit, nlit, effort);
+
+        release_lease_unlocked(worker_id, lease.node, lease.epoch);
 
         if (did_split) {
             ++m_stats.m_num_cubes;
-            m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, node->depth() + 1);
+            m_stats.m_max_cube_depth = std::max(m_stats.m_max_cube_depth, lease.node->depth() + 1);
             IF_VERBOSE(1, verbose_stream() << "Batch manager splitting on literal: " << mk_bounded_pp(lit, m, 3) << "\n");
         }
     }
 
+    void parallel::batch_manager::release_lease(unsigned worker_id, node_lease const &lease) {
+        std::scoped_lock lock(mux);
+        release_lease_unlocked(worker_id, lease.node, lease.epoch);
+    }
+
+    bool parallel::batch_manager::lease_canceled(node_lease const &lease) {
+        std::scoped_lock lock(mux);
+        return m_state == state::is_running && m_search_tree.is_lease_canceled(lease.node, lease.cancel_epoch);
+    }
+
     void parallel::batch_manager::collect_clause(ast_translation &l2g, unsigned source_worker_id, expr *clause) {
         std::scoped_lock lock(mux);
         expr *g_clause = l2g(clause);
@@ -400,9 +469,11 @@ namespace smt {
         try {
             r = ctx->check(asms.size(), asms.data());
         } catch (z3_error &err) {
-            b.set_exception(err.error_code());
+            if (!is_cancellation_exception(err.what()))
+                b.set_exception(err.error_code());
         } catch (z3_exception &ex) {
-            b.set_exception(ex.what());
+            if (!is_cancellation_exception(ex.what()))
+                b.set_exception(ex.what());
         } catch (...) {
             b.set_exception("unknown exception");
         }
@@ -501,9 +572,10 @@ namespace smt {
         }
     }
 
-    bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, node *&n) {
+    bool parallel::batch_manager::get_cube(ast_translation &g2l, unsigned id, expr_ref_vector &cube, bool is_first_run, node_lease &lease) {
+        std::scoped_lock lock(mux);
         cube.reset();
-        std::unique_lock<std::mutex> lock(mux);
+        
         if (m_search_tree.is_closed()) {
             IF_VERBOSE(1, verbose_stream() << "all done\n";);
             return false;
@@ -512,11 +584,21 @@ namespace smt {
             IF_VERBOSE(1, verbose_stream() << "aborting get_cube\n";);
             return false;
         }
-        node *t = m_search_tree.activate_node(n);
+        
+        node *t = is_first_run ? m_search_tree.activate_root() : m_search_tree.activate_best_node();
+        
         if (!t)
             return false;
+        
         IF_VERBOSE(1, m_search_tree.display(verbose_stream()); verbose_stream() << "\n";);
-        n = t;
+        
+        lease.node = t;
+        lease.epoch = t->epoch();
+        lease.cancel_epoch = t->get_cancel_epoch();
+        if (id >= m_worker_leases.size())
+            m_worker_leases.resize(id + 1);
+        m_worker_leases[id] = lease;
+        
         while (t) {
             if (cube_config::literal_is_null(t->get_literal()))
                 break;
@@ -525,6 +607,7 @@ namespace smt {
             cube.push_back(std::move(lit));
             t = t->parent();
         }
+        
         return true;
     }
 
@@ -532,6 +615,8 @@ namespace smt {
         m_state = state::is_running;
         m_search_tree.reset();
         m_search_tree.set_effort_unit(initial_max_thread_conflicts);
+        m_worker_leases.reset();
+        m_worker_leases.resize(p.m_workers.size());
     }
 
     void parallel::batch_manager::collect_statistics(::statistics &st) const {
@@ -540,7 +625,12 @@ namespace smt {
     }
 
     lbool parallel::operator()(expr_ref_vector const &asms) {
-        IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << num_threads << " threads\n";);
+        smt_parallel_params pp(ctx.m_params);
+        unsigned num_workers = std::min((unsigned)std::thread::hardware_concurrency(), ctx.get_fparams().m_threads);
+        unsigned num_sls_threads = (pp.sls() ? 1 : 0);    
+        unsigned total_threads = num_workers + num_sls_threads;
+
+        IF_VERBOSE(1, verbose_stream() << "Parallel SMT with " << total_threads << " threads\n";);
         ast_manager &m = ctx.m;
 
         if (m.has_trace_stream())
@@ -556,33 +646,32 @@ namespace smt {
         };
         scoped_clear clear(*this);
 
-        m_batch_manager.initialize();
         m_workers.reset();
-
-        smt_parallel_params pp(ctx.m_params);
-        m_should_run_sls = pp.sls();
         
         scoped_limits sl(m.limit());
         flet<unsigned> _nt(ctx.m_fparams.m_threads, 1);
-        SASSERT(num_threads > 1);
-        for (unsigned i = 0; i < num_threads; ++i)
+        SASSERT(num_workers > 1);
+        for (unsigned i = 0; i < num_workers; ++i)
             m_workers.push_back(alloc(worker, i, *this, asms));
         for (auto w : m_workers)
             sl.push_child(&(w->limit()));
-        if (m_should_run_sls) {
+
+        if (num_sls_threads == 1) {
             m_sls_worker = alloc(sls_worker, *this);
             sl.push_child(&(m_sls_worker->limit()));
         }
 
+        IF_VERBOSE(1, verbose_stream() << "Launched " << m_workers.size() << " CDCL threads, "
+                                       << (m_sls_worker ? 1 : 0) << " SLS threads\n";);
+
+        m_batch_manager.initialize();
+        
         // Launch threads
         vector<std::thread> threads;
-        threads.resize(m_should_run_sls ? num_threads + 1 : num_threads); // +1 for sls worker
-        for (unsigned i = 0; i < num_threads; ++i)
-            threads[i] = std::thread([&, i]() { m_workers[i]->run(); });
-        
-        // the final thread runs the sls worker
-        if (m_should_run_sls)
-            threads[num_threads] = std::thread([&]() { m_sls_worker->run(); });
+        threads.resize(total_threads);
+        unsigned thread_idx = 0;
+        for (auto* w : m_workers)
+            threads[thread_idx++] = std::thread([&, w]() { w->run(); });
 
         // Wait for all threads to finish
         for (auto &th : threads)
@@ -591,7 +680,7 @@ namespace smt {
         for (auto w : m_workers)
             w->collect_statistics(ctx.m_aux_stats);
         m_batch_manager.collect_statistics(ctx.m_aux_stats);
-        if (m_should_run_sls)
+        if (m_sls_worker)
             m_sls_worker->collect_statistics(ctx.m_aux_stats);
 
         return m_batch_manager.get_result();
diff --git a/src/smt/smt_parallel.h b/src/smt/smt_parallel.h
index bddb10b63..32517eb31 100644
--- a/src/smt/smt_parallel.h
+++ b/src/smt/smt_parallel.h
@@ -35,14 +35,26 @@ namespace smt {
 
     class parallel {
         context& ctx;
-        unsigned num_threads;
-        bool m_should_run_sls = false;
 
         struct shared_clause {
             unsigned source_worker_id;
             expr_ref clause;
         };
 
+        struct node_lease {
+            search_tree::node<cube_config>* node = nullptr;
+            // Version counter for structural mutations of this node (e.g., split/close).
+            // Used to detect stale leases: if a worker's lease.epoch != node.epoch,
+            // the node has changed since it was acquired and must not be mutated.
+            unsigned epoch = 0; 
+
+            // Cancellation generation counter for this node/subtree.
+            // Incremented when the node is closed; used to signal that all
+            // workers holding leases on this node (or its descendants)
+            // must abandon work immediately.
+            unsigned cancel_epoch = 0; 
+        };
+
         class batch_manager {        
 
             enum state {
@@ -66,6 +78,7 @@ namespace smt {
             stats m_stats;
             using node = search_tree::node<cube_config>;
             search_tree::tree<cube_config> m_search_tree;
+            vector<node_lease> m_worker_leases;
             
             unsigned m_exception_code = 0;
             std::string m_exception_msg;
@@ -91,7 +104,8 @@ namespace smt {
                 cancel_sls_worker();  
             }
 
-            void init_parameters_state();
+            void release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch);
+            void cancel_closed_leases_unlocked(unsigned source_worker_id);
 
         public:
             batch_manager(ast_manager& m, parallel& p) : m(m), p(p), m_search_tree(expr_ref(m)) { }
@@ -104,9 +118,11 @@ namespace smt {
             void set_exception(unsigned error_code);
             void collect_statistics(::statistics& st) const;
 
-            bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, node*& n);
-            void backtrack(ast_translation& l2g, expr_ref_vector const& core, node* n);
-            void try_split(ast_translation& l2g, unsigned id, node* n, expr* atom, unsigned effort);
+            bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, bool is_first_run, node_lease& lease);
+            void backtrack(ast_translation& l2g, unsigned worker_id, expr_ref_vector const& core, node_lease const& lease);
+            void try_split(ast_translation& l2g, unsigned worker_id, node_lease const& lease, expr* atom, unsigned effort);
+            void release_lease(unsigned worker_id, node_lease const& lease);
+            bool lease_canceled(node_lease const& lease);
 
             void collect_clause(ast_translation& l2g, unsigned source_worker_id, expr* clause);
             expr_ref_vector return_shared_clauses(ast_translation& g2l, unsigned& worker_limit, unsigned worker_id);
@@ -164,6 +180,7 @@ namespace smt {
             void collect_shared_clauses();
 
             void cancel();
+            void cancel_lease();
             void collect_statistics(::statistics& st) const;
 
             reslimit& limit() {
@@ -198,9 +215,6 @@ namespace smt {
     public:
         parallel(context& ctx) : 
             ctx(ctx),
-            num_threads(std::min(
-                (unsigned)std::thread::hardware_concurrency(),
-                ctx.get_fparams().m_threads)),
             m_batch_manager(ctx.m, *this) {}
 
         lbool operator()(expr_ref_vector const& asms);
diff --git a/src/util/search_tree.h b/src/util/search_tree.h
index bac6ce0d0..ebfc18e63 100644
--- a/src/util/search_tree.h
+++ b/src/util/search_tree.h
@@ -48,6 +48,10 @@ namespace search_tree {
         vector<literal> m_core;
         unsigned m_num_activations = 0;
         unsigned m_effort_spent = 0;
+        unsigned m_round_max_effort = 0;
+        unsigned m_active_workers = 0;
+        unsigned m_epoch = 0;
+        unsigned m_cancel_epoch = 0;
 
     public:
         node(literal const &l, node *parent) : m_literal(l), m_parent(parent), m_status(status::open) {}
@@ -65,9 +69,6 @@ namespace search_tree {
         literal const &get_literal() const {
             return m_literal;
         }
-        bool literal_is_null() const {
-            return Config::is_null(m_literal);
-        }
         void split(literal const &a, literal const &b) {
             SASSERT(!Config::literal_is_null(a));
             SASSERT(!Config::literal_is_null(b));
@@ -77,6 +78,7 @@ namespace search_tree {
             SASSERT(!m_right);
             m_left = alloc(node<Config>, a, this);
             m_right = alloc(node<Config>, b, this);
+            inc_epoch();
         }
 
         node* left() const { return m_left; }
@@ -94,22 +96,6 @@ namespace search_tree {
             return d;
         }
 
-        node *find_active_node() {
-            if (m_status == status::active)
-                return this;
-            if (m_status == status::closed)
-                return nullptr;
-            node *nodes[2] = {m_left, m_right};
-            for (unsigned i = 0; i < 2; ++i) {
-                auto res = nodes[i] ? nodes[i]->find_active_node() : nullptr;
-                if (res)
-                    return res;
-            }
-            if (m_left->get_status() == status::closed && m_right->get_status() == status::closed)
-                m_status = status::closed;
-            return nullptr;
-        }
-
         void display(std::ostream &out, unsigned indent) const {
             for (unsigned i = 0; i < indent; ++i)
                 out << " ";
@@ -137,12 +123,40 @@ namespace search_tree {
         void mark_new_activation() {
             set_status(status::active);
             ++m_num_activations;
+            ++m_active_workers;
+        }
+        void dec_active_workers() {
+            if (m_active_workers > 0)
+                --m_active_workers;
+            if (m_active_workers == 0 && m_status == status::active) {
+                m_round_max_effort = 0;
+                m_status = status::open;
+            }
+        }
+        bool has_active_workers() const {
+            return m_active_workers > 0;
         }
         unsigned effort_spent() const {
             return m_effort_spent;
         }
-        void add_effort(unsigned effort) {
-            m_effort_spent += effort;
+        void update_round_max_effort(unsigned effort) {
+            if (effort <= m_round_max_effort)
+                return;
+            m_effort_spent -= m_round_max_effort;
+            m_round_max_effort = effort;
+            m_effort_spent += m_round_max_effort;
+        }
+        unsigned epoch() const {
+            return m_epoch;
+        }
+        void inc_epoch() {
+            ++m_epoch;
+        }
+        unsigned get_cancel_epoch() const {
+            return m_cancel_epoch;
+        }
+        void inc_cancel_epoch() {
+            ++m_cancel_epoch;
         }
     };
 
@@ -161,13 +175,13 @@ namespace search_tree {
 
         struct candidate {
             node<Config>* n = nullptr;
-            unsigned effort_band = UINT_MAX;
+            unsigned scaled_effort = UINT_MAX;
             unsigned depth = 0;
         };
 
         // A measure of how much effort has been spent on the node, used for activation prioritization and expansion decisions
         // The effort unit is the workers' initial conflict budget, and effort spent grows by a factor defined in smt_parallel.h on each split attempt
-        unsigned effort_band(node<Config> const* n) const {
+        unsigned scaled_effort(node<Config> const* n) const {
             return n->effort_spent() / std::max<unsigned>(1, m_effort_unit);
         }
 
@@ -177,8 +191,8 @@ namespace search_tree {
                 return false;
             if (!b.n)
                 return true;
-            if (a.effort_band != b.effort_band)
-                return a.effort_band < b.effort_band;
+            if (a.scaled_effort != b.scaled_effort)
+                return a.scaled_effort < b.scaled_effort;
             if (a.depth != b.depth)
                 return a.depth > b.depth;
             return false;
@@ -191,7 +205,7 @@ namespace search_tree {
             if (cur->get_status() == target_status) {
                 candidate cand;
                 cand.n = cur;
-                cand.effort_band = effort_band(cur);
+                cand.scaled_effort = scaled_effort(cur);
                 cand.depth = cur->depth();
 
                 if (better(cand, best))
@@ -202,20 +216,6 @@ namespace search_tree {
             select_next_node(cur->right(), target_status, best);
         }
 
-        node<Config>* activate_best_node() {
-            candidate best;
-            select_next_node(m_root.get(), status::open, best);
-            if (!best.n) {
-                IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";);
-                select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection
-            }
-
-            if (!best.n)
-                return nullptr;
-            best.n->mark_new_activation();
-            return best.n;
-        }
-
         bool has_unvisited_open_node(node<Config>* cur) const {
             if (!cur || cur->get_status() == status::closed)
                 return false;
@@ -251,8 +251,8 @@ namespace search_tree {
             find_shallowest_timed_out_leaf_depth(cur->right(), best_depth);
         }
 
-        bool should_split(node<Config>* n) {
-            if (!n || n->get_status() != status::active || !n->is_leaf())
+        bool should_split(node<Config>* n, unsigned epoch) {
+            if (!is_lease_valid(n, epoch) || !n->is_leaf())
                 return false;
 
             unsigned num_active_nodes = count_active_nodes(m_root.get());
@@ -344,6 +344,8 @@ namespace search_tree {
         void close(node<Config> *n, vector<literal> const &C) {
             if (!n || n->get_status() == status::closed)
                 return;
+            n->inc_epoch();
+            n->inc_cancel_epoch();
             n->set_status(status::closed);
             n->set_core(C);
             close(n->left(), C);
@@ -441,40 +443,32 @@ namespace search_tree {
 
         void reset() {
             m_root = alloc(node<Config>, m_null_literal, nullptr);
-            m_root->mark_new_activation();
         }
 
         // On timeout, either expand the current leaf or reopen the node for a
         // later revisit, depending on the tree-expansion heuristic.
-        bool try_split(node<Config> *n, literal const &a, literal const &b, unsigned effort) {
-            if (!n || n->get_status() != status::active)
+        bool try_split(node<Config> *n, unsigned epoch, unsigned cancel_epoch, literal const &a, literal const &b, unsigned effort) {
+            if (is_lease_canceled(n, cancel_epoch))
                 return false;
 
-            n->add_effort(effort);
+            // Record at most one effort contribution per concurrent round on this node.
+            // Stale workers still contribute, but only via the round-local maximum.
+            n->update_round_max_effort(effort);
             bool did_split = false;
 
-            if (should_split(n)) {
+            if (should_split(n, epoch)) {
                 n->split(a, b);
                 did_split = true;
             }
 
-            // Reopen immediately on timeout, even if other workers are still active.
-            // This keeps scheduling asynchronous: timeouts act as signals to reconsider
-            // the search, not barriers requiring all workers to finish.
-            //
-            // Early reopening also creates a soft penalty (via accumulated effort),
-            // reducing over-concentration while still allowing revisits.
-            //
-            // Waiting for all workers would introduce per-node synchronization, delay
-            // diversification, and let a slow worker stall progress.
-            n->set_status(status::open);
-            
             return did_split;
         }
 
         // conflict is given by a set of literals.
         // they are subsets of the literals on the path from root to n AND the external assumption literals
         void backtrack(node<Config> *n, vector<literal> const &conflict) {
+            if (!n)
+                return;
             if (conflict.empty()) {
                 close_with_core(m_root.get(), conflict);
                 return;
@@ -507,17 +501,41 @@ namespace search_tree {
             UNREACHABLE();
         }
 
-        // return an active node in the tree, or nullptr if there is none
-        // first check if there is a node to activate under n,
-        // if not, go up the tree and try to activate a sibling subtree
-        node<Config> *activate_node(node<Config> *n) {
-            if (!n) {
-                if (m_root->get_status() == status::active) {
-                    m_root->mark_new_activation();
-                    return m_root.get();
-                }
+        // Try to select an open node using the select_next_node policy
+        // If there are no open nodes, try to select an active node for portfolio solving
+        node<Config>* activate_best_node() {
+            candidate best;
+            select_next_node(m_root.get(), status::open, best);
+            if (!best.n) {
+                IF_VERBOSE(1, verbose_stream() << "NO OPEN NODES, trying active nodes for portfolio solving\n";);
+                select_next_node(m_root.get(), status::active, best); // If no open nodes, only then consider active nodes for selection
             }
-            return activate_best_node();
+
+            if (!best.n)
+                return nullptr;
+            best.n->mark_new_activation();
+            return best.n;
+        }
+
+        node<Config>* activate_root() {
+            if (m_root->get_status() == status::closed)
+                return nullptr;
+            m_root->mark_new_activation();
+            return m_root.get();
+        }
+
+        void dec_active_workers(node<Config>* n) {
+            if (!n)
+                return;
+            n->dec_active_workers();
+        }
+
+        bool is_lease_valid(node<Config>* n, unsigned epoch) const {
+            return n && n->get_status() == status::active && n->epoch() == epoch;
+        }
+        
+        bool is_lease_canceled(node<Config>* n, unsigned cancel_epoch) const {
+            return !n || n->get_status() == status::closed || n->get_cancel_epoch() != cancel_epoch;
         }
 
         vector<literal> const &get_core_from_root() const {