mirror of
https://github.com/Z3Prover/z3
synced 2026-06-02 15:17:54 +00:00
Terminate on Demand and some algorithmic bugfixes in the search tree (#9336)
* first attempt with codex. Codex notes:
What changed:
- Each tree node now tracks:
- active worker count
- lease epoch
- cancel epoch
- get_cube() now hands each worker an explicit lease: (node, epoch, cancel_epoch).
- try_split() and backtrack() now operate on that lease, and the batch manager releases the worker’s lease under the tree lock before mutating the
node.
- If another worker closes the leased node or subtree, the batch manager cancels only the workers whose current leased nodes are now closed.
- Workers detect canceled leases after check(), reset their local cancel flag, abandon the stale lease, and continue instead of turning that into a
global exception.
- The “reopen immediately into the open queue” policy is preserved. I did not add a barrier waiting for all workers on a node to finish.
- Active-worker accounting is now separate from the open/active/closed scheduling status, so reopening a node no longer erases the fact that other
workers are still on it.
I also updated search_tree bookkeeping so:
- closure bumps node cancel/lease epochs
- active-node counting uses actual active-worker presence, not just status == active
* fix(parallel-smt): gate split/backtrack by lease epoch
What it changes:
- util/search_tree.h
- bumps node epoch on split
- threads epoch through should_split(...) and try_split(...)
- always records effort, but only split/reopen if the lease epoch still matches
- smt/smt_parallel.cpp
- requires is_lease_valid(..., lease.epoch) before backtrack(...)
- passes lease.epoch into m_search_tree.try_split(...)
* clean up code and add some comments
* fix bug about backtracking condition being too strict: The epoch guard should not block backtrack(...) the same way it blocks try_split(...). A stale worker that proves UNSAT for n should still be able to
close n, and that closure should then cancel the other workers on n and its subtree.
I changed smt/smt_parallel.cpp accordingly:
- try_split(...) still uses epoch to reject stale structural splits
- backtrack(...) no longer requires is_lease_valid(..., epoch); it only requires that the lease is not already canceled
So the intended asymmetry is now restored:
- stale split: reject
- stale unsat/backtrack: allow closure, then cancel affected workers
* ablate to no backtracking on stale leases
* revert codex change about exception handling
* remove old code
* ablate backtracking gate
* attempt to fix linux crashes
* try to fix bug about active worker counts/lease accounting. current policy should hold: - stale leases: release/decrement
- canceled leases: do not release/decrement (just ignore since we have an invariant that canceled leases mean closed nodes that are never revisited
* delay premature root activation
* fix major semantic bug about threads continually choosing the root if their lease is reset
* fix cancellation to unknown status
* fix very bad bug about all threads needing to start at the root
* ablate active ranking: now nodes are only reopened if they are truly inactive (active worker count is 0)
* fix some bugs about leases
* ablate adding static effort only
* fix some bugs about leases
* don't explode effort for portfolio nodes
* fix: still accumulate per-node effort, but don't over-accumulate on portfolio solves
* restore dynamically scaled effort
* lease cancellation doens't touch rlimit now, it just sets max conflicts to 0. also fix a VERY BAD BUG about effort never being updated until all leases are done on a node, which meant we never left the root
* cross-thread modification of max conflicts is unsafe, so create an atomic lease canceled variable that's ch
ecked in ctx where max conflicts is also checked
* move atomic lease check in the context to the more global get_cancel_flag function
* Fix new SIGSEV. issue: The root cause: get_cancel_flag() is called from within propagation loops (mid-BCP, mid-equality-propagation, mid-atom-propagation). When it returns true there, the solver exits early and leaves the context in an intermediate state —
propagation queues partially processed, theory state potentially inconsistent with boolean state.
For the global cancel (m.limit().cancel()), this is harmless: the worker exits entirely and the context is destroyed. Intermediate state doesn't matter.
For a lease cancel, the context is reused — the worker gets a new cube and calls ctx->check() again on the same context object. Re-entering check() on a context interrupted mid-propagation causes it to access that corrupted intermediate
state → SIGSEGV.
The m_max_conflicts check is the only checkpoint that's safe for re-entry: it only fires post-conflict-resolution, pre-decision, when propagation queues are empty and theory state is consistent.
Fix: Remove m_lease_canceled from get_cancel_flag(). Keep it only at safe, between-phase checkpoints where the context is in a known-consistent state. The result is two safe checkpoints for m_lease_canceled: after each conflict (post-resolution, queues empty) and before each theory final check (not yet entered the theory). Neither interrupts the solver mid-mutation. The SIGSEGV should be
gone, and NIA performance should improve because long theory final checks (where NIA burns most time) are now preemptable before they start.
* fix new inconsistent theory bug: The problem is returning FC_GIVEUP from inside final_check() after some theories have already run final_check_eh() and pushed propagations into the queue. Those pending propagations reference context state that gets invalidated on the next check() call → SIGSEGV. The fix: check m_lease_canceled before entering final_check() in bounded_search(), never from inside it. That way the context is always in a clean pre-final-check state when we bail out. This is safe: decide() returned false (all variables assigned, no pending propagations), theories haven't been touched yet, context is in a fully consistent state. For NIA, this is still a meaningful win — we avoid entering expensive arithmetic final checks entirely when the lease is already canceled.
* remove second lease cancel check in smt_context, not sure it's safe. only check where we do the max conflicts check
* check epoch match in release_lease_unlocked
* restore exception handling logic to master branch
* restore reslimit cancels since the bug appears to be latent
---------
Co-authored-by: Ilana Shapiro <ilanashapiro@Ilanas-MacBook-Pro.local>
Co-authored-by: Ilana Shapiro <ilanashapiro@Mac.lan1>
This commit is contained in:
parent
0669cdd829
commit
44966e1733
4 changed files with 242 additions and 120 deletions
|
|
@ -35,14 +35,26 @@ namespace smt {
|
|||
|
||||
class parallel {
|
||||
context& ctx;
|
||||
unsigned num_threads;
|
||||
bool m_should_run_sls = false;
|
||||
|
||||
struct shared_clause {
|
||||
unsigned source_worker_id;
|
||||
expr_ref clause;
|
||||
};
|
||||
|
||||
struct node_lease {
|
||||
search_tree::node<cube_config>* node = nullptr;
|
||||
// Version counter for structural mutations of this node (e.g., split/close).
|
||||
// Used to detect stale leases: if a worker's lease.epoch != node.epoch,
|
||||
// the node has changed since it was acquired and must not be mutated.
|
||||
unsigned epoch = 0;
|
||||
|
||||
// Cancellation generation counter for this node/subtree.
|
||||
// Incremented when the node is closed; used to signal that all
|
||||
// workers holding leases on this node (or its descendants)
|
||||
// must abandon work immediately.
|
||||
unsigned cancel_epoch = 0;
|
||||
};
|
||||
|
||||
class batch_manager {
|
||||
|
||||
enum state {
|
||||
|
|
@ -66,6 +78,7 @@ namespace smt {
|
|||
stats m_stats;
|
||||
using node = search_tree::node<cube_config>;
|
||||
search_tree::tree<cube_config> m_search_tree;
|
||||
vector<node_lease> m_worker_leases;
|
||||
|
||||
unsigned m_exception_code = 0;
|
||||
std::string m_exception_msg;
|
||||
|
|
@ -91,7 +104,8 @@ namespace smt {
|
|||
cancel_sls_worker();
|
||||
}
|
||||
|
||||
void init_parameters_state();
|
||||
void release_lease_unlocked(unsigned worker_id, node* n, unsigned epoch);
|
||||
void cancel_closed_leases_unlocked(unsigned source_worker_id);
|
||||
|
||||
public:
|
||||
batch_manager(ast_manager& m, parallel& p) : m(m), p(p), m_search_tree(expr_ref(m)) { }
|
||||
|
|
@ -104,9 +118,11 @@ namespace smt {
|
|||
void set_exception(unsigned error_code);
|
||||
void collect_statistics(::statistics& st) const;
|
||||
|
||||
bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, node*& n);
|
||||
void backtrack(ast_translation& l2g, expr_ref_vector const& core, node* n);
|
||||
void try_split(ast_translation& l2g, unsigned id, node* n, expr* atom, unsigned effort);
|
||||
bool get_cube(ast_translation& g2l, unsigned id, expr_ref_vector& cube, bool is_first_run, node_lease& lease);
|
||||
void backtrack(ast_translation& l2g, unsigned worker_id, expr_ref_vector const& core, node_lease const& lease);
|
||||
void try_split(ast_translation& l2g, unsigned worker_id, node_lease const& lease, expr* atom, unsigned effort);
|
||||
void release_lease(unsigned worker_id, node_lease const& lease);
|
||||
bool lease_canceled(node_lease const& lease);
|
||||
|
||||
void collect_clause(ast_translation& l2g, unsigned source_worker_id, expr* clause);
|
||||
expr_ref_vector return_shared_clauses(ast_translation& g2l, unsigned& worker_limit, unsigned worker_id);
|
||||
|
|
@ -164,6 +180,7 @@ namespace smt {
|
|||
void collect_shared_clauses();
|
||||
|
||||
void cancel();
|
||||
void cancel_lease();
|
||||
void collect_statistics(::statistics& st) const;
|
||||
|
||||
reslimit& limit() {
|
||||
|
|
@ -198,9 +215,6 @@ namespace smt {
|
|||
public:
|
||||
parallel(context& ctx) :
|
||||
ctx(ctx),
|
||||
num_threads(std::min(
|
||||
(unsigned)std::thread::hardware_concurrency(),
|
||||
ctx.get_fparams().m_threads)),
|
||||
m_batch_manager(ctx.m, *this) {}
|
||||
|
||||
lbool operator()(expr_ref_vector const& asms);
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue