The core issue here is that we need to ensure `num_active_worker_threads_`
is read before incrementing `done_workers`. See the comments
added in this PR to explain why, and why the resulting code is
race-free.
Large circuits can run hundreds or thousands of ABCs in a single AbcPass.
For some circuits, some of those ABC runs can run for hundreds of seconds.
Running ABCs in parallel with each other and in parallel with main-thread
processing (reading and writing BLIF files, copying ABC BLIF output into
the design) can give large speedups.