[My apologies for the mangled formatting of the diffs in [2] and [3] above. at least as shown on the web view of this list. Will happily repost or resend by direct email if anyone expresses interest.]
Presumably there's also a chance for the runner to detect and log the expired lock before it dies.
This may require some cleverness, as the delays introduced by recording a log message e.g. "I'm the nntp runner holding the lock, and it has not expired yet." are likely to change the results of the lock competition, maybe even allowing all runner processes to continue as intended. It is unclear to me whether a process would ever have an opportunity to record a log message e.g. "I'm the nntp runner holding the lock which has expired, but I am not dead yet." before it dies. If such an opportunity exists, it should extend the lock's lifetime via e.g Lock.refresh(timedelta(seconds=10)).
As I attempted to summarize at the end of [3] above, each time I tried shorter sleep intervals (1, 10, 15 seconds) within .../flufl/lock/_lockfile.py the original problem remains, namely one or more locks are broken and runner processes die.
On 6/19/23 03:00, Stephen J. Turnbull wrote:
The best "performance" I have yet obtained on this single core system is [by] [3] [which waits 20s before trying to obtain the lock].
Is 20s necessary? There are a lot of runners, you may be able to cut the startup time by 2-3 minutes if you can cut that to 5s or less. That would still be 100x longer than the naive #cores * seconds estimate, but perhaps significant.