Details
-
Bug
-
Status: Resolved
-
Not specified
-
Resolution: Fixed
-
None
-
None
Description
The Thor manager watchdog runs at the start of each graph and waits for watchdog/progress packets from the workers.
If there is an exception processing one of those packets, it stops.
Workers continue to send progress packets and the MP messaging system keeps all of them pending waiting to be read.
This causes over time, a massive build up of pending messages - which wastes memory, but I think also causes a huge slowdown in MP communication between manager and workers
(as seen primarily by very slow sorts).
I believe this is being seen now, because a serialization/deserialization issue has been introduced in recent builds related to the sub file stats.