Process runs fine up till 68 iterations. On the 69th iteration (depending on data), the job fails to complete or runs out of memory. This occurs when there are non-separable data points in the training set, causing the process to run to max-depth.
The allocated node ids are reorganized after every 32 iterations to avoid overflowing but it turns out that under certain conditions, the node id can wrap before 32 iterations has gone by. This causes a mismatch in ids and confounds the JOIN, creating potentially billions of records as output.
The fix is two-fold:
- Increase the size of the nodeId field from UNSIGNED4 to UNSIGNED8
- Create a positive test for overflow, rather than depending on a fixed count of iterations.
The nodeId field should still be constained to <= 2**48 as this is the limit of what can be held in a Layout_Model2 field.