-
Type:
Bug
-
Status: Accepted
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Core Libraries
-
Labels:None
-
Compatibility:Minor
I am training a Neural Network using Back-Propagation, and I was having severe performance problems on Thor. I isolated the problem down to the behavior of LOOP, though I haven’t been able to create a small test program that recreates it. Here’s the background:
- The BP method used does not lend itself to parallelization, so I put all the data on the first node.
- The algorithm consists of three levels of loop:
- IterLoop – The number of training iterations (n = 100)
- DataLoop – Loop through the datapoints, adjusting weights after each (n = 100)
- FFLoop – Loop through the layers (n = 2)
- DeltaLoop – Loop through the layers (n = 2)
- Running on Thor takes hundreds of times longer running on thor than on hthor.
- Having determined that the LOOP is an issue, I recoded all the LOOPs to use LOCAL ITERATE instead.
- Now running on thor is only twice as slow as hthor – I can live with that.
- Next I tried to pin it down further, so I started to put the LOOPs back in.
- I was able to change all the iterates back to LOOP without affecting performance substantially, except the DataLoop (middle loop).
- Whenever I change the dataloop back to using LOOP, performance goes down by orders of magnitude.
- The performance degredation is accompanied by thousands of warnings displayed on ECLWatch.
- Some of those warnings include Deadman Timer Expiries, which may be related.
- When that loop is done via ITERATE, I get no warnings.
- So, attached are 2 ZAP-reports:
- 1 with the DataLoop using ITERATE – NNIter(Runtime 2:34)
- 1 with the DataLoop using LOOP – NNLoop (Runtime 2:37:54 – 60x longer)
- All of the other loops are implemented with LOOP.
- Keep in mind that all of the data is on Node 1, so only that node has any work to do during the 3 levels of loop.
- If you would like to look at the code in question, the differences are localized to the attribute IterStep (around #360 in NeuralNetworks.ecl)
- DataLoop – Loop through the datapoints, adjusting weights after each (n = 100)
- IterLoop – The number of training iterations (n = 100)
If the problem is inherent to the distributed LOOP functionality, then perhaps a LOCAL variant of LOOP could be made available.