At some point we will want to implement a stranded IF() (and other variants). Opening now to document the thoughts I have had so far.
There are a few key variants which are worth thinking about. IF/CASE, NWaySelect, and also nwayinput. They each introduce slightly different complications.
One consideration is that all the activities need to work with a childquery/loop so the output steams should only need connecting once.
- It needs to create a consistent number of streams for the two inputs, and have n output streams which redirect to the appropriate input depending on the condition.
- Simplest is to call getStreams() on true input. Pass the number of streams returned by true input to the getStreams() request on false input. If they number of streams do not match, force them both to a single output stream.
- For IF(), and especially CASE() the first input may not be the one that naturally strands - so it may be better to calculate in 2 passes.
- Some activities that wouldn't naturally be stranded (e.g., FAIL) should support stranding to prevent the true branch of the IF() from being forced to a single strand. Possibly ALL source activities should be stranded - with the default being all the work is done on on strand 0??
- The current model of creating a paired M:1 junction doesn't quite work. If you have two inputs, each creating a junction, which one will be used in the output stream? The underlying implementation would actually work (it relies on a callback notification), so this code would need refactoring.
- Splitters could be got to work by receiving the callback notification, inserting a pseudo row into the buffered stream and processing accordingly. It is likely to be complicated though!
- This should be implemented very similarly to IF(). The complication is that which inputs are available as well as which of those inputs to select could theoretically change each child query execution. Which leads to the next activity...
- Currently this is implemented with numConcreteOutputs() and queryConcreteOutput() calls. This was quickly put together for the benefit of GRAPH and NWAY-JOIN, and I don't think it is really correct.
Once stranding is introduced the problem is that changing the possible datasets each time means that you either need to force any input to an nwayinput to a single stream, or you need to reconnect streams each start() - which we have decided isn't allowed.
I think the correct fix is for the number of outputs generated by a nway activity to never change. Instead of calling queryConcreteOutput() the caller would call queryOutput(), and there would be a form of link that indicated all outputs from an activity fed into the next activity. There would be a new isActive() function on the input which could be called by the caller to indicate whether the input was active for this iteration. It would remove the need for the queryOutputStream()/queryOutputJunction() code that Jake has added recently, and allow nway activities to be processed in a more consistent way.
This would place a minor restriction - it would not support an arbitrary set of datasets sets implementation since that isn't a selection from a predefined maximum set.
NB: Need to be careful that use of ROWSETS inside graph is still as efficient.