If you can efficiently join two datasets on one condition, but have the result grouped by another condition it would allow you to efficiently solve some relationship matching problems.
As a first approximation the following ECL:
where leftId is a value assigned from LEFT.Id inside the transform t()
Could be translated to:
There are several cases that are worth taking care of
- A Self join.
- Here SL should be used for both the left and the right of the JOIN.
- There should be a LOOKUP SELF JOIN activity (if there isn't already).
- It may be more efficient to combine the SORT with the JOIN since that may remove an extra pointer array.
- ATMOST with optional fields.
The right hand side of the join would probably be needed to be sorted by those optional fields, so that the matches for a particular key were in order and could be narrowed down efficiently. (This means the sides of the join would be sorted differently.)
- Large keys.
- Instead of storing the key values it might be more efficient to store the keys locally in a hash table, and save pointers in the records. (Ideally just between the two distributes. It would save memory and allow pointer compares for equality comparisons.
Please add any comments/observations.
|Implement syntax for GROUPED JOIN||Resolved|
|Add a hint to exclude matches of the row with itself.||Resolved|
|Generate a SORTED-output LOOKUP JOIN||Accepted|