Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-9951

JOIN(a,b,LEFT.key=RIGHT.key,GROUP(LEFT.x))

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Accepted
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Code Generator, Thor
    • Labels:
      None

      Description

      If you can efficiently join two datasets on one condition, but have the result grouped by another condition it would allow you to efficiently solve some relationship matching problems.

      As a first approximation the following ECL:

      R := JOIN(l, r, LEFT.key = RIGHT.key, t(LEFT,RIGHT), GROUP(leftId))
      

      where leftId is a value assigned from LEFT.Id inside the transform t()

      Could be translated to:

      DL = DISTRIBUTE(L, HASH(key));
      DR = DISTRIBUTE(R, HASH(key));
      SL = SORT(DL, id, LOCAL);
      JR := JOIN(SL, DR, LEFT.key = RIGHT.key, t(LEFT,RIGHT), LOOKUP MANY, LOCAL);
      DJ := DISTRIBUTE(J, HASH(LEFTID), MERGE(LEFTID));
      R := GROUP(DJ, LEFTID, LOCAL);
      

      There are several cases that are worth taking care of

      • A Self join.
      • Here SL should be used for both the left and the right of the JOIN.
      • There should be a LOOKUP SELF JOIN activity (if there isn't already).
      • It may be more efficient to combine the SORT with the JOIN since that may remove an extra pointer array.
      • ATMOST with optional fields.

      The right hand side of the join would probably be needed to be sorted by those optional fields, so that the matches for a particular key were in order and could be narrowed down efficiently. (This means the sides of the join would be sorted differently.)

      • Large keys.
      • Instead of storing the key values it might be more efficient to store the keys locally in a hash table, and save pointers in the records. (Ideally just between the two distributes. It would save memory and allow pointer compares for equality comparisons.

      Jake Smith,Richard Chapman,David Bayliss

      Please add any comments/observations.

        Attachments

          Activity

            People

            Assignee:
            ghalliday Gavin Halliday
            Reporter:
            ghalliday Gavin Halliday
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: