Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-13668

Introduce a non-partitioning spray implementation (e.g. round-robin)

    Details

    • Type: New Feature
    • Status: Scheduled
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 7.0.0
    • Component/s: DFU Server
    • Labels:

      Description

      Some formats are very expensive to partition, CSV with possible quoted terminators (the default) is one of them.
      The spray implementation must walk the source file from start to finish, to establish record boundaries and therefore split points.

      This can mean the partitioning time equals or exceeds the actual data transfer time

      We have the option 'quotedTerminator=0' since 5.0 (see HPCC-10961), which allows partitioning points to be discovered quickly, with the caveat that if the CSV file does contain quoted terminators it may well break record boundaries.

      A spray implementation that forgoes partitioning and streams the source file to the target nodes in a round-robin fashion would be sensible.

      This may also be useful when spraying other formats where partitioning is either expensive or impractical. Zip files may be a good candidate.

      The implementation would:
      + read the source file, up to the next record boundary or until it has filled a reasonable size send buffer of complete records.
      + send collated records to a target node in the destination cluster - round robin or similar.
      + repeat

      Gavin HallidayAttila Vamos

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                attilavamos Attila Vamos
                Reporter:
                jakesmith Jake Smith
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: