Some formats are very expensive to partition, CSV with possible quoted terminators (the default) is one of them.
The spray implementation must walk the source file from start to finish, to establish record boundaries and therefore split points.
This can mean the partitioning time equals or exceeds the actual data transfer time
We have the option 'quotedTerminator=0' since 5.0 (see
HPCC-10961), which allows partitioning points to be discovered quickly, with the caveat that if the CSV file does contain quoted terminators it may well break record boundaries.
A spray implementation that forgoes partitioning and streams the source file to the target nodes in a round-robin fashion would be sensible.
This may also be useful when spraying other formats where partitioning is either expensive or impractical. Zip files may be a good candidate.
The implementation would:
+ read the source file, up to the next record boundary or until it has filled a reasonable size send buffer of complete records.
+ send collated records to a target node in the destination cluster - round robin or similar.