Currently during a remote file read dafilesrv, returns 100 records at a time to the client. This is currently a significant bottleneck for read performance. The Spark-HPCC connector is currently reading somewhere between 20-100x (record size dependent) slower than the Spark-Thor POC connector due to this issue.
A simple solution would be to allow the client to set the max number of rows it would like to receive at one time. However, I think a better solution would be for the client to request a certain amount of data. IE: "give me as many rows as you can fit into 4mb".
The reasoning here is the client in most circumstances will not be aware of the exact row size. So, if it were to try and calculate the number of rows it needs to request based on I/O limits it would be off except for fixed record sizes. Dafilesrv however has the information required.