For cloud systems the default storage for files and indexes is likely to be blob/s3 storage which has a reasonably high throughput, but poor latency.
When Thor is performing keyed joins, the performance is likely to be very poor if the indexes are stored on blob storage. A couple of options spring to mind:
1. Write all indexes to a faster storage plane
HPCC-28502 may help in this case. The disadvantage is that if files are kept for a long time and not actively used the storage costs will be much higher.
2. Write indexes as normal, and then create copies on another storage plane for the active indexes.
The issue with option(2) is that the file will exist on multiple storage planes - how will Thor know which one to prefer. It is possible to include the plane name in the name of the logical filename e.g. a::b::c@myplane, but that is not a great solution if the filenames are part of a super file.
Two possibilities suggest themselves (both might be useful)
i) Each thor instance (or roxie) could have a ordered list of preferred planes. If a file is on more than one plane the list would be used to select which one was preferred.
ii) Add ,CLUSTER('x' [,OPT]) syntax to indexes and files to indicate the perferred source
On reflection this doesn't seem like a very good solution - it isn't the correct logical place for that information.
So tonymkirk , are we going to face this problem (file on multiple planes) and would the preferred plane list be a good solution? (It could apply equally well to bare-metal where the decision is currently (poorly) made based on the ip-distance.)
Unrelated: It also strikes me that it might be useful to have a dfu copy command which copies the contents of a superfile, but only copies the files if they do not already exist. Does this already exist? If not would it be useful?