Hash dedup/aggregate currently has a fixed max limit, which is based on a guess of how many rows it can load into memory before it starts.
We want to make this restriction is lifted and it dynamically cope with sizes larger than memory etc.
There was a previous dissusion about coping with siimilar situations which I have briefly recovered with @ghalliday .
I think the approach needs to be:
+ load input rows, create hash key, if exist in HT, dispose of key/row else output row.
+ repeat, expanding HT (will necessitate rehashing HT, so will want to do at a reasonably large granularity)
+ On spill request
a) Continue to load input rows, create key, filter if HT match, spill to disk into N buckets if doesn't.
b) continue until end-of-input
c) clear HT
[NB: If critical spill request occurs, can reduce size of HT or clear completely]
+ Repeat process for each bucket, ideally it will not cause further spilling, if does, each bucket will in turn be split into a further
N buckets and process repeats.
After 1st spill, an alternative is to load and sort each bucket (may necessitate spilling in process) and merge dedup streams.
The process will be similar for hash aggregate, but focusing on dedup in discussion for simplicity.
@ghalliday, @richardkchapman - thoughts?