Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-12251 Create cassandra plugin for workunit storage
  3. HPCC-13735

Add support for sorting/filtering WU lists in Cassandra



    • Sub-task
    • Status: Resolved
    • Not specified
    • Resolution: Fixed
    • None
    • 6.0.0
    • Plugins
    • None


      At the same time may want to rationalize the existing API.

      WuQuery ESP call is used in 2 modes: get a filtered list (should be lightweight) and get a single WUs current (summary) state.

      EclWatch expects to be able to specify a single sort order (defaults to wuid desc) and multiple filters, and to page through the results.

      It's a bit of a challenge to make this efficient (especially for fuzzy filters). For an in-memory DB like Dali it is POSSIBLE to do some of this reasonably efficiently (though we don't...) but in Cassandra it's a lot harder, so need to think about what we actually NEED.

      While we present things as paged, I suspect the users never make it to page 3. But if they are sorted, I had to fetch the other 1000000 pages to make sure you saw the right values on pages 1,2, and 3...

      My view is that MOST of the things we allow you to sort by are useless
      as soon as you get a large resultset, there’s no point sorting it by a low cardinality field. since you would have to page to page 6743 to find the entries for ‘owner=richard'. You should be filtering rather than sorting (but this may need UI changes...)

      Really, all the sorts are topN where N is determined by how many times the user is prepared to hit next - but the only 'sort' where topn really makes much sense are the 'date desc' and the 'total thor time desc' ones.

      I think what I will do is implement all the filters (using indexes for the common non-fuzzy ones, postfiltering for the other). But I will implement all sort orders other than wuid as post-sorts and fail if I hit a threshold (10000, say) while gathering the data to post-sort. I may also fail if I hit a threshold of records rejected while post-filtering (so if you want to do a fuzzy search by jobname you may have to restrict it to a date range first, for example.

      The biggest concern I have with the above proposal is that it doesn’t cover some ‘topN’ type use-cases - e.g. give me the top 10 longest-running WUs
      I can special case that to allow it unfiltered (but it would have the same threshold limit if there WAS a filter). I can also support running time as a filter condition which would make the above limitation more palatable - i.e. Show me the longest running sorted by runtime desc BUT filtered to those that ran more than an hour...




            richardkchapman Richard Chapman
            richardkchapman Richard Chapman
            0 Vote for this issue
            2 Start watching this issue