Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-23163

Keyed Join 'Failed to open' error if handling index files on a larger overlapping cluster

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 7.0.0
    • Fix Version/s: 7.6.16, 7.4.40
    • Component/s: Thor
    • Labels:
      None

      Description

      An error like:

      Error: System error: 0:Graph graph1[5], keyedjoin[10]: SLAVE #3 [192.168.10.3:20010]: Failed to open index file /var/lib/HPCCSystems/hpcc-data/thor/someindex._5_of_80

      can be hit if a job is running on a smaller thor cluster that is reading from a larger cluster, and the smaller Thor is running on a subset of nodes of the larger cluster, i.e. they overlap.

      In this type of setup, the mapping of KJ index parts to slaves can see that a part copy (not primary) from the larger cluster is local to the smaller cluster. The map tells other slaves to make request to the slave it's local to, but does not tell it that it is a non-primary copy.
      And as a consequence the remote slave fails to open the part.

      A workaround is to add this option:

      #option('remoteKeyedLookup', false);

      which forces all KJ part thread handlers to access the keyed parts directly.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jakesmith Jake Smith
                Reporter:
                jakesmith Jake Smith
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: