Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-16927

Global Sort socket timeouts silently reported, not causing job to abort

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2.8
    • Component/s: Thor
    • Labels:
      None

      Description

      There have been a few incidents where workunits have stalled indefinitely whilst performing a large global sort.
      On examination of the logs, a slave had hit a socket timeout error, reported in the slave log, but no where else.

      This should peculate up as a fatal error and be seen in the workunit.

      Example from logs:

      00000F7C 2017-01-20 10:26:10.109 28215 29351 "SYS: PU=  5% MU= 25% MAL=3676429824 MMP=3674476544 SBK=1953280 TOT=3593484K RAM=12096640K SWP=1659344K"
      00000F7D 2017-01-20 10:26:10.109 28215 29351 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.0 kw/s=0.0 bsy=0 [sdb] r/s=142.1 kr/s=16941.4 w/s=1.6 kw/s=12.9 bsy=3 NIC: rxp/s=0.0 rxk/s=0.0 txp/s=0.0 txk/s=0.0 CPU: usr
      =0 sys=4 iow=0 idle=94"
      00000F7E 2017-01-20 10:26:32.583 28215 30198 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(505) : **MultiMerge.2 : timeout e
      xpired
      Target: C!10.241.20.79, Raised in: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/system/jlib/jsocket.cpp, line 1524"
      00000F7F 2017-01-20 10:26:32.583 28215 30198 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsortmp.cpp(563) : SortSlaveMP::marshall : ti
      meout expired
      Target: C!10.241.20.79, Raised in: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/system/jlib/jsocket.cpp, line 1524"
      00000F80 2017-01-20 10:26:32.583 28215 30198 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts.cpp(621) : **Exception(10) : timeout e
      xpired
      Target: C!10.241.20.79, Raised in: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/system/jlib/jsocket.cpp, line 1524"
      

      Other examples:

      000D00B2 2017-01-18 18:43:16.479  5924 30847 "ERROR: 110: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(207) : CSortMergeBase processRows : ETIMEDOUT - Connection timed out
      000D00B3 2017-01-18 18:43:16.479  5924 30847 "ERROR: 110: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(258) : CSortMerge notifySelected.2 : ETIMEDOUT - Connection timed out
      000D00B4 2017-01-18 18:43:16.480  5924 30847 "ERROR: -10: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(548) : **Exception(4c) : connection closed other end
      000D00B5 2017-01-18 18:43:16.480  5924 30847 "ERROR: -10: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(247) : CSortMerge notifySelected.1 : connection closed other end
      
      
      000CFC11 2017-01-18 18:43:15.684 13539 30512 "ERROR: -4: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(207) : CSortMergeBase processRows : connection is broken
      000CFC12 2017-01-18 18:43:15.684 13539 30512 "ERROR: -4: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(258) : CSortMerge notifySelected.2 : connection is broken
      
      000D0011 2017-01-18 18:58:41.626 13377 16887 "ERROR: -4: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(207) : CSortMergeBase processRows : connection is broken
      000D0012 2017-01-18 18:58:41.626 13377 16887 "ERROR: -4: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(258) : CSortMerge notifySelected.2 : connection is broken
      
      thorslave.183.2017_01_18.log:000CFE42 2017-01-18 19:27:50.937  4038 26333 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(505) : **MultiMerge.2 : timeout expired
      thorslave.183.2017_01_18.log:000CFE43 2017-01-18 19:27:50.937  4038 26333 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsortmp.cpp(563) : SortSlaveMP::marshall : timeout expired
      thorslave.183.2017_01_18.log:000CFE44 2017-01-18 19:27:50.938  4038 26333 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts.cpp(621) : **Exception(10) : timeout expired
      
      000CFD1D 2017-01-18 19:27:50.154  4041 26334 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts1.cpp(505) : **MultiMerge.2 : timeout expired
      000CFD1E 2017-01-18 19:27:50.154  4041 26334 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsortmp.cpp(563) : SortSlaveMP::marshall : timeout expired
      000CFD1F 2017-01-18 19:27:50.155  4041 26334 "ERROR: -6: /jenkins/workspace/LN-Candidate-withplugins-5.6.8-1/LN/centos-6.0-x86_64/HPCC-Platform/thorlcr/msort/tsorts.cpp(621) : **Exception(10) : timeout expired
      

        Attachments

          Activity

            People

            • Assignee:
              mckellyln Mark Kelly
              Reporter:
              jakesmith Jake Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: