Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-10234

ThorMaster segmentation fault

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0.2
    • Fix Version/s: None
    • Component/s: Thor
    • Labels:
      None

      Description

      We had a workunit (that has run fine many times in the past) fail after getting stuck for a long time overnight with this error:

      eclagent 	-1: System error: -1: Failed to receive reply from thor 10.220.5.10:16520; (-1, Failed to receive reply from thor 10.220.5.10:16520)

      It appears ThorMaster crashed after deciding to abort the job for some reason - as a guess, one of the slaves had trouble reading the file? but - don't really know, since what got reported was ThorMaster crashing and not the true source of the problem.

      Here's the log from the master:

      0004358E 2013-10-16 14:01:40.858 28636 28636 "WARNING: Graph wait cancelled, aborted=false - graph(graph1, 4)"
      0004358F 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(diskwrite, 10)"
      00043590 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(iterate, 8)"
      00043591 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(split, 6)"
      00043592 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(diskread, 5)"
      00043593 2013-10-16 14:01:40.882 28636 28636 "================================================"
      00043594 2013-10-16 14:01:40.882 28636 28636 "Signal:    11 Segmentation fault"
      00043595 2013-10-16 14:01:40.882 28636 28636 "Fault IP:  0000003481288AE6"
      00043596 2013-10-16 14:01:40.882 28636 28636 "Accessing: 00007FB85C11F000"
      00043597 2013-10-16 14:01:40.882 28636 28636 "Registers:"
      00043598 2013-10-16 14:01:40.882 28636 28636 "EAX:0000000001A38F90  EBX:00007FFF2306B450  ECX:000000000180C5C0  EDX:0000000000000008  ESI:00007FB85C11EFFB  EDI:0000000001A38F90"
      00043599 2013-10-16 14:01:40.882 28636 28636 "CS:EIP:0033:0000003481288AE6"
      0004359A 2013-10-16 14:01:40.882 28636 28636 "   ESP:00007FFF2306B368  EBP:0000000000000008"
      0004359B 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B368]: 00007FB86664ABF5 00CC78B000007FB8 0000000000CC78B0 01A38F7000000000 0000000001A38F70 00CC78B000000000 0000000000CC78B0 66652F8C00000000"
      0004359C 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B388]: 00007FB866652F8C 0000001700007FB8 0000000000000017 00CC78D000000000 0000000000CC78D0 0000000100000000 0000000000000001 FE7B52B800000000"
      0004359D 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3A8]: 04FF39A0FE7B52B8 0000000004FF39A0 0000000000000000 00ABFD9000000000 0000000000ABFD90 2306B45000000000 00007FFF2306B450 0000001700007FFF"
      0004359E 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3C8]: 0000000000000017 006588F800000000 00000000006588F8 FE7B52B800000000 00000000FE7B52B8 0000000000000000 0000000000000000 63DC786700000000"
      0004359F 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3E8]: 00007FB863DC7867 00ABFD9000007FB8 0000000000ABFD90 2306B54F00000000 00007FFF2306B54F 6697DBB000007FFF 00007FB86697DBB0 2306B4D800007FB8"
      000435A0 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B408]: 00007FFF2306B4D8 2306B4DF00007FFF 00007FFF2306B4DF 64C21E1800007FFF 0000001764C21E18 0000000000000017 0000000000000000 0000000000000000"
      000435A1 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B428]: 0000000000000000 2306B4B800000000 00007FFF2306B4B8 0000003600007FFF FFFFFFFF00000036 2306B490FFFFFFFF 00007FFF2306B490 2306B4D400007FFF"
      000435A2 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B448]: 00007FFF2306B4D4 0011DE8B00007FFF 000000000011DE8B 5C00117000000000 00007FB85C001170 0000001500007FB8 0000001500000015 0000000100000015"
      000435A3 2013-10-16 14:01:40.883 28636 28636 "Backtrace:"
      000435A4 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x28) [0x7fb86665b1b8]"
      000435A5 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libjlib.so(_Z13excsighandleriP7siginfoPv+0x9da) [0x7fb86665be5a]"
      000435A6 2013-10-16 14:01:40.886 28636 28636 "  /lib64/libpthread.so.0() [0x348160f500]"
      000435A7 2013-10-16 14:01:40.886 28636 28636 "  /lib64/libc.so.6(memcpy+0x46) [0x3481288ae6]"
      000435A8 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libjlib.so(_ZN12MemoryBuffer10readEndianEjPv+0x65) [0x7fb86664abf5]"
      000435A9 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libjlib.so(_Z21createStdTimeReporterR12MemoryBuffer+0x11c) [0x7fb866652f8c]"
      000435AA 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph16getFinalProgressEv+0x257) [0x7fb863dc7867]"
      000435AB 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph4doneEv+0x170) [0x7fb863dc7cb0]"
      000435AC 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN10CGraphBase9doExecuteEjPKhb+0x19b) [0x7fb864c0f3cb]"
      000435AD 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN10CGraphBase15executeSubGraphEjPKh+0xb4) [0x7fb864c104b4]"
      000435AE 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph15executeSubGraphEjPKh+0x23b) [0x7fb863dc94bb]"
      000435AF 2013-10-16 14:01:40.886 28636 28636 "  /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN10CJobMaster2goEv+0x76b) [0x7fb863dcb7cb]"
      000435B0 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager12executeGraphER14IConstWorkUnitPKcRK14SocketEndpoint+0x5ef) [0x40f88f]"
      000435B1 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager4doitEP14IConstWorkUnitPKcRK14SocketEndpoint+0x201) [0x4102b1]"
      000435B2 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager3runEv+0xd5d) [0x41121d]"
      000435B3 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_Z8thorMainP14ILogMsgHandler+0x26e) [0x411ace]"
      000435B4 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2(main+0x1205) [0x4140a5]"
      000435B5 2013-10-16 14:01:40.886 28636 28636 "  /lib64/libc.so.6(__libc_start_main+0xfd) [0x348121ecdd]"
      000435B6 2013-10-16 14:01:40.886 28636 28636 "  /var/lib/HPCCSystems/mythor2/thormaster_mythor2() [0x40a459]"
      000435B7 2013-10-16 14:01:40.886 28636 28636 "ThreadList:
      7FB862203700 140429896988416 28637: CMPNotifyClosedThread
      7FB861802700 140429886498560 28638: MP Connection Thread
      7FB85BFFF700 140429794211584 28640: CSocketSelectThread
      7FB85ABFD700 140429773231872 28667: LogMsgParentReceiver
      7FB8517FB700 140429618034432 28668: LogMsgFilterReceiver
      7FB8521FC700 140429628524288 28669: CMemoryUsageReporter
      7FB860DE1700 140429875877632 28670: CMasterWatchdogBase
      7FB7ADEFE700 140426873923328 28671: CDeregistrationWatch
      7FB7AD4FD700 140426863433472 28672: CDaliConnectionValidator
      7FB7ACAFC700 140426852943616 28673: CDaliConnectionValidator
      7FB85A1FC700 140429762742016 28675: CDaliPublisherClient
      7FB8597FB700 140429752252160 28676: Member of thread pool: CDaliPublisherClientMessages
      7FB858DFA700 140429741762304 1118: Member of thread pool: CDaliPublisherClientMessages
      7FB79FFFF700 140426640094976 16503: ReleaseBufferThread
      7FB85B5FE700 140429783721728 16504: TimeoutTrigger
      7FB853DCF700 140429657700096 16505: CSlaveMessageHandler
      7FB79F357700 140426626823936 16506: WorkunitAbortHandler
      7FB787FFF700 140426237441792 16558: Stopped CMasterActivity
      7FB79DF55700 140426605844224 16559: Stopped CMasterActivity
      7FB79E956700 140426616334080 16560: TimeoutTrigger
      7FB787FFF700 140426237441792 16609: Stopped CMasterActivity
      7FB786BFD700 140426216462080 16610: Stopped CMasterActivity
      7FB787FFF700 140426237441792 16611: Stopped CMasterActivity
      7FB786BFD700 140426216462080 16612: Stopped CMasterActivity
      "
      

      It generated a core, which I have, but gzipped it's 15.6 M and it appears this ticket system won't allow anything bigger than 10M.

        Attachments

          Activity

            People

            • Assignee:
              jakesmith Jake Smith
              Reporter:
              lneric lneric
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: