We had a workunit (that has run fine many times in the past) fail after getting stuck for a long time overnight with this error:
eclagent -1: System error: -1: Failed to receive reply from thor 10.220.5.10:16520; (-1, Failed to receive reply from thor 10.220.5.10:16520)
It appears ThorMaster crashed after deciding to abort the job for some reason - as a guess, one of the slaves had trouble reading the file? but - don't really know, since what got reported was ThorMaster crashing and not the true source of the problem.
Here's the log from the master:
0004358E 2013-10-16 14:01:40.858 28636 28636 "WARNING: Graph wait cancelled, aborted=false - graph(graph1, 4)" 0004358F 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(diskwrite, 10)" 00043590 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(iterate, 8)" 00043591 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(split, 6)" 00043592 2013-10-16 14:01:40.859 28636 16505 "Abort condition set - activity(diskread, 5)" 00043593 2013-10-16 14:01:40.882 28636 28636 "================================================" 00043594 2013-10-16 14:01:40.882 28636 28636 "Signal: 11 Segmentation fault" 00043595 2013-10-16 14:01:40.882 28636 28636 "Fault IP: 0000003481288AE6" 00043596 2013-10-16 14:01:40.882 28636 28636 "Accessing: 00007FB85C11F000" 00043597 2013-10-16 14:01:40.882 28636 28636 "Registers:" 00043598 2013-10-16 14:01:40.882 28636 28636 "EAX:0000000001A38F90 EBX:00007FFF2306B450 ECX:000000000180C5C0 EDX:0000000000000008 ESI:00007FB85C11EFFB EDI:0000000001A38F90" 00043599 2013-10-16 14:01:40.882 28636 28636 "CS:EIP:0033:0000003481288AE6" 0004359A 2013-10-16 14:01:40.882 28636 28636 " ESP:00007FFF2306B368 EBP:0000000000000008" 0004359B 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B368]: 00007FB86664ABF5 00CC78B000007FB8 0000000000CC78B0 01A38F7000000000 0000000001A38F70 00CC78B000000000 0000000000CC78B0 66652F8C00000000" 0004359C 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B388]: 00007FB866652F8C 0000001700007FB8 0000000000000017 00CC78D000000000 0000000000CC78D0 0000000100000000 0000000000000001 FE7B52B800000000" 0004359D 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3A8]: 04FF39A0FE7B52B8 0000000004FF39A0 0000000000000000 00ABFD9000000000 0000000000ABFD90 2306B45000000000 00007FFF2306B450 0000001700007FFF" 0004359E 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3C8]: 0000000000000017 006588F800000000 00000000006588F8 FE7B52B800000000 00000000FE7B52B8 0000000000000000 0000000000000000 63DC786700000000" 0004359F 2013-10-16 14:01:40.882 28636 28636 "Stack[00007FFF2306B3E8]: 00007FB863DC7867 00ABFD9000007FB8 0000000000ABFD90 2306B54F00000000 00007FFF2306B54F 6697DBB000007FFF 00007FB86697DBB0 2306B4D800007FB8" 000435A0 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B408]: 00007FFF2306B4D8 2306B4DF00007FFF 00007FFF2306B4DF 64C21E1800007FFF 0000001764C21E18 0000000000000017 0000000000000000 0000000000000000" 000435A1 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B428]: 0000000000000000 2306B4B800000000 00007FFF2306B4B8 0000003600007FFF FFFFFFFF00000036 2306B490FFFFFFFF 00007FFF2306B490 2306B4D400007FFF" 000435A2 2013-10-16 14:01:40.883 28636 28636 "Stack[00007FFF2306B448]: 00007FFF2306B4D4 0011DE8B00007FFF 000000000011DE8B 5C00117000000000 00007FB85C001170 0000001500007FB8 0000001500000015 0000000100000015" 000435A3 2013-10-16 14:01:40.883 28636 28636 "Backtrace:" 000435A4 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x28) [0x7fb86665b1b8]" 000435A5 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libjlib.so(_Z13excsighandleriP7siginfoPv+0x9da) [0x7fb86665be5a]" 000435A6 2013-10-16 14:01:40.886 28636 28636 " /lib64/libpthread.so.0() [0x348160f500]" 000435A7 2013-10-16 14:01:40.886 28636 28636 " /lib64/libc.so.6(memcpy+0x46) [0x3481288ae6]" 000435A8 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libjlib.so(_ZN12MemoryBuffer10readEndianEjPv+0x65) [0x7fb86664abf5]" 000435A9 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libjlib.so(_Z21createStdTimeReporterR12MemoryBuffer+0x11c) [0x7fb866652f8c]" 000435AA 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph16getFinalProgressEv+0x257) [0x7fb863dc7867]" 000435AB 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph4doneEv+0x170) [0x7fb863dc7cb0]" 000435AC 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN10CGraphBase9doExecuteEjPKhb+0x19b) [0x7fb864c0f3cb]" 000435AD 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN10CGraphBase15executeSubGraphEjPKh+0xb4) [0x7fb864c104b4]" 000435AE 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph15executeSubGraphEjPKh+0x23b) [0x7fb863dc94bb]" 000435AF 2013-10-16 14:01:40.886 28636 28636 " /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN10CJobMaster2goEv+0x76b) [0x7fb863dcb7cb]" 000435B0 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager12executeGraphER14IConstWorkUnitPKcRK14SocketEndpoint+0x5ef) [0x40f88f]" 000435B1 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager4doitEP14IConstWorkUnitPKcRK14SocketEndpoint+0x201) [0x4102b1]" 000435B2 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_ZN11CJobManager3runEv+0xd5d) [0x41121d]" 000435B3 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2(_Z8thorMainP14ILogMsgHandler+0x26e) [0x411ace]" 000435B4 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2(main+0x1205) [0x4140a5]" 000435B5 2013-10-16 14:01:40.886 28636 28636 " /lib64/libc.so.6(__libc_start_main+0xfd) [0x348121ecdd]" 000435B6 2013-10-16 14:01:40.886 28636 28636 " /var/lib/HPCCSystems/mythor2/thormaster_mythor2() [0x40a459]" 000435B7 2013-10-16 14:01:40.886 28636 28636 "ThreadList: 7FB862203700 140429896988416 28637: CMPNotifyClosedThread 7FB861802700 140429886498560 28638: MP Connection Thread 7FB85BFFF700 140429794211584 28640: CSocketSelectThread 7FB85ABFD700 140429773231872 28667: LogMsgParentReceiver 7FB8517FB700 140429618034432 28668: LogMsgFilterReceiver 7FB8521FC700 140429628524288 28669: CMemoryUsageReporter 7FB860DE1700 140429875877632 28670: CMasterWatchdogBase 7FB7ADEFE700 140426873923328 28671: CDeregistrationWatch 7FB7AD4FD700 140426863433472 28672: CDaliConnectionValidator 7FB7ACAFC700 140426852943616 28673: CDaliConnectionValidator 7FB85A1FC700 140429762742016 28675: CDaliPublisherClient 7FB8597FB700 140429752252160 28676: Member of thread pool: CDaliPublisherClientMessages 7FB858DFA700 140429741762304 1118: Member of thread pool: CDaliPublisherClientMessages 7FB79FFFF700 140426640094976 16503: ReleaseBufferThread 7FB85B5FE700 140429783721728 16504: TimeoutTrigger 7FB853DCF700 140429657700096 16505: CSlaveMessageHandler 7FB79F357700 140426626823936 16506: WorkunitAbortHandler 7FB787FFF700 140426237441792 16558: Stopped CMasterActivity 7FB79DF55700 140426605844224 16559: Stopped CMasterActivity 7FB79E956700 140426616334080 16560: TimeoutTrigger 7FB787FFF700 140426237441792 16609: Stopped CMasterActivity 7FB786BFD700 140426216462080 16610: Stopped CMasterActivity 7FB787FFF700 140426237441792 16611: Stopped CMasterActivity 7FB786BFD700 140426216462080 16612: Stopped CMasterActivity "
It generated a core, which I have, but gzipped it's 15.6 M and it appears this ticket system won't allow anything bigger than 10M.