Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-21091

Segfault during Smart Join

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.4.24
    • Fix Version/s: 7.0.6
    • Component/s: Thor
    • Labels:
      None

      Description

      The job is segfaulting the thorslave:

      The thorslave segfaulted on this job. Looks like it generated core files as well.

       

      TS logs

      10.173.71.44:/var/lib/HPCCSystems/thor100_71_5/thorslave.94.2018_11_29.log

      0032F51B 2018-11-29 16:17:00.714 189606 1471742 "recvLoop - received bcast_stop, from : node=3, slave=3 - activity(ch=0, smartjoin, 1317)"

      0032F51C 2018-11-29 16:17:00.785 189606 1471729 "clearNonLocalRows[slave=2], numCommitted=87621, totalRows(inc uncommitted)=87646, flushMarker=0 - activity(ch=0, smartjoin, 1317)"

      0032F51D 2018-11-29 16:17:00.789 189606 1471729 "clearAllNonLocalRows(100): CThorSpillableRowArray::save (skipNulls=true, emptyRowSemantics=0) max rows = 87621 - activity(ch=0, smartjoin, 1317)"

      0032F51E 2018-11-29 16:17:00.802 189606 1471729 "clearAllNonLocalRows(100): CThorSpillableRowArray::save done, rows written = 87621, bytes = 1051452 - activity(ch=0, smartjoin, 1317)"

      0032F51F 2018-11-29 16:17:00.944 189606 1471729 "clearNonLocalRows[slave=2], numCommitted=67, totalRows(inc uncommitted)=87711, flushMarker=0 - activity(ch=0, smartjoin, 1317)"

      0032F520 2018-11-29 16:17:00.944 189606 1471729 "================================================"

      0032F521 2018-11-29 16:17:00.944 189606 1471729 "Program:   10.173.71.44:/mnt/disk1/HPCCSystems/bin/thorslave_lcr"

      0032F522 2018-11-29 16:17:00.944 189606 1471729 "Signal:    11 Segmentation fault"

      0032F523 2018-11-29 16:17:00.944 189606 1471729 "Fault IP:  00007FF65D81EF22"

      0032F524 2018-11-29 16:17:00.944 189606 1471729 "Accessing: 0000000000000000"

      0032F525 2018-11-29 16:17:00.944 189606 1471729 "Backtrace:"

      0032F526 2018-11-29 16:17:00.961 189606 1471729 "  /var/lib/HPCCSystems/queries/thor100_71_5_24200/V4167451050_libW20181129-154014.so(+0x6d3f22) [0x7ff65d81ef22]"

      0032F527 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libactivityslaves_lcr.so(+0xe44ab) [0x7ff669eca4ab]"

      0032F528 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libactivityslaves_lcr.so(+0xe45e3) [0x7ff669eca5e3]"

      0032F529 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libroxiemem.so(+0x16432) [0x7ff66488a432]"

      0032F52A 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libroxiemem.so(+0x16660) [0x7ff66488a660]"

      0032F52B 2018-11-29 16:17:00.962 189606 1471729 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2c) [0x7ff6645c5cbc]"

      0032F52C 2018-11-29 16:17:00.962 189606 1471729 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x1e) [0x7ff6645c768e]"

      0032F52D 2018-11-29 16:17:00.962 189606 1471729 "  /lib64/libpthread.so.0(+0x7e25) [0x7ff6631dee25]"

      0032F52E 2018-11-29 16:17:00.962 189606 1471729 "  /lib64/libc.so.6(clone+0x6d) [0x7ff662f08bad]"

      0032F52F 2018-11-29 16:17:00.962 189606 1471729 "Registers:"

      0032F530 2018-11-29 16:17:00.962 189606 1471729 "EAX:0000000000000000  EBX:0000000000000000  ECX:0000000000000000  EDX:00007FF64D9C0080  ESI:0000000000000000  EDI:00000000015F4168"

      0032F531 2018-11-29 16:17:00.962 189606 1471729 "R8 :0000000000000001  R9 :00007FF662E5716D  R10:61202D20303D7265  R11:0000000000000000"

      0032F532 2018-11-29 16:17:00.962 189606 1471729 "R12:00007FF410008440  R13:0000000000000000  R14:0000000000000043  R15:00007FF4100084B0"

      0032F533 2018-11-29 16:17:00.962 189606 1471729 "CS:EIP:0033:00007FF65D81EF22"

      0032F534 2018-11-29 16:17:00.962 189606 1471729 "   ESP:00007FF40DFFA320  EBP:00007FF40DFFA350"

      0032F535 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA320]: 0000000000000000 015F416800000000 00000000015F4168 0DFFA3C000000000 00007FF40DFFA3C0 645402DC00007FF4 00007FF6645402DC 0000000000007FF6"

      0032F536 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA340]: 00007FF400000000 0000000000007FF4 0000000000000000 0193A07000000000 000000000193A070 69ECA4AB00000000 00007FF669ECA4AB 0DFFA3E000007FF6"

      0032F537 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA360]: 00007FF40DFFA3E0 1000844000007FF4 00007FF410008440 0193A07000007FF4 000000000193A070 0000000200000000 0000000000000002 0000006400000000"

      0032F538 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA380]: 0000000000000064 0000000200000000 0000000000000002 0193A4C000000000 000000000193A4C0 69ECA5E300000000 00007FF669ECA5E3 100360F000007FF6"

      0032F539 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3A0]: 00007FF4100360F0 6487FE8100007FF4 00007FF66487FE81 0193A53000007FF6 000000000193A530 8870000000000000 0000000188700000 0000000100000001"

      0032F53A 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3C0]: 0000000000000001 100015D000000000 00007FF4100015D0 540022A000007FF4 00007FF6540022A0 0000000A00007FF6 000008000000000A 0000000000000800"

      0032F53B 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3E0]: 0000000000000000 648899E700000000 00007FF6648899E7 0000000000007FF6 00007FF600000000 0000004E00007FF6 000000800000004E 8884003800000080"

      0032F53C 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA400]: 00007FF488840038 1000163000007FF4 00007FF410001630 5400185000007FF4 00007FF654001850 64889D1400007FF6 00007FF664889D14 0000021000007FF6"

      0032F53D 2018-11-29 16:17:00.962 189606 1471729 "ThreadList:

      7FF6617E6700 140696174356224 189613: CMPNotifyClosedThread

      7FF660FE5700 140696165963520 189614: CSocketBaseThread

      7FF6607E4700 140696157570816 189615: MP Connection Thread

      7FF65FFE3700 140696149178112 189617: CMemoryUsageReporter

      7FF65F7E2700 140696140785408 189619: CBackupHandler

      7FF65EFE1700 140696132392704 189621: CGraphProgressHandler

      7FF40DFFB700 140686183610112 1471729: BackgroundReleaseBufferThread

      7FF42CBF2700 140686699472640 1471733: ProcessSlaveActivity

      7FF4529B6700 140687334663936 1471734: CGraphExecutor pool

      7FF41D7FA700 140686443652864 1471742: CBroadcaster::CRecv

      7FF4539B8700 140687351449344 1471743: CBroadcaster::CSend

      7FF4531B7700 140687343056640 1471744: CRowProcessor

      7FF4521B5700 140687326271232 1471745: CDistributorBase::cRecvThread

      7FF41FFFF700 140686485616384 1471746: CDistributorBase::cSendThread

      7FF41F7FE700 140686477223680 1471747: CDistributorBase::cRecvThread

      7FF41EFFD700 140686468830976 1471748: CDistributorBase::cSendThread

      7FF41DFFB700 140686452045568 1471752: CRowStreamLookAhead

       

      From Jake:

      Not sure what the root cause is, but it's crashing during a Smart Join, whilst spilling.
      As the RHS of this join is pretty big and it's spilling quite a bit, it may be better to use a standard join rather than a smart join.
      If the bug is with Smart Join as it appears, using a standard join will workaround the problem.

        Attachments

          Activity

            People

            • Assignee:
              jakesmith Jake Smith
              Reporter:
              rwagner42 Russell Wagner
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: