Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-15737

sasha crashing when archiving a workunit

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Timed Out
    • 6.0.0
    • None
    • Sasha
    • None

    Description

      sasha cored when archiving workunits on the 190 cluster.

      the node in question is 10.239.190.101.

      Here is the excerpt from the logs:
      00001E73 2016-06-14 12:43:51.866 3701 3709 "ARCHIVE: Scanning WorkUnits limit=1000"
      00001E74 2016-06-14 12:43:52.385 3701 3709 "ARCHIVE count=1001 ignored=436 later=7 nulltimes=0 protected=0"
      00001E75 2016-06-14 12:43:52.385 3701 3709 "ARCHIVE: WorkUnits - 1 to archive, 0 to backup"
      00001E76 2016-06-14 12:43:58.006 3701 3709 "================================================"
      00001E77 2016-06-14 12:43:58.006 3701 3709 "Signal: 11 Segmentation fault"
      00001E78 2016-06-14 12:43:58.006 3701 3709 "Fault IP: 00007F523795C1AE"
      00001E79 2016-06-14 12:43:58.006 3701 3709 "Accessing: 0000000000000000"
      00001E7A 2016-06-14 12:43:58.006 3701 3709 "Registers:"
      00001E7B 2016-06-14 12:43:58.006 3701 3709 "EAX:737953434350482F EBX:00007F521863AAC0 ECX:000000032BCB6980 EDX:0000000072657673 ESI:0000000000000000 EDI:0000000000000001"
      00001E7C 2016-06-14 12:43:58.006 3701 3709 "CS:EIP:0033:00007F523795C1AE"
      00001E7D 2016-06-14 12:43:58.006 3701 3709 " ESP:00007F5231EA4800 EBP:00007F521863AAD8"
      00001E7E 2016-06-14 12:43:58.006 3701 3709 "Stack[00007F5231EA4800]: 00007F521863AAD0 364CA6B100007F52 00007F52364CA6B1 1867099000007F52 00007F5218670990 1867099000007F52 00007F5218670990 3795C45000007F52"
      00001E7F 2016-06-14 12:43:58.006 3701 3709 "Stack[00007F5231EA4820]: 00007F523795C450 180B5B2000007F52 00007F52180B5B20 31EA48C000007F52 00007F5231EA48C0 180B5B3800007F52 00007F52180B5B38 1863AAD000007F52"
      00001E80 2016-06-14 12:43:58.006 3701 3709 "Stack[00007F5231EA4840]: 00007F521863AAD0 3B80022600007F52 00007F523B800226 0000000000007F52 00007F5200000000 31EA490000007F52 00007F5231EA4900 31EA4AE000007F52"
      00001E81 2016-06-14 12:43:58.006 3701 3709 "Stack[00007F5231EA4860]: 00007F5231EA4AE0 1809D50000007F52 00007F521809D500 31EA495800007F52 00007F5231EA4958 31EA499800007F52 00007F5231EA4998 31EA48E000007F52"
      00001E82 2016-06-14 12:43:58.007 3701 3709 "Stack[00007F5231EA4880]: 00007F5231EA48E0 1809F4B000007F52 000000011809F4B0 0064244000000001 0000000000642440 1809F30000000000 00007F521809F300 31EA495800007F52"
      00001E83 2016-06-14 12:43:58.007 3701 3709 "Stack[00007F5231EA48A0]: 00007F5231EA4958 31EA494000007F52 00007F5231EA4940 31EA498000007F52 00007F5231EA4980 180320E000007F52 00007F52180320E0 0000000000007F52"
      00001E84 2016-06-14 12:43:58.007 3701 3709 "Stack[00007F5231EA48C0]: 0000000000000000 FFFF000000000000 16BEEF0AFFFF0000 0000000016BEEF0A 0000000000000000 0041CF0C00000000 000000000041CF0C 3B8354E000000000"
      00001E85 2016-06-14 12:43:58.007 3701 3709 "Stack[00007F5231EA48E0]: 00007F523B8354E0 186727E000007F52 00007F52186727E0 94D8231900007F52 00F0A84394D82319 0122126000F0A843 0000000001221260 0064339000000000"
      00001E86 2016-06-14 12:43:58.007 3701 3709 "Backtrace:"
      00001E87 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libjlib.so(+0xe09e8) [0x7f52378b19e8]"
      00001E88 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libjlib.so(_Z13excsighandleriP7siginfoPv+0x21c) [0x7f52378b33fc]"
      00001E89 2016-06-14 12:43:58.007 3701 3709 " /lib64/libpthread.so.0(+0xf710) [0x7f52367f3710]"
      00001E8A 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libjlib.so(_ZN16CWorkQueueThread4postEP14IWorkQueueItem+0x7e) [0x7f523795c1ae]"
      00001E8B 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libworkunit.so(_ZN14CLocalWorkUnit16cleanupAndDeleteEbbPK11StringArray+0x646) [0x7f523b800226]"
      00001E8C 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libworkunit.so(_ZN13CDaliWorkUnit16cleanupAndDeleteEbbPK11StringArray+0x10) [0x7f523b819f80]"
      00001E8D 2016-06-14 12:43:58.007 3701 3709 " /opt/HPCCSystems/lib/libworkunit.so(_ZN14CLocalWorkUnit15archiveWorkUnitEPKcbbbb+0x9aa) [0x7f523b7fe3da]"
      00001E8E 2016-06-14 12:43:58.007 3701 3709 " saserver() [0x4167bb]"
      00001E8F 2016-06-14 12:43:58.007 3701 3709 " saserver(_ZN17CWorkUnitArchiver13cWUBranchItem7archiveEv+0x52) [0x41bbd2]"
      00001E90 2016-06-14 12:43:58.007 3701 3709 " saserver(_ZN15CBranchArchiver6actionEv+0x563) [0x41b6f3]"
      00001E91 2016-06-14 12:43:58.008 3701 3709 " saserver(_ZN20CSashaArchiverServer3runEv+0xab4) [0x41e004]"
      00001E92 2016-06-14 12:43:58.008 3701 3709 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2c) [0x7f5237957abc]"
      00001E93 2016-06-14 12:43:58.008 3701 3709 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x1e) [0x7f523795945e]"
      00001E94 2016-06-14 12:43:58.008 3701 3709 " /lib64/libpthread.so.0(+0x79d1) [0x7f52367eb9d1]"
      00001E95 2016-06-14 12:43:58.008 3701 3709 " /lib64/libc.so.6(clone+0x6d) [0x7f52365388fd]"
      00001E96 2016-06-14 12:43:58.008 3701 3709 "ThreadList:
      7F52350AB700 139991053940480 3702: CMPNotifyClosedThread
      7F52346AA700 139991043450624 3703: CSocketBaseThread
      7F5233CA9700 139991032960768 3704: MP Connection Thread
      7F52332A8700 139991022470912 3706: LogMsgParentReceiver
      7F522BFFF700 139990902241024 3707: LogMsgFilterReceiver
      7F52328A7700 139991011981056 3708: CMemoryUsageReporter
      7F5231EA6700 139991001491200 3709: CSashaArchiverServer
      7F52314A5700 139990991001344 3710: CSashaSDSCoalescingServer
      7F5230AA4700 139990980511488 3711: CSashaXRefServer
      7F522B5FE700 139990891751168 3712: Stopped CSashaDaFSMonitorServer
      7F522ABFD700 139990881261312 3713: Stopped CSashaQMonitorServer
      7F522A1FC700 139990870771456 3714: CSashaExpiryServer
      7F522B5FE700 139990891751168 3761: CStopThread
      7F522ABFD700 139990881261312 13026: Member of thread pool: sachaCmdPool
      "

      The core file can be found at /var/lib/HPCCSystems/mysasha

      here's the info from the core:
      Program terminated with signal 11, Segmentation fault.
      #0 0x00007f523795c1ae in CWorkQueueThread::post(IWorkQueueItem*) () from /opt/HPCCSystems/lib/libjlib.so
      Missing separate debuginfos, use: debuginfo-install hpccsystems-platform-6.0.0-2.x86_64
      (gdb) where
      #0 0x00007f523795c1ae in CWorkQueueThread::post(IWorkQueueItem*) () from /opt/HPCCSystems/lib/libjlib.so
      #1 0x00007f523b800226 in CLocalWorkUnit::cleanupAndDelete(bool, bool, StringArray const*) ()
      from /opt/HPCCSystems/lib/libworkunit.so
      #2 0x00007f523b819f80 in CDaliWorkUnit::cleanupAndDelete(bool, bool, StringArray const*) ()
      from /opt/HPCCSystems/lib/libworkunit.so
      #3 0x00007f523b7fe3da in CLocalWorkUnit::archiveWorkUnit(char const*, bool, bool, bool, bool) ()
      from /opt/HPCCSystems/lib/libworkunit.so
      #4 0x00000000004167bb in _start ()

      jakesmith stuartort

      Attachments

        Activity

          People

            anybody Available for anyone
            cloln Chris Lo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: