Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-22175

Apparent Deadlock on python threads acquiring lock

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Not specified
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.2.16
    • Component/s: None
    • Labels:
      None

      Description

      My test program always runs on hthor, but often freezes when running on Thor.  The same job will sometimes run and sometimes freeze, so there is probably a race condition.

      The program is calling embedded python activities that both receive and return  STREAMED DATASETS.  I tried to reproduce this with a small test program, but was not able to get the right combination to occur. 

      Jake captured the following traces of the two threads' activities:

      Thread 31 (Thread 0x7f3cf5dfe700 (LWP 16437)):
      #0  0x00007f3dced34adb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
      #1  0x00007f3dced34b6f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
      #2  0x00007f3dced34c0b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
      #3  0x00007f3dc5316735 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
      #4  0x00007f3dc52e37de in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
      #5  0x00007f3dc526b2b8 in gen_send_ex.isra.0 () from /lib64/libpython2.7.so.1.0
      #6  0x00007f3dc5251e3b in PyIter_Next () from /lib64/libpython2.7.so.1.0
      #7  0x00007f3dc55e8d8d in py2embed::PythonRowStream::nextRow (this=0x7f3cdc001b80) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/plugins/pyembed/pyembed.cpp:132
      7
      #8  0x00007f3dd64cdc0c in nextRowNoCatch (this=0x11d4010) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/iterate/thiterateslave.cpp:414

       

      Thread 29 (Thread 0x7f3ce3fff700 (LWP 16439)):
      #0  0x00007f3dced354ed in __lll_lock_wait () from /lib64/libpthread.so.0
      #1  0x00007f3dced30de6 in _L_lock_941 () from /lib64/libpthread.so.0
      #2  0x00007f3dced30cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
      #3  0x00007f3dd652dba2 in enter (this=0x11d4c78) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/./../../system/jlib/jmutex.hpp:300
      #4  CriticalBlock (c=..., this=<synthetic pointer)> at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/./../../system/jlib/jmutex.hpp:341
      #5  writeahead (outIdx=0, writeBlockSem=..., stopped=<optimized out>, current=0, this=<optimized out>, this=<optimized out)> at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPC
      C-Platform/thorlcr/activities/nsplitter/thnsplitterslave.cpp:268
      #6  CSplitterOutput::nextRow (this=0x11d4ef0) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/nsplitter/thnsplitterslave.cpp:479
      #7  0x00007f3dd64a02e6 in nextRowNoCatch (this=0x11d5b20) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/filter/thfilterslave.cpp:83
      #8  CFilterSlaveActivity::nextRow (this=0x11d5b20) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/filter/thfilterslave.cpp:78
      #9  0x00007f3dc55e52e9 in ungroupedNextRow (this=0x11d5c68) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/plugins/pyembed/./../../system/jlib/jio.hpp:171
      #10 py2embed::ECLDatasetIterator_iternext (self=0x7f3dd68bd530) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/plugins/pyembed/pyembed.cpp:1201
      #11 0x00007f3dc52e4c41 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
      #12 0x00007f3dc526b2b8 in gen_send_ex.isra.0 () from /lib64/libpython2.7.so.1.0
      #13 0x00007f3dc5251e3b in PyIter_Next () from /lib64/libpython2.7.so.1.0
      #14 0x00007f3dc55e8d8d in py2embed::PythonRowStream::nextRow (this=0x7f3dc00035b0) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/plugins/pyembed/pyembed.cpp:132
      7
      #15 0x00007f3dd649a523 in nextRowNoCatch (this=0x11d5df0) at /mnt/disk1/jenkins/workspace/LN-with-Plugins-Spark-7.2.x-Nightly-Build/LN/centos-7.0-x86_64/HPCC-Platform/thorlcr/activities/external/thexternalslave.cpp:101

       

      Here is a job that was hung on 160 cluster(http://10.173.160.101:8010/): W20190520-174357

        Attachments

          Activity

            People

            • Assignee:
              richardkchapman Richard Chapman
              Reporter:
              rdev Roger Dev
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: