Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-14606

Esp can deadlock whilst trying to update a logical file description.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0.0
    • Component/s: DFS
    • Labels:
      None

      Description

      This may not be specific to Esp, I think it cold also happen via some fileservice calls.

      This is a section from a reply where the incident was spotted on a stalled esp:

      It is chocking, because 1 of it's threads is indefinitely blocked, this thread:
      > Thread 71 (Thread 0x7ff36d5e8700 (LWP 65792)):
      > #0 0x00007ff3f2dd1720 in sem_wait () from /lib64/libpthread.so.0
      > #1 0x00007ff3f4b8b2a5 in Semaphore::wait(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
      > #2 0x00007ff3f4815dff in CMPServer::recv(CMessageBuffer&, SocketEndpoint const*, mptag_t, CTimeMon&) () from /opt/HPCCSystems/lib/libmp.so
      > #3 0x00007ff3f481a9b0 in CCommunicator::recv(CMessageBuffer&, unsigned int, mptag_t, unsigned int*, unsigned int) () from /opt/HPCCSystems/lib/libmp.so
      > #4 0x00007ff3f3dbd3c1 in CCovenClient::sendRecv(CMessageBuffer&, unsigned int, mptag_t, unsigned int) () from /opt/HPCCSystems/lib/libdalibase.so
      > #5 0x00007ff3f3dc6f91 in CClientSDSManager::changeMode(CRemoteConnection&, unsigned int, unsigned int, bool) () from /opt/HPCCSystems/lib/libdalibase.so
      > #6 0x00007ff3f3e66963 in safeChangeModeWrite(IRemoteConnection*, char const*, bool&, unsigned int) () from /opt/HPCCSystems/lib/libdalibase.so
      > #7 0x00007ff3f3df0adc in CDistributedFileBase<IDistributedFile>::lockProperties(unsigned int) () from /opt/HPCCSystems/lib/libdalibase.so
      > #8 0x00007ff3b4a8cd25 in CWsDfuEx::doGetFileDetails(IEspContext&, IUserDescriptor*, char const*, char const*, char const*, IEspDFUFileDetail&) () from /opt/HPCCSystem
      > s/lib/libws_dfu.so
      > #9 0x00007ff3b4a8e7e1 in CWsDfuEx::onDFUInfo(IEspContext&, IEspDFUInfoRequest&, IEspDFUInfoResponse&) () from /opt/HPCCSystems/lib/libws_dfu.so
      > #10 0x00007ff3b4a0e1e1 in ws_dfu::CWsDfuSoapBinding::onGetInstantQuery(IEspContext&, CHttpRequest*, CHttpResponse*, char const*, char const*) () from /opt/HPCCSystems/

      It looks like other threads are piling up behind it, trying to get details on the same file.
      There are 15 threads in 'doGetFileDetails'.

      The code here, is locking the file for read access, then trying to update 'description'.

      What I think is happening, is that the update to description tries to get an exclusive lock. It will fail if there are other readers, e.g. any one of the other 15 threads reading the same file details.
      When it timesout trying to get the lock, it will release the lock completely, sleep a bit to give someone else a chance, then try again.

      This is not a good mechanism, it means if there are multiple threads/clients interacting, any one of them not at the 'sleep-unlock' phase, will block out anyone else.
      With enough clients/threads, it means at least someone will be awake and trying again, and holding a read-lock.

      We had a similar situation elsewhere and the change was to get it to sleep immediately and release the lock(s) altogether when there was contention - IOW when it failed to get an exclusive lock it wanted.
      That change meant it was no longer holding readlocks for prolonged periods of time.

        Attachments

          Activity

            People

            • Assignee:
              mckellyln Mark Kelly
              Reporter:
              jakesmith Jake Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: