Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-9860

Potential Dali deadlock in session manager if session dies.

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.0.2
    • Dali
    • None

    Description

      A dali deadlock was observed, when a node that had dali clients on, died.
      Dali treads started deadlocking calling session manager routines, mainly getClientProcessEndpoint(), with stacks like (1):

      000F8EBF 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x26) [0x2ba3d6920166]"
      000F8EC0 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libjlib.so(_ZN20CheckedCriticalBlockC1ER5MutexjPKcj+0x36) [0x2ba3d6962d16]"
      000F8EC1 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CCovenSessionManager24getClientProcessEndpointExR12StringBuffer+0x46) [0x2ba3d71e1c46]
      000F8EC2 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN9CLockInfo11getLockInfoER12StringBuffer+0x29c) [0x2ba3d71d4b8c]"
      000F8EC3 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN16CCovenSDSManager4lockER17CServerRemoteTreePKcxxjjR15IUnlockCallback+0x2be) [0x2ba3d71ba6
      ce]"
      000F8EC4 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN16CCovenSDSManager16createConnectionExjjPKcRP17CServerRemoteTreeRxbR5OwnedI20LinkingCritic
      alBlockE+0x7e2) [0x2ba3d71c5612]"

      The session manager's mutex was being held by onClose in this thread stack (2):

      000F8F0F 2013-08-14 17:20:00.965 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x26) [0x2ba3d6920166]"
      000F8F10 2013-08-14 17:20:00.965 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZN20CheckedCriticalBlockC1ER5MutexjPKcj+0x36) [0x2ba3d6962d16]"
      000F8F11 2013-08-14 17:20:00.965 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN16CCovenSDSManager9unlockAllEx+0x3b) [0x2ba3d71b74eb]"
      000F8F12 2013-08-14 17:20:00.965 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN17CServerRemoteTree14COrphanHandler8onRemoveEPv+0x83) [0x2ba3d71d4403]"
      000F8F13 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZN14SuperHashTable10releaseAllEv+0x34) [0x2ba3d69b60a4]"
      000F8F14 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZN14SuperHashTable4killEv+0x16) [0x2ba3d69b62a6]"
      000F8F15 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN17CServerRemoteTree14COrphanHandlerD0Ev+0x24) [0x2ba3d71cdf74]"
      000F8F16 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZNK8ChildMap7ReleaseEv+0x3c) [0x2ba3d697819c]"
      000F8F17 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZN5PTreeD2Ev+0x53) [0x2ba3d696a773]"
      000F8F18 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN17CServerRemoteTreeD0Ev+0xf6) [0x2ba3d71d16f6]"
      000F8F19 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZNK15CRemoteTreeBase7ReleaseEv+0x3c) [0x2ba3d71d33dc]"
      000F8F1A 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN17CServerConnectionD0Ev+0x14a) [0x2ba3d71c489a]"
      000F8F1B 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZNK17CServerConnection7ReleaseEv+0x3c) [0x2ba3d71d3b1c]"
      000F8F1C 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN19CSessionManagerBase25CSessionSubscriptionProxyD0Ev+0x31) [0x2ba3d71e3af1]"
      000F8F1D 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZNK19CSessionManagerBase25CSessionSubscriptionProxy7ReleaseEv+0x3c) [0x2ba3d71e5acc]"
      000F8F1E 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CCovenSessionManager24CSessionSubscriptionStubD0Ev+0x1c) [0x2ba3d71e360c]"
      000F8F1F 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libjlib.so(_ZN13OwningArrayOfIP10CInterfaceRS0_E4killEb+0x8a) [0x2ba3d691018a]"
      000F8F20 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CCovenSessionManager11stopSessionExb+0x17a) [0x2ba3d71e541a]"
      000F8F21 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CCovenSessionManager7onCloseER14SocketEndpoint+0x13f) [0x2ba3d71e1ebf]"
      000F8F22 2013-08-14 17:20:00.966 11202 11203 " /opt/HPCCSystems/lib/libmp.so(_ZN21CMPNotifyClosedThread3runEv+0x174) [0x2ba3d6c58fa4]"

      .. which in turn was blocked in CovenSDSManager::unlockAll
      This appears to be the thread/stack responsible (3):

      000F8EBF 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x26) [0x2ba3d6920166]"
      000F8EC0 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libjlib.so(_ZN20CheckedCriticalBlockC1ER5MutexjPKcj+0x36) [0x2ba3d6962d16]"
      000F8EC1 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CCovenSessionManager24getClientProcessEndpointExR12StringBuffer+0x46) [0x2ba3d71e1c46]"
      000F8EC2 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN9CLockInfo11getLockInfoER12StringBuffer+0x29c) [0x2ba3d71d4b8c]"
      000F8EC3 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN16CCovenSDSManager4lockER17CServerRemoteTreePKcxxjjR15IUnlockCallback+0x2be) [0x2ba3d71ba6
      ce]"
      000F8EC4 2013-08-14 17:20:00.873 11202 11606 " /opt/HPCCSystems/lib/libdalibase.so(_ZN16CCovenSDSManager16createConnectionExjjPKcRP17CServerRemoteTreeRxbR5OwnedI20LinkingCritic
      alBlockE+0x7e2) [0x2ba3d71c5612]"

      getLockInfo() - which is being called to get lock info following a lock timeout, gets session info. It has the global 'lockCrit' locked at the time – and deadlocks on trying to get the session managers 'sessmanagersect', which is locked waiting for unlockAll, which is locked on 'lockCrit' = deadlock.

      That's the diagnosis of the problem..

      I think one solution is to avoid the stopSession destroying the subscription stubs whilst it holds the session mutex ( see thread stack(2) ). The destruction of the subscription stubs can involve quite a lot, including as in this case, trying to block other disparate mutex's. Which in turn was held by stack(3), which was itself blocked on the session manager mutex.

      In stopSession(), it can remove the stubs within the mutex, but destroy them outside, it already removes and notify outside of the mutex lock.

      Attachments

        Activity

          People

            jakesmith Jake Smith
            jakesmith Jake Smith
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: