Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-24069

Regression in SDS transaction locks causing lengthy backlog

    Details

    • Type: Regression
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.8.16
    • Component/s: Dali
    • Labels:
      None

      Description

      HPCC-23433 introduced a change which closed a race condition when a the RTM_DELETE_ON_DISCONNECT logic tried to delete the root of the connection on the last disconnect release.

      When it detected it was the last lock, it used to release the SDS global readwrite transaction lock, then try to gain an exclusive write lock, before deleting root.

      However, that meant there was a window where other transactions could get in, and lock the same node.

      When the RTM_DELETE_ON_DISCONNECT thread gained the write lock, it would proceed to delete the node, meaning that other connections which were not established were dealing with an orphaned node.

      This meant that clients would see 'Transaction to orphaned node' errors (and warnings in the dali server log).

      The change in HPCC-23433 closed the window, by not releasing the read lock, and instead using a new changeToWrite (in ReadWriteLock).

      As a consequence, it blocked anybody else from performing write transactions.
      In effect it meant that it was stalled until all existing other readers would finish or timeout, whilst blocking writes, and stalled any write transactions.

      The SDS global read write lock (dataRWLock) should not be held for any significant periods of time.

        Attachments

          Activity

            People

            • Assignee:
              jakesmith Jake Smith
              Reporter:
              jakesmith Jake Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved: