Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-9588

XREF "Delete Empty Directories" runs the ESP out of handles

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0.2
    • Component/s: XREF
    • Labels:
      None
    • Environment:
      USLM

      Description

      I keep seeing issues with the ESP’s in USLM’s 702 systems getting clobbered by too many open file handles (im just focusing on the esp @ 10.144.140.19 for now). I wrote a script that monitors the number of file handles in use and I’ve been seeing it max out even after bumping the values up a few times. As an example:
      10.144.140.19:/c$/espfd.log
      Fri Jun 28 09:01:01 EDT 2013 - ESP PID 6766 has 26737 open handles.
      Fri Jun 28 09:02:01 EDT 2013 - ESP PID 6766 has 26855 open handles.
      Fri Jun 28 09:03:01 EDT 2013 - ESP PID 6766 has 32512 open handles.
      Fri Jun 28 09:04:02 EDT 2013 - ESP PID 6766 has 32514 open handles.
      Fri Jun 28 09:05:01 EDT 2013 - ESP PID 6766 has 29430 open handles.
      Fri Jun 28 09:06:02 EDT 2013 - ESP PID 6766 has 29481 open handles.

      So I checked the /proc/6766/fd and almost all of them are sockets. I did a netstat –plan, and I see a crazy number of connections from the esp to dafilesrv on thor50_90 (10.144.90.1-51):
      10.144.140.19:/c$/netstatfd.log
      tcp 0 0 10.144.140.19:60914 10.144.90.10:7100 ESTABLISHED 6766/esp
      tcp 0 0 10.144.140.19:43105 10.144.90.47:7100 ESTABLISHED 6766/esp
      tcp 0 0 10.144.140.19:39728 10.144.90.47:7100 ESTABLISHED 6766/esp
      tcp 0 0 10.144.140.19:38771 10.144.90.47:7100 ESTABLISHED 6766/esp
      tcp 0 0 10.144.140.19:58215 10.144.90.38:7100 ESTABLISHED 6766/esp
      tcp 0 0 10.144.140.19:35375 10.144.90.47:7100 ESTABLISHED 6766/esp

      I did a count on the number of connections, and there are thousands of them to dafilesrv on some of the thorslaves:
      grep 10.144.90 netstatfd.log | awk '

      { print $5 }

      ' | sort | uniq -c
      3884 10.144.90.11:7100
      1501 10.144.90.19:7100
      720 10.144.90.1:7100
      2840 10.144.90.22:7100
      4084 10.144.90.23:7100
      4178 10.144.90.29:7100
      3885 10.144.90.2:7100
      1527 10.144.90.31:7100
      3771 10.144.90.34:7100
      1756 10.144.90.35:7100
      2277 10.144.90.38:7100
      646 10.144.90.43:7100
      1291 10.144.90.44:7100
      4130 10.144.90.48:7100
      2037 10.144.90.4:7100
      696 10.144.90.7:7100
      2158 10.144.90.9:7100

      There were no thor or DFU jobs running against thor50_90 at the time. After further investigation, I was able to duplicate the problem by running XREF "Delete Empty Directories" against a thor that had built up a large number of empty directories.

      — Kevin Wang said —

      When clean up XREF empty directory, ESP calls a dali/dfuXRefLib method:

      CXRefNode::removeEmptyDirectories(StringBuffer &errstr).

      I guess the problem happens inside the method.

      Kevin

        Attachments

          Activity

            People

            • Assignee:
              jakesmith Jake Smith
              Reporter:
              rwagner42 Russell Wagner
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: