Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-15480

Slow ftslave transfers

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 5.4.6
    • None
    • FTSlave
    • None

    Description

      Spray test: D20160502-074136 = perf_test::filespraytest::myfile, 2,425,473 bytes on a single node 10.147.64.151, 0 bytes on other nodes
      10.147.64.208:/var/log/HPCCSystems/dfuserver/dfuserver.2016_05_02.log


      001B3B8B 2016-05-02 07:41:36.827 27927 27937 "DFU Server running job: D20160502-074136"
      001B3B8C 2016-05-02 07:41:36.840 27927 27937 "DFS: import(CASELAW-Exports-Man,)"
      001B3B8D 2016-05-02 07:41:36.840 27927 27937 "Using transfer buffer size 100000000"
      001B3B8E 2016-05-02 07:41:36.840 27927 27937 "Start gathering file sizes..."
      001B3B8F 2016-05-02 07:41:36.840 27927 27937 "Gathering 1 file sizes on 1 threads"
      001B3B90 2016-05-02 07:41:37.240 27927 27937 "Finished gathering file sizes..."
      001B3B91 2016-05-02 07:41:37.240 27927 27937 "compressedInput:0, compressOutput:0"
      001B3B92 2016-05-02 07:41:37.240 27927 27937 "Unable to find machine for 10.120.83.18"
      001B3B93 2016-05-02 07:41:37.240 27927 27937 "Use pull operation because 10.120.83.18 doesn't appear to have an ftslave"
      001B3B94 2016-05-02 07:41:37.240 27927 27937 "Start gathering file sizes..."
      001B3B95 2016-05-02 07:41:37.240 27927 27937 "Finished gathering file sizes..."
      001B3B96 2016-05-02 07:41:37.240 27927 27937 "Calculate partition information"
      001B3B97 2016-05-02 07:41:37.240 27927 27937 "Setting up no split partition"
      001B3B98 2016-05-02 07:41:37.240 27927 27937 "Insert headers"
      001B3B99 2016-05-02 07:41:37.240 27927 27937 "Insert a dummy entry for target 1"
      001B3B9A 2016-05-02 07:41:37.240 27927 27937 "Insert a dummy entry for target 2"
      001B3B9B 2016-05-02 07:41:37.240 27927 27937 "Insert a dummy entry for target 3"
      001B3B9C 2016-05-02 07:41:37.240 27927 27937 "Insert a dummy entry for target 4"
      001B3B9D 2016-05-02 07:41:37.241 27927 27937 "Partition //10.120.83.18:7100/nfd/shared/mncr/CASELAW-Exports-Man

      {0}[0 size 2425473]->//10.147.64.151:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._1_of_5{0}

      [0 size 2425473]"
      001B3B9E 2016-05-02 07:41:37.241 27927 27937 "Partition //10.120.83.18:7100/nfd/shared/mncr/CASELAW-Exports-Man

      {0}[0 size 0]->//10.147.64.152:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._2_of_5{1}[0 size 0]"
      001B3B9F 2016-05-02 07:41:37.241 27927 27937 "Partition //10.120.83.18:7100/nfd/shared/mncr/CASELAW-Exports-Man{0}

      [0 size 0]->//10.147.64.153:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._3_of_5

      {2}[0 size 0]"
      001B3BA0 2016-05-02 07:41:37.241 27927 27937 "Partition //10.120.83.18:7100/nfd/shared/mncr/CASELAW-Exports-Man{0}[0 size 0]->//10.147.64.154:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._4_of_5{3}[0 size 0]"
      001B3BA1 2016-05-02 07:41:37.241 27927 27937 "Partition //10.120.83.18:7100/nfd/shared/mncr/CASELAW-Exports-Man{0}[0 size 0]->//10.147.64.155:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._5_of_5{4}[0 size 0]"
      001B3BA2 2016-05-02 07:41:37.241 27927 27937 "Calculate CRC = 1"
      001B3BA3 2016-05-02 07:41:37.241 27927 27937 "Begin to transfer parts (5 threads)
      001B3C4B 2016-05-02 07:43:29.605 27927 27937 ",FileAccess,DfuPlus,IMPORT,10.147.64.208,mncr-sprayer,perf_test::filespraytest::myfile"
      001B3C4C 2016-05-02 07:43:29.610 27927 27937 "DFU Server finished job: D20160502-074136"
      ----


      The ftslave log from the server with the data, 10.147.64.151:
      ----
      00000012 2016-05-02 07:41:36.979 36705 36705 "Begin to transfer chunk 0: Start at length 0"
      00000013 2016-05-02 07:43:29.048 36705 36705 "Progress: 100% done. [2425473]"
      ----


      Ftslave logs from one of the servers with 0 byte file part, shows it finishing instantly.


      When Tony copies the same file directly from the LZ 10.120.83.18 to 10.147.64.151, it finishes in less than a second:
      ----
      nfd@psc1817:/fab/mncr $ scp CASELAW-Exports-Man wuebbeae1@10.147.64.151:/mnt/disk1/home/wuebbeae1
      CASELAW-Exports-Man 100% 2369KB 2.3MB/s 00:00
      ----


      Despray test: D20160502-074526 = perf_test::filespraytest::myfile, 2,425,473 bytes on a single node 10.147.64.151, 0 bytes on other nodes
      10.147.64.207:/var/log/HPCCSystems/dfuserver/dfuserver.log
      ----
      0057E290 2016-05-02 07:45:23.969 27332 27339 "DFU Server running job: D20160502-074526"
      0057E291 2016-05-02 07:45:23.975 27332 27339 "DFS: export(perf_test::filespraytest::myfile,myfile)"
      0057E292 2016-05-02 07:45:23.975 27332 27339 "Using transfer buffer size 100000000"
      0057E293 2016-05-02 07:45:23.975 27332 27339 "Start gathering file sizes..."
      0057E294 2016-05-02 07:45:23.975 27332 27339 "Gathering 5 file sizes on 2 threads"
      0057E295 2016-05-02 07:45:23.977 27332 27339 "Finished gathering file sizes..."
      0057E296 2016-05-02 07:45:23.977 27332 27339 "compressedInput:0, compressOutput:0"
      0057E297 2016-05-02 07:45:23.978 27332 27339 "Unable to find machine for 10.120.83.19"
      0057E298 2016-05-02 07:45:23.978 27332 27339 "Use push operation because 10.120.83.19 doesn't appear to have an ftslave"
      0057E299 2016-05-02 07:45:23.978 27332 27339 "Start gathering file sizes..."
      0057E29A 2016-05-02 07:45:23.978 27332 27339 "Finished gathering file sizes..."
      0057E29B 2016-05-02 07:45:23.978 27332 27339 "Calculate partition information"
      0057E29C 2016-05-02 07:45:23.978 27332 27339 "Setting up many2one partition"
      0057E29D 2016-05-02 07:45:23.978 27332 27339 "Insert headers"
      0057E29E 2016-05-02 07:45:23.978 27332 27339 "Partition //10.147.64.151:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._1_of_5{0}[0 size 2425473]->//10.120.83.19:7100/nfd/shared/mncr/myfile2{0}[0 size 2425473]"
      0057E29F 2016-05-02 07:45:23.978 27332 27339 "Partition //10.147.64.152:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._2_of_5{1}[0 size 0]->//10.120.83.19:7100/nfd/shared/mncr/myfile2{0}[2425473 size 0]"
      0057E2A0 2016-05-02 07:45:23.978 27332 27339 "Partition //10.147.64.153:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._3_of_5{2}

      [0 size 0]->//10.120.83.19:7100/nfd/shared/mncr/myfile2

      {0}[2425473 size 0]"
      0057E2A1 2016-05-02 07:45:23.979 27332 27339 "Partition //10.147.64.154:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._4_of_5{3}[0 size 0]->//10.120.83.19:7100/nfd/shared/mncr/myfile2{0}

      [2425473 size 0]"
      0057E2A2 2016-05-02 07:45:23.979 27332 27339 "Partition //10.147.64.155:7100/var/lib/HPCCSystems/hpcc-data/thor/perf_test/filespraytest/myfile._5_of_5

      {4}

      [0 size 0]->//10.120.83.19:7100/nfd/shared/mncr/myfile2

      {0}

      [2425473 size 0]"
      0057E2A3 2016-05-02 07:45:24.117 27332 27339 "Begin to transfer parts (5 threads)
      0057E38E 2016-05-02 07:46:24.469 27332 27339 "WARNING: No response from any slaves in last 60 seconds."
      0057E38F 2016-05-02 07:46:24.469 27332 27339 "WARNING: Still waiting for slave 10.147.64.151"
      0057E3AC 2016-05-02 07:47:19.366 27332 27339 "Checking input CRCs"
      0057E3AD 2016-05-02 07:47:19.429 27332 27339 ",FileAccess,DfuPlus,EXPORT,10.147.64.207,mncr-sprayer,perf_test::filespraytest::myfile"
      0057E3AE 2016-05-02 07:47:19.435 27332 27339 "DFU Server finished job: D20160502-074526"


      The ftslave log from the server with the data, 10.147.64.151:


      00000010 2016-05-02 07:45:26.785 102399 102399 "Begin to transfer chunk 0: Start at length 0"
      00000011 2016-05-02 07:47:21.604 102399 102399 "Progress: 100% done. [2425473]"


      Copying the file from the thorslave to the LZ, and it takes 1 second:


      [wuebbeae1@lnp_prod_thor200_64:node010147064151 ~]$ scp CASELAW-Exports-Man nfd@10.120.83.18:/fab/mncr/myfile3
      CASELAW-Exports-Man 100% 2369KB 2.3MB/s 00:01


      In both cases, it appears to hang during the "Begin to transfer parts (5 threads)" on a ~2.5MB file.

      Doing the exact same spray/despray test on their old Prod environment is significantly faster, and it is running the same build on older hardware.

      Attachments

        Activity

          People

            attilavamos Attila Vamos
            rwagner42 Russell Wagner
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: