Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-28788

Regression query 'supercopy.ecl' causes dfuserver to crash and runs until timeout

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Not specified
    • Resolution: Fixed
    • 8.12.0
    • 8.12.0
    • hthor, Roxie
    • None

    Description

      The reported errors on both engines (Thor, Roxie):

      811. Test: supercopy.ecl
      811. supercopy.ecl has not started yet. Reset due to timeout after 720 sec (2 retry attempt(s) remain).
      811. supercopy.ecl has not started yet. Reset due to timeout after 720 sec (1 retry attempt(s) remain).
      811. Create Stack Trace for hthor (pid:9580) into '/home/centos/HPCCSystems-regression/log/W20230117-075307-4-hthor.trace'
      811. Execute pre abort script '/home/centos/build/bin/preabort.sh'
      811. Pre abort script result stored into '/home/centos/HPCCSystems-regression/log/W20230117-075307-4-preAbort.log'
      811. Timeout occured for supercopy.ecl and no more attempt left. Force to abort... 
      811. Fail supercopy.ecl - W20230117-075307-4 (2206 sec)
      811. URL http://127.0.0.1:8010/?Widget=WUDetailsWidget&Wuid=W20230117-075307-4
      811. Zipped Analysis Package: ZAP file written to /home/centos/HPCCSystems-regression/zap/ZAPReport_W20230117-075307-4_regress.zip.
      
      Error:
      811. Test: supercopy.ecl
          Aborted ( reason: Timeout and retry count exhausted! )
       

      Test and pre abort logs, ZAP files are attached.

      Environment
      (Thor config)
      Hthor (runtime) Core file pid Roxie (runtime) Core file pid
      ​4sl/4ch W20230117-075307-4 (2206 sec) 10800, 12880 ​W20230117-085005-4 (2209 sec) 11109
      4cl/1ch W20230117-105755-3 (2206 sec)   W20230117-115100-3 (2208 sec)  

      In the DFUServer directory I found 3 related core files. 

      Excerpt from the first core trace:

      Core was generated by `dfuserver'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  FileSprayer::setSource (this=0x7fc158026710, source=0x7fc15802d750) at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/./../../system/jlib/jsocket.hpp:151
      151        SocketEndpoint() = default;
      [Current thread is 1 (Thread 0x7fc163fff700 (LWP 10811))]
      .
      .
      .
      Thread 1 (Thread 0x7fc163fff700 (LWP 10811)):
      #0  FileSprayer::setSource (this=0x7fc158026710, source=0x7fc15802d750) at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/./../../system/jlib/jsocket.hpp:151
      #1  0x00007fc19bbb727b in CDistributedFileSystem::copy (this=<optimized out>, from=0x7fc15802d750, to=0x7fc1580279a0, recovery=<optimized out>, recoveryConnection=0x7fc15802ed08, filter=0x0, options=0x7fc158025930, progress=0x7fc163ffe2d0, abort=0x7fc163ffe1d0, wuid=0x7fc158001b30 "D20230117-075313") at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/daft.cpp:58
      #2  0x0000000000412ad6 in CDFUengine::runWU(char const*) ()
      #3  0x000000000040999f in CDFUengine::cDFUlistener::run() ()
      #4  0x00007fc197a3984e in Thread::begin (this=0x107c240) at /home/centos/build/CE/platform/HPCC-Platform/system/jlib/jthread.cpp:307
      #5  0x00007fc197a38da7 in Thread::_threadmain (v=0x107c240) at /home/centos/build/CE/platform/HPCC-Platform/system/jlib/jthread.cpp:176
      #6  0x00007fc1968dbea5 in start_thread () from /lib64/libpthread.so.0
      #7  0x00007fc196604b0d in clone () from /lib64/libc.so.6

      Excerpt from the DFUServer log:

      00000250 USR 2023-01-17 07:53:13.976 10800 10811 "DFU Server running job: D20230117-075313"
      00000251 PRG 2023-01-17 07:53:13.984 10800 10811 "DFS: copy(regress::single::hthor::w20230117-075307-4::superalbums1sub,)"
      00000252 USR 2023-01-17 07:53:13.984 10800 10811 "================================================"
      00000253 USR 2023-01-17 07:53:13.984 10800 10811 "Program:   10.20.0.182:/opt/HPCCSystems/bin/dfuserver"
      00000254 USR 2023-01-17 07:53:13.984 10800 10811 "Signal:    11 Segmentation fault"
      00000255 USR 2023-01-17 07:53:13.984 10800 10811 "Fault IP:  00007FC19BBCFEB8"
      00000256 USR 2023-01-17 07:53:13.984 10800 10811 "Accessing: 0000000000000000"
      00000257 PRG 2023-01-17 07:53:13.984 10800 10811 "Backtrace:"
      00000258 PRG 2023-01-17 07:53:13.985 10800 10811 "  /opt/HPCCSystems/lib/libdalift.so(_ZN11FileSprayer9setSourceEP16IDistributedFile+0xf8) [0x7fc19bbcfeb8]"
      00000259 PRG 2023-01-17 07:53:13.985 10800 10811 "  /opt/HPCCSystems/lib/libdalift.so(_ZN22CDistributedFileSystem4copyEP16IDistributedFileS1_P13IPropertyTreeP17IRemoteConnectionP13IDFPartFilterS3_P13IDaftProgressP21IAbortRequestCallbackPKc+0xdb) [0x7fc19bbb727b]"
      0000025A PRG 2023-01-17 07:53:13.985 10800 10811 "  dfuserver() [0x412ad6]"
      0000025B PRG 2023-01-17 07:53:13.985 10800 10811 "  dfuserver() [0x40999f]"
      0000025C PRG 2023-01-17 07:53:13.985 10800 10811 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2e) [0x7fc197a3984e]"
      0000025D PRG 2023-01-17 07:53:13.985 10800 10811 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x27) [0x7fc197a38da7]"
      0000025E PRG 2023-01-17 07:53:13.985 10800 10811 "  /lib64/libpthread.so.0(+0x7ea5) [0x7fc1968dbea5]"
      0000025F PRG 2023-01-17 07:53:13.985 10800 10811 "  /lib64/libc.so.6(clone+0x6d) [0x7fc196604b0d]"
      00000260 USR 2023-01-17 07:53:13.985 10800 10811 "Registers:"
      00000261 USR 2023-01-17 07:53:13.985 10800 10811 "EAX:0000000000000000  EBX:0000000000000000  ECX:0000000000000072  EDX:0000000000000000  ESI:0000000000000000  EDI:00007FC15802D7B8"
      00000262 USR 2023-01-17 07:53:13.985 10800 10811 "R8 :0000000000000000  R9 :0000000000000001  R10:00007FC196645290  R11:0000000000000000"
      00000263 USR 2023-01-17 07:53:13.985 10800 10811 "R12:00007FC15802D750  R13:00007FC158026710  R14:00007FC15802D750  R15:00007FC163FFDE50"
      00000264 USR 2023-01-17 07:53:13.985 10800 10811 "CS:EIP:0033:00007FC19BBCFEB8"
      00000265 USR 2023-01-17 07:53:13.985 10800 10811 "   ESP:00007FC163FFDE40  EBP:0000000000000000"
      00000266 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE40]: 00007FC158025930 0000000100007FC1 0000000100000001 63FFE1F000000001 00007FC163FFE1F0 58001B3000007FC1 00007FC158001B30 5802593000007FC1"
      00000267 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE60]: 00007FC158025930 5802671000007FC1 00007FC158026710 5802CC1800007FC1 00007FC15802CC18 5802ED0800007FC1 00007FC15802ED08 5802593000007FC1"
      00000268 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE80]: 00007FC158025930 58001B3000007FC1 00007FC158001B30 580279A000007FC1 00007FC1580279A0 5802D75000007FC1 00007FC15802D750 5802671000007FC1"
      00000269 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEA0]: 00007FC158026710 5802ED0800007FC1 00007FC15802ED08 5802593000007FC1 00007FC158025930 9BBB727B00007FC1 00007FC19BBB727B 0000000000007FC1"
      0000026A USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEC0]: 0000000000000000 63FFE2D000000000 00007FC163FFE2D0 63FFE1D000007FC1 00007FC163FFE1D0 9AAF618700007FC1 00007FC19AAF6187 580025D800007FC1"
      0000026B USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEE0]: 00007FC1580025D8 5802A31000007FC1 00007FC15802A310 63FFE3F000007FC1 00007FC163FFE3F0 0000000000007FC1 0000000000000000 01076E9000000000"
      0000026C USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDF00]: 0000000001076E90 5802D75000000000 00007FC15802D750 5802593000007FC1 00007FC158025930 00412AD600007FC1 0000000000412AD6 5802593000000000"
      0000026D USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDF20]: 00007FC158025930 63FFE2D000007FC1 00007FC163FFE2D0 63FFE1D000007FC1 00007FC163FFE1D0 58001B3000007FC1 00007FC158001B30 0000000300007FC1" 

      In 8.10.x the supercopy finishes in ~50 sec on Hthor and in ~160s on Roxie. 

      jakesmith 

      Attachments

        1. core_dfuserver.10800.trace
          22 kB
        2. core_dfuserver.11109.trace
          22 kB
        3. core_dfuserver.12880.trace
          22 kB
        4. dfuserver.2023_01_17.log
          1.94 MB
        5. hthor.23-01-17-07-49-15.log
          186 kB
        6. hthor.23-01-17-10-54-06.log
          186 kB
        7. roxie.23-01-17-08-45-07.log
          198 kB
        8. roxie.23-01-17-11-46-13.log
          198 kB
        9. W20230117-075307-4-preAbort.log
          134 kB
        10. W20230117-085005-4-preAbort.log
          137 kB
        11. W20230117-105755-3-preAbort.log
          134 kB
        12. W20230117-115100-3-preAbort.log
          141 kB
        13. ZAPReport_W20230117-075307-4_regress.zip
          165 kB
        14. ZAPReport_W20230117-085005-4_regress.zip
          155 kB
        15. ZAPReport_W20230117-105755-3_regress.zip
          165 kB
        16. ZAPReport_W20230117-115100-3_regress.zip
          154 kB

        Issue Links

          Activity

            People

              jakesmith Jake Smith
              attilavamos Attila Vamos
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: