Details
-
Bug
-
Status: Resolved
-
Not specified
-
Resolution: Fixed
-
8.12.0
-
None
-
OBT:
Build system: CentOS_Linux_7
Hardware: CPU/Cores: 16, RAM: 30 GB
IP Address: 10.20.0.182 (OBT-AWS01)
Build on git branch: candidate-8.12.x
Date: Fri Jan 13 15:45:50 2023
SHA: 75F62343 (8adc4e77f3d5d9082a5f5696b28317cd)
Number of commits (24h): 0
Build type: RelWithDebInfo
Number of Thor slaves: 4
Number of channels per Thor slave: 4 or 1OBT: Build system: CentOS_Linux_7 Hardware: CPU/Cores: 16, RAM: 30 GB IP Address: 10.20.0.182 (OBT-AWS01) Build on git branch: candidate-8.12.x Date: Fri Jan 13 15:45:50 2023 SHA: 75F62343 (8adc4e77f3d5d9082a5f5696b28317cd) Number of commits (24h): 0 Build type: RelWithDebInfo Number of Thor slaves: 4 Number of channels per Thor slave: 4 or 1
Description
The reported errors on both engines (Thor, Roxie):
811. Test: supercopy.ecl 811. supercopy.ecl has not started yet. Reset due to timeout after 720 sec (2 retry attempt(s) remain). 811. supercopy.ecl has not started yet. Reset due to timeout after 720 sec (1 retry attempt(s) remain). 811. Create Stack Trace for hthor (pid:9580) into '/home/centos/HPCCSystems-regression/log/W20230117-075307-4-hthor.trace' 811. Execute pre abort script '/home/centos/build/bin/preabort.sh' 811. Pre abort script result stored into '/home/centos/HPCCSystems-regression/log/W20230117-075307-4-preAbort.log' 811. Timeout occured for supercopy.ecl and no more attempt left. Force to abort... 811. Fail supercopy.ecl - W20230117-075307-4 (2206 sec) 811. URL http://127.0.0.1:8010/?Widget=WUDetailsWidget&Wuid=W20230117-075307-4 811. Zipped Analysis Package: ZAP file written to /home/centos/HPCCSystems-regression/zap/ZAPReport_W20230117-075307-4_regress.zip. Error: 811. Test: supercopy.ecl Aborted ( reason: Timeout and retry count exhausted! )
Test and pre abort logs, ZAP files are attached.
Environment (Thor config) |
Hthor (runtime) | Core file pid | Roxie (runtime) | Core file pid |
---|---|---|---|---|
4sl/4ch | W20230117-075307-4 (2206 sec) | 10800, 12880 | W20230117-085005-4 (2209 sec) | 11109 |
4cl/1ch | W20230117-105755-3 (2206 sec) | W20230117-115100-3 (2208 sec) |
In the DFUServer directory I found 3 related core files.
Excerpt from the first core trace:
Core was generated by `dfuserver'. Program terminated with signal SIGSEGV, Segmentation fault. #0 FileSprayer::setSource (this=0x7fc158026710, source=0x7fc15802d750) at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/./../../system/jlib/jsocket.hpp:151 151 SocketEndpoint() = default; [Current thread is 1 (Thread 0x7fc163fff700 (LWP 10811))] . . . Thread 1 (Thread 0x7fc163fff700 (LWP 10811)): #0 FileSprayer::setSource (this=0x7fc158026710, source=0x7fc15802d750) at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/./../../system/jlib/jsocket.hpp:151 #1 0x00007fc19bbb727b in CDistributedFileSystem::copy (this=<optimized out>, from=0x7fc15802d750, to=0x7fc1580279a0, recovery=<optimized out>, recoveryConnection=0x7fc15802ed08, filter=0x0, options=0x7fc158025930, progress=0x7fc163ffe2d0, abort=0x7fc163ffe1d0, wuid=0x7fc158001b30 "D20230117-075313") at /home/centos/build/CE/platform/HPCC-Platform/dali/ft/daft.cpp:58 #2 0x0000000000412ad6 in CDFUengine::runWU(char const*) () #3 0x000000000040999f in CDFUengine::cDFUlistener::run() () #4 0x00007fc197a3984e in Thread::begin (this=0x107c240) at /home/centos/build/CE/platform/HPCC-Platform/system/jlib/jthread.cpp:307 #5 0x00007fc197a38da7 in Thread::_threadmain (v=0x107c240) at /home/centos/build/CE/platform/HPCC-Platform/system/jlib/jthread.cpp:176 #6 0x00007fc1968dbea5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fc196604b0d in clone () from /lib64/libc.so.6
Excerpt from the DFUServer log:
00000250 USR 2023-01-17 07:53:13.976 10800 10811 "DFU Server running job: D20230117-075313" 00000251 PRG 2023-01-17 07:53:13.984 10800 10811 "DFS: copy(regress::single::hthor::w20230117-075307-4::superalbums1sub,)" 00000252 USR 2023-01-17 07:53:13.984 10800 10811 "================================================" 00000253 USR 2023-01-17 07:53:13.984 10800 10811 "Program: 10.20.0.182:/opt/HPCCSystems/bin/dfuserver" 00000254 USR 2023-01-17 07:53:13.984 10800 10811 "Signal: 11 Segmentation fault" 00000255 USR 2023-01-17 07:53:13.984 10800 10811 "Fault IP: 00007FC19BBCFEB8" 00000256 USR 2023-01-17 07:53:13.984 10800 10811 "Accessing: 0000000000000000" 00000257 PRG 2023-01-17 07:53:13.984 10800 10811 "Backtrace:" 00000258 PRG 2023-01-17 07:53:13.985 10800 10811 " /opt/HPCCSystems/lib/libdalift.so(_ZN11FileSprayer9setSourceEP16IDistributedFile+0xf8) [0x7fc19bbcfeb8]" 00000259 PRG 2023-01-17 07:53:13.985 10800 10811 " /opt/HPCCSystems/lib/libdalift.so(_ZN22CDistributedFileSystem4copyEP16IDistributedFileS1_P13IPropertyTreeP17IRemoteConnectionP13IDFPartFilterS3_P13IDaftProgressP21IAbortRequestCallbackPKc+0xdb) [0x7fc19bbb727b]" 0000025A PRG 2023-01-17 07:53:13.985 10800 10811 " dfuserver() [0x412ad6]" 0000025B PRG 2023-01-17 07:53:13.985 10800 10811 " dfuserver() [0x40999f]" 0000025C PRG 2023-01-17 07:53:13.985 10800 10811 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2e) [0x7fc197a3984e]" 0000025D PRG 2023-01-17 07:53:13.985 10800 10811 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x27) [0x7fc197a38da7]" 0000025E PRG 2023-01-17 07:53:13.985 10800 10811 " /lib64/libpthread.so.0(+0x7ea5) [0x7fc1968dbea5]" 0000025F PRG 2023-01-17 07:53:13.985 10800 10811 " /lib64/libc.so.6(clone+0x6d) [0x7fc196604b0d]" 00000260 USR 2023-01-17 07:53:13.985 10800 10811 "Registers:" 00000261 USR 2023-01-17 07:53:13.985 10800 10811 "EAX:0000000000000000 EBX:0000000000000000 ECX:0000000000000072 EDX:0000000000000000 ESI:0000000000000000 EDI:00007FC15802D7B8" 00000262 USR 2023-01-17 07:53:13.985 10800 10811 "R8 :0000000000000000 R9 :0000000000000001 R10:00007FC196645290 R11:0000000000000000" 00000263 USR 2023-01-17 07:53:13.985 10800 10811 "R12:00007FC15802D750 R13:00007FC158026710 R14:00007FC15802D750 R15:00007FC163FFDE50" 00000264 USR 2023-01-17 07:53:13.985 10800 10811 "CS:EIP:0033:00007FC19BBCFEB8" 00000265 USR 2023-01-17 07:53:13.985 10800 10811 " ESP:00007FC163FFDE40 EBP:0000000000000000" 00000266 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE40]: 00007FC158025930 0000000100007FC1 0000000100000001 63FFE1F000000001 00007FC163FFE1F0 58001B3000007FC1 00007FC158001B30 5802593000007FC1" 00000267 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE60]: 00007FC158025930 5802671000007FC1 00007FC158026710 5802CC1800007FC1 00007FC15802CC18 5802ED0800007FC1 00007FC15802ED08 5802593000007FC1" 00000268 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDE80]: 00007FC158025930 58001B3000007FC1 00007FC158001B30 580279A000007FC1 00007FC1580279A0 5802D75000007FC1 00007FC15802D750 5802671000007FC1" 00000269 USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEA0]: 00007FC158026710 5802ED0800007FC1 00007FC15802ED08 5802593000007FC1 00007FC158025930 9BBB727B00007FC1 00007FC19BBB727B 0000000000007FC1" 0000026A USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEC0]: 0000000000000000 63FFE2D000000000 00007FC163FFE2D0 63FFE1D000007FC1 00007FC163FFE1D0 9AAF618700007FC1 00007FC19AAF6187 580025D800007FC1" 0000026B USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDEE0]: 00007FC1580025D8 5802A31000007FC1 00007FC15802A310 63FFE3F000007FC1 00007FC163FFE3F0 0000000000007FC1 0000000000000000 01076E9000000000" 0000026C USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDF00]: 0000000001076E90 5802D75000000000 00007FC15802D750 5802593000007FC1 00007FC158025930 00412AD600007FC1 0000000000412AD6 5802593000000000" 0000026D USR 2023-01-17 07:53:13.985 10800 10811 "Stack[00007FC163FFDF20]: 00007FC158025930 63FFE2D000007FC1 00007FC163FFE2D0 63FFE1D000007FC1 00007FC163FFE1D0 58001B3000007FC1 00007FC158001B30 0000000300007FC1"
In 8.10.x the supercopy finishes in ~50 sec on Hthor and in ~160s on Roxie.
Attachments
Issue Links
- relates to
-
HPCC-28505 Clean up invalid null reference
-
- Resolved
-