Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-28209

Message in esp log: Cannot find node group for thor400_112_x

    XMLWordPrintable

Details

    Description

       

      00003EA0 2022-09-02 15:54:47.505 1391677 1397223 "Cannot find node group for thor400_112_7"

      00003EA1 2022-09-02 15:54:47.514 1391677 1397223 "Cannot find node group for thor400_112_9"

      00003EA2 2022-09-02 15:54:49.830 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_1:: 02:54:51"

      00003EA3 2022-09-02 15:54:49.830 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_2:: 02:54:51"

       

      I'm not sure what or why those are happening, but they look like they're related to preflight.

      I don't think related to the slowdown.

      Can you open a JIRA for this?

      and adding Kevin.


       

      We had an issue, two days in a row, with the MVR V1 Key Build job taking longer than normal to run. The job starts around 1 PM and should be done before 2 PM. The last two days the job took over 2 hours to complete. While digging around looking in logs, I did see an SDS lock in the dali log today, as well as yesterday:

      04A0FFF3 2022-09-02 14:44:17.884 1660355 1665322 "WARNING: 1: dasds.cpp(4588) : EXCEPTION in reply to client 10.194.112.103:7137 : SDS: Lock timeout

      Lock timeout trying to establish lock to /JobQueues/, existing lock info: Locks on path: /JobQueues/

      Endpoint            |SessionId       |ConnectionId    |mode    |time(duration)]

      10.194.112.101:7386 |b106454d88      |b1064ed1c9      |1       |2022-09-02T18:44:17(66 ms)

      10.194.112.106:7180 |b020b2e0b6      |b1064df7f1      |1       |2022-09-02T18:42:50(87233 ms)

      10.194.112.106:7180 |b020b2e0b6      |b1064e6a97      |1       |2022-09-02T18:43:47(30453 ms)"

       

      The session ID indicated these were esp processes on 101 and 106. I bounced the production esp on 101 and the SDS lock seemed to go away, but the job was still moving slowly. I couldn’t find anything wrong with the hardware for the MW/slaves/NFS servers, so I kept watching, and the SDS lock on the 106 popped up again in the dali log later. I looked in the 106 production_esp_1_https log and saw these messages out there:

      "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_1:: 02:54:51"

       

      I bounced the esp on 106 and MVR job took off and started flying. I kept tailing the esp log on 106 and it happened again, after the service was bounced. This is the whole thing:

      00003E9A 2022-09-02 15:54:47.452 1391677 1397223 "HTTP First Line: POST /ws_machine/GetTargetClusterInfo HTTP/1.1"

      00003E9B 2022-09-02 15:54:47.453 1391677 1397223 "POST /ws_machine/GetTargetClusterInfo, from 10.145.44.133"

      00003E9C 2022-09-02 15:54:47.454 1391677 1397223 "Updated @timeoutAt for (/WS_MACHINE/GETTARGETCLUSTERINFO) : 1662162887"

      00003E9D 2022-09-02 15:54:47.475 1391677 1397223 "Cannot find node group for thor400_112"

      00003E9E 2022-09-02 15:54:47.485 1391677 1397223 "Cannot find node group for thor400_112_3"

      00003E9F 2022-09-02 15:54:47.494 1391677 1397223 "Cannot find node group for thor400_112_5"

      00003EA0 2022-09-02 15:54:47.505 1391677 1397223 "Cannot find node group for thor400_112_7"

      00003EA1 2022-09-02 15:54:47.514 1391677 1397223 "Cannot find node group for thor400_112_9"

      00003EA2 2022-09-02 15:54:49.830 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_1:: 02:54:51"

      00003EA3 2022-09-02 15:54:49.830 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_2:: 02:54:51"

      00003EA4 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_slave_4:: 02:54:51"

      00003EA5 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_3_slave_1:: 02:54:51"

      00003EA6 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_3_slave_2:: 02:54:51"

      00003EA7 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_3_slave_4:: 02:54:51"

      00003EA8 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_5_slave_1:: 02:54:51"

      00003EA9 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_5_slave_2:: 02:54:51"

      00003EAA 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_5_slave_4:: 02:54:52"

      00003EAB 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_7_slave_1:: 02:54:52"

      00003EAC 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_7_slave_2:: 02:54:52"

      00003EAD 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_7_slave_4:: 02:54:52"

      00003EAE 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_9_slave_1:: 02:54:52"

      00003EAF 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_9_slave_2:: 02:54:52"

      00003EB0 2022-09-02 15:54:49.831 1391677 1397227 "incorrect data for process /var/run/HPCCSystems/thor400_112_9_slave_4:: 02:54:52"

      I searched backwards and all the instances of this message I could find were for the odd named thors – 1. 3, 5, 7 and 9 on just the first 4 slaves. The job that was running slowly was not using an odd numbered TM instance, it was using even numbered thors. I worked with Sushant, paused all the odd numbered thor job queues and bounced 1, 3, 5, 7, 9 and 11. The errors stopped for a bit, but they have started again.

       

      I went back in time to see if these errors happened prior to today and, yes, it was happening a lot yesterday, on both esps on 101 and 106. They had also occurred on previous days, but they were not consistent, and they go back prior to when we upgraded to OSS 8.6. Do you know what these errors mean and what could be causing them? The slow running job is becoming an issue.

       

      There are only 60 days of logs to look back through:

       

      On 10.194.112.101

       

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_03.log:21792

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_06.log:73260

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_14.log:19008

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_18.log:43560

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_22.log:84

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_28.log:110880

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_08_01.log:6695

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_08_02.log:40

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_08_17.log:21780

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_09_01.log:53460

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_09_02.log:75200

       

      On 10.194.112.106

       

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_03.log:33660

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_06.log:69300

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_14.log:66528

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_18.log:59400

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_22.log:8

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_07_28.log:93020

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_08_01.log:11880

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_08_17.log:15840

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_09_01.log:19780

      /var/log/HPCCSystems/production_esp_1_https/esp_main.2022_09_02.log:70485

       

       

       

       

       

      Lisa Frates

      Senior Systems Engineer

      HPCC Operations

      LexisNexis | Risk Solutions

      lisa.frates@lexisnexisrisk.com

       

      Attachments

        Activity

          People

            wangkx Kevin Wang
            frateslx Lisa Frates
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: