Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-26761

Thor Slave not starting

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 8.4.2
    • Fix Version/s: 8.2.34, 8.4.10, 7.12.80, 8.0.50
    • Component/s: Thor
    • Labels:
      None

      Description

      Buckle up...

      Note: I believe more versions are affected by this.

      I'm on Ubuntu 20.04 and build source code based on tag community_8.4.2-1 (available here: https://github.com/lpezet/HPCC-Platform/tree/community_8.4.2-1-lpezet)

       

      After fixing a couple other small things, I'm still not able to get the Thor Slave to start.

      I install the package and simply run:

      systemctl start hpccsystems-platform.target

      I don't see any thorslave_lcr (that's what I'm expecting, but please correct me if I'm wrong):

      lpezet@lpezet-Virtual-Machine:/volumes/disk1/Work/git/HPCC-Platform-build$ ps auxwww | grep hpcc
      lpezet 24808 0.0 0.0 17532 736 pts/1 S+ 19:58 0:00 grep --color=auto hpcc
      lpezet@lpezet-Virtual-Machine:/volumes/disk1/Work/git/HPCC-Platform-build$ systemctl start hpccsystems-platform.target 
      lpezet@lpezet-Virtual-Machine:/volumes/disk1/Work/git/HPCC-Platform-build$ ps auxwww | grep hpcc
      hpcc 24990 0.0 0.0 130560 6692 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/dafilesrv --logDir=/var/log/HPCCSystems --name=mydafilesrv --daemon
      hpcc 25005 0.0 0.0 87164 5480 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/toposerver --daemon mytoposerver
      hpcc 25006 0.0 0.0 347308 9120 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/agentexec --daemon myeclagent
      hpcc 25007 0.0 0.0 642284 9216 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/eclccserver --daemon myeclccserver
      hpcc 25008 0.1 0.1 739784 21688 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/dfuserver --daemon mydfuserver
      hpcc 25013 0.0 0.0 1011156 9352 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/saserver --daemon mysasha
      hpcc 25016 0.0 0.0 863524 9064 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/eclscheduler --daemon myeclscheduler
      hpcc 25017 0.6 0.3 2746872 37856 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/daserver --daemon mydali
      hpcc 25023 0.4 0.3 1454476 47080 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/esp --daemon myesp
      hpcc 25053 0.2 0.3 2422336 39788 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/roxie --topology=RoxieTopology.xml --logfile --restarts=2 --stdlog=0 --daemon myroxie
      hpcc 25352 0.0 0.2 230100 23980 ? Ssl 19:58 0:00 /opt/HPCCSystems/bin/thormaster_lcr --daemon mythor MASTER=172.25.115.182:20000
      lpezet 25715 0.0 0.0 17664 728 pts/1 S+ 19:58 0:00 grep --color=auto hpcc
      lpezet@lpezet-Virtual-Machine:/volumes/disk1/Work/git/HPCC-Platform-build${code}
      I see log file */var/log/HPCCSystems/mythor/thorslaves-launch.debug* was created, and its content is:
      
       
      {code:java}
      $ cat /var/log/HPCCSystems/mythor/thorslaves-launch.debug 
      + [[ -z mythor ]]
      + [[ -z stop ]]
      ++ pwd
      + cwd=/var/lib/HPCCSystems/mythor
      + [[ /var/lib/HPCCSystems/mythor != \/\v\a\r\/\l\i\b\/\H\P\C\C\S\y\s\t\e\m\s\/\m\y\t\h\o\r ]]
      + source mythor.cfg
      ++ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/opt/HPCCSystems/bin:/opt/HPCCSystems/sbin:/var/lib/HPCCSystems/mythor
      ++ THORNAME=mythor
      ++ THORMASTER=172.25.115.182
      ++ THORMASTERPORT=20000
      ++ THORSLAVEPORT=20100
      ++ localthorportinc=20
      ++ slavespernode=1
      ++ channelsperslave=1
      ++ DALISERVER=172.25.115.182:7070
      ++ localthor=true
      ++ breakoutlimit=3600
      ++ refreshrate=3
      ++ autoSwapNode=false
      ++ SSHidentityfile=/home/hpcc/.ssh/id_rsa
      ++ SSHusername=hpcc
      ++ SSHpassword=
      ++ SSHtimeout=0
      ++ SSHretries=3
      ++ SSHsudomount=
      + slaveIps=($(/opt/HPCCSystems/bin/daliadmin server=$DALISERVER clusternodes ${THORNAME} slaves timeout=2 1>/dev/null 2>&1; uniq slaves))
      ++ /opt/HPCCSystems/bin/daliadmin server=172.25.115.182:7070 clusternodes mythor slaves timeout=2
      ++ uniq slaves
      + [[ -z 172.25.115.182 ]]
      + [[ -z 172.25.115.182 ]]
      + numOfNodes=1
      + (( i=0 ))
      + (( i<1 ))
      + (( c=0 ))
      + (( c<1 ))
      + __slavePort=20100
      + __slaveNum=1
      + ssh -o LogLevel=QUIET -o StrictHostKeyChecking=no -o BatchMode=yes -i /home/hpcc/.ssh/id_rsa hpcc@172.25.115.182 '/bin/bash -c '\''/opt/HPCCSystems/sbin/thorslaves-exec.sh stop thorslave_mythor_1 20100 1 mythor 172.25.115.182 20000'\'''
      + (( c++ ))
      + (( c<1 ))
      + (( i++ ))
      + (( i<1 ))
      + exit 0

      For testing purposes here, I believe we can just do:

      sudo -u hpcc /opt/HPCCSystems/sbin/thorslaves-exec.sh stop thorslave_mythor_1 20100 1 mythor 172.25.115.182 20000

      Now /opt/HPCCSystems/sbin/thorslaves-exec.sh has many things going on and here is a comment.
      It checks that a couple of directories exist, BUT it will create those without "sudo" and I believe it would fail since parent directories are root only (e.g. /var/log, /var/run).
      It then creates a configuration file in /var/lib/HPCCSystems/thorslaves and finally run:

      systemctl ${ACTION} thorslave@${INSTANCENAME}.service

      (which ends up being "sudo systemctl start thorslave@thorslave_mythor_1.service").

      Which fails:

       

      lpezet@lpezet-Virtual-Machine:/volumes/disk1/Work/git/HPCC-Platform-build$ systemctl status thorslave@thorslave_mythor_1.service 
      ● thorslave@thorslave_mythor_1.service - thorslave_mythor_1
       Loaded: loaded (/etc/systemd/system/thorslave@.service; static; vendor preset: enabled)
       Active: failed (Result: exit-code) since Fri 2021-10-29 20:14:46 MDT; 9s ago
       Process: 28754 ExecStart=/opt/HPCCSystems/bin/thorslave_lcr --daemon thorslave_mythor_1 master=${THORMASTER}:${THORMASTERPORT} slave=.:${SLAVEPORT} slaven>
       Main PID: 28754 (code=exited, status=1/FAILURE)
      Oct 29 20:14:46 lpezet-Virtual-Machine systemd[1]: Started thorslave_mythor_1.
      Oct 29 20:14:46 lpezet-Virtual-Machine systemd[1]: thorslave@thorslave_mythor_1.service: Main process exited, code=exited, status=1/FAILURE
      Oct 29 20:14:46 lpezet-Virtual-Machine systemd[1]: thorslave@thorslave_mythor_1.service: Failed with result 'exit-code'.
      

      Now I was fiddling with thslavemain.cpp and I believe it's exiting because parameters are not passed properly.
      This is what's being called in the end:

      /opt/HPCCSystems/bin/thorslave_lcr --daemon %i master=${THORMASTER}:${THORMASTERPORT} slave=.:${SLAVEPORT} slavenum=${SLAVENUM} logDir=/var/log/HPCCSystems/${THORNAME}

      I believe this is what it should be:

      /opt/HPCCSystems/bin/thorslave_lcr --daemon %i --master=${THORMASTER}:${THORMASTERPORT} --slaveport=.:${SLAVEPORT} --slavenum=${SLAVENUM} --logDir=/var/log/HPCCSystems/${THORNAME}

      (notice the "–" before each argument name)

       

      This seems to be a recurring pattern with HPCC-26757.

        Attachments

          Activity

            People

            Assignee:
            Michael-Gardner Michael Gardner
            Reporter:
            lpezet Luc
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: