Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-13752

Temporary Spill File Lost on Job Pause/Resume Action.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Accepted
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 5.0.16
    • Fix Version/s: None
    • Component/s: Build process
    • Labels:
    • Compatibility:
      Minor

      Description

      Job is unable to locate temporary spill file created when the job is paused/resumed.
      Job is an over night job so the pause/resume was not of a user action.

      Email Thread of research so far.
      I'm not sure what the bug is, but this is something to do with the fact the job was paused then resumed.
      i.e. the pause/resume mechanism (which is seldom used afaik) failed here:
      Evidence of pause/resume
      000000F 2015-06-16 03:14:14.385 88634 88634 "Thor on 10.241.30.201:21000 running W20150616-021342"
      00000010 2015-06-16 03:36:08.662 88634 88634 "Job W20150616-021342 paused, waiting for resume/abort"
      00000011 2015-06-16 03:56:47.034 88634 88634 "Enqueuing on thor400_30.thor to run wuid=W20150616-021342, graph=graph4, timelimit=172800 seconds, priority=0"
      From thor master log:
      000B05AF 2015-06-16 03:36:00.938 117040 117460 "Pausing job"
      ...
      000B05BB 2015-06-16 03:36:03.194 117040 117040 "Paused, saving spills.."
      000B05BC 2015-06-16 03:36:03.327 117040 117040 "createLogicalFile ( hpccinternal::dknowles_prod::temporary::spill::cnsf4__w20150616-021342 )"
      ...
      000B08DD 2015-06-16 03:36:04.243 117040 117040 "Paused, 1 spill(s) saved."

      Can you open a JIRA and add any other info. about how it was paused/resumed etc. to the issue please?


      Jake.

      http://10.241.20.202:8010/?Wuid=W20150616-021342&Widget=WUDetailsWidget#/stub/Summary

      I think its related…to (see below) are you doing that rename? Adding Jake for comment.

      00B05AC 2015-06-16 03:35:57.855 117040 117040 "createLogicalFile ( hpccinternal::dknowles_prod::temporary::spill::cnsf4__w20150616-021342 )"
      000B05AD 2015-06-16 03:35:58.040 117040 117040 ",Progress,Thor,StartSubgraph,thor400_30_2,W20150616-021342,4,1006,thor400_30,thor400_30.thor"
      000B05AE 2015-06-16 03:35:58.040 117040 117040 "allocateMPTag: tag = 66512"
      000B05AF 2015-06-16 03:36:00.938 117040 117460 "Pausing job"
      000B05B0 2015-06-16 03:36:02.081 117040 117040 "sendGraph took 4041 ms - graph(graph4, 1006)"
      000B05B1 2015-06-16 03:36:02.081 117040 117040 "Processing graph - graph(graph4, 1006)"
      000B05B2 2015-06-16 03:36:02.156 117040 117040 "preStart - activity(join, 1010)"
      000B05B3 2015-06-16 03:36:02.389 117040 36109 "process - activity(join, 1010)"
      000B05B4 2015-06-16 03:36:02.389 117040 36109 "process exit - activity(join, 1010)"
      000B05B5 2015-06-16 03:36:02.460 117040 117040 "registerTmpFile name=hpccinternal::dknowles_prod::temporary::spill::cnsf4__w20150616-021342, usageCount=1"
      000B05B6 2015-06-16 03:36:02.462 117040 117040 "deregisterTmpFile name=hpccinternal::dknowles_prod::temporary::spill::ansf4__w20150616-021342"
      000B05B7 2015-06-16 03:36:02.462 117040 117040 "deregisterTmpFile name=hpccinternal::dknowles_prod::temporary::spill::bnsf4__w20150616-021342"
      000B05B8 2015-06-16 03:36:02.635 117040 117040 ",Timing,ThorGraph,thor400_30_2,W20150616-021342,4,1006,1,4776,SUCCESS,thor400_30,thor400_30.thor"
      000B05B9 2015-06-16 03:36:02.889 117040 117040 "Graph Done - graph(graph4, 1006)"
      000B05BA 2015-06-16 03:36:02.891 117040 117040 "PU= 0% MU= 7% MAL=3135479552 MMP=3055468544 SBK=80011008 TOT=3205756K RAM=4253552K SWP=0K DSK: [sda] r/s=0.0 kr/s=0.0 w/s=2.2 k
      w/s=24.4 bsy=0 NIC: rxp/s=0.0 rxk/s=0.0 txp/s=0.0 txk/s=0.0 CPU: usr=0 sys=0 iow=0 idle=99 - graph(graph4, 1006)"
      000B05BB 2015-06-16 03:36:03.194 117040 117040 "Paused, saving spills.."
      000B05BC 2015-06-16 03:36:03.327 117040 117040 "createLogicalFile ( hpccinternal::dknowles_prod::temporary::spill::cnsf4__w20150616-021342 )"
      000B05BD 2015-06-16 03:36:03.328 117040 117040 "fullname = /var/lib/HPCCSystems/thor400_30_2/hpccinternal/dknowles_prod/thorpause/cnsf4_w20150616-021342_w20150616-021342._1

      And then

      "Succeeded rename //10.241.30.37/var/lib/HPCCSystems/thor400_30_2/temp/cnsf4__w20150616-021342._37_of_400 to //10.241.30.37/var/lib
      /HPCCSystems/thor400_30_2/hpccinternal/dknowles_prod/thorpause/cnsf4_w20150616-021342_w20150616-021342._37_of_400"

        Attachments

          Activity

            People

            Assignee:
            jakesmith Jake Smith
            Reporter:
            edmants Tamika Edman
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Due:
              Created:
              Updated: