Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-10698

Superfile transaction gets stuck on safeChangeModeWrite

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.2.2
    • Fix Version/s: 4.2.4
    • Component/s: Thor
    • Labels:
      None
    • Environment:
      USLM Infra

      Description

      Workunit W20140124-175930 appears to be stuck on the USLM system. The eclagent log has:

      0000000A 2014-01-24 17:59:32.325 6157 6157 "setResultString(gl7A2A1,-3,'~oss_issue7::sub_W20140124-175930')"
      0000000B 2014-01-24 17:59:32.343 6157 6157 "Enqueuing on thor10_64_a.thor to run wuid=W20140124-175930, graph=graph1, timelimit=172800 seconds, priority=0"
      0000000C 2014-01-24 17:59:32.352 6157 6157 "Thor on 10.144.64.11:16500 running W20140124-175930"
      0000000D 2014-01-24 17:59:37.700 6157 6157 "WARNING: safeChangeModeWrite - temporarily releasing lock on oss_issue7::super to avoid deadlock"
      0000000E 2014-01-24 17:59:42.705 6157 6157 "WARNING: safeChangeModeWrite on oss_issue7::super waiting for 10s"
      0000000F 2014-01-24 17:59:42.705 6157 6157 "Backtrace:"
      00000010 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libjlib.so(_Z16PrintStackReportv+0x26) [0x2b92b0b017f6]"
      00000011 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_Z19safeChangeModeWriteP17IRemoteConnectionPKcRbj+0x1fc) [0x2b92af7e84dc]"
      00000012 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN20CDistributedFileBaseI21IDistributedSuperFileE14lockPropertiesEj+0x83) [0x2b92af782013]"
      00000013 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN9CDFAction4lockEPb+0x54) [0x2b92af77eee4]"
      00000014 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN21CDistributedSuperFile20cRemoveSubFileAction7prepareEv+0x8b) [0x2b92af784fab]"
      00000015 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN27CDistributedFileTransaction14prepareActionsEv+0x38) [0x2b92af77b158]"
      00000016 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN27CDistributedFileTransaction6commitEv+0x57) [0x2b92af775d27]"
      00000017 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN27CDistributedFileTransaction10autoCommitEv+0x3d) [0x2b92af7759cd]"
      00000018 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libdalibase.so(_ZN21CDistributedSuperFile13removeSubFileEPKcbbP27IDistributedFileTransaction+0x33e) [0x2b92af7989be]"
      00000019 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/plugins/libfileservices.so(fslRemoveSuperFile+0xca) [0x2aaaaae84b3a]"
      0000001A 2014-01-24 17:59:42.707 6157 6157 " /var/lib/HPCCSystems/dllserver/temp/libW20140124-175930.so [0x2aaaad079f59]"
      0000001B 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libworkunit.so(_ZN15WorkflowMachine11performItemEjj+0x54) [0x2b92aef9d474]"
      0000001C 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libworkunit.so(_ZN15WorkflowMachine13doExecuteItemER20IRuntimeWorkflowItemj+0x3f) [0x2b92aef9e0ef]"
      0000001D 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libworkunit.so(_ZN15WorkflowMachine11executeItemEjj+0x26a) [0x2b92aef9db8a]"
      0000001E 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libworkunit.so(_ZN15WorkflowMachine7performEP18IGlobalCodeContextP11IEclProcess+0x139) [0x2b92aef9e799]"
      0000001F 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libhthor.so(_ZN8EclAgent10runProcessEP11IEclProcess+0x1d7) [0x2b92add12cf7]"
      00000020 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libhthor.so(_ZN8EclAgent9doProcessEv+0x2f6) [0x2b92add19b06]"
      00000021 2014-01-24 17:59:42.707 6157 6157 " /opt/HPCCSystems/lib/libhthor.so(_Z13eclagent_mainiPPKcP12StringBufferb+0x6cc) [0x2b92add1ac1c]"
      00000022 2014-01-24 17:59:42.707 6157 6157 " eclagent(main+0x66) [0x401226]"
      00000023 2014-01-24 17:59:42.707 6157 6157 " /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b92b3353994]"
      00000024 2014-01-24 17:59:42.707 6157 6157 " eclagent(__gxx_personality_v0+0xe9) [0x4010f9]"
      00000025 2014-01-24 17:59:42.707 6157 6157 "CDFAction lock timed out on oss_issue7::super"
      00000026 2014-01-24 17:59:42.707 6157 6157 "CDistributedFileTransaction: Transaction pausing"
      00000027 2014-01-24 18:06:08.908 6157 6157 "WARNING: safeChangeModeWrite - temporarily releasing lock on oss_issue7::sub_w20140124-175907 to avoid deadlock"
      00000028 2014-01-24 18:11:43.954 6157 6157 "WARNING: safeChangeModeWrite on oss_issue7::sub_w20140124-175907 waiting for 635s"
      00000029 2014-01-24 18:11:43.954 6157 6157 "Backtrace:"

      This workunit was started while workunit W20140124-175917, which reads the superfile the stuck workunit wants to write, was running.

      The equivalent pair of workunits runs fine in a 702 environment.

      The ECL for both these workunits is:

      unsigned4 plus4(unsigned4 invalue) := beginc++
      	sleep(1);
      	return invalue + 4;
      endc++;
      
      layout := {unsigned4 num};
      
      seed := dataset([{1}, {2}, {3}, {4}, {5}, {6}],layout);
      
      superName := '~oss_issue7::super';
      subName := '~oss_issue7::sub_' + workunit;
      
      blocked := sequential(
        output(seed,,subName,overwrite),
      	if(FileServices.SuperFileExists(superName),
      		 FileServices.ClearSuperFile(superName)),
      	FileServices.StartSuperFileTransaction(),
      	  FileServices.AddSuperFile(superName,subName),
        FileServices.FinishSuperFileTransaction()
      );
      
      super := dataset(superName,layout,thor);
      blockerName := '~oss_issue7::blocker_output';
      
      ds1b := normalize(super, 10, transform(layout,self.num := plus4(counter)));
      
      blockerSubgraph := output(ds1b,,blockerName,overwrite);
      
      blocked;
      // blockerSubgraph;
      

      The "stuck" workunit executes the "blocked" attribute (which must also be run first to initialize the conditions) and "blockSubgraph" executes the workunit that holds the lock for a long enough time to kick off "blocked".

        Attachments

          Activity

            People

            • Assignee:
              jakesmith Jake Smith
              Reporter:
              joecella Joe Cella
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: