Uploaded image for project: 'HPCC'
  1. HPCC
  2. HPCC-23282

Correct reading of a UTF-16 file

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Scheduled
    • Priority: Not specified
    • Resolution: Unresolved
    • Affects Version/s: 6.4.38
    • Fix Version/s: None
    • Component/s: DFS
    • Labels:
      None

      Description

      I have a file encoded in the following manner:

      Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators
      

      I have sprayed this to the cluster using a call to Fileservices.SprayVariable (seems our version doesn't support this in STD.File.SprayDelimited - even though the docs show it?)

      With this I have specified encoding := 'utf16le'

      At this point the file 'looks' fine through ECL Watch and shows a record structure of:

      RECORD
          UTF8 field1;
          UTF8 field2;
          ...
      END;
      

      I am trying to read the data within a DATASE, but cannot get it to display correctly.

      Sample ECL IDE output of the dataset:

      "Instalment

      Expected output: Instalment

      Octal dump of raw file:

      0003760 " \0 I \0 n \0 s \0 t \0 a \0 l \0 m \0
      0004000 e \0 n \0 t

      Hex dump of part of the file:

      00000000 ff fe 22 00 44 00 41 00 54 00 45 00 5f 00 53 00 |..".D.A.T.E._.S.|
      00000010 54 00 41 00 52 00 54 00 22 00 2c 00 22 00 44 00 |T.A.R.T.".,.".D.|
      00000020 41 00 54 00 45 00 5f 00 45 00 4e 00 44 00 22 00 |A.T.E._.E.N.D.".|

      I have tried using the following DATASET variations:

      DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','))
      DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE)
      DATASET('logical_file_name', layout, CSV(HEADING(1),SEPARATOR(','), UNICODE16)
      

      I have also adjusted the RECORD definition:

      input_lay := RECORD
      UNICODE field1;
      UNICODE field2;
      UNICODE field3;
      ...
      END;

      input_lay := RECORD
      STRINGfield1;
      STRING field2;
      STRING field3;
      ...
      END;

      input_lay := RECORD
      UTF8 field1;
      UTF8 field2;
      UTF8 field3;
      ...
      END;

      None of the above combinations have been able to successfully parse the file, please can you comment on how to read a UTF16 file that has been sprayed to the cluster.

        Attachments

          Activity

            People

            Assignee:
            attilavamos Attila Vamos
            Reporter:
            Stuart Stuart Chatman
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated: