DATA-INTO crashing with UTF16 in CTL-OPT

emaxt6 · Post by **emaxt6** » Thu Sep 26, 2024 8:44 am

Hi all,
just a - say - technical curiosity I have, which didn't find an easy explanation.

I have a little program to retrieve machine aided translations including the use of HTTPAPI that receive an UTF8 (multilanguage, there can be 4byte characters etc. inside) string that I then DATA-INTO into a DS (with all strings all UTF8 defined explicitely field by field with CCSID(*UTF8)).

To be complete, I have also some UCS2 fields in the module with CCSID(*UTF16) explicit, to move information then in the UI or business layer ("business layer" is by convention UTF16 internally, GUI also, database also, UTF8 tends to be for net transport, external json serialization etc.).

All is working fine, but the moment I put CTL-OPT CCSID(*UCS2 : *UTF16) in the module to avoid explicit CCSID in my UCS2 vars, this statement crashes with a parse error, without it, all works

DATA-INTO res %DATA(res_s : 'allowmissing=yes allowextra=yes +
case=any countprefix=cnt_ ' )
%PARSER('YAJLINTO');

res_s is a UTF8 json string.

Tracing DATA-INTO with the CTL-OPT off it says
Data length 138 bytes
Data CCSID 13488
Converting data to UTF-8
Allocating YAJL stream parser
Parsing JSON data (yajl_parse)
[...all ok....]

With the CTL-OPT on

---------------- Start ---------------
Data length 138 bytes
Data CCSID 1200
Converting data to UTF-8
Allocating YAJL stream parser
Parsing JSON data (yajl_parse)
YAJL parser status 0 after 0 bytes
YAJL parser final status: 2
YAJL error: parse error: premature EOF
ReportError, RC = 1002
- Bytes parsed: 0
Terminating due to error

It clearly changes the behaviour of the parser itself internally and not the module? Intuitively CCSID 1200 would even be "more correct" to even convert UTF8 due to whole coverage of unicode codepoints...

any insights? or any parameter to pass to the parser to force a particular work CCSID (like UTF8? being by parsing source and destination UTF8)

thanks

Post by **Scott Klement** » Sat Sep 28, 2024 6:42 am

I'm not 100% following you.

Can you tell me what I would need to do to reproduce this problem? Which fields are defined as UCS-2 without a CCSID keyword (which would be the only thing affected by the CTL-OPT you posted.) How are these fields used?

If at all possible, provide an example that reproduces the problem you cite.

emaxt6 · Post by **emaxt6** » Mon Sep 30, 2024 9:08 am

I had the same question, because UTF16 isn't even used in that part.... so mine was just a technical curiosity due to my ignorance if there is influence to the parser.

To trace:
ADDENVVAR ENVVAR(QIBM_RPG_DATA_INTO_TRACE_PARSER) VALUE(*STDOUT)

Code: Select all


//***relevant part

DCL-DS *N;
         res_s VARCHAR(50000) CCSID(*UTF8) INZ('');
         res_ss VARCHAR(50000) SAMEPOS(res_s);
END-DS; 

DCL-DS res LIKEDS(deeplTranslateResponse_t) INZ;

 DCL-DS deeplTranslateResponse_t TEMPLATE QUALIFIED;
         cnt_translations INT(10) INZ(0);
         DCL-DS translations DIM(NTEXTS);
           detected_source_language VARCHAR(10) CCSID(*UTF8) INZ('');
           text VARCHAR(NBYTESTEXT) CCSID(*UTF8) INZ('');
           billed_characters INT(10) INZ(0);
         END-DS;
END-DS; 



DATA-INTO res %DATA(res_s : 'allowmissing=yes allowextra=yes +
                           case=any countprefix=cnt_ ' )
                                    %PARSER('YAJLINTO'); 
//*****                                  


//FULL CODE: in case one need to interface to DEEPL translation API from ILE, anyone feel free to reuse/readapt/recycle or use the following as a working exercise for httpapi tutorials, was tested returning correct unicode and working in production (i.e. english to chinese, german to russian):
//
       CTL-OPT DFTACTGRP(*NO) BNDDIR('HTTPAPI');
       CTL-OPT CCSIDCVT(*LIST : *EXCP);
       //CTL-OPT CCSID(*UCS2 : *UTF16);

       /COPY DEEPL_H
       /COPY HTTPAPI_H

       DCL-PI *N;
         srcText_in VARUCS2(2000) CCSID(*UTF16);
         dstText_out VARUCS2(2000) CCSID(*UTF16);
         srcLang_in CHAR(10) OPTIONS(*OMIT) CONST;
         dstLang_in CHAR(10) CONST;
         context_in VARUCS2(1000) CCSID(*UTF16) OPTIONS(*NOPASS:*OMIT);
       END-PI;

       DCL-S srcText VARUCS2(2000) CCSID(*UTF16);
       DCL-S dstText VARUCS2(2000) CCSID(*UTF16);
       DCL-S context VARCHAR(2000) CCSID(*UTF8) INZ;
       DCL-S srcLang VARCHAR(10) CCSID(*UTF8) INZ;
       DCL-S dstLang VARCHAR(10) CCSID(*UTF8);
       DCL-S apiKey VARCHAR(255)
             INZ('mykey:fx');

       DCL-DS req LIKEDS(deeplTranslateRequest_t)  INZ;
       DCL-DS res LIKEDS(deeplTranslateResponse_t) INZ;

       DCL-DS *N;
         req_s VARCHAR(50000) CCSID(*UTF8) INZ('');
         req_ss VARCHAR(50000) SAMEPOS(req_s);
       END-DS;

       DCL-DS *N;
         res_s VARCHAR(50000) CCSID(*UTF8) INZ('');
         res_ss VARCHAR(50000) SAMEPOS(res_s);
       END-DS;
       DCL-S respCode INT(10);
       DCL-S errorNo INT(10);
       DCL-S errormex CHAR(80);

       *INLR = *ON;

       IF %PASSED(srcLang_in);
         srcLang = srcLang_in;
       ENDIF;
       IF %PASSED(context_in);
         context = context_in;
       ENDIF;
       srcText = srcText_in;
       dstLang = dstLang_in;

       req.text(1) = srcText;
       req.cnt_text = 1;
       req.source_lang = srcLang;
       req.target_lang = dstLang;
       req.context = context;

       DATA-GEN req %DATA(req_s : 'countprefix=cnt_') %GEN('YAJLDTAGEN');

       http_debug(*ON: '/tmp/httpapi_deepl.txt');
       http_setOption('network-ccsid': '1208');
       http_setOption('local-ccsid'  : '1208');
       http_setOption('timeout' : '8');
       http_xproc(HTTP_POINT_ADDL_HEADER: %PADDR(addHeaders));

       MONITOR;
       res_ss = http_string('POST':
                            DEEPL_BASE+DEEPL_TRANSLATE:
                            req_ss: 'application/json');

       DATA-INTO res %DATA(res_s : 'allowmissing=yes allowextra=yes +
                           case=any countprefix=cnt_ ' )
                                    %PARSER('YAJLINTO');

       dstText = res.translations(1).text;
       dstText_out = dstText;

       ON-ERROR;
         errormex = http_error(errorNo:respCode);
         SND-MSG %PROC+' '+ errormex +' '+%CHAR(errorNo)+' '+%CHAR(respCode);
       ENDMON;
       RETURN;

       DCL-PROC addHeaders;
         DCL-PI *N;
           toBeAdded VARCHAR(32767);
         END-PI;
         toBeAdded = 'Authorization: DeepL-Auth-Key '+ apiKey + X'0D25';
       END-PROC; 


//DEEPL_H

       /IF DEFINED(DEEPL_H)
       /EOF
       /ENDIF
       /DEFINE DEEPL_H

       DCL-C DEEPL_BASE 'https://api-free.deepl.com';
       //DCL-C DEEPL_BASE 'https://api.deepl.com';
       DCL-C DEEPL_TRANSLATE '/v2/translate';

       DCL-DS deeplAllowedLangs QUALIFIED;
         listSource CHAR(300)
              INZ('BG CS DA DE EL EN ES ET FI FR HU ID IT JA KO +
                   LT LV NB NL PL PT RO RU SK SL SV TR UK ZH');
         listDestination CHAR(300)
              INZ('AR BG CS DA DE EL EN-GB EN-US ES ET FI FR HU ID +
                   IT JA KO LT LV NB NL PL PT-BR PT-PT RO RU SK SL +
                   SV TR UK ZH ZH-HANS ZH-HANT');
       END-DS;

       DCL-C NTEXTS 5;
       DCL-C NBYTESTEXT 4000;

       DCL-PR DEEPLTR01R EXTPGM;
         srcText_in VARUCS2(2000) CCSID(*UTF16);
         dstText_out VARUCS2(2000) CCSID(*UTF16);
         srcLang_in CHAR(10) OPTIONS(*OMIT) CONST;
         dstLang_in CHAR(10) CONST;
         context_in VARUCS2(1000) CCSID(*UTF16) OPTIONS(*OMIT:*NOPASS);
       END-PR;

       DCL-DS deeplTranslateRequest_t TEMPLATE QUALIFIED;
         cnt_text INT(10) INZ(0);
         text VARCHAR(NBYTESTEXT) DIM(NTEXTS) CCSID(*UTF8) INZ('');
         source_lang VARCHAR(10) CCSID(*UTF8) INZ('');
         target_lang VARCHAR(10) CCSID(*UTF8) INZ('');
         context VARCHAR(2000) CCSID(*UTF8) INZ;
       END-DS;

       DCL-DS deeplTranslateResponse_t TEMPLATE QUALIFIED;
         cnt_translations INT(10) INZ(0);
         DCL-DS translations DIM(NTEXTS);
           detected_source_language VARCHAR(10) CCSID(*UTF8) INZ('');
           text VARCHAR(NBYTESTEXT) CCSID(*UTF8) INZ('');
           billed_characters INT(10) INZ(0);
         END-DS;
       END-DS;

thanks

Post by **Scott Klement** » Tue Oct 01, 2024 9:53 pm

Hello,

Thanks for providing an example.

Unfortuantely, if I wanted to use this, it appears that I would have to pay for a commercial API.

Can you provide a sample program that can be used to reproduce the problem? I don't need to call the API. I don't need to do any useful work. Feel free to just hardcode some simple data in the program. The important part is that it reproduces the problem.

emaxt6 · Post by **emaxt6** » Wed Oct 02, 2024 2:39 pm

Sure Scott.
Bye the way the aforementioned used service also has a free tier (require of course registration for the key. Not affiliated in any way, just happen to use such service. Anyone feel free to copy the example).

To recreate see below.
Doesn't work on my machine V7R4.
Remove CTL-OPT CCSID(*UCS2 : *UTF16) , it works as is (text in DS gets populated with the characters).
Reduce "s" variable from 50000 to 5000, it works even with CTL-OPT on.
Some requirement on s or expected len?
len 32768 works, so some hardwire on 32768 "magic number" somewhere?

Still worked with 50000 with the CTL-OPT off apparently.
Just a technical curiosity being ignorant on DATA-INTO internals... after some attempts I made it to work but it was weird at first...

Code: Select all

       CTL-OPT DFTACTGRP(*NO) OPTION(*SRCSTMT);
       CTL-OPT CCSID(*UCS2 : *UTF16);

       DCL-PI *N;
       END-PI;

       DCL-DS res QUALIFIED;
         cnt_translations INT(10);
         DCL-DS translations;
           detected_source_language VARCHAR(10) CCSID(*UTF8) INZ;
           text VARCHAR(1000) CCSID(*UTF8) INZ;
         END-DS;
       END-DS;

       DCL-S s VARCHAR(50000) CCSID(*UTF8) INZ;
       DCL-S x VARUCS2(50000) CCSID(*UTF16);

       s = '{"extratag" : "IT" , +
             "translations": { "detected_source_language" : "EN", +
             "text" : "';
       s = s+U'D83DDE0AD83DDE0AD83DDE0AD83DDE0A';
       s = s+'"}}';

       DATA-INTO res %DATA(s : 'allowmissing=yes allowextra=yes +
                           case=any countprefix=cnt_ ' )
                                    %PARSER('YAJLINTO');



       *INLR = *ON;
       RETURN;

Post by **Scott Klement** » Wed Oct 02, 2024 3:18 pm

I was able to reproduce the problem. This appears to be a bug in RPG (rather than YAJL).

RPG is calling YAJLINTO and passing it parameters. One of the parameters is the CCSID of the input data. In your case, the input is the string 's', which is CCSID(*UTF8), so the CCSID it passes to YAJLINTO should be 1208 for UTF-8. However when the CCSID(*UCS2:*UTF16) keyword is added and the length of 's' is > 32k, it passes 1200 (UTF-16) in the CCSID parameter to YAJLINTO. That is a bug -- the CCSID of your 's' variable hasn't changed, it is still 1208 regardless of the length or the CTL-OPT CCSID keyword, since that CTL-OPT should not be affecting this field, it has an explicit CCSID set.

Since RPG is telling YAJLINTO that the data is in CCSID 1200, it is trying to convert it -- but that causes it to fail with an error because the data isn't 1200, it's actually 1208.

To fix this, you'll have to report it to IBM (create a case) and have them create a PTF. Unfortunately, I don't have an active SWMA with IBM (since I'm a consultant, not a direct IBM customer) so you'll have to do this. But I'm happy to assist in any way I can.

emaxt6 · Post by **emaxt6** » Thu Oct 03, 2024 12:43 pm

Thanks Scott for your support time and insights.
Called IBM, confirmed bug, and a PTF will be released in the future.

scottklement.com

DATA-INTO crashing with UTF16 in CTL-OPT

DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT

Re: DATA-INTO crashing with UTF16 in CTL-OPT