[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Problem parsing XML with EXPAT
Thank you Scott,
I appreciate you taking the time to explain that and I'm glad you did. I
thought the problem was with the CSSID I was using. The fact that my
receiver routines were being called should have been my first clue that
the data being passed to EXPAT was in an acceptable format and that I
just wasn't handling the data being returned correctly.
You were right about me using %str() to decode the data being returned.
Changing my program to receive the string in the (UCS-2) data type
solved my problem.
Thanks again, I really appreciate your help.
Griz
-----Original Message-----
From: ftpapi-bounces@xxxxxxxxxxxxxxxxxxxxxx
[mailto:ftpapi-bounces@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Scott
Klement
Sent: Wednesday, November 28, 2007 11:19 PM
To: HTTPAPI and FTPAPI Projects
Subject: Re: Problem parsing XML with EXPAT
Hi Griz,
Grab a sandwich and relax, while I give you the long-winded
explanation...
It's important to understand that XML supports many different character
sets, and that XML is not specifically designed for i5/OS.
i5/OS has this really neat feature where every file in it's file systems
has a "ccsid" in the object description. This way, when you save a file
to disk, you can store the CCSID that represents it's character set with
the file, and other applications/users/utilities, etc can read that
CCSID and know what character set the data is in.
However, AFIAK, no other computer system has this feature. Windows,
Unix, Mac, etc... none of them have this CCSID feature. Therefore, it
can't be used for the XML standard.
So the XML standard says that XML documents will designate their
character set by putting an "encoding" in the opening <?xml> tag. And
if there's no encoding there, a parser should just ASSUME that the data
is in UTF-8 format.
<?xml version="1.0" encoding="iso-8859-1"?>
So if your document starts with something containing "encoding", like
the preceding example, then that encoding tells the XML parser what
format it's in. If not, then it's assumed to be in the UTF-8 flavor of
unicode.
But -- there's still a problem. Did you spot it?
HOW THE HECK CAN AN XML PARSER READ THE "ENCODING" TAG IF IT DOESN'T
ALREADY KNOW WHAT CHARACTER SET THE DATA IS IN? Think about that. It's
a catch-22. The parser has to know which character set it's reading in
order to be able to understand the "encoding" attribute, and therefore
discover the character set.
So the symbols that make up the opening XML tag must always have
particular hex codes.
Fortunately, it's pretty easy. The four basic encodings of XML are
US-ASCII, ISO-8859-1, UTF-8 and UTF-16. In the first three (US-ASCII,
ISO-8859-1 and UTF-8) the hex code of the < character is always x'3c'.
The hex code of the ? character is always x'3f'. So there's no
conflict. In UTF-16, the < character is always either x'003f' (big
endian) or x'3f00' (little endian). So it's pretty easy for a program
to read the first two bytes of a file and determine enough about the
encoding to be able to read the opening <?xml> tag to determine what the
actual encoding is.
With those rules, the XML standard will work with any flavor of ASCII or
single or double-byte Unicode without having to know the encoding before
opening the file. However, it'll NEVER work with EBCDIC. A proper XML
parser does not work with EBCDIC data.
Whew. Like I said... that was long winded. But it explains why you
have to translate your data to ASCII to parse it with Expat.
Note that IBM's XML parser that's built into ILE RPG (V5R4 feature) does
work with EBCDIC. I talked to Barbara Morris about this, and she said
that this built-in XML parser always uses the CCSID of your job for
alphanumeric fields, UCS2 for UCS-2 data type fields, and the CCSID that
the stream file is tagged with when reading a file. Technically, this
behavior is wrong, because it completely ignores the "encoding"
attribute -- which violates the XML spec. It also means that if you
transfer a file from another system, you have to make darned sure that
you set the CCSID correctly on the file. And I have no idea how you'd
be sure of that without writing your own code to interrogate the file!
But all of this is moot since you're using Expat, and Expat does
respsect the XML standard. (It's just the built-in one in RPG that does
not.)
Okay... your 2nd question about getting blanks back from Expat... that
one I don't understand. I guess it could potentially be because you
haven't specified an encoding, so it'll default to UTF-8. But if your
data is in ASCII rather than UTF-8, you might have problems with some
characters. However, this probably ISN'T it, since most characters in
ASCII are the same as UTF-8, and therefore this wouldn't happen on every
element.
The only other thing I can think of is that your handler procedures are
written incorrectly, and are using %STR() to decode the data sent to
them. In that case, only less common strings (such as accented
characters, asian alphabets, etc) would start with anything other than
x'00'. And %str() uses x'00' to denote "end-of-data"... so your data
would always come back as zero-length strings.
To fix it, you'll either have to compile Expat to output UTF-8 (which I
don't recommend, since UTF-8 is not easy to deal with in RPG). Or
you'll have to receive the strings using RPG's C (UCS-2) data type like
I do in the sample programs included with the Expat download.
-----------------------------------------------------------------------
This is the FTPAPI mailing list. To unsubscribe, please go to:
http://www.scottklement.com/mailman/listinfo/ftpapi
-----------------------------------------------------------------------