RE: japanese xml

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Aug 30 2001 - 13:14:49 EDT


Misha Wolf wrote:
> You seem to be implying that Viranga's question was:
> "Can one encode all Unicode code points using EUC?"
>
> That is a strange interpretation of:
> "Is it ok for Unicode code points to be encoded/serialized
> using EUC?"

In fact, that is exactly my interpretation. I think that it may appear
"strange" only if you ignore (I don't know whether deliberately or not) the
Unicode meaning of a basic term like "to encode".

> Furthermore, Viranga's context appears to be XML, in which
> case it *is* possible to encode *all* Unicode code points
> using EUC (or ISO-8859-1 or ASCII or ...)

Yes, yes. XML documents can represent characters in at least two ways:

1) By encoding them directly with the underlying plain-text encoding. This
encoding is either declared in documents or defaults to UTF-8 or UTF-16.

2) By representing them with numeric references in the form "Ӓ" etc.
The numeric references themselves are sequences of characters ("&" + "#" +
one or more of "0".."9" + ";") expressed in the underlying plain text
encoding. The meaning of a numeric reference for an XML parser is the single
Unicode character whose code is written between "&#" and ";".

In the context of Unicode and, more generally, plain-text encoding "to
encode" means only point 1 above, and "&1234;" is just a six-character
string. BTW, this is also the interpretation of tools (text editor, etc.)
used to manipulate XML files -- so it is not a pointless distinction for
someone working in XML.

Point 2, in Unicode speech, is defined a "higher level protocol", and it is
considered out of the scope of the standard. If you use the term "encoding"
for this kind of syntax you are deliberately using a term out of its normal
technical meaning, so you are causing confusion, like it or not.

Numeric references are a well-known feature of both XML and HTML. There are
very little chances that Viranga or other people working with XML don't know
such a trivial thing. My impression (and perhaps not only *my* impression)
is that Viranga was asking help about arcane text encoding issues, rather
than about trivial XML syntax.

According to my experience, it is quite plausible that an experienced
software engineer may have problems understanding the caveats of Unicode and
DBCS's, and how they are different. Not so likely that someone needs help on
the basics of his own field.

So, how helpful can it be saying: "O, sure you can encode Unicode in EUC,
(implied: as far as all characters >= U+0080 are represented with a numeric
reference, so that EUC practically becomes ASCII, and the Japanese text
becomes unreadable within a text editor)".

Or were you perhaps trying to say: "O, sure you can encode Unicode in EUC,
(implied: as far as you convert Unicode to JIS)"? In this case, the answer
is not even formally correct: the question was how to encode/serialize
Unicode, not how to change it into something else (that was his the next
question, BTW).

_ Marco



This archive was generated by hypermail 2.1.2 : Thu Aug 30 2001 - 14:24:33 EDT