Re: japanese xml

From: Viranga Ratnaike (viranga@mds.rmit.edu.au)
Date: Fri Aug 31 2001 - 05:09:32 EDT


...apologies if I've posted this twice, mutt crashed on me as I was trying
to post it the first time.

On Thu, Aug 30, 2001 at 11:55:24PM -0500, Peter_Constable@sil.org wrote:
> So, it comes down to a question of how we define "encode", and of the
> usage context that determines our definition. Marco was assuming a
> definition as it would be used internal to Unicode. Misha apparently
> was using a broader definition that is valid in other contexts, though
> not internally to Unicode.
>
> So, they were both right in relation to the assumptions they were
> making. The question, though, is what definition or context Viranga
> was assuming when the question was asked.
 
Hi All,
 
        I started writing the context, but it soon turned into my work
        history. This is my second attempt : ) Thanks for your patience.
        And apologies for not replying to the thread sooner. I work in
        Australia which puts me slightly out of phase to most people.
        And apologies for my previously vague questions. Tho' I must
        admit that, in hindsight, I'm glad the questions were open to
        interpretation, as I have learned much from the thread : )
 
    My (Viranga's) original question was:
> Is it ok for Unicode code points to be encoded/serialized using EUC?
> I'm not planning on doing this; just wondering what (?if any?)
> restrictions, there are on choice of transformation format.
 
        Perhaps I can ask another question (with a slightly wider scope).
 
        When I came across the weekly-euc-jp.xml document, I was rapt; an
        xml document with japanese tags. But when I looked at the underlying
        hex, it clearly wasn't "encoded" using a UTF. Which confused me
        as I was (?mistakenly?) under the notion that XML required unicode.
        I have read W3C's XML spec
        (see http://www.w3.org/TR/2000/REC-xml-20001006#sec-well-formed)
 
            "2.2 Characters
             [Definition: A parsed entity contains text, a sequence of
             characters, which may represent markup or character data.]
             [Definition: A character is an atomic unit of text as specified
             by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]).
             Legal characters are tab, carriage return, line feed, and the
             legal characters of Unicode and ISO/IEC 10646."
 
             [rest of paragraph deleted]
 
            "Character Range
 
             [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
                           [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 
             /* any Unicode character, excluding the
                surrogate blocks, FFFE, and FFFF. */ "
 
 
        However in the next paragraph...
 
            "The mechanism for encoding character code points into bit
             patterns may vary from entity to entity. All XML processors
             must accept the UTF-8 and UTF-16 encodings of 10646; the
             mechanisms for signaling which of the two is in use, or for
             bringing other encodings into play, are discussed later, in
             4.3.3 Character Encoding in Entities."
 
        If the character set is specified as ISO/IEC 10646, in what
        circumstances would it be appropriate to use an "encoding" other
        than UTF-8 or UTF-16 ?
 
        Further questions are:
 
            Could I, theoretically, invent my own encoding and say that this
            is conformant XML?
 
            Would the character set, I use, have to be Unicode/10646 ?
            Or could an XML document use one of the JIS character sets.
 
        The page at...
 
            http://java.sun.com/xml/jaxp-1.1/examples/samples/weekly-euc-jp.xml
 
        ...states
 
            <?xml version="1.0" encoding="euc-jp"?>
 
        Does '-jp' (or "euc-jp" collectively) imply JIS ?
 
            If so, does this violate section 2.2 from the XML 1.0 standard?
            Can you have a document that simultaneously satisfies Unicode and
            JIS? Or (as is more likely : ) is my understanding flawed?
 
Regards,
 
        Viranga
 
P.S. I am interested in the DoCoMo/WAP stuff purely as a source of "real"
        japanese XML/XHTML documents; we're not in the phone business.
 
P.P.S. I have looked at ICU but have had difficulty compiling it on a
        Solaris box (our principal OS for new development is Solaris 8).
        I'm a lurker on the icu list; noting with some hope the increased
        success other people seem to be having re: compiling with solaris.
 
P.P.P.S. For those who might be interested:
 
        The group I work for is planning on going to Japan to find a
        Japanese partner for the software we produce. We're essentially
        success other people seem to be having re: compiling with solaris.
 
P.P.P.S. For those who might be interested:
 
        The group I work for is planning on going to Japan to find a
        Japanese partner for the software we produce. We're essentially
        an SGML/XML group that writes document management systems,
        high performance information retrieval engines, ...
        But we don't really have much (any) experience with East Asian
        scripts and languages.
 
        I'm one of the people responsible for making us unicode conformant.
        And also to keep an eye on the unicode mailing list. Most of my job
        involves writing C++ class libraries, database (for want of a better
        word) "wizards", and most recently - helping out with a demo (to show
        that we can supply a toolkit; for japanese developers to use).
 
        So, I was spending some of my time hunting for japanese documents.
        Preferably in unicode because we can do (hopefully) intelligent things
        with the character properties; in word parsers, finite state machines,
        (I'm sure there are other things, just can't think of them right now :)
 
        Our string classes are essentially smart arrays of 8-bit, 16-bit
        and 32-bit code units. We also use James Clark's (SP and Expat)
        parsers. We have seen references to JIS in his stuff, but would
        rather stick to interfacing with the Unicode stuff (mainly because
        it's so much easier supporting just the one thing internally, and
        we can deal with other character sets by either (converting to
        Unicode) or (promising only storage and retrieval of raw data w/o
        interpreting it in any way).
 
P.P.P.P.S. Which leads me to ask for a clarification of the interoperability
            issue which David Starner introduced
 
> So EUC-JP <-> Shift-JIS produces different results than
> EUC-JP <-> Unicode <-> Shift-JIS.
 
        Does one of the transformations produce lossy output or mutations
        or is it some other issue?



This archive was generated by hypermail 2.1.2 : Fri Aug 31 2001 - 06:33:26 EDT