Re: ISO 10646 compliance and EU law

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 05 2005 - 16:24:20 CST

  • Next message: Rick McGowan: "Re: ISO 10646 compliance and EU law"

    Antoine continued:

    > On Wednesday, January 5th, 2005 19:17Z Kenneth Whistler va escriure:
    >
    > >> The Tibetan characters are _never_ encoded using Unicode in this
    > >> process, are they?
    > >> Looks like a clear case of nonconformance to me.
    > >
    > > Not at all.
    >
    > Indeed, it seems there is no necessity to use Unicode defined code points to
    > represent anything.

    Not quite. They represent neither more nor less than what
    they are supposed to. An assigned Unicode code point associates
    that code point with a particular abstract character, to create
    an encoded character.

    U+0062 is the encoded character LATIN SMALL LETTER B, neither
    more nor less.

    What people choose to *do* with that "b" is their own business,
    including any and all weird semiotic usages they may choose
    to put it to.

    As I said, somebody may decide that the letter "b" is then
    used to represent a chocolate chip cookie recipe, if they
    want. Who's to stop them? Who's to stop them from doing so
    now, *regardless* of the encoding? That's the *point*.

    > > The Unicode *conformance* issue there is whether the Latin
    > > letter "b" used in the Wylie transliteration is correctly
    > > represented as U+0062, and whether, if using UTF-16, that
    > > shows up in stored data and strings as a 16-bit code unit,
    > > 0x0062, or if using UTF-8, that shows up in stored data
    > > and strings as an 8-bit code unit, 0x62, and so on.
    >
    > O
    > - O
    > O
    >
    > But there are _no_ Latin letter "b" here; we are dealing with Tibetan
    > letters, ain't we?

    No, we are dealing with the encoded Latin letter "b" that someone is then
    using to represent a Tibetan letter.

    In some other context, they might be using it to represent the second
    element of the English alphabet, or they might be using it to represent
    a bilabial voiced stop, or ..., or...

    I think you may be confused simply because transliteration involves
    the symbolic use of characters from one script to represent
    characters from another script, and then people may invent creative
    ways of displaying transliterations that involve protocols other
    than simple plain text.

    > Or did you switch one level lower, disregarding the semantic meaning of the
    > translitteration text, to only attach yourself to grapheme used in the
    > translitteration,

    Yes. Which is the appropriate level to consider here.

    > which happens to be English letters in ASCII/UTF-8
    > encoding?

    Latin letters. In Unicode. (But it doesn't really matter, because
    the argument would be exactly the same for *any* character encoding
    that includes characters from more than one script, being used
    this way.)

    > To make a more extreme (and dumb) example, let's assume I have an
    > ISCII-based rendering system, using Roman (reversed for you)
    > translitterations but not plain English (that is, both A and a would be
    > written \xA4 if we speak about the grapheme, or \xAC if we speak about the
    > English letter).

    This is mixing a couple things -- writing "A" or "a" with \xA4
    (= U+0905 DEVANAGARI LETTER A) would be a transliteration system;
    writing the English phoneme /ey/ (the pronunciation of the
    letter "A") with \xAC (= U+090F DEVANAGARI LETTER E) would be
    a transcription system. But never mind, since it doesn't impact
    the answer to your question below.

    > Furthermore it exchanges them by adding a signaling 0xEC00
    > to the ISCII codepoints, while not suming anything to the ASCII codepoints,
    > resulting in using the ranges 0x000A-0x0040, 0x005B-0x0060, 0x007B-0x007E,
    > and 0xECA1-0xECFA.
    >
    > Can I claim conformance to Unicode/10646 on the basis I am using codepoints
    > 0020 for SPACE, 002C for COMMA etc., that I do not destroy surrogates, I do
    > not emit FFFF etc. etc.?

    Yes.

    What you do with U+ECA1..U+ECFA is your own private business. And
    if you want to define those code points as being an EC00-shift
    ISCII transliteration (or transcription) system for English, more
    power to you.

    >
    > [ Or is there a special case for the Latin letters that disallow this? ]

    No.

    > Second question, if the above is "Yes I can claim conformance", what is the
    > point of claiming conformance to Unicode/10646 (in such a case)?

    The point is that you would be guaranteeing to a recipient of your
    data that (assuming you were using UTF-16), 0x0020 was SPACE and
    0x002C was COMMA and 0xECA1 was PRIVATE USE CHARACTER-ECA1 and so on.

    And you would be guaranteeing to a recipient that such data was
    not jpeg or mim or GB2312 plain text or any other conceivable thing
    that some bag of binary coming down a wire could be.

    What conformance to the Unicode Standard won't buy you is any
    comprehension by your recipient of what your strange use of
    PUA code points and the particulars of your Devanagari transliteration
    of English are, nor how to convert it to display on an ISCII system.
    For that, you need to convey your higher-level protocol to your
    recipient.

    > I remember Peter Constable remarking once that a process that rings the bell
    > when submitted the code 7 is Unicode-conformant.

    And he's right.

    For that matter, a process that dispenses the cup of hot
    tea when submitted the code U+2615 is Unicode-conformant.

    In either case, the conformance issue comes down to some
    pattern of binary bits in a data stream being interpreted
    as a character, according to the assignments and code charts
    of the standard.

    What happens as a result of that interpretation, or what
    protocol might be layered on top of that interpretation, is
    up to the creative minds of everybody using characters
    to do whatever they want.

    I think perhaps the difficulty you are expressing comes from
    the assumption that "X conforms to the Unicode Standard" should
    imply something about a coverage of some particular repertoire
    with some minimum standards of input and rendering, and so
    on. But I think that constitutes a different class of claims
    about software.

    Consider it this way. Suppose I have some software that
    purports to be an editor that "supports Greek". Now a claim
    like that would reasonably be interpreted as being able to
    input, edit, display, and print Greek text, and also to
    perform other typical tasks, perhaps including spellchecking,
    and so on. I would expect such things *regardless* of
    whether the implementation internally was using 8859-7 or
    Unicode or something else to represent the characters.

    People might expect more of an editor that claims to be
    "a Unicode implementation that supports Greek", simply because
    Unicode contains more Greek characters than 8859-7, and because
    you might then expect it to support Greek *and* one or more
    other scripts as well. But that is really orthogonal to
    the fundamental conformance issues of ensuring that
    inside, deep under the covers, 0x039C is being interpreted
    as GREEK CAPITAL LETTER MU and not some other random thing,
    and that 0x03AC is treated equivalently to <0x03B1, 0x0301>,
    and so on.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 05 2005 - 16:33:15 CST