Re: IJ joint in spaced lettering

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 11 2006 - 19:28:19 CST

  • Next message: Alexej Kryukov: "Old Cyrillic I and Ukrainian Yi"

    > From: "Kenneth Whistler" <kenw@sybase.com>
    > >> Another related question: Why isn't there a standard 16-bit UTF
    > >> that preserves the binary ordering of codepoints?
    > >> (I mean for example UTF-16 modified simply by moving all
    > >> code units or code points in E000..FFFF down to D800..F7FF
    > >> and moving surrogate code units in D800..DFFF up to F800..FFFF).
    > >
    > > Huh? Because it would confuse the hell out of everybody and lead
    > > to problems, just like any other putative fixes by proliferation
    > > of UTF's.
    > >
    > > Sorting UTF-16 in binary order is easy. See "UTF-16 in UTF-8 Order",
    > > p. 136 of TUS 4.0.
    >
    > I don't say it is not easy to do. What I just indicated is
    > that there are applications where onereally wants pure binary
    > sort order, where it would also begood that it preserves the order
    > of codepoints (like with UTF-8 and UTF-32, but not in UTF-16).

    So?

    Given that UTF-16 doesn't sort in binary code point order for
    supplementary characters, you program around the problem if you
    need to.

    Advocating changing the encoding of the *data* to work around a
    limitation of an algorithm when dealing with that data strikes
    me as just another invitation to bad engineering practice.

    >
    > May be what you are replying there is that Unicode doesnot want
    > to add more standard UTFs,

    Well, roughly, yes. More precisely, I would put it that the UTC
    is absolutely on record as not wanting to modify (or add to) the
    3 standard Unicode Encoding Forms (UTF-8, UTF-16, and UTF-32) in
    any way, whatsoever, period, end of story.

    > and instead prefer to insist that such UTFs

    You are already off the rails here. "UTFs" indefinite plural is
    an undefined concept, as far as the UTC is concerned. The Unicode
    Standard doesn't define some generic concept of encoding bijection,
    calling them "UTFs", claim to be standardizing 3 of them, and then
    invite others to make up however many more they want to.

    The Unicode Standard specifies and standardizes 3 "Unicode Encoding
    Forms", which are designed as bijections, and says to conform
    to the standard, you use one of those, period.

    > should remain private (requiring explicit agreements between users,
    > or using private internal interfaces and APIs, so that no public
    > standard will need to be standardized).

    The UTC can't prevent people from doing whatever odd things pop
    into their heads, but I can assure you there isn't any sentiment
    on the UTC that implementers should be off making up more
    "UTFs" in efforts to solve sorting problems by encoding and then
    exchanging such data in putatively "private", "internal" interfaces
    where chances are better than even that somewhere down the line
    such data is going to leak into public contexts and create
    data corruptions that somebody *else* other than the originator
    is going to have to deal with.

    >
    > It's just that alternative UTFs are still possible without
    > affecting full conformance with the Unicode standard: with
    > the same required properties for all UTFs that they MUST
    > preserve the exact encoding of all valid codepoints between
    > U+0000 and U+10FFFF, including non-characters,

    You're just making this up, right?

    > and that they must not change their relative encoding order
    > in strings so that all all normalization forms and denormalizations
    > are preserved,

    How are these connected? UTF-16 and UTF-32 don't have the same
    "relative encoding order in strings", but do preserve normalization
    forms. Again, you're just making this up, right?

    > allthis meaning there must exist a bijection beween all
    > UTFs applied to all Unicode strings.

    Who said?

    >
    > If this is still not clear enough, the standard should insists
    > that it documents 3 UTFs explicitly with several byte ordering
    > options for endianness,

    It says perfectly clearly that it documents 3 Unicode Encoding Forms.
    What is unclear about that?

    > but this still does not restrict full conformance only to these.

    Actually, it does.

    > In fact Unicode approves also SCSU and BOCU-8, and because they
    > respect the bijection rule, they are already compliant UTFs.

    The UTC (not "Unicode") approved SCSU as a Unicode Technical Standard.
    It is not part of the Unicode Standard, and it isn't a Unicode
    Encoding Form. SCSU is losslessly convertible to/from Unicode,
    does *not* sort in code point order, meets the criteria for an
    IANA charset, and is not MIME-compatible. It is a stateful encoding,
    and is non-deterministic (different encoders may produce different
    actual SCSU sequences as output).

    BOCU-*1* is not something approved by the UTC at all. It is an
    independent specification (not a standard) developed by a member
    of the Unicode Consortium. It is not part of the Unicode Standard,
    and it isn't a Unicode Encoding Form. BOCU-1 is losslessly convertible
    to/from Unicode, *does* sort in code point order, meets the
    criteria for an IANA charset, and is MIME-compatible. It is a
    stateful encoding, and has deterministic output.

    > But it should be clear in the standard that they are just
    > examples of valid UTFs,

    No, that is not at all clear, nor is that the intent in the standard
    whatsoever.

    > recommandedfor interchange across heterogenous systems or
    > networks, and that applications can use their own alternate
    > representation, as needed to comply with other needs

    The last part of this is certainly true. The use of BOCU-1 as
    a compression in a database would be an example of an application
    using its own alternate representation of data.

    > (for example any attempt to make any standard UTF fit on platforms
    > with 64-bit or 80-bit word size would already require an extension,
    > which cannot strictly be equal to any standardized UTF, even if
    > it's just a simple zero-bit padding, that requires an additional
    > specification for the validity of binary interfaces).

    The implementation of encoding forms on platforms whose
    native word sizes exceed the size of code units has never
    been considered an issue of "requir[ing] an extension ...
    to any standardized UTF". It is just a special case of the
    very general issue (handled by compilers below the level that
    most programmers have to worry about) of putting numbers of
    defined sizes into registers of defined sizes.

    On a Z80 8-bit computer, I would have represented ASCII
    "cat" as an array <63 61 74> pushed through registers as:

    01100011
    01100001
    01110100

    On a 64-bit processor these days, I would represent ASCII
    "cat" as an array <63 61 74> pushed through registers as:

    000000000000000000000000000000000000000000000000000000000000000001100011
    000000000000000000000000000000000000000000000000000000000000000001100001
    000000000000000000000000000000000000000000000000000000000000000001110100

    It's still ASCII, and it's still handled logically as 8-bit characters,
    although they may get pushed through big registers with lots of zeroes.

    On a Z80 8-bit computer, I would have represented UTF-8 for
    U+4E8C as an array <E4 BA 8C> pushed through registers as:

    11100100
    10111010
    10001100

    And likewise, on the 64-bit processor it would be:

    000000000000000000000000000000000000000000000000000000000000000011100100
    000000000000000000000000000000000000000000000000000000000000000010111010
    000000000000000000000000000000000000000000000000000000000000000010001100

    In either case, it is just UTF-8, conformant to the specification
    in the Unicode Standard, and neither I nor you should care how many
    bits got set to zero in the register when the load register instruction
    was executed by the hardware. The guys who write assembly code and
    microcode on chips may need to care -- the rest of us don't.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 11 2006 - 19:29:55 CST