Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 25 2004 - 15:13:21 CDT

  • Next message: Mark Davis: "Re: Response to Everson Ph and why Jun 7? fervor"

    John Cowan asked:

    > Doug Ewell scripsit:
    >
    > > > So is [VIQR] a 7-bit encoding, or a scheme layered on top of ASCII?
    > >
    > > It's a scheme layered on top of ASCII
    > > > And what is KOI-7?
    > >
    > > A true 7-bit encoding for Russian, in which Cyrillic letters (small and
    > > capital respectively) were encoded in the ranges where ASCII has Latin
    > > letters (capital and small respectively).
    >
    > Ah. And on what principle do you distinguish them?

    VIQR uses (for example) a sequence of two ASCII characters 'd' + 'd'
    to represent, conventionally, the Vietnamese barred-d, i.e.,
    U+0111 LATIN SMALL LETTER D WITH STROKE. However, that is the
    convention for the use of a sequence of two ASCII characters --
    not a direct encoding of the character.

    It is correct (and appropriate) to display VIQR with an ASCII
    font, in conformance with the ASCII standard. People then learn
    to interpret the various sequences of letters or letters plus
    ASCII punctuation and symbols as representing "real" Vietnamese
    orthography.

    KOI-7, on the other hand, is an encoded character set. The
    *definition* of the code points is as representing the
    Cyrillic letters. 0x40 encodes CYRILLIC SMALL LETTER YU. It
    is not AT SIGN masquerading as YU. It is correct (and
    appropriate) to display KOI-7 with a KOI-7 font, in
    conformance with the KOI-7 standard; it is *not* correct to
    display it with an ASCII font.

    The fact that KOI-7 was designed the way it was to make it
    feasible to do Cyrillic on devices that could only handle
    ASCII data is besides the point -- it was simply a clever
    way to get around the then 7-bit limitations of devices.

    > The IETF clearly
    > treats them both as charsets, within its definitions.

    The IETF definition of "charset" is underdetermined for
    distinguishing these kinds of cases. Any specification that
    allows you to map unambiguously from a sequence of bytes
    to a sequence of abstract characters is, potentially, considered
    a "charset" in the IETF sense, right?

    As such, it cannot readily distinguish between true coded
    character sets and conventional orthographies built on
    top of ASCII, for example.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue May 25 2004 - 15:14:16 CDT