Re: ASCII and Unicode lifespan

From: Alexander Kh. (alexkh@writeme.com)
Date: Thu May 19 2005 - 01:50:55 CDT

  • Next message: Dominikus Scherkl: "AW: ASCII and Unicode lifespan"

    Greetings!

    > On 5/18/05, Alexander Kh. <alexkh@writeme.com> wrote:
    > > the fact that local encodings are more well-thought in design.
    >
    > That's absurd, considering that most local encodings in use were the
    > basis for the Unicode encoding of that script--in fact, many
    > complexities of Unicode can be attributed to a need for compatibility
    > with those local encodings--or were designed as a subset of Unicode.

    Huh? Most of them are not even ISO, what are you talking about? New ones
    are emerging even today, for example, KOI8-C, which unifies Russian,
    Ukranian, Belorussian, Serbian and Macedonian + 3 letters used in Russia
    before 1920's: YAT', FITA, IZHITSA, which were previously missing. That
    encoding is alive and kicking, and with only 8-bits per character, thanks
    to open source. I did not pay anything. With that shift key - even more
    letters would fit in!

    > > Also, consider this idea: how about using a code for "shift" key
    > > which will reduce
    > > in 2 usage of code space.
    >
    > No one cares. Really. If you want something like this, look up SCSU on
    > the Unicode website. But the number of cases where the amount of space
    > wasted is important and standard compression algorithms can't be used
    > is rare. Adding additional complexity for saving a few bytes isn't a
    > good trade off.

    SCSU? Now, that's what I call another level of complexity. Gzipping is
    enough. However things like indexes in database cannot be gzipped, and
    they sometimes make up 70% of the database. Unicode itself is not perfectly
    suited for sorting alphabetically, mind you.

    >
    > > Consider this example: suppose I have a bilingual database:
    > > English-Russian for
    > > example. I am not planning to use all the Chinese Hieroglyphs, so
    > > why would I use
    > > 16-bit characters???
    >
    > There is no 8-bit character set that supports both English and
    > Russian; the standard Russian character sets don't support accented
    > English characters. Besides which, it's rare that you have a large
    > stream of "English" data without any Spanish, French or German. I'm
    > sure Serbian, Ukranian and other odd letters slip into Russian text as
    > names and other ways.

    Koi8-C is not to bad. Would be better if it used shift key encoded into
    the ASCII part, as I mentioned
    above, the considerable freed space could be used up by those missing
    characters. And again, for imbedded text in different language I mengioned
    an encoding selector sequence (an escape code). Still being a UTF-8 mod the
    last resort will be using usual UTF-8's way to represent Unicode. Hieroglyphs
    won't benefit from UTF-8's compactness anyway.

    > Besides which, it's painful to handle a huge collection of encodings
    > and constantly have to do interconvertions (which always fail in some
    > way, because two 8-bit encodings never have one to one mappings.)

    Mapping is always a problem. Unicode itself has to be mapped for sorting in
    alphabetical order for some scripts. I guess it would make sense
    to map letter "A" of all scripts (where similar letter exists) into one place,
    and then specify which script it is. This would simplyfy transliteration for
    similar scripts at least. What do you think? I have not thought about it
    much yet myself.

    > > And also, every script has its own particular properties, for
    > > example, letter ordering,
    > > case sensitivity, numeric systems et.c. It will be difficult to
    > > maintain all those
    > > special particularities of every script in a rigid standard
    > > anyway. This will result
    > > in big overhead, requiring huge amounts of programming and
    > > resources to map all those
    > > orderings and other particularities into one standard interface.
    > > The local encodings
    > > are aware of those particularities and are designed for a
    > > particular purpose each.
    >
    > Local encodings aren't aware of anything. Code is "aware" of those
    > particularities, and all the local encodings do is make the code more
    > complex. Unicode lets the code handle those particularities as
    > consistenly as possible.

    What I meant for example, is that KOI-8 was designed for simple
    transliteration: the order of letter more or less conicided with
    a similar Latin version, and so I could read Russian texts even
    on systems where no Russian font is installed. The very design of the
    font provided for such a convenience without much complexity to code.
    That's just an example of what
    I meant by "being designed for a particular purpose". For sorting
    purposes of course it is better if the glyphs are in alphabetical order.
    For example, if I were to sort an Old-Slavonic text, I would have to
    make my owh character map in order to put the letters in Unicode in
    proper order. I don't really see another way to sort those letters.

    I am sure Unicode will be popular like Pop music is. Most people
    don't use old scripts. Me - I can't even write a simple text in Old-Slavonic
    for there are letters missing. Now same thing with Glagolitic. Maybe,
    this kind of ignorance is only towards Slavic scripts, which are being
    stepped on. I imagine most people will never understand what my problem is.

    Best regards,

    Alexander Kh.

    -- 
    ___________________________________________________________
    Sign-up for Ads Free at Mail.com
    http://promo.mail.com/adsfreejump.htm
    


    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 01:53:11 CDT