Re: Greek chars encoded twice -- why?

From: vanisaac@boil.afraid.org
Date: Fri Feb 19 2010 - 05:19:32 CST

  • Next message: Kent Karlsson: "Re: Greek chars encoded twice -- why?"

    From: spir (denis.spir@free.fr)
    Date: Fri Feb 19 2010 - 04:27:43 CST

    > On Thu, 18 Feb 2010 19:14:58 +0200
    > Apostolos Syropoulos <ijdt.editor@gmail.com> wrote:
    >
    >> > Yes, it is absolutely necessary. Converting from a legacy encoding to
    >> > Unicode and back should be a lossless operation. How else would interchange
    >> > between legacy systems, and Unicode systems work?
    >>
    >> That's a problem that should concern those who still use legacy systems. In
    >> addition, today to the best of my knowledge no one
    >> is using 8bit Greek encodings. Finally, just because there are some people
    >> using legacy systems, should we continue
    >> supporting something that is wrong?
    >
    > I start to have the impression that, supposedly, compatibility with legacy
    > character sets was (and still is) the source of various Unicode design flaws
    > (*). Typically, they seem to add unneeded complication to a basically
    > complicated problem. Maybe it's only me.

    No, that's pretty much about as accurate a characterization as possible. Aside from just plain screw-ups (Myanmar, anyone?), pretty much all of the counter-intuitive things are inherited.

    > Where can one find rationales for design decisions? I would like to change my
    > mind for sensible reasons.

    The sensible reason is that for initial early adoption, Unicode needed to be a superset of existing character encodings. Some of those character encodings were just codged together, dealing with unsuitable rendering technology by using insane encoding models. That's it. There's no great secret fount of elucidation. Any character encoding in existence as of (1993?) gave its illogic to Unicode.

    > Denis
    >
    > (*) Including the #1 flaw imo: precomposed characters -- but again maybe it's
    > only me. (As legacy formatted texts need to be "transcoded" anyway, mapping
    > to a couple of codes in some cases is no big deal, is it?

    Not now. At one time, the technical limitations were such that supporting the full complement of existing character encodings on a one-to-one basis was deemed necessary. Precomposed characters are currently discouraged from use, and I believe NFKD is the prefered normalization. Personally, I would put precomposed characters, C0 and C1 controls, and Latin-1 micro sign at the top of my list of what you characterize as "flaws".

    > Also, this has to
    > be done only once... And on the software side, unicode-aware apps *must* be
    > able to cope with decomposed characters.)
    > Ditto for eg duplicate codes, and for allowing "unordered" combining marks.
    > (But these issues are not as problematic as precomposed characters.)

    Fortunately, they are not really all that problematic. The NFx and ccc algorithms are implemented with standard libraries in many languages, so you really just need to make sure you include them in an implementation. It's part of conformance requirements, so it isn't even remotely secret.

    > ________________________________
    >
    > la vita e estrany
    >
    > http://spir.wikidot.com/

    Van



    This archive was generated by hypermail 2.1.5 : Fri Feb 19 2010 - 05:22:42 CST