RE: Brahmic list ? (was: Oriya: mba / mwa ?)

From: Philippe Verdy (
Date: Sun Nov 30 2003 - 16:55:39 EST

  • Next message: Werner LEMBERG: "msdos graphics"

    > As long as you are sure that this will not leak out into the outside
    > world, you are free to use the UTF-8 mechanism internally to represent
    > any type of 31-bit data you like, including this private replacement for
    > allkeys.txt. (You do know about allkeys.txt, don't you? And the fact
    > that UCA is heavily customizable?)

    Yes I know allkeys.txt, and the fact that UCA is highly customizable.
    This is still too much complex to handle a lot of languages consistently,
    and I prefer having rules that define a hierarchy tree of languages for
    sorting or collating, so that a single reset of a language root will move
    all its collation keys along with related characters that are normally
    logically collated with them, even if they are not used in typical
    orthograph of that language.

    Also UCA still does not order very precisely all the characters in the
    [variable] section: this is a mix of characters mostly sorted by script
    type and then by code points, but many of them can be rearranged with
    related characters.

    > It would seem to make sense primarily for retaining ASCII compatibility
    > and representing smaller values in fewer bytes than larger values, so
    > you would want to be sure these are your design goals too.

    Unfortunately, this is IMPOSSIBLE! I need code positions between
    successive ASCII positions. All I can do is to preserve 1 byte for
    the ASCII character in the encoding scheme for the code position, but
    other bytes will be prepended and appended.

    Due to this constraint, any ASCII character will really be represented
    by at least 3 bytes, and this is not intended to be used for interchange
    of text, just for internal representation during processing, for lookup
    tables or to extract some binary coded character properties (I have more
    properties than those listed in Unicode, simply because I have inserted
    properties needed for UCA and tailored collation).

    > But things like this do have a tendency to leak into the outside world,
    > and if this ever happens with your collation keys, you will have
    > unleashed something like CESU-8 that fails the "duck test": it walks and
    > talks like UTF-8, but it's not.

    Be sure this won't leak out. Simply because this internal encoding is
    strictly for internal processing as an intermediate step. It is not
    efficient enough to make it a true encoding, simply because it uses 1
    code per function, instead of packing several functions into bitfields.

    As I have not determined the correct size of these bitfields, I need some
    intermediate solution to pack them a little, and the UTF-8 TES (not the
    UTF-8 CES used by Unicode)venient for now, until I change it to a better
    encoding, which may or may not leak out (I am not sure that I need to
    make the encoding accessible from an interface, except for debugging).

    After all, the intermediate tables computed by the ICU builder are
    completely internal, and their format is not guaranteed to be supported
    elsewhere: these tables use their own encoding and convention, and are
    strictly bound strictly with the internal implementation of the ICU
    runtime. That's the same thing for me.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 17:46:37 EST