RE: Internal Representation of Unicode

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Sep 26 2003 - 06:42:50 EDT

  • Next message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"

    myrkraverk@users.sourceforge.net wrote:
    > In a plain text environment, there is often a need to encode more than
    > just the plain character. A console, or terminal emulator, is such an
    > environment. Therefore I propose the following as a technical report
    > for internal encoding of unicode characters; with one goal in mind:
    > character equalence is binary equalence.

    I guess you meant "equivalence".

    Q1: But what are "character equivalence" and "binary equivalence", and
    why did you choose them as your goals?

    > I thought of dividing the 64 bit code space into 32 variably wide
    > plains,

    Q2: What are these "plains" for? Why are there 32 of them?

    > one for control characters, one for latin characters, one for
    > han characters,

    Q3: Why do you want to treat Latin character and Han characters
    differently?

    There is nothing special with Latin or Han characters in Unicode: they are
    just 2 of the about 50 scripts currently supported in Unicode. (see
    http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
    http://www.unicode.org/Public/UNIDATA/Scripts.txt)

    Q4: And how do you plan to distinguish them?

    Both Latin and Han characters are scattered all over the Unicode space, so
    you need to check many ranges to determine which character belongs to which
    category.

    Q5: And what about all character which are neither Latin nor Han?

    > and so on; using 5 bits and the next 3 fixed to zero
    > (for future expansion and alignment to an octet).
    > I call plain 0 control characters and won't discuss it further.

    Q6: Why do control characters have a special handling?

    Q7: Don't control characters have properties attached like any other
    characters?

    One example of properties which could be useful to attach to control
    character is directionality. E.g., a TAB is always a TAB but, after it
    passed through the Bidirectional Algorithm, its directionality can be
    resolved to be either LTR or RTL.

    > Plain 1, I had intended for latin characters with the following
    > encoding method in mind:
    >
    > bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0
    > +-------+------+------+------+------+------+------+------+
    > | plain | zero | attr | res | uacc | lacc | res | char |
    > +-------+------+------+------+------+------+------+------+
    >
    > * Plain Plain (5 bits)
    > * Zero Zero bits (3 bits)
    > * Attr Attributes (16 bits)

    Q8: What kind of information are these three fields for?

    Q9: In case your answer to Q8 is "they are application-defined", then
    what is the rationale for defining and naming more than one field? I mean:
    if they are application-defined, why not leave the task of defining
    sub-fields to the application?

    > * Res Reserved (8 bits)
    > * Uacc Upper Accent (8 bits)
    > * Lacc Lower Accent (8 bits)

    Q10: Why do treat "accents" specially?

    They are just characters as any others. In Unicode there is no special
    limitation as to how many "accents" can be applied to a base character.
    There is also no obligation for accents to have a base character.

    > * Res Reserved (8 bits)
    > * Char Character (8 bits)

    Q11: How can you store a Latin character in 8 bits?

    Unicode has 938 Latin characters, and their codes range from U+0041 to
    U+FF5A.

    > All of these fields are actually implementation defined, with just one
    > rule for char: don't include characters that can be made with
    > combinations, that's what the accent fields are for.

    But characters are non necessarily decomposed in one "Latin character" with
    one "upper accent" and one "lower accent". E.g., U+01D5 (LATIN CAPITAL
    LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304
    (LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both
    COMBINING DIAERESIS and COMBINING MACRON are "upper accents".

    Q12: How are you going to deal with a combination of, e.g., a base letter
    + 5 "upper accents" + 3 "lower accents"?

    > This allows for 255 upper and lower accents which should be enough -- for
    now.

    I counted 129 "upper accents". But their codes range from U+0300 to U+1D1AD.

    Q13: How are you going to compress these codes into 8 bits? Are you
    planning to use a conversion table from the Unicode code to your internal
    8-bit code?

    > For Han characters I thought of the following encoding method (with no
    > particular plain in mind):
    >
    > bits 63..59 58..56 55..40 39..32 31 .. 0
    > +-------+------+------+-------+--------------------------+
    > | plain | zero | attr | style | char |
    > +-------+------+------+-------+--------------------------+
    >
    > * Plain Plain (5 bits)
    > * Zero Zero bits (3 bits)
    > * Attr Attributes (16 bits)
    > * Style Stylistic Variation (8 bits)

    Q14: What kind of information is in field "Style"?

    Q15: Why do only Han characters have this?

    Letters in many other scripts may have stylistic variations. E.g., "é" is
    one and the same character (or combination of characters), but its
    typographical shape is different in Italian and Polish.

    > * Char Character (32 bits)

    Q16: Why 32 bits?

    *Any* Unicode code points range from U+0000 to U+10FFFF, so all of them can
    fit in 21 bits (or 24, if you want to stick to 8-bit boundaries).

    > Again, all fields are implementation defined. Telling something like
    > a terminal emulator what stylistic variation to use is outside the
    > scope of this email, but for attributes, there are standardized escape
    > sequences; but I suspect language tags can be used.

    Q17: Why are you mentioning language tags? What do they have to do with
    escape sequences?

    > I was also thinking of a plain for punctuation and symbolic
    > characters.

    Q18: And what about all other characters, e.g., Arabic letters?

    > I will be pleased if anyone can come up with better encoding methods
    > than I did, and I call upon other people to come up with encodings for
    > scripts I know nothing about, such as arabic and others. Then let's
    > wrap it up in a technical report and be done with it ;)
    >
    > Any comments?

    See my 18 questions above.

    Some more general comments now. I understand that it can be useful in many
    circumstances to internally store character codes together with some kind of
    properties. But I fail to understand most points of the architecture you are
    proposing. Particularly:

    A) I don't see why you want to treat characters of different scripts in
    different ways. The purpose of Unicode is exactly to encode any character
    from any kind of script in an uniform way. Moreover, determining the script
    to which a character belongs is a relatively complex and time-consuming
    operation.

    B) I don't see why you make all those assumptions about to the structure of
    the properties attached to characters. If these properties have to be
    application defined, let it be application defined... I don't see a reason
    for defining all those "Plain", "Zero", "Attr", "Res", "Style" fields: just
    put all the available bits together and call them "Properties": it will be
    the task of the application programmer to decide how to use these bits.

    C) I don't see why you want to store a letter and its "accents" as a single
    units. Beside the fact that this is an impossible task, because a letter can
    have an arbitrary number of "accents", I fail to see any need for it. Also
    consider that, doing this, a letter and its accent(s) cannot have
    *different* properties, and this can be useful in a number of cases. An
    example which comes to mind is the entries in the most widespread Italian
    dictionary are typed in "bold" type, but the accent on the letters are in
    "bold" type if they are mandatory in the orthography and in "normal" type if
    they are optional, so you can have a "bold" letter with a "normal" accent.
    Another example, best known, is that of Arabic religious text, where the
    letters are normally black and the "accents" (representing vowels and other
    phonetic data) are red.

    If your assumption is that each character plus its attributes will take 64
    bits, the logical partition of these bits would be:

            Application-defined Properties: 43 bits
            Character code: 21 bits

    If, for any reason, you want to stick to 8-bit boundaries, you can use this
    alternative partition:

            Application-defined Properties: 40 bits
            Character code: 24 bits

    In either case, 40 or 43 bits is a huge space for the properties of a single
    character, and there are plenty possible useful uses for that space. In the
    unlikely case that the needed properties would not fit in 40-43 bits, the
    field can be used to store an index to an external array of properties.
    However long and complex a text can be, I doubt that 1 or 8 *trillions* of
    different character properties will not suffice!

    If an application needs "accents" to have the same properties as their base
    character, the application can define a special property value which means
    "this characters inherits the properties of previous character/the character
    on its left/the character on its right/the character at position N/etc.".

    But if you really want to stick to you (sorry, insane) idea of storing a
    letter and its accent in the same unit, I would suggest at least to:
            1) limit this to the *first* accent only, without distinguishing
    between "upper" and "lower" accents (any subsequent "accent" will take its
    own 64-bit entry);
            2) encode the "accent" with its regular Unicode code point, rather
    than with an ad-hoc 8-bit code.

    This would result to this partition:

            Application-defined Properties: 22 bits
            Base character code: 21 bits
            Accent character code: 21 bits

    Or, if you need to stick to 8-bit boundaries:

            Application-defined Properties: 16 bits
            Base character code: 24 bits
            Accent character code: 24 bits

    A field of 16 or 22 bits is still a fair amount of space, especially if the
    application uses it as an index to an external table.

    And now comes my last and more fundamental question:

    Q0: why do you want to propose all this as a Unicode "technical report"?

    Internal data structures and algorithms are, by definition, "internal", so I
    see no need of standardizing them, or even of publishing them.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Fri Sep 26 2003 - 08:22:34 EDT