RE: Internal Representation of Unicode

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Sep 26 2003 - 06:42:50 EDT

Next message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"

Previous message: jon@spin.ie: "RE: AddDefaultCharset considered harmful (was: Mojibake on my Web pages)"
Maybe in reply to: myrkraverk@users.sourceforge.net: "Internal Representation of Unicode"
Next in thread: Rick McGowan: "Re: Internal Representation of Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

myrkraverk@users.sourceforge.net wrote:
> In a plain text environment, there is often a need to encode more than
> just the plain character. A console, or terminal emulator, is such an
> environment. Therefore I propose the following as a technical report
> for internal encoding of unicode characters; with one goal in mind:
> character equalence is binary equalence.

I guess you meant "equivalence".

Q1: But what are "character equivalence" and "binary equivalence", and
why did you choose them as your goals?

> I thought of dividing the 64 bit code space into 32 variably wide
> plains,

Q2: What are these "plains" for? Why are there 32 of them?

> one for control characters, one for latin characters, one for
> han characters,

Q3: Why do you want to treat Latin character and Han characters
differently?

There is nothing special with Latin or Han characters in Unicode: they are
just 2 of the about 50 scripts currently supported in Unicode. (see
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Scripts.txt)

Q4: And how do you plan to distinguish them?

Both Latin and Han characters are scattered all over the Unicode space, so
you need to check many ranges to determine which character belongs to which
category.

Q5: And what about all character which are neither Latin nor Han?

> and so on; using 5 bits and the next 3 fixed to zero
> (for future expansion and alignment to an octet).
> I call plain 0 control characters and won't discuss it further.

Q6: Why do control characters have a special handling?

Q7: Don't control characters have properties attached like any other
characters?

One example of properties which could be useful to attach to control
character is directionality. E.g., a TAB is always a TAB but, after it
passed through the Bidirectional Algorithm, its directionality can be
resolved to be either LTR or RTL.

> Plain 1, I had intended for latin characters with the following
> encoding method in mind:
>
> bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0
> +-------+------+------+------+------+------+------+------+
> | plain | zero | attr | res | uacc | lacc | res | char |
> +-------+------+------+------+------+------+------+------+
>
> * Plain Plain (5 bits)
> * Zero Zero bits (3 bits)
> * Attr Attributes (16 bits)

Q8: What kind of information are these three fields for?

Q9: In case your answer to Q8 is "they are application-defined", then
what is the rationale for defining and naming more than one field? I mean:
if they are application-defined, why not leave the task of defining
sub-fields to the application?

> * Res Reserved (8 bits)
> * Uacc Upper Accent (8 bits)
> * Lacc Lower Accent (8 bits)

Q10: Why do treat "accents" specially?

They are just characters as any others. In Unicode there is no special
limitation as to how many "accents" can be applied to a base character.
There is also no obligation for accents to have a base character.

> * Res Reserved (8 bits)
> * Char Character (8 bits)

Q11: How can you store a Latin character in 8 bits?

Unicode has 938 Latin characters, and their codes range from U+0041 to
U+FF5A.

> All of these fields are actually implementation defined, with just one
> rule for char: don't include characters that can be made with
> combinations, that's what the accent fields are for.

But characters are non necessarily decomposed in one "Latin character" with
one "upper accent" and one "lower accent". E.g., U+01D5 (LATIN CAPITAL
LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304
(LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both
COMBINING DIAERESIS and COMBINING MACRON are "upper accents".

Q12: How are you going to deal with a combination of, e.g., a base letter
+ 5 "upper accents" + 3 "lower accents"?

> This allows for 255 upper and lower accents which should be enough -- for
now.

I counted 129 "upper accents". But their codes range from U+0300 to U+1D1AD.

Q13: How are you going to compress these codes into 8 bits? Are you
planning to use a conversion table from the Unicode code to your internal
8-bit code?

> For Han characters I thought of the following encoding method (with no
> particular plain in mind):
>
> bits 63..59 58..56 55..40 39..32 31 .. 0
> +-------+------+------+-------+--------------------------+
> | plain | zero | attr | style | char |
> +-------+------+------+-------+--------------------------+
>
> * Plain Plain (5 bits)
> * Zero Zero bits (3 bits)
> * Attr Attributes (16 bits)
> * Style Stylistic Variation (8 bits)

Q14: What kind of information is in field "Style"?

Q15: Why do only Han characters have this?

Letters in many other scripts may have stylistic variations. E.g., "é" is
one and the same character (or combination of characters), but its
typographical shape is different in Italian and Polish.

> * Char Character (32 bits)

Q16: Why 32 bits?

*Any* Unicode code points range from U+0000 to U+10FFFF, so all of them can
fit in 21 bits (or 24, if you want to stick to 8-bit boundaries).

> Again, all fields are implementation defined. Telling something like
> a terminal emulator what stylistic variation to use is outside the
> scope of this email, but for attributes, there are standardized escape
> sequences; but I suspect language tags can be used.

Q17: Why are you mentioning language tags? What do they have to do with
escape sequences?

> I was also thinking of a plain for punctuation and symbolic
> characters.

Q18: And what about all other characters, e.g., Arabic letters?

> I will be pleased if anyone can come up with better encoding methods
> than I did, and I call upon other people to come up with encodings for
> scripts I know nothing about, such as arabic and others. Then let's
> wrap it up in a technical report and be done with it ;)
>
> Any comments?

See my 18 questions above.

Some more general comments now. I understand that it can be useful in many
circumstances to internally store character codes together with some kind of
properties. But I fail to understand most points of the architecture you are
proposing. Particularly:

A) I don't see why you want to treat characters of different scripts in
different ways. The purpose of Unicode is exactly to encode any character
from any kind of script in an uniform way. Moreover, determining the script
to which a character belongs is a relatively complex and time-consuming
operation.

B) I don't see why you make all those assumptions about to the structure of
the properties attached to characters. If these properties have to be
application defined, let it be application defined... I don't see a reason
for defining all those "Plain", "Zero", "Attr", "Res", "Style" fields: just
put all the available bits together and call them "Properties": it will be
the task of the application programmer to decide how to use these bits.

C) I don't see why you want to store a letter and its "accents" as a single
units. Beside the fact that this is an impossible task, because a letter can
have an arbitrary number of "accents", I fail to see any need for it. Also
consider that, doing this, a letter and its accent(s) cannot have
*different* properties, and this can be useful in a number of cases. An
example which comes to mind is the entries in the most widespread Italian
dictionary are typed in "bold" type, but the accent on the letters are in
"bold" type if they are mandatory in the orthography and in "normal" type if
they are optional, so you can have a "bold" letter with a "normal" accent.
Another example, best known, is that of Arabic religious text, where the
letters are normally black and the "accents" (representing vowels and other
phonetic data) are red.

If your assumption is that each character plus its attributes will take 64
bits, the logical partition of these bits would be:

Application-defined Properties: 43 bits
Character code: 21 bits

If, for any reason, you want to stick to 8-bit boundaries, you can use this
alternative partition:

Application-defined Properties: 40 bits
Character code: 24 bits

In either case, 40 or 43 bits is a huge space for the properties of a single
character, and there are plenty possible useful uses for that space. In the
unlikely case that the needed properties would not fit in 40-43 bits, the
field can be used to store an index to an external array of properties.
However long and complex a text can be, I doubt that 1 or 8 *trillions* of
different character properties will not suffice!

If an application needs "accents" to have the same properties as their base
character, the application can define a special property value which means
"this characters inherits the properties of previous character/the character
on its left/the character on its right/the character at position N/etc.".

But if you really want to stick to you (sorry, insane) idea of storing a
letter and its accent in the same unit, I would suggest at least to:
1) limit this to the *first* accent only, without distinguishing
between "upper" and "lower" accents (any subsequent "accent" will take its
own 64-bit entry);
2) encode the "accent" with its regular Unicode code point, rather
than with an ad-hoc 8-bit code.

This would result to this partition:

        Application-defined Properties: 22 bits
        Base character code: 21 bits
        Accent character code: 21 bits

Or, if you need to stick to 8-bit boundaries:

        Application-defined Properties: 16 bits
        Base character code: 24 bits
        Accent character code: 24 bits

A field of 16 or 22 bits is still a fair amount of space, especially if the
application uses it as an index to an external table.

And now comes my last and more fundamental question:

Q0: why do you want to propose all this as a Unicode "technical report"?

Internal data structures and algorithms are, by definition, "internal", so I
see no need of standardizing them, or even of publishing them.

_ Marco

Next message: Peter Kirk: "Re: Unicode Normalisaton Optimisation Experiments"
Previous message: jon@spin.ie: "RE: AddDefaultCharset considered harmful (was: Mojibake on my Web pages)"
Maybe in reply to: myrkraverk@users.sourceforge.net: "Internal Representation of Unicode"
Next in thread: Rick McGowan: "Re: Internal Representation of Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 26 2003 - 08:22:34 EDT