Re: [long] Use of Unicode in AbiWord

From: schererm@us.ibm.com
Date: Thu Mar 18 1999 - 15:39:25 EST


hello,

i am another "software craftsman" with too little time to actively
participate in projects like yours, but i have used and studied unicode for
quite a while now and hope that i can give you some tips.

first of all, i believe that choosing unicode will make it a lot easier for
you to deal with text of any language.
however, there are some things that need to be mentioned.

> All document
> characters are defined to be part of the Unicode space. Our internal
> data structures always use a 16-bit integer to represent a character.

the most important thing is that unicode and its twin standard, iso-10646
(-1), are not (not really, anyway) 16-bit character standards in the naive
way, and could never be (this will cause some uproar here, but it's true
:-). the common goal of unicode (http://www.unicode.org/) and iso-10646 (
http://www.dkuug.dk/jtc1/sc2/wg2/) is to define a _universal_ character
set, one with all characters ever used. now, there are between 60000 and
80000 characters for chinese alone. there will at some time be more than
100000 characters officially assigned, and unicode 2.0 provides 128k+6400
codes for private use - expect someone to use some of them!

in short, there are three accepted ways to represent a character:
1) with one 32-bit word (typically, wchar_t), with values up to 0x10ffff;
   in Unicode, this is a Unicode Scalar Value.
   the only possible fixed-width-encoding for this is UCS-4
2) with one or two 16-bit words (Windows WCHAR, Java char);
   Unicode calls pairs that encode one character a pair of "surrogates".
   this is UTF-16
   (the first code is in 0xd800-0xdbff, the second in 0xdc00-0xdfff)
3) with one to four 8-bit bytes; this is UTF-8
   (bytes <128 are straight ASCII)

transformation between them is pure, fast bit-shifting.

officially since 1996, with Unicode 2.0, UTF-16 is the prefered encoding
for unicode, not the 16-bit, fixed-width UCS-2.

see http://www.unicode.org/unicode/reports/tr17/
also ftp://ftp.ietf.org/internet-drafts/draft-hoffman-utf16-02.txt

so, your basic choice is if you want to use a fixed-width encoding or a
variable-width one.
fixed-width means UCS-4, 32b per character.
variable-width means UTF-16, 1x16b or 2x16b per character, or UTF-8,
(1-4)x8b per character.

consider something else for the second choice between UTF-16 and UTF-8:
unicode defines a standard line separator (0x2028) and a standard paragraph
separator (0x2029) to resolve ambiguous use of CR, LF, VT, and other
controls.
if you use UTF-8, then these characters become pretty ugly sequences of 3
bytes that you would have to check for in plain text all the time. all the
other formatting characters, e.g., for right-to-left-writing, will also
need 3 bytes. in UTF-16 (and, of course, in UCS-4) , they are single code
words.
on the separators, see http://www.unicode.org/unicode/reports/tr13/

the next version of the standard will make this point clearer.

there is a third choice to make: aside from what datatype to use for
strings and streams of text, what datatype do you use for a single
character?

again, you could use a 32b int/wchar_t for any character, or you could use
two 16b unsigned short/WCHAR that either contain (value, 0) if
value<=0xffff, or else (s1, s2) with s1 and s2 being the 2 surrogates
according to the UTF-16 algorithm. UTF-8 as a basis for single characters??

in other words, in text arguments for functions, a string may be of type
(unsigned short *), and a single character may be of type (int). note that
bit 31 is definitely not used in unicode/iso-10646, a signed int works just
fine.

> We also think that we should switch our representation to UTF-8. On
> every platform we current plan to support, this would eliminate the
> encoding conversion step (as well as a lot of memory usage) for any
> run of text which includes only ASCII characters. For obvious
> reasons, and with no offense intended to the majority of the world who
> primarily use double-byte encoded characters, we believe this to be a
> common case worth optimizing.

this optimizes for one language while making it a lot less efficient to use
just about every other language including all european ones, as well as the
line and paragraph separator, the euro, the formatting characters,
dingbats, etc.
is it so bad to double the size of text in memory to improve performance?
you can still save it in a file in UTF-8 or in a traditional charset.

> 4. Change our rendering code to allow for the conversion (if any)
> between the font's encoding, whatever it may be, and the UTF-8
> representation of Unicode character which we'll be using for our
> data structures.

you will still need to compare for each byte if it is <128 to pass it
through unchanged. doing so on 16b or even 32b should not cost much more,
if any. using a 16b x-font with UTF-8 should degrade your performance.

one more point: there are going to be characters added, and occasionally a
character's properties are modified. you should look into using something
like a database of character properties that can be updated from
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt
see here for more: http://www.unicode.org/unicode/onlinedat/online.html

i can imagine that there is already some open source work for a common
implementation for this.

good luck!

markus

ps:

for some advanced text handling in Java, see
http://www.ibm.com/java/education/international-text/
with links from there, you should also get a lot more basic information.
see also http://www.ibm.com/java/tools/international-classes/

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com
                        Unicode is here! --> http://www.unicode.org/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT