Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 03 2004 - 17:19:38 CST

Next message: Rene Hache: "latin equivalent to specific indian characters"

Previous message: Philippe Verdy: "Re: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Theo: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Theo" <delete@elfdata.com>
>> From: Asmus Freytag <asmusf@ix.netcom.com>
> So, despite it being UTF-8 case insensitive, it was totally blastingly
> fast. (One person reported counting words at 1MB/second of pure text, from
> within a mixed Basic / C environment). You'll need to keep in mind, that
> the counter must look up through thousands of words (Every single word its
> come across in the text), on every single word lookup.
>
> Anyhow, from my experience, UTF-8 is great for speed and RAM.

Probably true for English or most Western European Latin-based languages
(plus Greek and Coptic).

But for other languages that still use lots of characters in the range
U+0000 to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin
Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining
Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as
efficient.

For all others, that need lots of characters out of the range U+0000 to
U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American
or African scripts, or even PUAs), UTF-16 is better (more compact in memory,
so faster).

UTF-32 will be better only for historic texts written nearly completely with
characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic,
Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB,
CR and LF), or ASCII SPACE, or NBSP are a minority.

Next message: Rene Hache: "latin equivalent to specific indian characters"
Previous message: Philippe Verdy: "Re: OpenType vs TrueType (was current version of unicode-font)"
In reply to: Theo: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 17:21:32 CST