Gary responded to Samir's question:
> Characters <128 take one byte.
> Characters <2048 take two bytes.
> All others in the 64K normal range take three bytes each.
> There are provisions for characters above 2,097,152 to use three bytes, but
> normally unicode is only up to 64K. However when additional space is used,
> it still uses two bytes up to the 2,097,152 point.
Unicode (= UTF-16) is defined for the Unicode scalar range 0 .. 0x10FFFF.
Unicode scalar values in the range 0x10000..0x10FFFF (decimal 65,536..1,114,111)
require *four* bytes when expressed in UTF-8. (And also four bytes when
expressed as a surrogate pair for UTF-16, by the way.)
> Using Character Agent from Bjondi, we find that 2048 (hex 0800) is in the
> middle of the arabic characters.
Actually, that boundary is *above* all the Arabic characters, and also above
Syriac and Thaana.
> Below this are things like hebrew,
> armenian, cyrillic, greek, and some other misc stuff. All the asian sets are
> above this point.
If Samir's question is reinterpreted as "what are those languages which
require use of 3-byte UTF-8 forms?", then the answer is roughly:
Any language written using one of the Indic scripts (e.g. Hindu, Marathi,
Nepali, etc. using the Devanagari script, and so on); Sinhalese; Thai
and Lao; Tibetan and Dzongkha; Myanmar and any other language written
using the Myanmar script (e.g. Shan, Karen, etc.); Georgian; any
language using the Ethiopic script (e.g. Amharic, Tigré, etc.); Cherokee;
any language using Unified Canadian Aboriginal Syllabics (e.g. Cree,
Inuit, etc.); old Irish written in the Ogham script; old European
languages written with Runes; Khmer; any language written with the
Mongolian script (Mongolian, Manchu, Todo, etc.); Vietnamese and a
number of minority European languages (Livonian, Welsh, etc.) if
using Latin extended precomposed characters; polytonic Greek, if using
Greek extended precomposed characters; Japanese, Chinese, Korean, and Yi.
Furthermore, since most other languages, including English, make use of
some of the general punctuation characters in U+2000..U+206F, including,
notably, the curly quote marks, en dash, em dash, and bullet, it can be
expected that 3-byte UTF-8 forms will be found mixed into Unicode text
for almost any language, even when all of the letters of that text only
require 1- or 2-byte UTF-8.
> ----- Original Message -----
> > Forwarding for Samir...
> > > Subject: Largest character
> > > Date: Fri, 31 Mar 2000 10:16:33 +0530
> > >
> > > Hi,
> > > Which are those languages whose characters requires maximum number
> > > bytes to store using UTF 8?
> > >
> > > - Samir Mehrotra,
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT