Re: [very OT] Documentation: beyond 65,536 ; misc Semitic ?s

From: Kenneth Whistler (
Date: Fri Feb 16 2001 - 13:20:18 EST

Elaine Keown asked:

> Within the book, Unicode 3.0, is there somewhere a long section I
> missed about all the stuff that happens beyond the "first 65,536,"
> in addition to surrogate stuff?


> Is there other documentation somewhere?

Yes -- in the next version of the standard. See:

> Today are there still 7,827 unused code values?

Actually, there are 880,325 reserved unassigned code points
(7,793 on the BMP and 872,532 on the supplementary planes).

> Will they be unassigned until version 4.0 gels?

No. Unicode 3.1 has already been approved, and is in the
last stages of publication. After that, Unicode 3.2 will
appear, adding over 1000 more characters to the BMP. Unicode
Version 4.0 is beyond that, and will, no doubt, add another
collection of characters.

> Also, is there a linguistic index to Unicode character
> database files, saying which mention Semitic languages?

No. But simple tools like grep enable you to pull out
all instances of ARABIC, HEBREW, or SYRIAC characters, if
you want.

> And finally, is there documentation somewhere on whether 3.0
> has complete symbols for the 18 languages written in Arabic
> script that are mentioned in the book?

I presume you are talking about letters and points, rather
than symbols per se. The consortium doesn't have any explicit
language-by-language listing of Arabic alphabets and their
correlation with the encoded characters. However, the UTC does
consider the current encoding to be complete for the languages
that are explicitly mentioned, as well as for many others written
with the Arabic script that are not explicitly mentioned.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT