Re: Unicode 3.0.1 update beta data files available

From: Christopher J. Fynn (cfynn@dircon.co.uk)
Date: Sat Aug 05 2000 - 00:10:06 EDT


Ken:

UnicodeData-3.0.1d2.beta.txt has:
<<
0F00;TIBETAN SYLLABLE OM;Lo;0;L;;;;;N;;;;;
...
0F0E;TIBETAN MARK NYIS SHAD;Po;0;L;;;;;N;TIBETAN DOUBLE SHAD;nyi shey;;;
0F0F;TIBETAN MARK TSHEG SHAD;Po;0;L;;;;;N;;tsek shey;;;
...
0F71;TIBETAN VOWEL SIGN AA;Mn;129;NSM;;;;;N;;;;;
0F72;TIBETAN VOWEL SIGN I;Mn;130;NSM;;;;;N;;;;;
0F73;TIBETAN VOWEL SIGN II;Mn;0;NSM;0F71 0F72;;;;N;;;;;
0F74;TIBETAN VOWEL SIGN U;Mn;132;NSM;;;;;N;;;;;
0F75;TIBETAN VOWEL SIGN UU;Mn;0;NSM;0F71 0F74;;;;N;;;;;
0F76;TIBETAN VOWEL SIGN VOCALIC R;Mn;0;NSM;0FB2 0F80;;;;N;;;;;
0F77;TIBETAN VOWEL SIGN VOCALIC RR;Mn;0;NSM;<compat> 0FB2 0F81;;;;N;;;;;
0F78;TIBETAN VOWEL SIGN VOCALIC L;Mn;0;NSM;0FB3 0F80;;;;N;;;;;
0F79;TIBETAN VOWEL SIGN VOCALIC LL;Mn;0;NSM;<compat> 0FB3 0F81;;;;N;;;;;
0F7A;TIBETAN VOWEL SIGN E;Mn;130;NSM;;;;;N;;;;;
0F7B;TIBETAN VOWEL SIGN EE;Mn;130;NSM;;;;;N;TIBETAN VOWEL SIGN AI;;;;
0F7C;TIBETAN VOWEL SIGN O;Mn;130;NSM;;;;;N;;;;;
0F7D;TIBETAN VOWEL SIGN OO;Mn;130;NSM;;;;;N;TIBETAN VOWEL SIGN AU;;;;
>>

Shouldn't 0F00 have a decomposition / equivalence to 0F6B 0F7C 0F7E?
It is essetially a ligature or precomposed form of those characters.

Shouldn't 0F7B have a decomposition / equivalence to 0F7A 0F7A
and 0F7D have a decomposition / equivalence to 0F7C 0F7C ?

Are 0F77 and 0F79 marked as <compat> because, once decomposed,
the subjoined RA [0FB2] or LA [0FB3] cannot easily be identified as part of
a vowel. (Most Tibeans will enter these in decomposed form anyway.)

Shouldn't 0F0E have a decompsition to 0F0F 0F0F?

==

0F14 TIBETAN MARK GTER TSHEG has an alternative name "TIBETAN COMMA"
this altenative is completly misleading and should be removed if possible.
0F14 basically has the same function as 0F0F - and is used in certain types
of text instead of 0F0F. Text that has this character instead of 0F14 is
"treasure text" [gter-ma] or revealed text.

==
PropList-3_0_1d2_beta.txt has:
<<
Property dump for: 0x20001000 (Punctuation)
...
0F04..0F12 (15 chars)
0F3A..0F3D (4 chars)
0F85
...
>>

The first range should probably begin at 0F02 instead of 0F04
as 0F02 and 0F03 are special foms of YIG MGO.

0F14 is also a punctuation character (just like 0F0F which it replaces in
certain types of text).

I don't think 0F85 TIBETAN MARK PALUTA should be counted as a "punctuation
character"

 ===

<<
Property dump for: 0x00800000 (Delimiter)
...
<<
0F0B
0F0D..0F12 (6 chars)
0F3A..0F3D (4 chars)
>>

0F0C like 0F0B is a word delimiter - the difference is a break cannot occur
after 0F0C while it can after 0F0B.

0F14 should also be included here for the same reasons as mentioned earlier.

0F34 and 0FBE should probably also be included here - although they indicate
something like "etc." or "and so forth" they also contain an implicit 0F0B and
can occur only at at the end of a phrase not in the middle.

- Chris

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: "Unicode List" <unicode@unicode.org>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Saturday, August 05, 2000 2:24 AM
Subject: Unicode 3.0.1 update beta data files available

> The beta directory for the Unicode 3.0.1 update has been created.
>
> Due to the current problem with anonymous ftp on www.unicode.org,
> only the http version of this directory is currently available:
>
> http://www.unicode.org/Public/3.0-Update1/
>
> The updated beta files at that location for the Unicode 3.0.1 update are:
>
> 5179 Jul 31 21:14 ArabicShaping-3d1.beta.txt
> 43559 Jul 31 21:14 CaseFolding-2d1.beta.txt
> 5085 Jul 31 21:14 CompositionExclusions-2d1.beta.txt
> 55254 Jul 31 21:14 PropList-3.0.1d2.beta.txt
> 13841 Jul 31 21:14 SpecialCasing-3d2.beta.txt
> 48261 Jul 31 21:14 UnicodeData-3.0.1d1.beta.html
> 636269 Jul 31 21:15 UnicodeData-3.0.1d2.beta.txt
>
> These are temporary names. Once the beta review closes, the "beta" and
> the delta number on the files will be dropped for the permanent
> versioned filename, and the latest versions of the files will
> be copied into the UNIDATA directory minus the version extension.

> And comparable changes will be made in the ftp hierarchy as well, as
> soon as regular ftp service can be restored on the server.

> Before that happens, however, we would like to invite all interested
> implementers to examine the data files and report any problems you
> find in them, so that any problem can be corrected before the finalization
> of the Unicode 3.0.1 update.

> Note that UnicodeData.txt and PropList.txt now explicitly contain
> codepoint listings using the 5- or 6-digit UTF-32 notation. If
> you are using automated parsers on either of those files, be aware
> of this change in convention and make sure your code is prepared
> to handle parsing of codepoint values greater than 0xFFFF.

> We are introducing this change now with the relatively trivial
> listing of user-defined, unassigned, and not-a-character codepoints
> past U+FFFF, so people can test out their implementations before
> they get whumped with 40,000+ new characters from Planes 1, 2,
> and 14 for the upcoming Unicode 3.1.

> --Ken Whistler
>
> ===================================================================
>
> The changes in the data files from the 3.0.0 release version are
> as follows:
>
> ArabicShaping.txt
>
> Updated the shaping class for 0671.
>
> CaseFolding.txt
>
> This is a new contributory data file. See UTR #21, Case Mappings.
>
> CompositionExclusions.txt
>
> Fixed a comment in the file.
> Added a minimal label/version comment at the top of the file.
>
> PropList.txt
>
> Removed F8F0..F8FF from a listing of several properties. (Bug)
> Fix the default bidi property to LR for all user-defined character
> ranges.
> Updated properties for 0E47. (removed from alphabetics, added to
> diacritics)
> Extended property listing to full UTF-16 range for user-defined
> characters (including Planes 15 and 16), for bidi LR, and
> for unassigned characters.
> Added not-a-character property (a property of codepoints, not
> of characters), and provided listing for full UTF-16 range.
>
> SpecialCasing.txt
>
> Minor fixes to the BNF syntax.
> Addition of Lithuanian AFTER condition.
> Addition to notes in the comments in the file.
>
> UnicodeData.html
>
> Corrected a bullet numbering problem.
> Added documentation of range listing for Plane 15 and Plane 16
> user-defined characters.
> Added documentation of 4/5/6 digit hex notation conventions.
>
> UnicodeData.txt
>
> Added definition ranges for Plane 15, and Plane 16 user-defined
> characters.
> Added "dena sum" in the ISO comment field for 0FCF.
> Added 10646-1 Annex P asterisk comments to 01A6, 0280.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT