Re: Clarification of character classes

From: Asmus Freytag (
Date: Thu Dec 14 2000 - 19:25:06 EST

For linebreaking you should refer to Unicode Technical Report #14 and not
the (more cursory) discussion in chapter 5 of the book. The character
classes are defined in much greater detail there and there are links to the
datafile that spell them all out (again, for line breaking).

For the other types of boundaries, chapter 5 has all the available information.

At 03:43 AM 12/14/00 -0800, Tobias Hunger wrote:
>Hash: SHA1
>I have some more questions on the charcater classes used in chapter 5 of the
>book. Here is a list of classes mentioned and what I made them out to be. It
>would be great if someone could verify that I got it right:-)
>Table 5-3:
>CR: U+000D
>LF: U+000A
>Format: Everything with General Category of C? besides CR/LF
>Virama: Every character with Canonical Combining Class of 9
>Joining: Every charcter with Canonical Combining Class != 0 (but not
> a Virama) (?)
>L: U+1100-U+115F
>V: U+1160-U+11A7
>T: U+11A8-U+11FF
>Lo: Everything besides to above with General Category of Lo (?)
>Other: Every letter (besides those above) with General Category L? (?)
>Table 5-4:
>Sep: Paragraph Separator (U+2029) / Line Separator (U+2028)
>TAB: Now which one is this? U+0009? U+000B?
>Let: Everything with General Category L?
>Com: Same as Joining above? Or Combining Property from PropList.txt?
>Hira: U+3040-U+309F
>Kata: U+30A0-U+30FF
>Han: All the CJK-Ranges (?)
>Table 5-5:
>ZWSP U+200B
>Sp: Every letter with General Category Zp (?)
>Break: LS/PS. What else?
>Com: same as above (?)
>Ideographic: Same as Han above or everything with Ideographic Property in
> PropList.txt?
>Alphabetic: Everything with Alphabetic Property in PropList.txt
>Exclam: Now how do I figure this one out?Terminal Punctuation Property?
>Syntax: What is a Solidus? Which characters belong here?
>Open: General Category Ps
>Close: General Category Pe
>Quote: General Category Pi and Pf
>NonStarter: Which Haragama and Katakana characters are small?
>HyphenMinus: U+002D
>Insep: Ellipsis Characters and leaders (?)
>Number: General Category Nd
>NumericPrefix: How do I figure this one out? With Bidi-Properties?
>NumericPostfix: and this one?
>NumericInfix: how about this one?
>Base: Cannonical Combining Class == 0
>NonBase: Cannonical Combining Class != 0 (?)
>All: Everything
>Table 5-6:
>Sp: Same as in 5-5?
>Term: Terminal Punctuation Property?
>Dot: U+00B7 (?)
>Cap: General Category Lu, Lt, Lo
>Lower: General Category Ll
>Open: same as in 5-5
>Close: Is this the same as in 5-5? This one includes period, comma, ... which
> the on ein 5-5 does not.
>Thank you for your help. Propaby I am just a bit confused and should be able
>to figure this out on my own, but I just don't get it.
>- --
>- -------------------------------------------------------------------
>Tobias Hunger The box said: 'Windows 95 or better'
> So I installed Linux.
>- -------------------------------------------------------------------
>Version: GnuPG v1.0.4 (GNU/Linux)
>Comment: For info see

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT