Re: Is it save to dig into comment contents of PropList.txt?

From: Steffen <sdaoden_at_gmail.com>
Date: Thu, 07 Nov 2013 12:35:36 +0100

Markus Scherer <markus.icu_at_gmail.com> wrote:
 |On Wed, Nov 6, 2013 at 2:43 AM, Steffen Daode <sdaoden_at_gmail.com> wrote:
 |> Nope according to POSIX, Vol. 1: Base Definitions, 7.3.1. LC_CTYPE ([1]):
 |
 |There is a Unicode spec for these properties:
 |http://www.unicode.org/reports/tr18/#Compatibility_Properties

Jaaaa.. you know i'm not completely convinced of that list yet,
in that i need to dig into that complex stuff myself. At the
moment i don't include \p{gc=Symbol} in punctuation, it's only
GRAPH and PRINT (i don't see evidence that «POSIX adds symbols» to
PUNCT, isn't that dangerous propaganda?).

(But i'm really dealing with basic library layout development
rather than anything else for quite a while (i think i've already
said that i want to support platforms which don't have dynamic
linkers -- also, maybe someone has only interest in specific
datasets as such -- so what i did is to completely detach the UCD
data handling from my upcoming library in order to make this
datapool usable by itself, and changing anything so as to be able
to offer a space restricted, property-reduced, binary search array
based as well as a not-yet-totally-clear but most likely
multistage-array based fully featured (that one exists only for
properties in the moment) one), so i'm off; give me a decade, and
i promise i use this list to capacity...)

And i'm not alone; e.g., perl(1) 5.18.1 doesn't include U+00A0 in
[:blank:] (from which it states in 'man perlrecharclass' that it
is a GNU extension, though it's a regular part of POSIX:

  blank
  Define characters to be classified as <blank> characters.)

 |ICU should be implementing that, for example
 |[:print:]<http://unicode.org/cldr/utility/list-unicodeset.\
 |jsp?a=%5B%3Aprint%3A%5D&g=>

But not behind GPRS, sorry -- locally i have a nice crash-free
ucd(1) which uses the current ICU dataset, no clumsy
i-was-Corel-Office.
Ciao,

 |markus

--steffen

attached mail follows:


On Wed, Nov 6, 2013 at 2:43 AM, Steffen Daode <sdaoden_at_gmail.com> wrote:

> |TAB is "printable" (for the isprint() macro in standard C librries)
> because
> |it has a whitespace property, even if its general category is very weakly
>
> Nope according to POSIX, Vol. 1: Base Definitions, 7.3.1. LC_CTYPE ([1]):
>
> print
> Define characters to be classified as printable characters,
> including the <space>.
>
> In the POSIX locale, all characters in class graph shall be
> included; no characters in class cntrl shall be included.
>
> In a locale definition file, characters specified for the
> keywords upper, lower, alpha, digit, xdigit, punct, graph, and
> the <space> are automatically included in this class. No
> character specified for the keyword cntrl shall be specified.
>

There is a Unicode spec for these properties:
http://www.unicode.org/reports/tr18/#Compatibility_Properties

ICU should be implementing that, for example
[:print:]<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aprint%3A%5D&g=>

markus
Received on Thu Nov 07 2013 - 05:38:54 CST

This archive was generated by hypermail 2.2.0 : Thu Nov 07 2013 - 05:38:56 CST