Re: Is it save to dig into comment contents of PropList.txt?

From: Steffen <sdaoden_at_gmail.com>
Date: Wed, 06 Nov 2013 11:43:23 +0100

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
 |2013/11/5 Steffen Daode <sdaoden_at_gmail.com>
 |> (The problem i'm facing is that _PRINT and _GRAPH cannot be set
 |> for some properties from PropList.txt, say, _PRINT can't be set
 |> for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in
 |
 |TAB is "printable" (for the isprint() macro in standard C librries) because
 |it has a whitespace property, even if its general category is very weakly

Nope according to POSIX, Vol. 1: Base Definitions, 7.3.1. LC_CTYPE ([1]):

  print
  Define characters to be classified as printable characters,
  including the <space>.

  In the POSIX locale, all characters in class graph shall be
  included; no characters in class cntrl shall be included.

  In a locale definition file, characters specified for the
  keywords upper, lower, alpha, digit, xdigit, punct, graph, and
  the <space> are automatically included in this class. No
  character specified for the keyword cntrl shall be specified.

  [1] <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01>

Verifieable under LC_ALL=en_GB.UTF-8 in Mac OS X Snow Leopard
(which admittedly uses very old Citrus data, i always wonder why all
those Gigabytes of «Software Update»s don't tweak that, not to
talk about GNU make 3.81 and all the other buggy or non-compliant
stuff, but that is a different story):

  #include <stdio.h>
  #include <ctype.h>
  #include <wctype.h>
  int main(void) {
    printf("%d %d\n",isprint('\t'), wcwidth(L'\t'));
    return 0;
  }

  ?0[steffen_at_sherwood tmp]$ cc -o zt t.c && ./zt
  0 -1

 |The character mapping for the isprint() macro is defined by an expression
 |based on existing Unicode properties. Most C libraries optimize this

But i agree that POSIX has to move towards Unicode definitions,
and more byte- than bitwise.

--steffen

attached mail follows:


2013/11/5 Steffen Daode <sdaoden_at_gmail.com>

> Hello,
> ...i came to this solution in order to generate test data with
> awk(1) in a memory-friendly way?
>
> (The problem i'm facing is that _PRINT and _GRAPH cannot be set
> for some properties from PropList.txt, say, _PRINT can't be set
> for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in
> order to know that i had to parse UnicodeData.txt and store
> character information in memory first, (not thinking about further
> options), but that requires a lot of memory, more than is
> available on low-end machines.)

TAB is "printable" (for the isprint() macro in standard C librries) because
it has a whitespace property, even if its general category is very weakly
defined (kept for upward compatibility, the GC property is not enough for
most applications). It is treated for example in word and line breaking
properties.

The character mapping for the isprint() macro is defined by an expression
based on existing Unicode properties. Most C libraries optimize this
expression using fast compressed lookup table, except those legacy
libraries buit only for 7-bit or 8-bit encodings based on ISO 646
(including ASCII, ISO 8859, and national encodings from Russia, Ukraine,
India, Japan, Korea, China -- VISCII needing a special exception as it
allocates some printable characters needed for accented letters, at code
positions of ISO 646 controls not needed and rarely used for plain text ;
same remark about old PC codepages where additional symbols are mapped in
those positions and found in old encoded texts for PCDOS/MSDOS..) or
EBCDIC, where this may be a very weak test on some 8-bit value ranges.
Received on Wed Nov 06 2013 - 04:46:40 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 06 2013 - 04:46:42 CST