Re: Is it save to dig into comment contents of PropList.txt?

From: Steffen <sdaoden_at_gmail.com>
Date: Tue, 05 Nov 2013 21:57:48 +0100

Markus Scherer <markus.icu_at_gmail.com> wrote:
 |On Tue, Nov 5, 2013 at 5:38 AM, Steffen Daode <sdaoden_at_gmail.com> wrote:
 |
 |> Hello,
 |> ...i came to this solution in order to generate test data with
 |> awk(1) in a memory-friendly way?
 |
 |Comments like at the end of this line?
 |
 |0009..000D ; White_space # Cc [5] <control>..<control>

Yes, the 'Cc' in particular.

 |(The problem i'm facing is that _PRINT and _GRAPH cannot be set
 |> for some properties from PropList.txt, say, _PRINT can't be set
 |> for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in
 |> order to know that i had to parse UnicodeData.txt and store
 |> character information in memory first, (not thinking about further
 |> options), but that requires a lot of memory, more than is
 |> available on low-end machines.)
 |
 |The comments are just that, comments, for human consumption, and their
 |format may change without notice. One exception is the syntax in the
 |@missing lines.

yup, it was a stupid question on auto-generated data, i've
realized that, but too late. So your answer deserves special
thanks. But imho it's a pity that it was felt that the
General_Category belongs into the comment only.

(And on the other hand stupid questions are better than shit
decisions :)

 |It is normal that you need to parse multiple Unicode data files for
 |extracting useful data.

Yes, and i'm in the process of getting used to it, but

 |It also does not require "a lot of memory" considering how much memory is
 |available even on ten-year-old clunkers at this point, unless you are
 |especially extravagant with how you store the data. Besides, after parsing,
 |you would normally build more compact data structures for the data you need.

the problem is that i want to be double-tracked here: On the
one hand i (will) have a ISO C parser that (will) mince(s) all the
Unicode data files in order to generate dainty nibbles for the
C code which performs the tasks i'm after, say, uppercasing.

But on the other i (actually have learned the lesson that i) need
to generate test data so that a 'make test' is possible. Well, in
a not to far future i'm gonna add 'make devel-test-perl' and even
later i hope for 'make devel-test-icu', so that it is possible to
compare the outcome against long-developed codebases, but still
a normal user should be able to run 'make test' and get some
useful testing. Etc. etc. (I.e., i once found two compiler bugs
because of testing, and have been able to correct my code so that
the bugs did not occur.)

Today, and as of Unicode 6.2.0 and 6.3.0, it seems to be possible
to gain that level of testing for rather simple tasks like case
transliterations and visual code point widths, which is all i know
yet, simply by sequentially iterating over the data and having
a test protocol that knows a bit of data precedence, i.e., pushing
the burden of duplicate entry elimination on the test parser, not
the test generator. awk(1) arrays are pretty memory hungry, and
having arrays of '0x10FFFFu - (2 * (0xFFFFu + 1))' elements will
maybe not blow a Raspberry Pi Model A with its 256 MB of RAM, but
maybe the administrator imposed datasize limit... effectively
resulting in a abort.

In fact i even had implemented a LOWMEM= hook for exactly this
case before (back in June), because GNU awk(1) required so much
memory that it happened to happen on a testbox. Just like this:

  Note that generating the test data requires a noticeable amount of
  memory, especially with GNU awk(1) (gawk(1)). If compilation fails
  during this step, try setting the LOWMEM environment variable -- this
  will generate incomplete test data, but which shouldn't worry most
  users, since the generated test data is always the same for a specific
  Unicode version and thus has been tested before shipout. Note that
  the `test' make(1) target will produce errors if LOWMEM is used (due to
  the incomplete data).

 |especially extravagant with how you store the data. Besides, after parsing,
 |you would normally build more compact data structures for the data you need.

Saving space is a particular frightening topic, but i'm far away
from that (since there is so few functionality yet). The ICU 52.1
data library i've compiled two weeks ago is incredible 23.5 MB,
and that after more than a decade engineering experience, so...
the lunatic is at least ahead.
Anyway, that is the installed data, not the throw-away test data.
Having

  ?1[steffen_at_sherwood ]$ for i in test/t_*; do wc -cl $i; done
       412 4108 test/t_ext_ctrans.dat
   1026577 11491666 test/t_props.dat
      9669 141462 test/t_simple_ctrans.dat
    336880 3031339 test/t_widths.dat

of test data is better than not having it, especially if generated
from the same input data than the library data, but in a slightly
different way.

 |Having said that, if your parsing works with the files you see and the data
 |you want to extract, then go for it. Just make sure that if the format
 |changes, you have enough checks in your parser so that it fails with an
 |error rather than silently producing garbage. You should also spot-check
 |that the data you get from the comments does indeed match the real data.

Well. If it changes, it changes. I hope it doesn't. I really
hope i'll never have to use that XML data, which seems to be
tremendous dubious (for the cheap-propaganda agitator in me) and
is also terribly huge.

  While every effort has been made to ensure consistency of the
  XML representation with the UCD files, there may be some errors;
  the UCD files are authoritative.

  ?0[steffen_at_sherwood tmp]$ ll ucd*
  144096 -rw-r--r-- 1 147551826 26 Sep 08:38 ucd.all.flat.xml
    3616 -rw-r--r-- 1 3700010 26 Sep 08:38 ucd.nounihan.grouped.xml

(I was not behind GPRS when i did those downloads, for the
emphatic fellow human beings on this list, thank you.)

Ok i spot and go.

 |markus

--steffen
Received on Tue Nov 05 2013 - 15:00:55 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 05 2013 - 15:00:56 CST