Re: Parsers for the UnicodeSet notation?

From: Eric Muller <>
Date: Wed, 23 Jul 2014 23:51:15 -0700

Thanks for the answers.

I take it from Steve's answer that Roozbeh's parser may work today but
may break tomorrow.

A couple of suggestions:

- a full "parser" of UnicodeSet is non-trivial, since it involves having
access to property values. That does not seem really necessary for
exemplars, so may be it would be good restrict the UnicodeSet there.

- alternatively, since the extent of a UnicodeSet can involve property
values, it means that the extent can depend on the Unicode version from
which those values come from. Which means that there ought to be a
Unicode version number in the CLDR data; it would be nice for that
number to be present in the data files (I don't see one in he.xml)

> Incidentally, I copy/pasted the punctuation exemplar characters for
> he.xml into the utility, and it reported that the set contains 8,130
> code points, including the ascii letters. Somehow, that seems
> incorrect. What did I do wrong?

Sorry, I took the UnicodeSet straight out of he/characters.json, without
handling the json serialization (or rather deserialization) of strings.

Taking it straight out of he.xml (where there is no serialization
effect) gives a much more reasonable set of twenty strings. XML wins
again ;-)


Unicode mailing list
Received on Thu Jul 24 2014 - 01:52:14 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 24 2014 - 01:52:14 CDT