Re: preparing a PUA specification (for historical Polish text)

From: Doug Ewell (
Date: Sun Apr 11 2010 - 11:19:21 CDT

  • Next message: AndrĂ© Szabolcs Szelp: "Re: preparing a PUA specification (for historical Polish text)"

    "Janusz S. "Bień"" <jsbien at mimuw dot edu dot pl> wrote:

    > The next stage will be to assign PUA code point to them, primarily for
    > the purpose to encode the texts systematically for inclusion in the
    > search engine
    > At this stage my question is purely technical: what is the best form
    > to prepare and maintain such a specification?

    By definition, a PUA specification will not be reviewed or approved by
    the Unicode or 10646 technical committees. It is a private
    specification. You can encode text in it, teach your search engine to
    recognize it, and distribute it to other interested parties ("private"
    does not mean "secret"), but if you want any of these characters to be
    formally encoded in Unicode/10646, you should follow Tex's link and
    prepare a "real" Unicode/10646 proposal form. Characters do not enjoy
    any special advantage in consideration for formal encoding merely by
    having been listed in a PUA spec.

    If these characters are only used in books *about* proposed new
    orthographies, not books written *in* the orthographies, then a PUA
    solution seems especially appropriate.

    If you do want PUA assignments, it might be most appropriate to propose
    these for inclusion in MUFI. They are medieval Latin glyphs, not a
    completely different invented script, which would be appropriate for the
    ConScript registry. However, you may wish to use the ConScript model in
    writing your proposal, since it provides structure to describe many of
    the important encoding and display issues.

    Presenting the General Property values of these characters in a format
    similar to UnicodeData.txt is probably a good idea (although you can
    augment this with prose as well). You can describe the combining
    sequences using the NamedSequences format, which is a better choice than
    encoding them as precomposed characters.

    The XeTeX approximations don't seem to shed much light on any issues
    involving these letters.

    Doug Ewell  |  Thornton, Colorado, USA  |
    RFC 5645, 4645, UTN #14  |  ietf-languages @ ­

    This archive was generated by hypermail 2.1.5 : Sun Apr 11 2010 - 11:26:41 CDT