Re: UNICODE & OTHER STANDARDS

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 29 2003 - 11:14:29 EST

  • Next message: Mark E. Shoulson: "Re: [hebrew] Re: Ancient Northwest Semitic Script (was Re: why Aramaic now)"

    From: "Christopher John Fynn" <cfynn@gmx.net>
    > Anyone have a list of other standards, protocols, RFC's etc which specify
    > Unicode (in any of it's encoding formats) as the base, default or
    preferred
    > character set to be used?

    For RFCs it's not difficult to get this list using the RFCeditor.org
    built-in
    search engine.

    However a more interesting list would be to seek for standards that were
    built on non-Unicode, non-ISO/IEC10646 charsets, registered in IANA, and
    that were since mapped onto Unicode, where these standards may perform
    some string processing that does not conform to Unicode processing rules.

    For example, these other standards may specify canonical equivalences
    which do not exist in Unicode:

    - For example, I think about some ETSI standards for Teletext, which may
    contain more combining marks than those currently encoded in Unicode,
    and may create some canonical or compatibility equivalences.

    - Or about Asian string processing algorithms, notably for Hangul, Han
    and Hiragana/Katakana.

    These standards may be supported by documenting the additional
    equivalences as Unicode folding rules. For now Unicode and ISO/IEC
    have focused on preserving the distinctions in supported character sets,
    but I think that there's some work to do with grapheme clusters that are
    now distinct in Unicode but equivalent or compatibility equivalent in
    other standards.

    Documenting folding algorithms that may be used in Unicode is probably
    a huge work, that is as much complex as unification of repertoires within
    ISO/IEC 10646 assignments of code points, or within Unicode canonical
    equivalences. Knowing them would certainly help to perform safe handling
    of texts with Unicode, when they were initially coded with legacy charsets.



    This archive was generated by hypermail 2.1.5 : Mon Dec 29 2003 - 12:01:14 EST