Re: Hebrew script in IDN

From: Mark Davis (
Date: Mon Nov 21 2005 - 10:57:55 CST

  • Next message: Cary Karp: "Re: Hebrew script in IDN"

    For stability of normalization, there is an absolute ban on normalizing
    any existing sequence to a new precomposed character (NFC).

    Thus since any new precomposed characters are normalized away there is
    little point to introducing them, and the committee has a policy to not
    encode them.

    Note: there is one possible exception. If a precomposed character and at
    least one character of its decomposition were both encoded in a new
    version of Unicode, it would be possible to normalize to the precomposed
    character in that new version. That would be a case like:

    X ~ Y + Z

    where X and either Y or Z are new. I don't think we've ever done that,
    since introducing NFC. It's unlikely that that situation would come up
    with an existing script, but might possibly come up with a new script.

    I don't recall exactly why the yiddish characters are treated in that
    fashion; it was some years ago. Perhaps Ken or someone else recalls.

    See also the on characters and combining
    marks, and on normalization.


    Cary Karp wrote:

    > Quoting Mark E. Shoulson:
    >> I'd venture to say that double-vav, vav-yod, and yod-yod ligatures
    >> should have *canonical* decomposition to their constituent letters!
    >> I'm sure that would cause problems of some sort, but at least
    >> compatibility decomposition is necessary.
    >> Doesn't really matter which is the more frequently entered; we
    >> normalize strings all the time in Unicode.
    > Why are they not being normalized here?
    > I assume that at least part of the answer lies in the fourth Yiddish
    > digraph 'pasekh tsvey yudn', HEBREW LIGATURE YIDDISH DOUBLE YOD WITH
    > HEBREW POINT PATAH (U+05F2 U+05B7). Which (I further assume) would
    > decompose and recompose correctly only if the YIDDISH DOUBLE YOD
    > ligature were the canonical form. What I don't understand, is why the
    > entire pointed digraph wasn't represented as a single precombined
    > character, with it then being possible to decompose the other three
    > ligatures as Mark suggests.
    > With apologies for not having been able to locate the answers to the
    > following questions and thus needing to pose them on this list:
    > Is there a categorical ban on the assignment of code points to new
    > characters that can be represented by combining preexisting characters
    > and, if so, where will I find a citable reference to it?
    > /Cary

    This archive was generated by hypermail 2.1.5 : Mon Nov 21 2005 - 11:05:07 CST