Re: Unicode normalization in CSS

From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Apr 08 2011 - 17:34:04 CDT

  • Next message: Christopher Fynn: "Re: On the possibility of encoding some localizable sentences in plane 7"

    > without a non-lossy normalization scheme, which Unicode currently lacks

    Depending on what is meant, this either trivially true, trivially false, or
    materially false. Any sense of "non-lossy" can only be measured against what
    is expected to be maintained. A normalization scheme that preserves all code
    points and their order, is completely lossless under any measure; it is the
    identity operation. And Unicode has that ;-).

    Any other normalization scheme loses some information; that is the purpose,
    after all, of normalization. The question is what that information is. NFC,
    for example, maintains all Unicode canonical equivalences, since that is
    what it is measured against. That is, two strings are canonically equivalent
    iff their NFC forms are the same.

    > (NFC having been hijacked by the anti-compatibility-character crusades)

    That's a myth; there was no hijacking going on. What there was is a mistake
    early on, in categorizing the CJK compatibility characters as being
    canonical equivalents. That was recognized later on, but by then stability
    considerations prevented it from being fixed. Excluding those, there are
    relatively few characters (currently) that are not allowed in NFC.

    However, the CJK compatibility characters were themselves a rather broken
    approach, and a much better one has developed in the meantime, the IVD (
    http://www.unicode.org/ivd/). And those sequences are maintained by NFC.

    Mark

    *— Il meglio è l’inimico del bene —*

    On Thu, Apr 7, 2011 at 17:11, fantasai <fantasai.lists@inkedblade.net>wrote:

    > There was a very very very long thread on Unicode normalization in CSS
    > back in January/February of 2009. IIRC the conclusion was that the
    > problem is much bigger than CSS, and I18n had some work yet to do to
    > figure it all out.
    >
    > Is that a correct recollection?
    >
    > Daniel Glazman has been collecting outstanding issues filed against
    > CSS Namespaces since we now have the implementations to move to PR,
    > and this was one of them. But I couldn't find any conclusions to the
    > discussion.
    >
    > I think realistically we have two options here:
    > 1. Nothing is normalized in CSS.
    > 2. CSS-internal user-defined identifiers are normalized to NFC, i.e.
    > - counter names
    > - namespace prefixes
    > - etc.
    > We already make a distinction between user-defined and CSS-defined
    > names in that user-defined names are case-sensitive.
    > http://www.w3.org/blog/CSS/2007/12/12/case_sensitivity
    >
    > Within #2 we could
    > - Normalize at "parse" time, i.e. before exposing such identifiers
    > to the CSSOM.
    > - In this case we need to decide whether unquoted font names are
    > also affected. Probably yes.
    > - Normalize at "match" time, i.e. store and expose the identifiers
    > unnormalized, but define that they represent the same thing.
    >
    > The third option would be to normalize the whole CSS file, but from
    > the discussions about interactions with XML, HTML, the DOM, etc. this
    > did not seem feasible, at least not without a non-lossy normalization
    > scheme, which Unicode currently lacks (NFC having been hijacked by the
    > anti-compatibility-character crusades).
    >
    > So I guess the question is, what's the right way forward here?
    >
    > ~fantasai
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Apr 08 2011 - 17:39:10 CDT