Re: Nicest UTF

From: D. Starner (shalesller@writeme.com)
Date: Wed Dec 08 2004 - 18:10:58 CST

  • Next message: Patrick Andries: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

    "Marcin 'Qrczak' Kowalczyk" writes:
    > String equality in a programming language should not treat composed
    > and decomposed forms as equal. Not this level of abstraction.

    This implies that every programmer needs an indepth knowledge of Unicode
    to handle simple strings. The concept makes me want to replace Unicode;
    spending the rest of my life explaining to programmers, and people who use
    their programs, why a search for "Römishe Elegien" isn't bringing the book
    is not my idea of happiness.

    > IMHO splitting into graphemes is the job of a rendering engine, not of
    > a function which extracts a part of a string which matches a regex.

    So S should _sometimes_ match an accented S? Again, I feel extended misery
    of explaining to people why things aren't working right coming on.

    > They are supposed to be equivalent when they are actual characters.
    > What if they are numeric character references? Should "<&#824;"
    > (7 characters) represent a valid plain-text character or be a broken
    > opening tag?

    Which 7 characters? My email "client" turned them into the actual characters.
    But I think it's fairly obvious that XML added entities in part so you
    could include '<'s and other characters without them getting interpreted as
    part of the text of the document. Similarly, a combining character entity
    following an actual < should be the start of a tag.

    >Note that if it's a valid plain-text character, it's impossible
    >to represent isolated combining code points in XML,

    No more then it's impossible to represent '<' in the text.

    > I expect breakage of XML-based protocols if implementations are
    > actually changed to conform to these rules (I bet they don't now).

    Really? In what cases are you storing isolated combining code points
    in XML as text? I can think of hypothetical cases, but most real-world
    use isn't going to be affected. If I were designing such an XML protocol,
    I'd probably store it as a decimal number anyway; XML is designed to
    be human-readable, and an isolated combining character that randomly
    combines with other characters that it's not logically associated with
    when displayed isn't particularly human readable.

    > Implementing an API which works in terms of graphemes over an API
    > which works in terms of code points is more sane than the converse,
    > which suggests that the core API should use code points if both APIs
    > are sometimes needed at all.

    Implementing an API which works in terms of lists over an API which works
    in terms of pointers is more sane than the converse, which suggests that the
    core API should use pointers if both APIs are sometimes needed at all.

    > While I'm not obsessed with efficiency, it would be nice if changing
    > the API would not slow down string processing too much.

    Who knows how much it would slow down string processing? If I get around
    to writing the test code, I'll try and see how much it slows stuff down,
    but right now we don't know.

    -- 
    ___________________________________________________________
    Sign-up for Ads Free at Mail.com
    http://promo.mail.com/adsfreejump.htm
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 18:15:04 CST