Re: outside decomposed, inside precomposed

From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Wed Oct 13 2004 - 09:11:11 CST

  • Next message: Eric Muller: "Re: outside decomposed, inside precomposed"

    Jon,

    Thanks for your reply.

    On Oct 13, 2004, at 3:15 AM, you wrote:

    >> imported UTF-8 sequences like [U+0065][U+0303] <e, tilde> get
    >> remapped internally to [U+1ebd] LATIN SMALL LETTER E WITH TILDE.
    >>
    >> Is this kind of behavior what one would expect?
    >
    > That's conformant, if it causes problems with any other process
    > (including
    > other processes that are part of the system in question)

    Like, for example, a rendering process?

    > then that other
    > process isn't complying with conformance clause C9.
    >
    > At a guess I'd say it's probably normalising to NFC which is
    > advantageous in
    > a lot of ways (for example you should do this with data that has to
    > conform
    > with the web's [draft] character model).
    >
    > One of the clearest advantages is that it makes searching a lot more
    > efficient, as only one of the potentially very many canonically
    > equivalent
    > sequences will have to be searched for

    Yes.

    > (though case-insensitive and/or
    > diacritical-insensitive searches will still have many possible matching
    > strings).

    Yup.

    > On the other hand there are potential security risks with such
    > normalisation, and perhaps therefore it is something that should be
    > configurable.
    >
    >> It's problematic (and buglike) for at least one reason: one needs to
    >> put all these precomposed things in one's font, or FileMaker doesn't
    >> display them properly.
    >
    > That's were the problem lies, not in the normalisation.

    Maybe they ought to be rendering the glyphs according to the characters
    in the font, with a fallback via decomposition. If they normalize and
    simply throw up the missing character empty box, this is not very
    helpful.

    I built a tidy IPA transcription font, lacking many precomposed things.
    Importing and exporting a data subset in FM7 reveals a total of 113
    characters not displaying properly. This is annoying, to say the least.

    One reason I wanted a *small* font is that in PDF generation big fonts
    may not always be subsetted properly, and even a single page PDF will
    end up embedding the whole font.

    Also, there is extra overhead with a big font that seems to slow things
    up a bit, even on a fast machine.

    >> I'm assuming it will export the data in decomposed form ...
    >> but haven't actually tried that yet ...
    >
    > I wouldn't assume anything of the sort. Normalising to NFD would be
    > quite
    > unusual.

    Yes, I realize that now. And my test confirms that the internal
    normalization is also what you get on export. And hence those 113 empty
    boxes ...

    >> BTW, this application supports import of UTF-8, but will not export
    >> UTF-8. That's odd, isn't it? It'll only export UTF-16 (it's internal
    >> storage form).
    >
    > Odd indeed.

    Well, maybe they're saving UTF-8 export for a future release ... though
    I can't imagine why.

    -Richard



    This archive was generated by hypermail 2.1.5 : Wed Oct 13 2004 - 09:13:57 CST