From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Wed Oct 13 2004 - 09:11:11 CST
Jon,
Thanks for your reply.
On Oct 13, 2004, at 3:15 AM, you wrote:
>> imported UTF-8 sequences like [U+0065][U+0303] <e, tilde> get
>> remapped internally to [U+1ebd] LATIN SMALL LETTER E WITH TILDE.
>>
>> Is this kind of behavior what one would expect?
>
> That's conformant, if it causes problems with any other process
> (including
> other processes that are part of the system in question)
Like, for example, a rendering process?
> then that other
> process isn't complying with conformance clause C9.
>
> At a guess I'd say it's probably normalising to NFC which is
> advantageous in
> a lot of ways (for example you should do this with data that has to
> conform
> with the web's [draft] character model).
>
> One of the clearest advantages is that it makes searching a lot more
> efficient, as only one of the potentially very many canonically
> equivalent
> sequences will have to be searched for
Yes.
> (though case-insensitive and/or
> diacritical-insensitive searches will still have many possible matching
> strings).
Yup.
> On the other hand there are potential security risks with such
> normalisation, and perhaps therefore it is something that should be
> configurable.
>
>> It's problematic (and buglike) for at least one reason: one needs to
>> put all these precomposed things in one's font, or FileMaker doesn't
>> display them properly.
>
> That's were the problem lies, not in the normalisation.
Maybe they ought to be rendering the glyphs according to the characters
in the font, with a fallback via decomposition. If they normalize and
simply throw up the missing character empty box, this is not very
helpful.
I built a tidy IPA transcription font, lacking many precomposed things.
Importing and exporting a data subset in FM7 reveals a total of 113
characters not displaying properly. This is annoying, to say the least.
One reason I wanted a *small* font is that in PDF generation big fonts
may not always be subsetted properly, and even a single page PDF will
end up embedding the whole font.
Also, there is extra overhead with a big font that seems to slow things
up a bit, even on a fast machine.
>> I'm assuming it will export the data in decomposed form ...
>> but haven't actually tried that yet ...
>
> I wouldn't assume anything of the sort. Normalising to NFD would be
> quite
> unusual.
Yes, I realize that now. And my test confirms that the internal
normalization is also what you get on export. And hence those 113 empty
boxes ...
>> BTW, this application supports import of UTF-8, but will not export
>> UTF-8. That's odd, isn't it? It'll only export UTF-16 (it's internal
>> storage form).
>
> Odd indeed.
Well, maybe they're saving UTF-8 export for a future release ... though
I can't imagine why.
-Richard
This archive was generated by hypermail 2.1.5 : Wed Oct 13 2004 - 09:13:57 CST