Re: outside decomposed, inside precomposed

From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Wed Oct 13 2004 - 09:11:11 CST

Next message: Eric Muller: "Re: outside decomposed, inside precomposed"

Previous message: Jon Hanna: "RE: outside decomposed, inside precomposed"
In reply to: Jon Hanna: "RE: outside decomposed, inside precomposed"
Next in thread: Eric Muller: "Re: outside decomposed, inside precomposed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Jon,

Thanks for your reply.

On Oct 13, 2004, at 3:15 AM, you wrote:

>> imported UTF-8 sequences like [U+0065][U+0303] <e, tilde> get
>> remapped internally to [U+1ebd] LATIN SMALL LETTER E WITH TILDE.
>>
>> Is this kind of behavior what one would expect?
>
> That's conformant, if it causes problems with any other process
> (including
> other processes that are part of the system in question)

Like, for example, a rendering process?

> then that other
> process isn't complying with conformance clause C9.
>
> At a guess I'd say it's probably normalising to NFC which is
> advantageous in
> a lot of ways (for example you should do this with data that has to
> conform
> with the web's [draft] character model).
>
> One of the clearest advantages is that it makes searching a lot more
> efficient, as only one of the potentially very many canonically
> equivalent
> sequences will have to be searched for

Yes.

> (though case-insensitive and/or
> diacritical-insensitive searches will still have many possible matching
> strings).

Yup.

> On the other hand there are potential security risks with such
> normalisation, and perhaps therefore it is something that should be
> configurable.
>
>> It's problematic (and buglike) for at least one reason: one needs to
>> put all these precomposed things in one's font, or FileMaker doesn't
>> display them properly.
>
> That's were the problem lies, not in the normalisation.

Maybe they ought to be rendering the glyphs according to the characters
in the font, with a fallback via decomposition. If they normalize and
simply throw up the missing character empty box, this is not very
helpful.

I built a tidy IPA transcription font, lacking many precomposed things.
Importing and exporting a data subset in FM7 reveals a total of 113
characters not displaying properly. This is annoying, to say the least.

One reason I wanted a *small* font is that in PDF generation big fonts
may not always be subsetted properly, and even a single page PDF will
end up embedding the whole font.

Also, there is extra overhead with a big font that seems to slow things
up a bit, even on a fast machine.

>> I'm assuming it will export the data in decomposed form ...
>> but haven't actually tried that yet ...
>
> I wouldn't assume anything of the sort. Normalising to NFD would be
> quite
> unusual.

Yes, I realize that now. And my test confirms that the internal
normalization is also what you get on export. And hence those 113 empty
boxes ...

>> BTW, this application supports import of UTF-8, but will not export
>> UTF-8. That's odd, isn't it? It'll only export UTF-16 (it's internal
>> storage form).
>
> Odd indeed.

Well, maybe they're saving UTF-8 export for a future release ... though
I can't imagine why.

-Richard

Next message: Eric Muller: "Re: outside decomposed, inside precomposed"
Previous message: Jon Hanna: "RE: outside decomposed, inside precomposed"
In reply to: Jon Hanna: "RE: outside decomposed, inside precomposed"
Next in thread: Eric Muller: "Re: outside decomposed, inside precomposed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 13 2004 - 09:13:57 CST