Re: RTL PUA? from Philippe Verdy on 2011-08-22 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 22 Aug 2011 22:02:02 +0200

2011/8/22 William_J_G Overington <wjgo_10009_at_btinternet.com>:
> Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use.
>
> Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format.

Note "some" encoded format. The strings are encoded using the encoding
specified in the platform selectors. The strings for the Macintish
Romain platform will be encoded using MacRoman. The strings for the MS
Unicode BMP platform will be encoded with the BMP part of UTF-16
(without support for surrogates). The strings for the Unicode platform
will use the UTF-32 encoding.

> So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found:

It just happens that you are opening the TrueType font as if it was a
plain-text encoded with Windows-1252, or some other 8-bit encoding
based on ASCII. You are also searching ASCII characters that are
encoded identically in Windows-1252 as well as in the MacRoman
encoding, so you find a match.

> yet the text that is in the Microsoft Unicode BMP only platform cannot be found.

Because tou would have to insert null bytes in your search strings, to
find an exact match in an UTF-16 encoded string. Without these nulls,
you'll get no match. What you are doing is a search in a text loaded
after assuming the wrong encoding. TrueType fonts are binary
containers, that can mix several encodings for its plain-text
elements, but that also embed many other non-text data. This happens
even if your text editor is capable of loading Unicode-encoded texts
(this fails here if you try to load it as UTF-16, because the whole
TTF container cannot match the conformance requirements for correctly
encoded UTF-16 texts, for the whole document, but only for fragments
of it. On the opposite, there's no conformance problem if you try to
read the file as if it was Windows-1252 or ISO-8859-1...
Received on Mon Aug 22 2011 - 15:05:12 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 22 2011 - 15:05:13 CDT