RE: &#61623 ?

From: Chris Pratley (chrispr@MICROSOFT.com)
Date: Mon Apr 03 2000 - 18:37:47 EDT


My guess is that Wordmail is *not* the culprit, since we are careful to
handle that case (thanks for the support Murray!). If someone can show that
Word generated a PUA symbol rather than a proper bullet in HTML I'd be glad
to take the bug report, but Word2000 at least doesn't spit out symbols for
bullets in HTML unless you tell us to by picking a custom bullet symbol.
Given that the sources below were hotmail and yahoo, I'd suppose that it is
some HTML control that is generating it.

It is much more likely that some innocent piece of software received pasted
or perhaps automatically generated a symbol in that range, and then naively
converted it to an NCR with no special handling of the PUA area. Actually,
this is desired behaviour if the text has no semantic meaning (i.e. it truly
is a symbol) or the text was not generated by the software in some default
setting that has an analog in Unicode or HTML constructs like <li>. For
example that character might be a user-defined Chinese character being sent
to a server that is using the same definition, and this would be expected to
work as with any value in the PUA. Using an automatically generated symbol
value that happens to be a bullet in some symbol fonts when an actual bullet
should be used and then outputting that in HTML is an error.

Chris Pratley
Group Program Manager
Microsoft Word

Sent with Office2000 SR1 wordmail

-----Original Message-----
From: Murray Sargent [mailto:murrays@microsoft.com]
Sent: Sunday, April 02, 2000 3:50 PM
To: Unicode List
Cc: Unicode List
Subject: RE: &#61623 ?

It's true that SYMBOL_CHARSET fonts are represented in TrueType fonts by
codes from 0xF020 through 0xF0FF. Microsoft Word uses these codepoints
internally as well, although RichEdit (also used for a email editor in
Outlook) doesn't. The 0xF0B7 is a bullet (similar to the U+2022) in the
Symbol font, which has been distributed since Windows 1. My guess is that
WordMail leaked the 0xF0B7 code out, but it would be good to have a
reproducible scenario. It should most definitely be fixed...

Thanks
Murrayh

-----Original Message-----
From: Juliusz Chroboczek [mailto:jec@dcs.ed.ac.uk]
Sent: Saturday, April 01, 2000 9:43 AM
To: Unicode List
Subject: Re: &#61623 ?

"Tony Harminc" <tzha1@ibm.net>:

TH> I have recently received a couple of emails from unrelated people
TH> (one at yahoo.com and the other at hotmail.com) containing the string
TH> "&#61623;" apparently as a list item bullet. This is hex F0B7, which
TH> is in the private use area.

TH> Does anyone know what character this is trying to be, and what evil
TH> software is generating such a thing?

Michael Everson:

ME> Tsk. Software making use of the Private Use Area is not evil per
ME> se; the evil creeps in where the sender and receiver have not
ME> agreed what the character is intended to represent.

The actual behaviour is somewhat more interesting. Lend me your ears.

Microsoft TrueType fonts may either contain glyphs indexed by Unicode
codepoints (``Microsoft Unicode encoding''), or glyphs indexed by
``symbol font'' glyph index (``Microsoft symbol encoding'').
Microsoft Symbol fonts contain 224 glyphs, starting, depending on the
font, at index 0x20 or 0xF020. It is not known how Windows
distinguishes between the two cases, but consulting usFirstGlyphIndex
in the OS/2 table works fine in all the fonts we have checked. (This
was explained to me by Richard Griffith, to whom I am very grateful.)

When using symbol fonts in some Windows software, the document
contains the glyph indices. When converting to HTML, to RTF, or to
Unicode plain text, the glyph indices are treated as Unicode
codepoints. They will therefore appear as either private zone
codepoints or Latin-1 codepoints depending on the internal
organisation of the font used.

Sincerely,

                                        Juliusz Chroboczek



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT