RE: Microsoft Word Query

From: Marco Cimarosti (
Date: Mon Mar 19 2001 - 09:46:21 EST

Peter Constable wrote:
> On 03/19/2001 04:42:39 AM Sam Chapman wrote:
> >Slightly off topic word query - I'd be grateful for any help
> or advice,
> >
> >I'm curious on how to extract Unicode information from
> Microsoft word, It
> >appears to apply implicit conversions to numerous characters,
> >
> >e.g. 0x00BC ''
> >is stored as
> >0x0031, 0x002F, 0x0034 i.e. '1/4'
> No, that's wrong. It's stored as 0x00BC. How did you conclude that it
> stores "1/4"?
> >Is there anyone out their who knows how to force word to extract the
> >expected values as input and represented visually.
> What do you mean by "extract"? Do you mean "export"? Or do
> you mean "show
> me the actual Unicode scalar value for this character"?

I think I see now what Sam meant: if you save the file as "Text Only",
fraction "" is actually converted to "1/4".

I don't know the reason this strange behavior, considering that the "Text
Only" character set on my system is Latin-1, which could accommodate ""

This also happens if you save it as "Encoded Text": only "Unicode" (i.e.
UCS-2) maintains ""; all other encodings (including "Unicode UTF-8" and
"Unicode UTF-7") convert the character to "1/4".

Of course, this may be OK for *some* encoding (because they don't have a ""
character), but not for all of them. E.g., I am sure that "vulgar fractions"
do exist in JIS and other Far East encodings.

The problem does not happen if you copy and paste the text from Word to
another application.

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT