Peter Constable wrote:
> On 03/19/2001 04:42:39 AM Sam Chapman wrote:
> >Slightly off topic word query - I'd be grateful for any help
> or advice,
> >I'm curious on how to extract Unicode information from
> Microsoft word, It
> >appears to apply implicit conversions to numerous characters,
> >e.g. 0x00BC '¼'
> >is stored as
> >0x0031, 0x002F, 0x0034 i.e. '1/4'
> No, that's wrong. It's stored as 0x00BC. How did you conclude that it
> stores "1/4"?
> >Is there anyone out their who knows how to force word to extract the
> >expected values as input and represented visually.
> What do you mean by "extract"? Do you mean "export"? Or do
> you mean "show
> me the actual Unicode scalar value for this character"?
I think I see now what Sam meant: if you save the file as "Text Only",
fraction "¼" is actually converted to "1/4".
I don't know the reason this strange behavior, considering that the "Text
Only" character set on my system is Latin-1, which could accommodate "¼"
This also happens if you save it as "Encoded Text": only "Unicode" (i.e.
UCS-2) maintains "¼"; all other encodings (including "Unicode UTF-8" and
"Unicode UTF-7") convert the character to "1/4".
Of course, this may be OK for *some* encoding (because they don't have a "¼"
character), but not for all of them. E.g., I am sure that "vulgar fractions"
do exist in JIS and other Far East encodings.
The problem does not happen if you copy and paste the text from Word to
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT