Peter Constable wrote:
> On 03/19/2001 04:42:39 AM Sam Chapman wrote:
> 
> >Slightly off topic word query - I'd be grateful for any help 
> or advice,
> >
> >I'm curious on how to extract Unicode information from 
> Microsoft word, It
> >appears to apply implicit conversions to numerous characters,
> >
> >e.g. 0x00BC '¼'
> >is stored as
> >0x0031, 0x002F, 0x0034 i.e. '1/4'
> 
> No, that's wrong. It's stored as 0x00BC. How did you conclude that it
> stores "1/4"?
> 
> >Is there anyone out their who knows how to force word to extract the
> >expected values as input and represented visually.
> 
> What do you mean by "extract"? Do you mean "export"? Or do 
> you mean "show
> me the actual Unicode scalar value for this character"?
I think I see now what Sam meant: if you save the file as "Text Only",
fraction "¼" is actually converted to "1/4".
I don't know the reason this strange behavior, considering that the "Text
Only" character set on my system is Latin-1, which could accommodate "¼"
nicely.
This also happens if you save it as "Encoded Text": only "Unicode" (i.e.
UCS-2) maintains "¼"; all other encodings (including "Unicode UTF-8" and
"Unicode UTF-7") convert the character to "1/4".
Of course, this may be OK for *some* encoding (because they don't have a "¼"
character), but not for all of them. E.g., I am sure that "vulgar fractions"
do exist in JIS and other Far East encodings.
The problem does not happen if you copy and paste the text from Word to
another application.
_ Marco
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT