RE: Microsoft Word Query

Date: Mon Mar 19 2001 - 10:58:56 EST

On 03/19/2001 08:21:39 AM Marco Cimarosti wrote:

>I think I see now what Sam meant: if you save the file as "Text Only",
>fraction "" is actually converted to "1/4".
>I don't know the reason this strange behavior, considering that the "Text
>Only" character set on my system is Latin-1, which could accommodate ""
>This also happens if you save it as "Encoded Text": only "Unicode" (i.e.
>UCS-2) maintains ""; all other encodings (including "Unicode UTF-8" and
>"Unicode UTF-7") convert the character to "1/4".

This is a bug that was introduced in Word 97 and that MS has acknowledged.
It affected several characters in the upper half of cp1252 (if my memory
serves me correctly, hex 82, 84, 85, 8b, 91, 92, 93, 94, 96, 97, 99, 9b,
a0, a9, ab, ad, ae, bb, bc, bd, be). It was I reported the problem to Chris
Pratley while Word 2000 was still in beta, but the fix didn't make it in
time. It has been largely addressed in Word XP - and very nicely, too: the
user gets the option as to whether they want character translations or not
(some do want them).

There are still a couple of items in Word XP's export to text that I think
are not quite right. I'll quote from what I wrote to Chris Pratley:

I was a little surprised at first when checking "Insert Line Breaks" caused
NBSP to change to SP, but I guess this makes some sense since asking it to
enter line breaks suggests you want a particular layout. The user could
possible want to edit the resulting plain text afterward and might prefer
the NBSP kept as is, but I guess it makes good sense that if they wanted to
edit further then they probably wouldn't have asked for the line breaks. I
just now noticed a couple of things I wonder about, though, and one thing I
think is a bug. With "Insert Line Breaks" checked, the pilcrow sign (00B6)
disappears. That doesn't seem right to me, since it isn't a line break
request; it's a graphic character used to represent a line break or end of
paragraph [in meta-discussion]. I think it should be kept in the text.
Also, the NOT SIGN (00ac) disappears. I'm sure this is an error -- that
someone was off by one and meant the soft hyphen (00ad). It would be
reasonable to remove this *if it's not at the end of a line* but currently
the (wrong) character is removed whether or not it's at the end of a line.

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT