Re: SHY, CGJ, etc. (was: unicode Digest V12 #108)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 8 Jul 2011 20:34:42 +0200

After more tests, it seems that Word effectively changes a SOFT HYPHEN
(U+00AD) on input into <control US> (U+001F), which it uses not as a
regular "soft hyphen" but as an "optional hyphen".

This is then changed back to a regular soft hyphen in the clipboard
when copying it there in a rich text format or when saving in a rich
text format like HTML, but effectively changed into a visible U+00AC
in the plain-text version inserted in the clipboard or when saving to
a plain-text file, most probably because of a legacy usage of "¬" in
legacy word processors initially made for old MS-DOS (probably
Wordperfect, or even old versions of Word) for encocing their own
"optional hyphen".

When I tested it with Word 2010 in Windows 7, I did not see that
internally, the regular soft hyphen had been converted to U+001F,
because there was no U+001F in the output Unicode-encoded HTML file I
had saved (U+001F is invalid in HTML, it must be converted, but Word
effectively chooses the correct SHY character U+00AD in this case).

Yes all this is a mess, and I see no reason why it still changes
internally a regular "soft hyphen" into its own legacy "optional
hyphen", that it cannot preserve, even when saving to a UTF-8
plain-text document (which assumes full support of Unicode, and not a
legacy 8-bit "OEM" encoding, that displays "¬" in the "IBM graphics"
charset at the ASCII control position 0x1F, for example on the
DOS-like console).

I admit that Word could do that only when saving to a non-"ANSI" text
file because there's not even the presence of SHY in those OEM PC
charsets. Even when saving to an "ANSI" file (some codepage based in
ISO 8859) SHY could be used as well since long (note that U+00AC "¬"
is mapped in OEM codepages 437 and 850 on 0xAA, but there's no mapping
in those OEM codepages for SHY).

When pasting into the Command-line console, that has now full support
for Unicode in its display buffer, it should still be a regular SHY
U+00AD, and it's only when reading characters from the Console in an
application whose input codepage is not Unicode (e.g. reading from
legacy BIOS keystrokes or from standard input or other legacy Windows
APIs working in the OEM charset) that the Unicode SHY U+00AD in the
display buffer "may" be changed to 0xAA with the legacy application's
use of codepages 437 or 850 (Windows APIs working in a "ANSI" charset
should still return SHY on 0xAD), and 0x1F otherwise (if there's not
even a "¬" mapped in that legacy codepage).

All this looks like a confusion between the internal storage and
processing in Word, and what should be part of specific text file
format convertors (they are extension DLL plugins in the "converters"
directory), and not built in such hardwired way within Word's core
engine.

In the DOM-like VisualBasic interface of Word (or COM/DCOM), there may
exist macros (or even extension plugins for various linguistic
correctors such as external dictionaries) that still depend on
detecting U+001F in the internal work buffer, or genering it when
working with those "optional hyphens". But here also this should just
be part of this VB or COM interface, and subject to versioning
(version tracking of interfaces is a required component for COM
programming), which may use any one of the various text format
converters.

Word should still make all efforts to maintain the distinction between
the SOFT HYPHEN and the NOT SIGN, and even with its legacy "optional
hyphen" control mapped on U+001F.

To make complete tests, you should know that the Windows clipboard
exposes several parallel versions of the same source text (this is
either exposed by negociation and collaboration with the source
application which just indicates which ouput format it supports, and
the Windows clipboard will store the clipboard in memory, or in a
temporary swap file using a standard text format, only if the source
application must exit; then the standard clipboard becoming then the
effective intermediate target capable of storing a rich-text format
that it can expose itself and convert later to any other target
application).

And you don't get the same results depending on which source or target
application or file format you use through the Windows clipboard or
to/from Word itself (this consideration is not specific to texts, you
have the same problem with images, even if Windows defines its own
"portable" DIB format, with various capabilities for color spaces,
color depths, pixel aspect ratios, logical resolutions in twips, and
so on... plus a legacy BMP format as well supported by internal lossy
image converters).

-- Philippe.

2011/7/7 Andreas Prilop <prilop4321_at_trashmail.net>:
> On Sat, 2 Jul 2011, Jukka K. Korpela wrote:
>
>> And there is really no guarantee that programs support the
>> soft hyphen. For one, Microsoft Word doesn’t—it treats it
>> as just another printable character.
>
>  ... and also:
>  http://www.cs.tut.fi/~jkorpela/shy.html#word
>
> MS Word's behaviour depends on the setting
> File > Options > Advanced > Cut, copy, and paste >
> Pasting from other programs.
> "Keep Text Only"  : U+00AD remains U+00AD.
> "Merge Formatting": U+00AD is changed to U+001F.
>
> When I copy MS Word's own soft hyphen (i.e. U+001F)
> from MS Word into any other program, I get U+00AC (¬).
> :-(
>
> --
>  From the New World:
>  http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k
>
>
>
Received on Fri Jul 08 2011 - 13:36:49 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 08 2011 - 13:36:49 CDT