Re: SHY, CGJ, etc. (was: unicode Digest V12 #108)

From: Philippe Verdy <>
Date: Tue, 5 Jul 2011 04:39:47 +0200

2011/7/4 Andreas Prilop <>:
> On Sun, 3 Jul 2011, Jukka K. Korpela wrote:
>>> You're wrong, it DOES. I just tested it (in Microsoft Word 2010 for
>>> Windows 7) within a random long word (aaaaaaaaaa....) and the SHY
>>> is recognized to generate the intended hyphenation break.
>> Thatís good news, if your analysis is correct, but the problem still
>> exists in all Word versions up and including Word 2007.
> Philippe Verdy does not understand the difference between U+001F
> and U+00AD. Even MS Word 2010 continues to use U+001F as soft hyphen
> but does not recognize U+00AD as soft hyphen.

I do know the difference, thanks.

I've not spoken at all about U+001F and not even tested it (anyway it
does not mean anything, and certainly not a soft hyphen, except
possibly in old legacy word processing formats converted to Word ;
it's just an ASCII control with unspecified behavior, not suitable for
plain-text interchange).

I have entered TRUE soft hyphens as U+00AD, in a plain-text document,
and opened it in word. And this works effectively as expected. I could
also copy-paste a SHY from a plain-text document, or from the Charmap
utility, or from my keyboard, and it works as well.

Saving the document back in XML format confirms that it remains
U+00AD. U+001F can only be a legacy from the past, it is certainly not
correct for the XML validation, and current Word formats are XML-based
(I don't know what Word uses in its past binary format compatible with
Word 6, but this binary format does not have to obey the same rules as
it is clearly not plain text ; same remark about the legacy RTF
format, still supported and used in Windows Write/Wordpad, which
contains lots of legacy hacks and that was not designed for Unicode
Received on Mon Jul 04 2011 - 21:45:11 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 04 2011 - 21:45:22 CDT