Get the RTF specification for complete details of RTF format:
Word (or the application in question) stores text internally in Unicode.
The file header contains the \ansicpg control word that specifies the
For characters that are in the codepage, the normal RTF is written (the
if ASCII, else \'##.
For characters that do not exist in the codepage, the Unicode value is
written (\u####), followed by the an approximation for the character in the
codepage. For double-byte characters, it gets a little more complicated.
What you've described is this case. The Unicode character doesn't exist in
the codepage, and the best thing the RTF writer came up with for an
approximation is '~'.
You'll need to teach your program to recognize the Unicode control words.
On Windows, you can just use WideCharToMultiByte and MultiByteToWideChar to
map the text between Unicode and the codepage specified in the file.
The Unicode web site has mapping tables for many codepages and encodings.
--- Paul Chase Dempsey
Microsoft Visual Studio Text Editor Developer
From: Alfinito, Charles [mailto:AlfinitoC@cadmus.com]
Sent: Tuesday, January 12, 1999 9:11 AM
To: Unicode List
Subject: New on list
Unicode is presenting a problem. For example, a ~ may be the character in a
file. Normally in RTF this would be shown as \'98. Recently I had a file
with the unicode, \u8776\'98. This character should have been an
"infinity". Since my program can't handle the Unicode RTF (\u8776) it
ignores it and changes the \'98 to a ~ which obviously is wrong.
Does anyone know how Unicode is deriving the number (as in \u8776). I know
it has to do with the ANSI code page but I can't figure out if there is any
ryhme or reason to the Unicode numbers it is assigning or the combination of
Unicode and RTF (\u8776\'98). If I know then I could program the Unicode
characters. I've been looking for some sort of table.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT