Re: FW: Help with win 1251 into utf-8

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Jun 07 1999 - 19:10:20 EDT


Hello Sam,

on 1999-6-3 at 11:22 h, your question has been posted to Unicode List
<unicode@unicode.org>, where I have seen it today.

On 1999-05-25 at 3:41 h, <sam.dunham@globalone.net> wrote:
> I am trying to convert a Russian win 1251 doc into utf-8 for an online
> Access database. It works perfectly well within Internet Explorer, but
> not unfortuantely in Netscape.

What do you mean by "it"? Probably not that either of these programs
is supposed to do that conversion for you. So, what did you really try
to do?

And what dou you mean by "does/doesn't work"? What are the exact symptomes?

And which versions of IE, and Netscape, respectively, did you try? On which
system?

> Can anyone tell me how I can convert these docs so it is acceptable and read
> by both. and in the cheapest manner possible as I am in the process of
> starting up.

As I did not quite understand your problem, I'll give you some starting-points.

The code page is best described in
  <http://czyborra.com/charsets/cyrillic.html#CP1251>
  <http://czyborra.com/charsets/codepages.html#CP1251>
  You may also wish to read the whole WWW pages.

The mapping from CP 1251 to Unicode is authoratively described in:
   <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT>

UTF-8 is described in:
   <http://czyborra.com/utf/#UTF-8>
   Again, you may wish to read the whole WWW page.

You must also mark the HTML version and the character encoding in your HTML
source, cf.
   <http://www.w3.org/TR/REC-html40/struct/global.html#h-7.2>
   <http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2>
   (Earlier HTML versions are based on ISO 8859-1, so you cannot legally
   display Cyrillic characters in pre-4.0 HTML.)

In contrast to the HTML 4.0 specification (cf.
<http://www.w3.org/TR/REC-html40/charset.html>, particularly
<http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1> and
<http://www.w3.org/TR/REC-html40/charset.html#h-5.1 >), Netscape 4.05 can
display characters above the 8-bit range *only* when a Unicode transfer
encoding is chosen (UTF-7 or UTF-8), either via a Meta tag, or manually
via the View/Encoding menu.

You may also wish to read other parts of the HTML 4.0 specification,
and hints for HTML authors:
  <http://www.w3.org/TR/REC-html40/>
  <http://www.w3.org/WAI/GL/#Current_Draft>
and to test your HTML source against pertinent validation services:
  <http://validator.w3.org/>
  <http://www.cast.org/bobby/>

I have tried the HTML assistants from two Word versions (with German texts
in CP 1252), and both of them have generated absolutely inacceptable
HTML versions. Inacceptable meaning: containing blatant HTML syntax errors
and a pletora of proprietary Microsoft features. Hence, I advice you *not*
to let Word generate the HTML source. If your "Russian win 1251 doc" is
really a Word document, then you'd better store it as bare text, and then
insert proper HTML tags manually (or with a good HTML editor). One program to
convert CP 1251 text to Unicode, edit it, and store it in UTF-8, is UniEdit:
  <http://www.lang.duke.edu/uniintro.htm>

If you still have problems, you may wish to store your partial result on
a HTTP, or FTP, server and provide the URL, together with a description
of the error symptoms encountered, so other people can test it and possibly
give advice on overcoming your problems.

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT