Re: utf-8 != latin-1

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Sat Oct 14 2000 - 18:25:44 EDT


Even in Windows 2000 where notepad support UTF-8 and wil try to auto-detect
it, BOM-less files will not be assumed to be UTF-8 if there is no reason why
it cannot be represented as a non-Unicode text file using the default system
code page.

If notepad was being used, and it was saved with a BOM, then it would have
worked. I think a BOM is the answer. Certainly notepad will never look at
the XML encoding tag (its not an XML parser, and technically a valid XML
parser does not have to respect the tag, per the spec).

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

----- Original Message -----
From: "Steven R. Loomis" <srl@jtcsv.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Saturday, October 14, 2000 2:48 PM
Subject: Re: utf-8 != latin-1

> Doug Ewell wrote:
> > Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as
anything.
>
> It wasn't interpreted as anything. It halted processing at that point
> in the text, as an error.
>
> George Zeigler wrote:
> > I didn't get it. So what happens if a company had a Job site in
Unicode,
> > and people were copying resume text from Word written in ISO 8859-1
> > and pasting into a text window in the browser? Does the character set
> > automatically convert correctly. Or does the user need to use a
character set
> > converter like Recode?
>
> It was pasted into Windows Notepad or some other editor editing an XML
> file. XML files unless otherwise tagged are UTF-8, but the editor
> thought it was something like Windows-1252. So, the right thing to do
> *might* be to tag the file as being 'windows-1252'. A better solution
> would be to use UTF-8 aware editors only.
>
> My point is that it was hard to tell visually whether the data being
> copied was a 'safe' subset of both utf-8 and windows-1252 [such as
> ASCII].
>
> -s
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT