From: Rick McGowan (rick@unicode.org)
Date: Wed Mar 23 2005 - 18:44:33 CST
Uni-cadets,
Picking up this thread again from early March... People noted at that time
text files served via HTTP from Unicode.org had no explicit charset, and
therefore defaulted to 8859-1. However, most of our files are not 8859-1 at
all.
We have attempted to remedy this situation as follows.
All ".txt" files served from Unicode.org now default to UTF-8. This is
in-line with the long-term fact that most of our data files are simple
ASCII anyway, and when they are not simple ASCII, they are mostly UTF-8
(such as the Unihan database). Because ASCII is a proper subset of UTF-8,
this should work fine for most text files.
Some files we serve are in fact encoded in 8859-1 -- specifically the
"NamesList.txt" files from various versions of the UCD. These files will
now all be explicitly served with the 8859-1 encoding.
Addison Phillips, on March 4, remarked:
> Just out of curiosity, why *don't* all the UCD files use UTF-8?
and Erik van der Poel noted:
> It might be a good idea to convert all of unicode.org's
> non-UTF-8 *.txt files to UTF-8, if that wouldn't cause too
> many problems.
Only the NamesList.txt files are not UTF-8. There is some history to that,
and it has to do with the toolset used to build the UCD and to publish the
standard. That is unlikely to change in the near future. But at least now,
you should be served the correctly tagged text files.
If anyone continues to have trouble with any text files retrieved via
HTTP, or finds any problem files (served with the wrong encoding), please
let me know off-list. I'll endeavor to fix the problem.
Cheers,
Rick
This archive was generated by hypermail 2.1.5 : Wed Mar 23 2005 - 18:45:14 CST