Re: Bad Content-type headers on Unicode web site?

From: Rick McGowan (rick@unicode.org)
Date: Wed Mar 23 2005 - 18:44:33 CST

  • Next message: Peter Kirk: "Re: Security Issues"

    Uni-cadets,

    Picking up this thread again from early March... People noted at that time
    text files served via HTTP from Unicode.org had no explicit charset, and
    therefore defaulted to 8859-1. However, most of our files are not 8859-1 at
    all.

    We have attempted to remedy this situation as follows.

    All ".txt" files served from Unicode.org now default to UTF-8. This is
    in-line with the long-term fact that most of our data files are simple
    ASCII anyway, and when they are not simple ASCII, they are mostly UTF-8
    (such as the Unihan database). Because ASCII is a proper subset of UTF-8,
    this should work fine for most text files.

    Some files we serve are in fact encoded in 8859-1 -- specifically the
    "NamesList.txt" files from various versions of the UCD. These files will
    now all be explicitly served with the 8859-1 encoding.

    Addison Phillips, on March 4, remarked:

    > Just out of curiosity, why *don't* all the UCD files use UTF-8?

    and Erik van der Poel noted:

    > It might be a good idea to convert all of unicode.org's
    > non-UTF-8 *.txt files to UTF-8, if that wouldn't cause too
    > many problems.

    Only the NamesList.txt files are not UTF-8. There is some history to that,
    and it has to do with the toolset used to build the UCD and to publish the
    standard. That is unlikely to change in the near future. But at least now,
    you should be served the correctly tagged text files.

    If anyone continues to have trouble with any text files retrieved via
    HTTP, or finds any problem files (served with the wrong encoding), please
    let me know off-list. I'll endeavor to fix the problem.

    Cheers,
            Rick



    This archive was generated by hypermail 2.1.5 : Wed Mar 23 2005 - 18:45:14 CST