RE: cp1252 decoder implementation

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Wed, 21 Nov 2012 16:58:56 +0000

I’ll be more definitive than Murray ☺ Our legacy code pages aren’t going to change. We won’t add more characters to 1252. We won’t add new code pages. We aren’t going change names (since that’ll break anyone already using them), we probably won’t recognize new names (since anyone trying to use a new name wouldn’t work on millions of existing computers, so no one would add it).

The churn is too painful for customers. If there’s a new character that everyone “must” use, we’ll point them at UTF-8 or UTF-16. Any request to change codepage behavior would have to meet a very high bar.

The status of these 5 characters is already in the best fit mappings document pointed to by the IANA registry entry for windows-1252, which is strong as I’m willing to go for them.

The last thing I did WRT to code page standards was to ask for the best fit mappings to be posted so that the IANA charset registry would have something to reference to clarify the existing names. It’s possible (if I find the time) that a few of the IANA charset entries could be updated to emphasize that some common names have differing implementations by different vendors/OS’s such as was done for shift_jis http://www.iana.org/assignments/charset-reg/shift_jis or the updates to point out the best fit mapping for 1252 at http://www.iana.org/assignments/charset-reg/windows-1252 In other words, the trend is to clarify that there are variations in behavior, and to please use Unicode.

Also see:
http://blogs.msdn.com/b/shawnste/archive/2007/09/24/are-we-going-to-update-or-maintain-the-best-fit-or-code-page-mappings.aspx
http://blogs.msdn.com/b/shawnste/archive/2008/01/17/code-pages-and-security-issues.aspx
http://blogs.msdn.com/b/shawnste/archive/2007/03/20/some-reasons-to-make-your-application-unicode.aspx

(and http://blogs.msdn.com/b/shawnste/archive/2012/06/16/building-the-lego-disney-wonder.aspx just because I think it’s cool)

I can see why HTML5 might think windows-1252 support is a good idea, but personally I’d’ve been happier if it wasn’t a requirement. Too much code page corruption happens on the web, and most of the badly-tagged content probably misdeclares itself as 1252. UTF-8 is a WAY better choice, particularly for the characters in the set supported by windows-1252.

-Shawn
( )

SSDE,
Microsoft

From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On Behalf Of Murray Sargent
Sent: Tuesday, November 20, 2012 8:55 PM
To: verdy_p_at_wanadoo.fr; Doug Ewell
Cc: Unicode Mailing List; Buck Golemon
Subject: RE: cp1252 decoder implementation

Phillipe commented: “(even if later Microsoft decides to map some other characters in its own "windows-1252" charset, like it did several times and notably when the Euro symbol was mapped)”.

Personal opinion, but I’d be very surprised if Microsoft ever changed the 1252 charset. The euro was added back in 1999 when code pages were still used a lot. Code pages in general are pretty much irrelevant today except for reading legacy documents. They are virtually never used internally in modern software. UTF-8,UTF-16, and UTF-32 are what are used these days.

(But code pages do have the advantage that they are associated with specific character repertoires, which amounts to a great hint for font binding…)

Murray
Received on Wed Nov 21 2012 - 11:05:01 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 21 2012 - 11:05:02 CST