Farsi and SQL Server 2000

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Fri Jul 28 2000 - 19:20:03 EDT


I need to clarify something because I was partially wrong in my original
discussions on Farsi and SQLS 2000.

HISTORY: Windows 2000, SQL Server 2000, and Jet 4.0 all use the same
codebase for string normalization (though the SQL 2000 case is missing some
scripts in Unicode 3.0 and Jet 4.0 is missing a few more). Farsi is included
on all of them, however. So if you choose an Arabic collation, farsi
characters will be properly normalized and thus should sort as a Farsi user
would expect.

HOWEVER, this only applies to NTEXT, NVARCHAR, and NCHAR fields.

If you are using any other kind of field then collation also controls code
page, and cp1256 does not completely support Farsi so some characters will
be converted to ? on insert.

However, in a Unicode field, you should find that Farsi data will sort
properly, without any special work needing to be done (so I am glad that I
did the original Farsi work free of charge so I do not feel like I was
wasting their money!).

Therefore, Farsi should probably have been included as a separate collation
under SQL Server marked as "Unicode only" which would cause a runtime error
if you tried to add it to a non-Unicode text field, to avoid the two worst
words to the SQL Server team (for those who do not know, the words are "Data
Corruption").

My suggestion to the SQL Server team is to handle the above and correct the
fact that many people will be perhaps not offended by at least a little
miffed at having to use a collation that looks a lot like a language that
historically would cause minor problems (which is how smart Farsi users who
understand the limitations of cp1256 will view the situation).

Windows 2000 might want to do the same thing, since they will have the same
problem with ANSI apps and a default system language of "Farsi" would they
not? I have verified that a GetLocaleInfo call for
LOCALE_IDEFAULTANSICODEPAGE returns 1256, which in this case is actually
French for "Data Loss" for a few characters, as far as I can tell.

<PERSONAL_OPINION>
Any time there is a not a 100% lossless conversion between Unicode and MBCS
for a given language, GetLocaleInfo should not say that this is a valid code
page, and SQL Server should not document the usage of that code page for the
language.
</PERSONAL_OPINION>

michka

(special thanks to Michael Kung, as both of us walked out of a conversation
a few minutes ago a lot more intelligent than when we walked in!)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT