From petercon@microsoft.com Thu Jan 13 13:21:40 2005 Received: with ECARTIS (v1.0.0; list hebrew); Thu, 13 Jan 2005 14:38:43 -0600 (CST) Received: from mail3.microsoft.com (mail3.microsoft.com [131.107.3.123]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0DJLVeQ019282; Thu, 13 Jan 2005 13:21:40 -0600 Received: from mailout2.microsoft.com ([157.54.1.120]) by mail3.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Thu, 13 Jan 2005 11:21:29 -0800 Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by mailout2.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Thu, 13 Jan 2005 11:21:15 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [hebrew] Re: ISO 10646 compliance and EU law Date: Thu, 13 Jan 2005 11:21:25 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: ISO 10646 compliance and EU law Thread-Index: AcT5oM7/ul/ibm5VSw+jVvyq13CIRQABBtZQ From: "Peter Constable" To: "E. Keown" , "Antoine Leca" , Cc: X-OriginalArrivalTime: 13 Jan 2005 19:21:15.0384 (UTC) FILETIME=[0CDB5F80:01C4F9A5] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id j0DJLVeQ019282 X-archive-position: 2905 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: petercon@microsoft.com Precedence: bulk X-list: hebrew > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf > Of E. Keown > So another question. Are the incorrect standard > combining classes for the Tiberian Hebrew diacritics > only part of Unicode, or are they also part of ISO > 10646? ISO/IEC 10646 does not define normalization, canonical ordering, or canonical combining classes. Peter Constable From rosennej@qsm.co.il Thu Jan 13 14:15:07 2005 Received: with ECARTIS (v1.0.0; list hebrew); Thu, 13 Jan 2005 14:40:21 -0600 (CST) Received: from mx-out.daemonmail.net (mx-out.daemonmail.net [216.104.160.39]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0DKF54h031914; Thu, 13 Jan 2005 14:15:07 -0600 Received: from localhost.daemonmail.net (localhost.daemonmail.net [127.0.0.1]) by mx-out.daemonmail.net (8.12.9p2/8.12.9) with SMTP id j0DKF3nJ026651; Thu, 13 Jan 2005 12:15:03 -0800 (PST) (envelope-from rosennej@qsm.co.il) Received: from [217.132.7.113] (via account 11756) by mx-out.daemonmail.net with ESMTP id gv60C0D2 authenticated by POP; Thu, 13 Jan 2005 12:15:00 -0800 (PST) From: "Jony Rosenne" To: Cc: Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) Date: Thu, 13 Jan 2005 22:15:01 +0200 Message-ID: <000301c4f9ac$91cd0480$0701c80a@QSM7> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.6626 In-Reply-To: <20050113182046.10879.qmail@web53805.mail.yahoo.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527 Importance: Normal Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id j0DKF54h031914 X-archive-position: 2906 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: rosennej@qsm.co.il Precedence: bulk X-list: hebrew > -----Original Message----- > From: unicode-bounce@unicode.org > [mailto:unicode-bounce@unicode.org] On Behalf Of E. Keown > Sent: Thursday, January 13, 2005 8:21 PM > To: Antoine Leca; unicode@unicode.org > Cc: E. Keown; hebrew@unicode.org > Subject: Re: ISO 10646 compliance and EU law > > > Elaine Keown > Seattle again > > Hi, > > Thanks to all who took the trouble to write me back. > I gathered from what you wrote that there is more > conceptual 'distance' than I realized between Unicode > and ISO 10646. > > It appeared to me that it's possible to be ISO > 10646-compliant but, at the same time, to *not* be > Unicode-compliant, since apparently Unicode has > greater innate complexity. > > So another question. Are the incorrect standard > combining classes for the Tiberian Hebrew diacritics > only part of Unicode, or are they also part of ISO > 10646? I object to the term "incorrect" relative to the "standard combining classes for the Tiberian Hebrew diacritics". Possibly they are not what some people would want them to be, but that does not make them incorrect. Could we please be civil? > ....thanks for all help--Elaine > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > > From petercon@microsoft.com Thu Jan 13 14:50:34 2005 Received: with ECARTIS (v1.0.0; list hebrew); Thu, 13 Jan 2005 15:06:11 -0600 (CST) Received: from mail2.microsoft.com (mail2.microsoft.com [131.107.3.124]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0DKoPsA003910; Thu, 13 Jan 2005 14:50:34 -0600 Received: from mailout2.microsoft.com ([157.54.1.120]) by mail2.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Thu, 13 Jan 2005 12:50:19 -0800 Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by mailout2.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Thu, 13 Jan 2005 12:50:18 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) Date: Thu, 13 Jan 2005 12:50:16 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Hebrew combining classes (was ISO 10646 compliance and EU law) Thread-Index: AcT5r9SWcPJokf6IR6C3Bd2+46UDmAAATgYA From: "Peter Constable" To: "Jony Rosenne" , Cc: X-OriginalArrivalTime: 13 Jan 2005 20:50:18.0956 (UTC) FILETIME=[7DDFC0C0:01C4F9B1] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id j0DKoPsA003910 X-archive-position: 2907 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: petercon@microsoft.com Precedence: bulk X-list: hebrew > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf > Of Jony Rosenne > I object to the term "incorrect" relative to the "standard combining classes > for the Tiberian Hebrew diacritics"... > Could we please be civil? We all know that there are different opinions regarding the appropriateness of "incorrect" or other such descriptions when applied to this issue, but I don't think any incivility was at all implied or intended by Elaine's choice of words. Peter Constable From tiro@tiro.com Thu Jan 13 17:30:27 2005 Received: with ECARTIS (v1.0.0; list hebrew); Thu, 13 Jan 2005 18:23:46 -0600 (CST) Received: from priv-edtnes40.telusplanet.net (outbound05.telus.net [199.185.220.224]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0DNUP1i026593 for ; Thu, 13 Jan 2005 17:30:26 -0600 Received: from [64.180.191.178] by priv-edtnes40.telusplanet.net (InterMail vM.6.01.04.00 201-2131-118-20041027) with ESMTP id <20050113233019.IRND19857.priv-edtnes40.telusplanet.net@[64.180.191.178]>; Thu, 13 Jan 2005 16:30:19 -0700 Message-ID: <41E70487.9030602@tiro.com> Date: Thu, 13 Jan 2005 15:30:15 -0800 From: John Hudson User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jony Rosenne CC: Eilidh Mackeown , hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) References: <000301c4f9ac$91cd0480$0701c80a@QSM7> In-Reply-To: <000301c4f9ac$91cd0480$0701c80a@QSM7> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 2908 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: tiro@tiro.com Precedence: bulk X-list: hebrew Jony Rosenne wrote: > I object to the term "incorrect" relative to the "standard combining classes > for the Tiberian Hebrew diacritics". > > Possibly they are not what some people would want them to be, but that does > not make them incorrect. > > Could we please be civil? I don't think Elaine intended to be uncivil. It is not as if she personally insulted anyone. In section 4.3 of the Unicode Standard, it states: The canonical order of character sequences does *not* imply any kind of linguistic correctness or linguistic preference for ordering of combining marks in sequence. The need for such a statement, to allay possible criticisms of preferential assignments favouring rendering or collation of particular languages, is obvious, but the statement more generally makes it difficult to say with certainty that any aspect of the canonical ordering is 'incorrect' per se. So perhaps that word should indeed be avoided. There are perfectly valid criticisms to be made that certain canonical class assignments are far from optimal for many purposes, create potential for ambiguities in text that need to be addressed e.g. by inserting the CGJ character to prevent reordering, and in some cases make it impossible to render normalised sequences with any existing font or layout technology. These are significant problems that might prompt anyone to conclude that the assignments are 'incorrect', without intending any lack of civility. John Hudson -- Tiro Typeworks www.tiro.com Vancouver, BC tiro@tiro.com Currently reading: The peasant of the Garonne, by Jacques Maritain The meaning of everything, by Simon Winchester From k_isoetc@yahoo.com Thu Jan 13 12:21:07 2005 Received: with ECARTIS (v1.0.0; list hebrew); Fri, 14 Jan 2005 14:39:04 -0600 (CST) Received: from web53805.mail.yahoo.com (web53805.mail.yahoo.com [206.190.36.200]) by unicode.org (8.12.11/8.12.11) with SMTP id j0DIKvpL030991 for ; Thu, 13 Jan 2005 12:21:07 -0600 Received: (qmail 10881 invoked by uid 60001); 13 Jan 2005 18:20:46 -0000 Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; b=wx21IqWr7KQ3r59zluzVxWomJF88Jd15yg98ezXKOg66SRp5iA8pz8e2M0kp8efQbnUTyR53/MnJeEOHGvc+8p3IVCsnJlMWlWTjZpJwc9uxSsQN5a/qbOUzpw/yfJ8Suy3hdsPGH87bYojgqM7RIh3AuvYEZIKQcMGC5g4x3Lg= ; Message-ID: <20050113182046.10879.qmail@web53805.mail.yahoo.com> Received: from [24.18.170.151] by web53805.mail.yahoo.com via HTTP; Thu, 13 Jan 2005 10:20:46 PST Date: Thu, 13 Jan 2005 10:20:46 -0800 (PST) From: "E. Keown" Subject: [hebrew] Re: ISO 10646 compliance and EU law To: Antoine Leca , unicode@unicode.org Cc: "E. Keown" , hebrew@unicode.org In-Reply-To: <009301c4ec03$e6438c20$46c86464@arcesa.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 2909 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: k_isoetc@yahoo.com Precedence: bulk X-list: hebrew Elaine Keown Seattle again Hi, Thanks to all who took the trouble to write me back. I gathered from what you wrote that there is more conceptual 'distance' than I realized between Unicode and ISO 10646. It appeared to me that it's possible to be ISO 10646-compliant but, at the same time, to *not* be Unicode-compliant, since apparently Unicode has greater innate complexity. So another question. Are the incorrect standard combining classes for the Tiberian Hebrew diacritics only part of Unicode, or are they also part of ISO 10646? ....thanks for all help--Elaine __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From k_isoetc@yahoo.com Fri Jan 14 20:55:50 2005 Received: with ECARTIS (v1.0.0; list hebrew); Fri, 14 Jan 2005 21:01:53 -0600 (CST) Received: from web53804.mail.yahoo.com (web53804.mail.yahoo.com [206.190.36.199]) by unicode.org (8.12.11/8.12.11) with SMTP id j0F2tmOh005985 for ; Fri, 14 Jan 2005 20:55:50 -0600 Received: (qmail 22934 invoked by uid 60001); 15 Jan 2005 02:55:42 -0000 Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; b=DC4jq3z2sY7MtWF/BuP0G0lg7AHWTmlFTp1Sxp4YuBfaMvWbta3RCYnTIazD4vbxqkWf6lkESEFcLz+Kl4mDXeE1xSdn6X9qKa41eP7yqKk/12tdzuycM+sVa2uXdb6kfgY3YONfQTE4cMjxg3Ls83JXtpIymy9mR6HKzh5a2QI= ; Message-ID: <20050115025542.22932.qmail@web53804.mail.yahoo.com> Received: from [24.18.170.151] by web53804.mail.yahoo.com via HTTP; Fri, 14 Jan 2005 18:55:42 PST Date: Fri, 14 Jan 2005 18:55:42 -0800 (PST) From: "E. Keown" Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) To: rosennej@qsm.co.il, hebrew@unicode.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 2910 X-Approved-By: cowan@ccil.org X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: k_isoetc@yahoo.com Precedence: bulk X-list: hebrew Elaine in Seattle Dear Jony, John and List: I did not _intend_ my remark on the combining classes to be uncivil. I intended it to be short, dry, and factual. I tried to be brief, since the main Unicode listers would probably respond quickly, and they have limited interest in Hebrew. In 15 months I hope to start uploading the Tanakh to some computer code or another. I'm not interested in uploading it twice! Obviously, I am not going to use the current Hebrew block precisely as written. I have to decide what on earth to do between now and then. So I have a lot of technical questions to solve, and soon. There are many, many things I do not understand. Eighteen months ago they conclusively showed, in a large software company, that the existing Unicode Hebrew combining classes, when applied to 'ivrit mikrait,' slow the software up even in a monolingual situation. Since I hope to produce trilingual parallel text software, obviously it won't work for that, if it's not good enough for the monolingual work. Since then, very skillful font designers have produced 'alternative combining classes' for ivrit mikrait. So I expect to use those--thank God. One problem probably solved..... But I want to deviate as little as possible from one of the standards. It's not smart to cavalierly deviate.....the *tiniest* functional deviation is the way to go....Shabbat shalom!---Elaine Elaine wrote: > So another question. Are the incorrect standard > combining classes for the Tiberian Hebrew diacritics > only part of Unicode, or are they also part of ISO > 10646? Jony Rosenne wrote: I object to the term "incorrect" relative to the "standard combining classes for the Tiberian Hebrew diacritics". Possibly they are not what some people would want them to be, but that does not make them incorrect. Could we please be civil? __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail From cowan@ccil.org Fri Jan 14 21:03:19 2005 Received: with ECARTIS (v1.0.0; list hebrew); Fri, 14 Jan 2005 21:03:19 -0600 (CST) Received: from mercury.ccil.org (mercury.ccil.org [192.190.237.100]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0F33Gam007358 for ; Fri, 14 Jan 2005 21:03:19 -0600 Received: from cowan by mercury.ccil.org with local (Exim 4.34) id 1CpeDc-0007jI-N7 for hebrew@unicode.org; Fri, 14 Jan 2005 22:03:16 -0500 Date: Fri, 14 Jan 2005 22:03:16 -0500 To: hebrew@unicode.org Subject: [hebrew] ADMIN: posting delay on hebrew@unicode.org Message-ID: <20050115030316.GB28092@ccil.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i From: John Cowan X-archive-position: 2911 X-Approved-By: cowan@ccil.org X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: cowan@ccil.org Precedence: bulk X-list: hebrew There will be no further postings approved on hebrew@unicode.org for the next 12-24 hours. There may be additional delays after that until Tuesday morning, New York time. -- We pledge allegiance to the penguin John Cowan and to the intellectual property regime cowan@ccil.org for which he stands, one world under http://www.ccil.org/~cowan Linux, with free music and open source http://www.reutershealth.com software for all. --Julian Dibbell on Brazil, edited From peterkirk@qaya.org Sat Jan 15 08:23:20 2005 Received: with ECARTIS (v1.0.0; list hebrew); Sat, 15 Jan 2005 11:34:12 -0600 (CST) Received: from pan.hu-pan.com (hu-pan.com [67.15.6.3]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0FENGmH024664 for ; Sat, 15 Jan 2005 08:23:20 -0600 Received: from 213-162-124-237.peterk253.adsl.metronet.co.uk ([213.162.124.237] helo=[10.0.0.1]) by pan.hu-pan.com with esmtpa (Exim 4.43) id 1Cpope-0007mA-2n; Sat, 15 Jan 2005 14:23:14 +0000 Received: from 127.0.0.1 (AVG SMTP 7.0.300 [265.6.12]); Sat, 15 Jan 2005 14:23:05 +0000 Message-ID: <41E92749.8000604@qaya.org> Date: Sat, 15 Jan 2005 14:23:05 +0000 From: Peter Kirk User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-gb, en, en-us, az, ru, tr, he, el, fr, de To: "E. Keown" CC: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) References: <20050115025542.22932.qmail@web53804.mail.yahoo.com> In-Reply-To: <20050115025542.22932.qmail@web53804.mail.yahoo.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - pan.hu-pan.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - qaya.org X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 2912 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: peterkirk@qaya.org Precedence: bulk X-list: hebrew On 15/01/2005 02:55, E. Keown wrote: >... > >In 15 months I hope to start uploading the Tanakh to >some computer code or another. I'm not interested in >uploading it twice! > >Obviously, I am not going to use the current Hebrew >block precisely as written. I have to decide what on >earth to do between now and then. > >... > >Since then, very skillful font designers have produced >'alternative combining classes' for ivrit mikrait. So >I expect to use those--thank God. One problem >probably solved..... > >But I want to deviate as little as possible from one >of the standards. > > > Elaine, the good news for you is that if you order your Unicode Hebrew text according to these 'alternative combining classes' you will not be deviating at all from the Unicode standard. Your text will not be normalised in any of the standard normalisation forms, but the standard nowhere specifies that texts must be normalised. Of course you need to ensure that your text is not normalised by other processes, or that if it is you then restore it to the order of the 'alternative combining classes' - a process which should be reversible. -- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.6.12 - Release Date: 14/01/2005 From petercon@microsoft.com Sat Jan 15 13:53:55 2005 Received: with ECARTIS (v1.0.0; list hebrew); Sat, 15 Jan 2005 16:20:47 -0600 (CST) Received: from mail3.microsoft.com (mail3.microsoft.com [131.107.3.123]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0FJrrAI019996 for ; Sat, 15 Jan 2005 13:53:55 -0600 Received: from mailout1.microsoft.com ([157.54.1.117]) by mail3.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Sat, 15 Jan 2005 11:53:48 -0800 Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by mailout1.microsoft.com with Microsoft SMTPSVC(6.0.3790.1289); Sat, 15 Jan 2005 11:53:49 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) Date: Sat, 15 Jan 2005 11:53:42 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) Thread-Index: AcT7KRffesc+gpbrT7qlSRmLAHBPtgAEo/cA From: "Peter Constable" To: "Peter Kirk" , "E. Keown" Cc: X-OriginalArrivalTime: 15 Jan 2005 19:53:49.0521 (UTC) FILETIME=[EE708410:01C4FB3B] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id j0FJrrAI019996 X-archive-position: 2913 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: petercon@microsoft.com Precedence: bulk X-list: hebrew > From: hebrew-bounce@unicode.org [mailto:hebrew-bounce@unicode.org] On > Behalf Of Peter Kirk > Of course you need to > ensure that your text is not normalised by other processes, or that if > it is you then restore it to the order of the 'alternative combining > classes' - a process which should be reversible. You would need to restore it to some alternate-canonical order only in case you have to use some process that requires that order. Peter Constable From tiro@tiro.com Sat Jan 15 16:49:37 2005 Received: with ECARTIS (v1.0.0; list hebrew); Sat, 15 Jan 2005 22:17:23 -0600 (CST) Received: from priv-edtnes40.telusplanet.net (outbound05.telus.net [199.185.220.224]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0FMnYoC010026 for ; Sat, 15 Jan 2005 16:49:37 -0600 Received: from [64.180.191.178] by priv-edtnes40.telusplanet.net (InterMail vM.6.01.04.00 201-2131-118-20041027) with ESMTP id <20050115224926.FDMO14143.priv-edtnes40.telusplanet.net@[64.180.191.178]>; Sat, 15 Jan 2005 15:49:26 -0700 Message-ID: <41E99DF2.10807@tiro.com> Date: Sat, 15 Jan 2005 14:49:22 -0800 From: John Hudson User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Peter Constable CC: Peter Kirk , "E. Keown" , hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes (was ISO 10646 compliance and EU law) References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 2914 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: tiro@tiro.com Precedence: bulk X-list: hebrew Peter Constable wrote: >>Of course you need to >>ensure that your text is not normalised by other processes, or that if >>it is you then restore it to the order of the 'alternative combining >>classes' - a process which should be reversible. > You would need to restore it to some alternate-canonical order only in > case you have to use some process that requires that order. Depending on how the software interacts with the system, e.g. whether it uses a system engine for Hebrew mark positioning and whether that engine orders marks for display purposes during rendering, it may be necessary to re-order normalised sequences in order to facilitate display. Elaine is interested in text analysis, e.g. searches and comparisons, so probably wants to work with customised normalisations. This is the issue that Libronix was dealing with when we started looking at custom combining classes for Hebrew marks: they need to be able to do search comparisons, so need some form of normalisation, but also needed to be able to display the results in a reliable and unambiguous way. John Hudson -- Tiro Typeworks www.tiro.com Vancouver, BC tiro@tiro.com Currently reading: The peasant of the Garonne, by Jacques Maritain The meaning of everything, by Simon Winchester From peterkirk@qaya.org Wed Jan 19 07:22:26 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 07:38:32 -0600 (CST) Received: from pan.hu-pan.com (hu-pan.com [67.15.6.3]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JDMOIH023037 for ; Wed, 19 Jan 2005 07:22:26 -0600 Received: from 213-162-124-237.peterk253.adsl.metronet.co.uk ([213.162.124.237] helo=[10.0.0.1]) by pan.hu-pan.com with esmtpa (Exim 4.43) id 1CrFmu-0007cD-Uw; Wed, 19 Jan 2005 13:22:21 +0000 Received: from 127.0.0.1 (AVG SMTP 7.0.300 [265.7.0]); Wed, 19 Jan 2005 13:22:34 +0000 Message-ID: <41EE5F1A.3080309@qaya.org> Date: Wed, 19 Jan 2005 13:22:34 +0000 From: Peter Kirk User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-gb, en, en-us, az, ru, tr, he, el, fr, de To: Jony Rosenne CC: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes References: <000301c4f9ac$91cd0480$0701c80a@QSM7> In-Reply-To: <000301c4f9ac$91cd0480$0701c80a@QSM7> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - pan.hu-pan.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - qaya.org X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 2915 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: peterkirk@qaya.org Precedence: bulk X-list: hebrew On 13/01/2005 20:15, Jony Rosenne wrote: > ... > > >I object to the term "incorrect" relative to the "standard combining classes >for the Tiberian Hebrew diacritics". > >Possibly they are not what some people would want them to be, but that does >not make them incorrect. > > > I note the following text has been added to the beta version of Unicode 4.1.0, in the context of a revised discussion of CGJ, see http://www.unicode.org/versions/Unicode4.1.0/: > the less-than-optimal assignment of fixed-position combining classes > to certain Hebrew accents and marks which do in fact interact > typographically This seems good wording to me. Jony, are you happy with this? If not, you have a few days left to make an official objection. A section on Meteg will also be added to Unicode 4.1.0, but this text has not yet been drafted. I hope there will be an opportunity to review any such text. There is also text describing each of the proposed new characters. I would anticipate some objections to the description of qamats qatan, especially that it is used only in biblical texts, but I will leave it to others to clarify this point. Again, just a few days remain to comment on this, as the deadline for substantive comments is 31st January. -- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005 From jcowan@reutershealth.com Wed Jan 19 13:08:06 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 13:08:06 -0600 (CST) Received: from ratanakiri.reutershealth.com ([65.246.141.37]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JJ859E006876 for ; Wed, 19 Jan 2005 13:08:06 -0600 Received: from skunk.reutershealth.com (ratanakiri [65.246.141.37]) by ratanakiri.reutershealth.com (8.13.1/8.13.1) with SMTP id j0JJ7x1g004529; Wed, 19 Jan 2005 14:07:59 -0500 (EST) Received: by skunk.reutershealth.com (sSMTP sendmail emulation); Wed, 19 Jan 2005 14:08:14 -0500 Date: Wed, 19 Jan 2005 14:08:14 -0500 From: John Cowan To: "Hart, Edwin F." Cc: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes Message-ID: <20050119190814.GK3213@skunk.reutershealth.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-archive-position: 2916 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: jcowan@reutershealth.com Precedence: bulk X-list: hebrew Hart, Edwin F. scripsit: > Given that several of us are not experts in Hebrew, I assume that no one > has produced a paper on updating the combining class assignments for > Hebrew marks and anything else that is required. Can one of the experts > do this so that we can update Unicode as appropriate? Combining class assignments are immutable, because normalization (which is immutable) depends on them. -- "By Elbereth and Luthien the Fair, you shall jcowan@reutershealth.com have neither the Ring nor me!" --Frodo http://www.ccil.org/~cowan From peterkirk@qaya.org Wed Jan 19 14:30:46 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 14:55:16 -0600 (CST) Received: from pan.hu-pan.com (hu-pan.com [67.15.6.3]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JKUjf3020313 for ; Wed, 19 Jan 2005 14:30:46 -0600 Received: from 213-162-124-237.peterk253.adsl.metronet.co.uk ([213.162.124.237] helo=[10.0.0.1]) by pan.hu-pan.com with esmtpa (Exim 4.43) id 1CrMTT-0004Qa-QT; Wed, 19 Jan 2005 20:30:44 +0000 Received: from 127.0.0.1 (AVG SMTP 7.0.300 [265.7.0]); Wed, 19 Jan 2005 20:30:42 +0000 Message-ID: <41EEC372.3080709@qaya.org> Date: Wed, 19 Jan 2005 20:30:42 +0000 From: Peter Kirk User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-gb, en, en-us, az, ru, tr, he, el, fr, de To: "Hart, Edwin F." CC: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes References: <20050119190814.GK3213@skunk.reutershealth.com> In-Reply-To: <20050119190814.GK3213@skunk.reutershealth.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - pan.hu-pan.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - qaya.org X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 2917 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: peterkirk@qaya.org Precedence: bulk X-list: hebrew On 19/01/2005 19:08, John Cowan wrote: >Hart, Edwin F. scripsit: > > > >>Given that several of us are not experts in Hebrew, I assume that no one >>has produced a paper on updating the combining class assignments for >>Hebrew marks and anything else that is required. Can one of the experts >>do this so that we can update Unicode as appropriate? >> >> Several of us have done so at various times. The "alternative combining classes" previously mentioned here come from one such attempt. But we have always been told: > >Combining class assignments are immutable, because normalization (which is >immutable) depends on them. > > > Well, at least in the draft Unicode 4.1.0 Unicode admits to "less-than-optimal assignment", but they still refuse to correct their mistakes, because of the big meta-mistake which was to freeze normalisation so tightly. -- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005 From jcowan@reutershealth.com Wed Jan 19 14:58:58 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 14:58:58 -0600 (CST) Received: from ratanakiri.reutershealth.com ([65.246.141.37]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JKwwsR014935 for ; Wed, 19 Jan 2005 14:58:58 -0600 Received: from skunk.reutershealth.com (ratanakiri [65.246.141.37]) by ratanakiri.reutershealth.com (8.13.1/8.13.1) with SMTP id j0JKwnfl005111; Wed, 19 Jan 2005 15:58:49 -0500 (EST) Received: by skunk.reutershealth.com (sSMTP sendmail emulation); Wed, 19 Jan 2005 15:58:55 -0500 Date: Wed, 19 Jan 2005 15:58:55 -0500 From: John Cowan To: John Hudson Cc: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes Message-ID: <20050119205855.GQ3213@skunk.reutershealth.com> References: <41EEBEEF.5050507@tiro.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <41EEBEEF.5050507@tiro.com> User-Agent: Mutt/1.4.1i X-archive-position: 2918 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: jcowan@reutershealth.com Precedence: bulk X-list: hebrew John Hudson scripsit: > Microsoft recently bit the bullet and accepted the performance hit > necessary to do buffered re-ordering of Hebrew text during display, because > this is what the combining classes force them to do. This means that Hebrew > text processing becomes slower than it has any rational need to be, which > hurts Hebrew users. I can't agree with this characterization. It would be nice if the canonical order were also the efficient order, but a conformant rendering engine should not depend on the order in which combining marks of different classes are received, since Unicode defines this order as irrelevant to processing (and provides a canonical order at all only so that a canonical form for easy comparison exists). So Microsoft is now doing the Right Thing, not merely the necessary thing. > Given this, I think the proposed wording > 'less-than-optimal' is quite reasonable. I agree. -- "But I am the real Strider, fortunately," John Cowan he said, looking down at them with his face jcowan@reutershealth.com softened by a sudden smile. "I am Aragorn son http://www.ccil.org/~/cowan of Arathorn, and if by life or death I can http://www.reutershealth.com save you, I will." --LotR Book I Chapter 10 From tiro@tiro.com Wed Jan 19 15:11:45 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 15:15:51 -0600 (CST) Received: from priv-edtnes40.telusplanet.net (outbound05.telus.net [199.185.220.224]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JLBir5022119 for ; Wed, 19 Jan 2005 15:11:45 -0600 Received: from [64.180.191.178] by priv-edtnes40.telusplanet.net (InterMail vM.6.01.04.00 201-2131-118-20041027) with ESMTP id <20050119211139.BJND607.priv-edtnes40.telusplanet.net@[64.180.191.178]>; Wed, 19 Jan 2005 14:11:39 -0700 Message-ID: <41EECD06.4030800@tiro.com> Date: Wed, 19 Jan 2005 13:11:34 -0800 From: John Hudson User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: John Cowan CC: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes References: <41EEBEEF.5050507@tiro.com> <20050119205855.GQ3213@skunk.reutershealth.com> In-Reply-To: <20050119205855.GQ3213@skunk.reutershealth.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 2919 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: tiro@tiro.com Precedence: bulk X-list: hebrew John Cowan wrote: >>Microsoft recently bit the bullet and accepted the performance hit >>necessary to do buffered re-ordering of Hebrew text during display, because >>this is what the combining classes force them to do. This means that Hebrew >>text processing becomes slower than it has any rational need to be, which >>hurts Hebrew users. > I can't agree with this characterization. It would be nice if the canonical > order were also the efficient order, but a conformant rendering engine should > not depend on the order in which combining marks of different classes are > received, since Unicode defines this order as irrelevant to processing > (and provides a canonical order at all only so that a canonical form for > easy comparison exists). So Microsoft is now doing the Right Thing, not > merely the necessary thing. I agree entirely. But the Right Thing would not have required this particular necessary thing if the canonical combining classes for Hebrew marks had been assigned in a way that took into account the specific typographic interaction required. There is no benefit to the canonical order that would not also be available if the 'efficient order' had been used. Hence my disagreement with Jony about the use of the phrase 'less-than-optimal' to describe the canonical assignments. They could have been better. John Hudson -- Tiro Typeworks www.tiro.com Vancouver, BC tiro@tiro.com Currently reading: The peasant of the Garonne, by Jacques Maritain The meaning of everything, by Simon Winchester From petercon@microsoft.com Wed Jan 19 15:38:34 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 16:18:38 -0600 (CST) Received: from mail3.microsoft.com (mail3.microsoft.com [131.107.3.123]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JLcXaZ030970 for ; Wed, 19 Jan 2005 15:38:34 -0600 Received: from mailout2.microsoft.com ([157.54.1.120]) by mail3.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 19 Jan 2005 13:38:30 -0800 Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by mailout2.microsoft.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 19 Jan 2005 13:38:28 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes Date: Wed, 19 Jan 2005 13:38:34 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes Thread-Index: AcT+ai456R7mh34eSdy9HJ2TasVaMQAA8KSA From: "Peter Constable" To: "John Cowan" , "John Hudson" Cc: X-OriginalArrivalTime: 19 Jan 2005 21:38:28.0241 (UTC) FILETIME=[36802810:01C4FE6F] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by unicode.org id j0JLcXaZ030970 X-archive-position: 2920 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: petercon@microsoft.com Precedence: bulk X-list: hebrew > From: hebrew-bounce@unicode.org [mailto:hebrew-bounce@unicode.org] On Behalf > Of John Cowan > > Microsoft recently bit the bullet and accepted the performance hit > > necessary to do buffered re-ordering of Hebrew text during display, because > > this is what the combining classes force them to do. This means that Hebrew > > text processing becomes slower than it has any rational need to be, which > > hurts Hebrew users. > > I can't agree with this characterization. It would be nice if the canonical > order were also the efficient order, but a conformant rendering engine should > not depend on the order in which combining marks of different classes are > received, since Unicode defines this order as irrelevant to processing > (and provides a canonical order at all only so that a canonical form for > easy comparison exists). So Microsoft is now doing the Right Thing, not > merely the necessary thing. John Hudson's characterization is appropriate here. The point is that, while alternate equivalent orders are valid at the data/rendering-API interface it is not feasible for font implementers to support arbitrary orders when they do glyph processing for mark positioning: they must have glyphs presented to them in some pre-determined order or have a means of re-ordering themselves prior to doing glyph positioning. Thus, for font implementations, it is both the Right Thing and a necessary thing. For OpenType implementations, we have defined an ordering that font implementers can anticipate at the rendering engine/OpenType Layout interface. That means Uniscribe spends some cycles on ordering during the layout process. Peter Constable From peterkirk@qaya.org Wed Jan 19 16:22:13 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 19:53:04 -0600 (CST) Received: from pan.hu-pan.com (hu-pan.com [67.15.6.3]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0JMMDaB008671 for ; Wed, 19 Jan 2005 16:22:13 -0600 Received: from 213-162-124-237.peterk253.adsl.metronet.co.uk ([213.162.124.237] helo=[10.0.0.1]) by pan.hu-pan.com with esmtpa (Exim 4.43) id 1CrODI-00082U-Qs; Wed, 19 Jan 2005 22:22:09 +0000 Received: from 127.0.0.1 (AVG SMTP 7.0.300 [265.7.0]); Wed, 19 Jan 2005 22:22:07 +0000 Message-ID: <41EEDD8F.4080800@qaya.org> Date: Wed, 19 Jan 2005 22:22:07 +0000 From: Peter Kirk User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-gb, en, en-us, az, ru, tr, he, el, fr, de To: John Hudson CC: John Cowan , hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes References: <41EEBEEF.5050507@tiro.com> <20050119205855.GQ3213@skunk.reutershealth.com> <41EECD06.4030800@tiro.com> In-Reply-To: <41EECD06.4030800@tiro.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - pan.hu-pan.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - qaya.org X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 2921 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: peterkirk@qaya.org Precedence: bulk X-list: hebrew On 19/01/2005 21:11, John Hudson wrote: > ... > I agree entirely. But the Right Thing would not have required this > particular necessary thing if the canonical combining classes for > Hebrew marks had been assigned in a way that took into account the > specific typographic interaction required. There is no benefit to the > canonical order that would not also be available if the 'efficient > order' had been used. Hence my disagreement with Jony about the use of > the phrase 'less-than-optimal' to describe the canonical assignments. > They could have been better. > Well, even if the classes were perfect (and, even more hypothetically, everyone could agree on that!) such that normalised text is already in the optimal form for rendering, there could be no guarantee that every text presented for rendering is normalised. So there would still be a need for the reordering step introduced by Microsoft - which would in that case be equivalent to Unicode normalisation. This limits the efficiency advantage as the string must in any case be normalised, although presumably it is slightly more efficient to normalise a string which is in fact already normalised. Perhaps the more serious problem with the canonical class assignments is in the effect they have on collation. In particular, it is almost impossible to tailor collation according to the ordering preferred in most biblical Hebrew dictionaries which sort sin and shin separately at the top level, because this involves treating as a unit a pair of characters (shin and shin/sin dot) which can be separated by four other characters - potentially thousands of collation contractions would be needed to hack one's way round this. -- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005 From kenw@sybase.com Wed Jan 19 21:14:05 2005 Received: with ECARTIS (v1.0.0; list hebrew); Wed, 19 Jan 2005 22:41:51 -0600 (CST) Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0K3E4vB004700 for ; Wed, 19 Jan 2005 21:14:05 -0600 Received: from smtp2.sybase.com (sybgate2 [10.22.97.85]) by inergen.sybase.com with ESMTP id j0K3Duj22687; Wed, 19 Jan 2005 19:13:56 -0800 (PST) Received: from birdie.sybase.com (localhost [127.0.0.1]) by smtp2.sybase.com with ESMTP id TAA27245; Wed, 19 Jan 2005 19:13:55 -0800 (PST) Received: from birdie (birdie [10.22.85.43]) by birdie.sybase.com (8.11.6+Sun/8.11.6) with SMTP id j0K3Dte11823; Wed, 19 Jan 2005 19:13:55 -0800 (PST) Message-Id: <200501200313.j0K3Dte11823@birdie.sybase.com> Date: Wed, 19 Jan 2005 19:13:55 -0800 (PST) From: Kenneth Whistler Reply-To: Kenneth Whistler Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes To: peterkirk@qaya.org Cc: hebrew@unicode.org, kenw@sybase.com MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: joIhtjEbtPBCMOQMcOQfSA== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.4.6_06 SunOS 5.8 sun4u sparc X-archive-position: 2922 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: kenw@sybase.com Precedence: bulk X-list: hebrew Peter, > Perhaps the more serious problem with the canonical class assignments is > in the effect they have on collation. In particular, it is almost > impossible to tailor collation according to the ordering preferred in > most biblical Hebrew dictionaries which sort sin and shin separately at > the top level, because this involves treating as a unit a pair of > characters (shin and shin/sin dot) which can be separated by four other > characters - potentially thousands of collation contractions would be > needed to hack one's way round this. "Almost impossible" only if one is insufficiently agile in how one approaches the problem. First of all, this is a misunderstanding of the fine points of handling combining mark weighting in the UCA. The relevant text of UCA is: ================================================================ Note: A combining mark in a string is called blocked is there is another combining mark of the same canonical combining class or zero between it and the last character of canonical combining class 0. S2.1 Find the longest initial substring S at each point that has a match in the table. S2.1.1 If there are any combining marks following S, process each combining mark C. S2.1.2 If C is not blocked, find if S + C has a match in the table. S2.1.3 If there is a match, replace S by S + C, and remove C. ================================================================ Now consider weighting of sin-dagesh versus shin-dagesh. You could represent these in text as: Now, when you normalize to NFD as the first step of UCA, you get, of course: Because ccc(dagesh)=21, ccc(sin-dot)=25, ccc(shin-dot)=24. And there could be other intervening points which also have canonical combining class < 24, and perhaps a sequence of them, as you indicated. But... sin-dot and shin-dot are not *blocked*, in the sense defined by the algorithm, since dagesh is not a combining mark with the *same* canonical combining class as sin-dot or shin-dot, nor is it ccc=0. Therefore, if I have a tailoring in my collation weight table that gives a tailored primary weight to the sequences and , steps S2.1.2 and S2.1.3 in the algorithm will append that tailored primary weight to the collation key and remove the sin-dot or shin-dot (logically) from the input string being weighted. Then the dagesh would be matched against a secondary weight in the table (or be unweighted if ignored, depending on your table). And you would end up with the result you were after. No need for "thousands" of contractions in the table -- just the two obvious ones. That should get you the right answer for any fully conformant implementation of the Unicode Collation Algorithm, working on Hebrew text, whether prenormalized or not. You could also approach the problem differently, if your data is in a controlled corpus, as it might well be if you were developing a specialized dictionary sort (which could have requirements beyond what you would expect in an off-the-shelf implementation of collation). Under such circumstances, you could simply declare that your collation handling of sin and shin would be treated *as if* they were represented in sequences with a CGJ, thus: Then if your data is maintained *in corpus*, in the actual forms: or even uses the compatibility characters: Then a collation that weights or or the compatibility characters *without* normalizing the strings first would in fact be producing the same collation keys as application of the UCA to the strings including the CGJ *with* normalization. This would be an example of producing equivalent results by means of a slightly different strategy in the implementation. And that falls within the bounds of the conformance to the algorithm, which only requires that "a conformant implementation shall replicate the same comparison of strings as those produced by Section 4 Main Algorithm", not that the exact steps of that algorithm be followed. If your claim is that your idealized text representation is with the CGJ, as shown above, then processing text strings without the CGJ, in the order shown above, without any fancy handling of combining mark sequences, will *replicate* the same string comparisons as if the full algorithm were applied to the strings actually containing CGJ's. So there are two approaches for you, both of which get you the reasonable results you are after, and neither of which requires you to crud up tailored collation weighting tables with thousands of contractions. So please don't Unicode collation as demonstrating that there is a "serious problem with the canonical class assignments". We can all agree that different canonical class assignments might have made things easier for some aspects of Biblical Hebrew text rendering, but the collation argument is simply a wash here. --Ken From smontagu@smontagu.org Fri Jan 21 06:34:25 2005 Received: with ECARTIS (v1.0.0; list hebrew); Fri, 21 Jan 2005 06:42:02 -0600 (CST) Received: from pillage.dreamhost.com (postfix@pillage.dreamhost.com [66.33.213.23]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0LCYMLw005725 for ; Fri, 21 Jan 2005 06:34:25 -0600 Received: from [127.0.0.1] (l192-115-31-204.broadband.actcom.net.il [192.115.31.204]) by pillage.dreamhost.com (Postfix) with ESMTP id 63C5714969D for ; Fri, 21 Jan 2005 04:34:17 -0800 (PST) Message-ID: <41F0F6B9.8000002@smontagu.org> Date: Fri, 21 Jan 2005 14:34:01 +0200 From: Simon Montagu User-Agent: Mozilla Thunderbird 1.0RC1 (Windows/20041201) X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'hebrew List'" Subject: [hebrew] Re: qamats qatan and other Unicode 4.1.0 changes Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 2923 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: smontagu@smontagu.org Precedence: bulk X-list: hebrew Peter Kirk wrote: > There is also text describing each of the proposed new characters. I > would anticipate some objections to the description of qamats qatan, > especially that it is used only in biblical texts, but I will leave it > to others to clarify this point. Again, just a few days remain to > comment on this, as the deadline for substantive comments is 31st January. > The text describing qamats qatan "Similarly, a number of publishers of Biblical texts have introduced a typographic distinction ..." could certainly do with rewording IMO. Will this become part of the prose of TUS? As far as I remember, many or even most of the examples in the qamats qatan proposal itself were from non-Biblical texts (prayerbooks). Does anyone have a reference to the proposal handy? From peterkirk@qaya.org Fri Jan 21 07:05:09 2005 Received: with ECARTIS (v1.0.0; list hebrew); Fri, 21 Jan 2005 08:32:51 -0600 (CST) Received: from pan.hu-pan.com (hu-pan.com [67.15.6.3]) by unicode.org (8.12.11/8.12.11) with ESMTP id j0LD56QL017234 for ; Fri, 21 Jan 2005 07:05:09 -0600 Received: from 213-162-124-237.peterk253.adsl.metronet.co.uk ([213.162.124.237] helo=[10.0.0.1]) by pan.hu-pan.com with esmtpa (Exim 4.43) id 1CryTI-0002jY-CM; Fri, 21 Jan 2005 13:05:04 +0000 Received: from 127.0.0.1 (AVG SMTP 7.0.300 [265.7.1]); Fri, 21 Jan 2005 13:04:54 +0000 Message-ID: <41F0FDF6.4050601@qaya.org> Date: Fri, 21 Jan 2005 13:04:54 +0000 From: Peter Kirk User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 X-Accept-Language: en-gb, en, en-us, az, ru, tr, he, el, fr, de To: Kenneth Whistler CC: hebrew@unicode.org Subject: [hebrew] Re: Hebrew combining classes, and other Unicode 4.1.0 changes References: <200501200313.j0K3Dte11823@birdie.sybase.com> In-Reply-To: <200501200313.j0K3Dte11823@birdie.sybase.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - pan.hu-pan.com X-AntiAbuse: Original Domain - unicode.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - qaya.org X-Source: X-Source-Args: X-Source-Dir: X-archive-position: 2924 X-Approved-By: jcowan@reutershealth.com X-ecartis-version: Ecartis v1.0.0 Sender: hebrew-bounce@unicode.org Errors-to: hebrew-bounce@unicode.org X-original-sender: peterkirk@qaya.org Precedence: bulk X-list: hebrew On 20/01/2005 03:13, Kenneth Whistler wrote: >Peter, > > > >>Perhaps the more serious problem with the canonical class assignments is >>in the effect they have on collation. In particular, it is almost >>impossible to tailor collation according to the ordering preferred in >>most biblical Hebrew dictionaries ... >> >> > >"Almost impossible" only if one is insufficiently agile in how one >approaches the problem. > >First of all, this is a misunderstanding of the fine points of >handling combining mark weighting in the UCA. ... > Thank you, Ken, for your explanation. I had not fully understood the effect of this part of the algorithm in allowing contractions to pass over a number of other combining characters - although not, I note, a real CGJ (ccc=0) rather than a notional one. Of course this complex algorithm (first normalising the shin/sin dot to the end of the combining character sequence then detecting the unblocked contraction and reordering again) reduces efficiency, but it does produce the correct result. I note that collation will also work correctly on a string "normalised" with the "alternative combining classes" at least if these are slightly adjusted to preserve canonical equivalence. The implication of this is that the ill effects of the less than optimal combining class allocations are less serious than many including myself have thought. These effects are it seems mainly reduced implementation efficiency, including both rendering and collation. -- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005