|
||||
This page contains notes recovered from old email archives, pertinent to decisions taken at early Unicode Working Group meetings from 1989 through 1990. These are not official records or meeting minutes, but represent commentary by contemporary participants in those meetings which sheds light on the kinds of decisions that were taken and how people understood them at the time. In some cases proposed addition to the agendas for immediately upcoming meetings are also listed, on the presumption that those topics would have been discussed at those meetings.
Some of the entries are simply attributed short quotes snipped from the email archives. Other entries are more extended quotes of entire email entries, including headers. These lengthy citations often provide more context about decisions taken, and explain more about the thinking at the time as the drafts were developed. Some of the extended quotes have sections of text elided. Such elisions are indicated with "[...]".
From Joseph_D._Becker.OSBU_North@xerox.com Wed Jan 9 19:35:27 1991
Date: Wed, 9 Jan 1991 11:36:46 PST
From: becker.osbu_north@xerox.com
Subject: Unicode Conformance Clause Proposal
To: u-core%noddy@Sun.COM
Cc: becker.osbu_north@xerox.com
Mark wrote a draft on Unicode Conformance a while back, but it seemed to me
fairly discursive, I was imagining that we might want a concise normative
statement. So, below is a draft of such a conformance clause, for discussion
at this Friday's meeting.
----------------------------------------------------------------
CONFORMANCE
1. Interchange
Interchange refers to processes which transmit and receive (including store and
retrieve) sequences of text characters. The conformance requirement for a
Unicode system with regard to Interchange is:
> Except insofar as a system makes intentional changes in a character
sequence consistent with the Unicode definition of the character semantics
(e.g. case forcing, filtering, Compatibility Zone mapping), a conforming system
must be able to retransmit the text as the same sequence of numerical code
values that was received.
2. Presentation (aka Rendering)
Presentation refers to processes having access to fonts and other resources,
which take in sequences of text characters and produce a visible graphic
depiction of the text. The conformance requirement for a Unicode system with
regard to Presentation is:
> For any given character sequence that a conforming system is able to
render legibly, the graphic depiction must have a reading consistent with the
Unicode definition of the character semantics.
3. Interpretation
Interpretation refers to processes which take in sequences of text characters
and produce results based on the content of the text (e.g. spell-checking,
indexing, transliteration). The conformance requirement for a Unicode system
with regard to Interpretation is:
> For any given character sequence that a conforming system is able to
interpret, the interpretation must be consistent with the Unicode definition of
the character semantics.
----------------------------------------------------------------
Examples:
A conforming system may receive any text sequence and retransmit it
unchanged. Whether or not it could have performed any other process on the
text (e.g. display it, spell-check it, etc.) is immaterial.
A conforming system may receive a sequence of English text and retransmit
it all converted to uppercase (presumably an intentional change consistent with
the text's semantics).
A conforming system may NOT receive a sequence of English text and
retransmit it all converted to random Bengali characters, or vice versa
(presumably an unintentional change inconsistent with the text's semantics).
A conforming system may NOT take in a sequence of Unicode characters and
treat it (i.e. present or interpret it) as though it were a sequence of ASCII
bytes.
A conforming system may be unable to render a given character or sequence
legibly (e.g. a system with only Latin font resources given a sequence of
Bengali characters).
A conforming system may render a given sequence of English text in any
Latin font style, line length, page layout, etc. of its choice, so long as the
text is conventionally readable with the intended interpretation.
Discussion:
This design is predicated on the assumption that with a fully multilingual
character set in use worldwide in various editions on all manner of equipment,
it is impossible to preclude by legislation the common occurrence that a system
may receive an unfamiliar character code, i.e. one that it is unable to present
or interpret. Rather than trying to legislate away the commonplace as ISO
does, we merely make one extremely simple provision for handling such cases:
insisting on the ability to retransmit even unfamiliar codes unchanged. We
thus define an entirely workable scheme that avoids the futility of the ISO
conformance clauses.
Joe
From whistler Mon Mar 11 16:41:47 1991
To: Becker.OSBU_North@Xerox.Com
Subject: Size & Shape of Unicode (Minutes of 12/14/90 WG Meeting)
Cc: unicore@Sun.Com, whistler@zarasun
O.k., Unicampers, I will pay the piper for not having written up the
minutes of the meeting of December 14, 1990.
Here are the Minutes [mainjuts] from the December 14 meeting regarding
the question Joe raised. [Just the relevant section.]
Corporate Zone
There was a general discussion of the pros and cons of defining an
explicit corporate zone.
The four possible types of characters were classified as:
Normal characters
Compatibility Zone characters
Corporate Zone characters (in question now)
User characters
The problem was pointed out that there is not feasible way to define
a vendor. This essentially makes it impossible to draw a clear
distinction between general user characters and corporate use
characters.
Asmus proposed a compromise way to separating the two. Let general
user characters "grow" in one direction in user space, and corporate
use character in the other. This would not formally distinguish
them, but would allow companies to separate them practically.
There was some discussion of how big the corporate use areas should
be, and the group converged on a proposal to expand the user area
by another 2K to accomodate the largest sets.
Result:
User space will be 6K - the Compatibility Zone at the
top of Unicode; i.e. there will be a 5-1/2 K user space.
New user space definition: E800 - FDFF
Compatibility zone: FE00 - FFEF
Conventions of assignment:
Corporate use characters start at FDFF and grow down.
General user characters start at E800 and grow up.
====================================
Latefully submitted.
--Ken Whistler, Unicode Secretary.
At the 12/14/90 Unicode meeting we decided to try to set a date for the first Unicode Inc. Directors meeting.
—cited from email: Kernaghan to Unicore, December 18, 1990
From decwrl!metaphor!HQ.M4.metaphor.com!kernaghan Wed Dec 5 03:47:06 1990 Date: Tue, 4 Dec 90 14:22:30 PST To: u-core@noddy.Eng.Sun.COM Subject: Licensor Aggreement for review by 12/14/90. This is a draft of a License Aggreement between Unicode and a Licensor to Unicode (a similiar one is forthcoming between Unicode and the Licensee). Please review it and provide any feedback to me on-or-before the Unicode meeting of 12/14/90. FYI - This draft was developed between Mark Davis, Mike Kernaghan, and a Metaphor lawyer as an action item Mike took from the last By-Laws meeting on 11/15/90. The plan is to try to have the License Aggreements ready by the time we incorporate (although it is not required). Although most of this id fairly standard wording, we do want to make sure it conveyes the Unicode spirit of "spreading the standard far and wide". Therefore, please sent me your comments for inclusion, and we will try to come to consensus on this agreement at the 12/14 meeting. Mike Kernaghan - Metaphor
At the November 30, 1990 meeting, in reviewing what needed to be added to the charts for the Draft Standard Final Review Document, Becker proposed to make U+FFF0 - FFFE, as a group "Special" characters, i.e. graphic character codes not in the Compatibility Zone. This was agreed to.
—cited from email: Whistler to Unicore, March 11, 1991
All my time until Friday is going to be devoted to the database update to get the cross-mappings back in sync with Microsoft. I'll bring hard copy of that (and soft copy) to Friday's meeting.
—cited from email: Whistler to Caldwell, November 28, 1990
At the November 16, 1990 meeting, it was agreed to add the REPLACEMENT CHARACTER at U+FFFE.
—cited from email: Whistler to Unicore, March 11, 1991
From whistler Mon Nov 19 13:14:10 1990 Date: Mon, 19 Nov 90 13:14:04 PST From: whistler (Ken Whistler) To: fdc@watsun.cc.columbia.edu Subject: Re: line & paragraph separators No, I wasn't talking about the Bremmer & Kroese "Waka waka bang splat", which I did enjoy, too. When we were discussing the proper name for apostrophe awhile back, I distributed the following to a number of people--though not to the entire unicode mailing list: We set out to fix the apostrophes, And avoid any coding catastrophes-- We'll take a new vote: 'APOSTROPHE-QUOTE!' Requiescat in pacem these daft'strophes. In any case, the mailer in question destroyed line breaks in both! The Unicode meeting last Friday (16th) decided to add a LINE SEPARATOR and a PARAGRAPH SEPARATOR as distinct, unambiguous characters--basically my Proposal A from the earlier discussion on this. We also talked about the guidelines for conversion when converting ASCII-based code to Unicode. We agreed that it would be quite useful to have a standard enumeration of how to deal with common formats for lines and paragraphs (Unix, PC, Mac, ...). The first order conversion for control codes is simply to sign extend then to 16 bits (for both C0 and C1). That is really all that a Unicode conformant "device" should have to do. But for lines and paragraphs, there are a number of specific interpretations of various sequences of CARRIAGE RETURN, LINE FEED, FORM FEED, etc. which a Unicode application could convert to unambiguous codes, if it so desired. Otherwise Unicode's intention is to leave the C0/C1 codes uninterpreted. They mean whatever an application intends them to mean. If some application wants to use a whole raft of specialized C1 codes from whichever ISO standard, it could, with the proviso that in Unicode text, the C1 codes are 16-bit sign extended (U+0080 .. U+009F) to conform with the 16-bit architecture of Unicode. (In earlier drafts of Unicode, we had omitted the C1 space, but sometime last spring it seemed advisable to vacate the C1 space and just let the semantics of those 32 positions be specified by the pertinent standards.) On the other hand, I don't think that anyone intends that Unicode will be implemented (except in marginal ways) with character-oriented devices, which is part of the reason why Unicode is nearly silent about control codes. "control sequences" are simply in another space, as far as Unicode is concerned, and text is not modeled as something which "controls" a device. Instead, a text store is acted upon by a rendering algorithm which maps it to a rendering device (typically a screen raster or a printer raster). The controlling language for the device itself (e.g. Display PostScript) has no direct relation to Unicode. I'm sure that the fact that Unicode is not an 8-bit standard (unlike 10646) will hinder its acceptance on DEC terminals. But the first implementations will all be in bitmapped graphics workstation/PC platforms, and the implementors don't much care about controlling terminals. The considered opinion seems to be that the control codes in text approach, however expanded, simply can't be scaled up to deal with the generic problems of multilingual software. The architecture is just not right for effective computerization of really multilingual software. --Ken Whistler
From Joseph_D._Becker.OSBU_North@xerox.com Tue Nov 13 09:39:27 1990 Subject: 10,532 More Hangul Syllables in Forthcoming Korean Standards To: u-core@noddy.Eng.Sun.COM I'm not sure how public the following was intended to be, so it's probably best not to forward it, but in case you hadn't received this information I think we should add it to the agenda for the 11/16 meeting. Joe ---------------------------------------------------------------- Date: 5 Nov 90 07:11:23 PST (Monday) Subject: Progress of KS Expansion Project From: ksri%halla.dacom.co.kr%uunet.uucp%autodesk.uucp%sun.Eng.Sun.COM@SGI:COM To: Becker:OSBU North Hi, This is Jeoung Sung Cho. I have received all messages distributed by UNICODE peoples. Thanks for allowing me to get those valuable information. I am sending this message to inform you the progress of KS Expansion Project for Hangul and Hanja code. The committee decided to define two supplementary sets. The first set we call Hangul Supplementary Set will include 8832 Hangul characters. This set and 2350 Hangul charcters of KS C 5601 will cover the entire Hangul characters that are used currently. The second set we call Old Hangul and Hanja Supplementary Set will include 1700 old Hangul and approximately 5000 Hanja charcters. The committee is reviewing the second set now. The new sets will be announced as KS by middle of 1991. I wonder if you can assign 8832 Hangul characters in UNICOE. I know that it is very difficult to assign additional characters in version 1.0. If you can't assign the additional Hangul characters in this version, is there any possibility to assign those characters in next version. The reason why we define the additional Hangul characters is that there is very strong requirement from the korean user that KS should include all Hangul characters that are used currently. If you need any additional information about the progress of the project let me know. Your prompt response on this message will be very appreciated. Sincerely yours, Jeoung Sung Cho KSRI ----------------------------------------------------------------
From whistler Tue Nov 13 14:15:28 1990 Date: Tue, 13 Nov 90 14:15:24 PST From: whistler (Ken Whistler) To: glenn@ila.com Subject: Re: Unicode Consortium It doesn't officially exist yet. That's what the Thursday meeting is about--to sort out any remaining legal problems in getting the thing incorporated. So far the Unicode "consortium" consists of those companies who show up at the regular RLG meetings. But once the Consortium exists, its formal members will be those companies which officially join and pay annual dues--like almost any consortium. We should have more details available this Friday, after the Thursday meeting sorts out whatever needs to be resolved. --Ken
Bidi Algorithm
We'll have a subcommitee meeting on the 11th at noon in Redmond.
—cited from email: Freytag to Unicore, October 7, 1990
I am currently maintaining the database which contains the non-CJK part of the Unicode draft. At each Unicode meeting I bring updates of the database in the form of various TAB-delimited text files representing reports from the database. One of these is the master Unicode names list and another is the alphabetical listing of character names.
—cited from email: Whistler to mrfung, October 11, 1990
Michel's "topics":
Looking at my notes I found:
We raised 2 at the last meeting:
- Statement on migration strateg(ies). What are you supposed
to do with the Middle High Norse charcters that aren't in unicode
- How to map from old standards: "Unicode allows to maintain intent"
when there is no 1:1 mapping.
—cited from email: Freytag to Unicore, October 7, 1990
On September 14, 1990, at roughly 2 p.m., the Compatibility Zone was added with considerable reluctance, originally defined to be 511 codes: U+FE00 - FFFE.
—cited from email: Whistler to Unicore, March 11, 1991
Michel's "topics":
Looking at my notes I found:
...
From the previous meeting we had
- Floating Diacritic handling
—cited from email: Freytag to Unicore, October 7, 1990
At the 9/14 meeting, we discussed producing a "Answers to the Top 10 Most Asked Questions" section, and I said I would distribute my old draft. It is below. I haven't re-read it since it was written 1.5 years ago, so undoubtedly it is out of date. Perhaps someone can use it as the basis for producing a modern version of such a summary.
—cited from email: Becker to Unicore, September 15, 1990
From daemon@Metaphor.COM Wed Oct 3 10:06:49 1990 Received: from YKTVMV by CUNYVM.CUNY.EDU (IBM VM SMTP R1.2.2MX) with BSMTP id 6536; Wed, 03 Oct 90 08:27:06 EDT Date: 03 Oct 1990 08:25:19 EDT From: dan%ibm.com@CUNYVM.CUNY.EDU (Walt Daniels) To: unicode@Sun.COM Subject: order of floating diacritics > As recently as the August 17, 1990, Unicode >consortium meeting, this very topic was discussed and the policy that the >order of multiple diacritics would not be specified was reaffirmed. > J VanStee - private mail I do not remember seeing this on the mailing list but I do remember a long discusion about the order being inside-out, top-bottom, etc. Were these just suggestions or will they be part of the standard?
From Joseph_D._Becker.osbunorth@Xerox.COM Mon Jul 30 17:38:32 1990
Subject: Re: Coding of accented characters
To: ma_hasegawa@jrdv04.enet.dec.com
Cc: Unicode@Sun.COM, Becker.osbunorth@Xerox.COM
In-Reply-To: "ma_hasegawa"%jrdv04.enet.dec.com%Xerox:COM's message of 22
Jul 90 23:19:12 PDT (Sunday)
Masami,
Thanks to your call for clarification, the Unicode meeting of 7/27
re-addressed the question of the ordering of multiple diacritical marks ...
especially since I had mis-documented the group's previous decision (for
which I had failed to find notes). Here is the correct statement:
----------------------------------------------------------------
...
Sequence order of multiple diacritcal marks: In case of multiple
diacritcal marks applied to the same base character, if the result is
unambiguous there is no reason to specify a sequence order for the mark
characters. In particular, marks may fall into four categories: above the
baseform, below the baseform, superimposed on the body of the baseform, and
surrounding the baseform. Between two marks that are in different
categories, there is never an ambiguity, hence never a need to specify
sequence order. In the relatively rare cases where an unambiguous sequence
order of multiple marks of the same category is necessary, that order
should be: FROM THE BASELINE OUTWARD.
...
----------------------------------------------------------------
We had a fairly careful discussion of which was more beneficial:
(A) to specify a canonical order in all cases
(B) to leave the order flexibile where possible
We decided that it really came down to a question of WHEN it was more
efficient to filter the character sequence into canonical order: (A) when
the sequence is created (e.g. by input or editing), or (B) when the
sequence is interpreted (e.g. by a comparison routine). It seemed clear to
all of us that (B) is more effective, since it is nearly impossible to
control all means of assembling character sequences, and the final
interpretation is best left to the end-user routine anyhow.
So, the correct semantics would be:
a) LATIN CAPITAL LETTER A + NON-SPACING MACRON + NON-SPACING DIAERESIS (3
characters) [indicates the diaeresis above the macron]
b) LATIN CAPITAL LETTER A + NON-SPACING DIAERESIS + NON-SPACING MACRON (3
characters) [indicates the macron above the diaeresis]
Meanwhile, in response to my message of 23 Jul 90 17:23:57 PDT concerning
10646 support for Rhade, etc., you replied 23 Jul 90 21:33:10 PDT:
>> As for missing characters for 10646, we have been allocating additional
characters based on request with justifications. If you think some
characters are missing, you should submit the request through the
established process like everyone else (through ANSI) to ISO.
I have been trying since that time to get you to confirm my understanding
of this reply. My understanding is that 10646 is unable to represent
> extensions of the Latin script
> marked symbols for mathematics & physics
> the International Phonetic Alphabet
> pointed Arabic & Hebrew
> Hindi & Sanskrit (and by extension all South and Southeast Asian
scripts)
UNLESS each and every possible combination of base characters and marks is
submitted for registration through ISO.
Is the above statement correct or not?
Meanwhile, in your message of 29 Jul 90 18:39:40 PDT, you said:
>> ISO 10646 can be used with a control code standard (ISO 6429). In ISO
6429, there is a control function GCC (Graphic Character Composition). So
for THOSE APPLICATIONS which need to combine graphic symbols, there is a
way.
Now I am extremely curious. Does 10646 plain text need to be encoded
differently for some applications than for others? Have you never had the
experience of transferring text between two systems or applications that
were not designed to expect such a transfer? It seems to me that such
"blind" transfer is a normal everyday part of text interchange, especially
in systems integrated from multiple-vendor components.
I do not understand whether the GCC control code is permitted, or optional,
or mandatory, for each of the cases:
> extensions of the Latin script
> marked symbols for mathematics & physics
> the International Phonetic Alphabet
> pointed Arabic & Hebrew
> Hindi & Sanskrit and other South / Southeast Asian scripts
Please make that clear to us in each case.
Finally, it would be valuable to return with this new knowledge to your
original question:
>> What I want to know is the "correct" representation of, say LATIN
CAPITAL LETTER A WITH MACRON AND DIAERESIS, a character needed for Lappish.
Now we are asking this question with regard to ISO 10646, assuming that the
Lappish character is not already registered (I don't think it is in 2nd
DP), and (I guess) using the GCC control code.
Possible representations are:
a) LATIN CAPITAL LETTER A + GCC + NON-SPACING MACRON + GCC + NON-SPACING
DIAERESIS (5 characters)
b) LATIN CAPITAL LETTER A + GCC + NON-SPACING DIAERESIS + GCC + NON-SPACING
MACRON (5 characters)
c) LATIN CAPITAL LETTER A WITH DIAERESIS + GCC + NON-SPACING MACRON (3
characters)
d) LATIN CAPITAL LETTER A WITH MACRON + GCC + NON-SPACING DIAERESIS (3
characters)
f) NON-SPACING MACRON + GCC + NON-SPACING DIAERESIS + GCC + LATIN CAPITAL
LETTER A (5 characters)
g) NON-SPACING DIARESIS + GCC + NON-SPACING MACRON + GCC + LATIN CAPITAL
LETTER A (5 characters)
h) NON-SPACING MACRON + GCC + LATIN CAPITAL LETTER A WITH DIAERESIS (3
characters)
i) NON-SPACING DIAERESIS + GCC + LATIN CAPITAL LETTER A WITH MACRON (3
characters)
Which of the above representations are valid? (Also please indicate which
ones specify whether the macron is above or below the diaeresis).
Thanks,
Joe
From BOSURGI1@AppleLink.Apple.COM Thu Aug 2 18:09:22 1990
Cc: AUTH1@AppleLink.Apple.COM (Auth, Michael,CLA),
RICHARDSON7@AppleLink.Apple.COM (Richardson, Lee,CLA)
Subject: Precomposed, compatible, etc.
To: U-CORE@NODDY.Eng.Sun.COM
From: BOSURGI1@AppleLink.Apple.COM (Bosurgi, Joe,CLA)
Date: 03 Aug 90 00:28 GMT
> I agree with both Glenn and Lee and this subject. All our mappings between
> existing PC standards and Unicode have been using precomposed characters. It
> is not even obvious for me that the first system implementations of Unicode
> should have full floating diacritic support, including:
> - collating,
> - rendering, etc...
> The precomposed character set included in Unicode will cover 99% of our
> current need and it will be difficult to justify the large investment
> required by the full support RIGHT NOW. I understand that we have to add the
> full support later, but this looks like a medium term goal for me.
> Michel Suignard, Microsoft
We've been having some discussions here on the European pre-composed/full
floating diacritic issue, and have been leaning in the same direction that
Michel has so eloquently expressed above. This in no way indicates that we
want to do away with floating diacritics or anything like that.
But it seems a mistake to take the floating diacritic version of a character to
be the standard, or even "preferred" way of representation, while denigrating
the pre-composed form. I think that many developers will want to transition to
Unicode WITHOUT implementing a generalized look-ahead function for handling
floating diacritics "in the first release". As Michel notes, this will "cover
99% of the current need". I _do_ think it is inevitable to move to the full
floating diacritics and a generalized look-ahead implementation at some point,
but I'm not sure what we'd gain (except resistance) by requiring this
initially.
Not only does this impinge on pre-composed European characters, but also on
IBM's suggestion of re-opening the "compatibility zone" at our last meeting.
Specifically, the only mechanism available for getting isolated Arabic letters,
zenkaku Roman characters, and hankaku katakana at present involves zero-width
non-joining characters, and the hankaku/zenkaku "diacritics". The same
argument might be applied here.
Of course, Arabic and Indic scripts will always involve look-ahead for
rendering - but there are probably some developers out there that are only
interested in Europe, the Americas, and East Asia "for now". Despite not
wanting to "cover the earth" all at once, they can further the quick acceptance
of Unicode. And for them, this will "cover 99% of the current need" for these
characters at a lower initial cost.
I'd still support the idea of the zero-width non-joiner in a "compatibility
zone" scenario. The zenkaku/hankaku "diacritics" could probably be tossed.
We can work with what is currently defined, but feel that IBM's suggestion has
merit. In terms of the "denigration" issue for European characters, we agree
with Michel.
Joe Bosurgi
Claris Corporation
I, too, am engaged in coding case relations and character properties. It would be nice if we agreed about most of these things, but I don't think there is any plan to have Unicode 1.0 also publish a full list of character properties. The names list (Joe Becker has completed a draft) will be invaluable for character identification, and it also contains information about case pairs. Joe should have a corrected draft of that available at the May 18 meeting.
The Apple FileMaker database has some character properties (direction and major class [letter vs. symbol vs. numeric vs. punctuation] coded, but it is only partially up-to-date. Perhaps we should bring up as an agenda item for the May 18 meeting the coordination of efforts to systematically agree upon and develop lists of at least those character properties.
—cited from email: Whistler to Freytag, May 10, 1990
From BL.KSS%RLG@Forsythe.Stanford.EDU Wed May 2 12:04:40 1990 Date: Wed, 2 May 90 11:56:54 PDT To: u-core@noddy.Eng.Sun.COM Subject: Mtg on May 4? (Inquiring Minds Want to Know) Re: What Character Names to List U-Core Folks -- 1. PLEASE let me know 1) if there IS a 5/4 mtg; 2) if it will be here at RLG; 3) what times it will be. (If we're not meeting at RLG, I'm obliged to release the room) 2. Although I have no opinion/advice on the Hamiltonian vs Nabla topic, I do have an opinion on the general question of what names you list for a given character: AT MINIMUM: reference any other name used in a character set standard. (On a case-by-case basis you may consider including names "well-known" but that are not documented in a standard; if the character does not appear in any standard you may have no choice but to use the "well-known" name. But if the character appears in more than one standard under different names, then I would argue for cross-referencing the names in those standards.) Karen
I vote to wait until someone willing to implement Unicode shows us that they need really the Mosaics before we put them back. The lesson here is that people interested in the content of Unicode must make an effort to attend the meetings,and if that is not possible to at least read the mail and minutes. We did vote and no one voted in favor.
[[This refers to a decision taken during the April 20, 1990 meeting to remove the Videotex Mosaics from the Unicode draft. That decision engendered an extended discussion on the Unicore list on May 22, 1990.]]
—cited from email: Collins to Unicore, May 22, 1990
From Joseph_D._Becker.osbunorth@Xerox.COM Thu Apr 26 10:35:09 1990
Sender: "Joseph_D._Becker.osbunorth"@Xerox.COM
Date: 26 Apr 90 09:22:34 PDT (Thursday)
Subject: Send 'em in
From: Becker.osbunorth@Xerox.COM
To: lcollins@apple.COM, BOSURGI1@applelink.apple.COM,
BL.KSS%RLG@Forsythe.Stanford.EDU, microsoft!michelsu@Sun.COM,
glennw@Sun.COM, zarasun!whistler@metaphor.com, James_Higa@NeXT.COM,
BR.JMA%RLG@Forsythe.Stanford.EDU
Cc: Becker.osbunorth@Xerox.COM
... those alphabet names lists and section introductions that folks said
they'd write at the 4/20 meeting. Let's aim for early next week. I have
written drafts for Diacritics, Greek, Cyrillic, Georgian, Armenian, Arabic,
and Ethiopian, and it'd be nice to try to have a whole package for the May
4 meeting.
(Is anyone calling the May 4 meeting?)
Joe
From Joseph_D._Becker.osbunorth@Xerox.COM Mon Apr 16 10:23:19 1990
Date: 15 Apr 90 16:24:55 PDT (Sunday)
Subject: Re: Miscellaneous characters
From: Becker.osbunorth@Xerox.COM
To: microsoft!michelsu@Sun.COM, microsoft!michelsu@Sun.COM
Cc: lcollins@apple.COM, zarasun!whistler@metaphor.com,
Becker.osbunorth@Xerox.COM
In-Reply-To: microsoft!michelsu%Sun:COM's message of 10 Apr 90 18:59:12 PDT
(Tuesday)
Hello again,
I agree with all comments, including:
SM720000 = 0x21b5 (bent arrow / Enter (Return) symbol)
JX710000 = 0x309b (daku-on)
JX720000 = 0x309c (han-daku-on)
JQ740000 = 0x00b7 (middle dot)
... and the fact that SP500000 was among the list of IBM symbols that I
took back out because I was not sure which ones were useful.
It sounds like we need to add SP500000 back into Unicode. We can confirm
this and a few other additions at the meeting next week.
Joe
We are getting here a significant pressure to have more dingbats into Unicode. Especially did you already look at the ITC ZAPF DINGBATS series 100, 200 and 300? Did you develop a standard Unicode position about how to transport them in a Unicode string? I would like that matter to be discussed in the next Unicode meeting. (I am personnally fairly reluctant to add random dingbats as it is not clear then that you wouldn't have to add zillions of them used in European or Asian publications, but again if these ZAPF DINGBATS are widely used we need at least an explicit position about their support)
—cited from email: Suignard to Collins, April 16, 1990
Would like to put onto the agenda for tomorrow's (3/23) agenda for 2:00 pm a discussion of whether there is a need for right-to-left punctuation in Unicode. (We need to give a specific starting time for this topic, because there are RLG staff who will be joining us specifically for this discussion.)
—cited from email: Smith-Yoshimura to Unicode, March 22, 1990
Something you may also need are the mapping tables between Codepages and our UGL, however this represent a lot of information to be sent by mail. If you have some specific ones I can do it or we may as well wait for our next Unicode meeting (March 23rd) when I can bring soft and hard copies.
—cited from email: Suignard to Becker, March 8, 1990
From Joseph_D._Becker.osbunorth@Xerox.COM Mon Mar 19 21:52:43 1990
Date: 19 Mar 90 21:46:06 PST (Monday)
Subject: Unicode Architecture Proto-Decisions
From: Becker.osbunorth@Xerox.COM
To: Unicode@Sun.COM
Cc: Becker.osbunorth@Xerox.COM
An impromptu micro-Unicode meeting was held today to discuss two smallish
Unicode architectural proposals that we thought we could (and did) get
agreement on. We present them here in the hopes that the group will
rubber-stamp them (yeah, sure) on Friday. Although, if the following folks
can agree on something, it must be pretty agreeable (-: Whistler, Morse,
Kernaghan, Collins, Bosurgi, Becker.
> Proposal: C1 pullout: Designate the 32 "C1" cells in the range 0080-009F
as "Control" (interpretation unspecified); distribute 30 of the characters
now in this range to appropriabe blocks of punctuation, math operators,
etc.; zap the duplicated script-f and pi
Con: We had already decided against this (see below); however, the
situation has somewhat changed.
-------------------------------------------------------
> C1 pullout: leave unassigned the "C1" range 0080-009F
Decision: Status Quo, i.e. leave the 32 miscellaneous letters and
stuff there.
Reason: We might have been willing to concede if we thought that
C1 had any widely-accepted standard semantics (as C0 does), but since we
don't, there's no point in just leaving the space open.
-------------------------------------------------------
Pro: This will make the first 256 Unicodes PRECISELY identical to Latin1,
thereby making Unicode more acceptable to many people. We have learned
that ISO is indeed playing control code games in the C1 space; we have no
pressing reason to prevent them from doing so. For example, some ISO DIS
10538 & DP 10646 C1 control functions:
-------------------------------------------------------
* indicates DP 10646
C1 Set
80* PAD PAD OCTET
81* HOP HIGH OCTET PRESET
82 BPH BREAK PERMITTED HERE
83 NBH NO BREAK HERE
85 NEL NEXT LINE
8B PLD PARTIAL LINE FORWARD
8C PLU PARTIAL LINE BACKWARD
99* SGCI SINGLE GRAPHIC CHARACTER INTRODUCER
9B CSI CONTROL SEQUENCE INTRODUCER
-------------------------------------------------------
The main people who had wanted the set of 32 characters that we put in
these cells were Apple, but they now feel that there would be greater
benefit to them in making Unicode more acceptable. If we make this change
now, and regret it in half a year, very little damage would be done; but if
we DON'T make it now, and regret it in half a year, we'd be badly stuck.
> Proposal: Move User Space: Move the User range 3000-3FFF to F000-FFFE;
move up CJK Aux to 3000-3FFF, and move up Han to start at 4000.
Con: None except for some hassle to make the change.
Pro: Not much gain either, but might give us greater flexibility if we
need to tweedle User Space in the future (for example, we're told the max
IBM user space is 6K, and we would be able to expand ours from 4K to 6K if
we felt like it). This is different from the old "Compatibility" proposal,
although it does permit us to refer to User Space as F-Space ...
Joe
I understand from an informal meeting earlier this week that there will be separate code points for the KS "phonetically-distinct" hanja, to guarantee a one-to-one mapping to KS for text marked as Korean. If so, we like it. What were the statistics on how many there were ? Maybe someone can present this at Friday's meeting.
—cited from email: Bosurgi (Claris) to Unicode, March 21, 1990
As a follow up to the last Unicode meeting I took responsability to produce a mapping from some IBM code pages to Unicode.
—cited from email: Suignard to Unicode, March 1, 1990
I think we really need to get more serious about having a meeting schedule determined in advance - announcing whole-day meetings over E-mail on just a few days' notice is unreal. I seem to recall Michel Suignard suggesting reserving one day out of each month (or two weeks, or some regular interval) and attempting to hav meetings on a *regular* basis, instead of just setting them up in an ad hoc fashion. Tha would be a lot easier.
—cited from email: Bosurgi (Claris) to Unicode, February 19, 1990
This is to confirm that RLG will be hosting the next Unicode meeting this Friday, Jan 26, in the conference room on the first floor called "OAK". (Ask the receptionist at the front desk for directions when you come in.)
I leave the actual agenda to others. Although we said we'd have a "full day meeting", RLG staff cannot attend before 9:30 am, so let's start then. (Wayne Davison, Associate Director for Development, will also be attending from RLG central staff, and will be there before me.)
—cited from email: Smith-Yoshimura to Unicode, January 22, 1990
I cannot make the Jan 26th meeting, but Asmus Freytag from my team will go and also eventually David Wood (National Language Support Program Manager). Unicode is getting a lot of attention in Microsoft and is clearly in good position to be used as an internal unique character representation.
—cited from email: Suignard to Unicode, January 17, 1990
At 7:00 pm on Friday, those present decided to extend the technical discussion of non-CJK and other issues to another meeting on Monday, 29 January. This meeting will held at Claris (5201 Patrick Henry Drive, Santa Clara), beginning at 5:00 pm. I will try to arrange for food, but given the short notice we might have to fax an order to Togo's.
—cited from email: Bosurgi (Claris) to Unicode, January 28, 1990
From ksar@hpcea.ce.hp.com Wed Jan 31 16:37:41 1990 To: microsoft!michelsu@Sun.COM Date: Wed, 31 Jan 90 8:43:38 PST Subject: Re: Re Compatibility Space Cc: unicode@Sun.COM In-Reply-To: Message from "michelsu@Sun.COM" of Jan 30, 90 at 7:49 pm [...] I like this idea of compatibility block in Unicode and it appears that it could resolve the impass we had at the last meeting. It does not have to be in the "primary space" but there is a need for it. The issue now is what to include in it. Why not start with what 10646 2nd DP has and provide feedback to SC2/WG2, through ANSI/X3L2, on what should be added or deleted from 2nd DP of 10646? Regards, Mike KSAR/HP
From lcollins@apple.com Fri Dec 15 17:24:00 1989 Date: Fri, 15 Dec 89 17:16:20 PST To: microsoft!michelsu@Sun.COM, unicode@Sun.COM Subject: Re: Dec 18th meeting Michel, Apple has always assumed that in the worst case we would have to go it alone with Unicode rather than accept a bad standard. If there is a vote to be taken, then we will opt for Unicode, since I doubt that we can rely on 10646. I think most of the Unicoders agree. We are counting on the final freezing of Unicode in early 1990. I think a formal meeting once a month may now be justified since we you and others have to come from outside the area. Please get us a list of the discrepancies in mapping and any missing symbols as soon as you can. Lee
From Joseph_D._Becker.osbunorth@Xerox.COM Fri Dec 15 19:09:44 1989
Date: 15 Dec 89 19:01:50 PST (Friday)
Subject: Re: Dec 18th meeting
From: Becker.osbunorth@Xerox.COM
To: microsoft!michelsu@Sun.COM
Cc: unicode@Sun.COM, Becker.osbunorth@Xerox.COM
In-Reply-To: microsoft!michelsu%Sun:COM's message of 15 Dec 89 16:59:21 PST
(Friday)
Michel,
[...]
Re: meetings more formal with a longer agenda
> Well, if/when we attract the money and personnel, we might envision
genuine Unicode conferences, teach-ins, even the greatly-to-be-desired
Unicoeds ... But as things stand, I think the main technical work will be
established before we get formally organized. You have raised many
interesting issues, so I'm glad you can at least participate by E-mail on
days when you can't justify a trip down here. Until we get formal, we
would certainly be glad to build a meeting around any other opportunity you
might have to visit the Bay Area.
Joe
From glennw@Sun.COM Thu Dec 21 15:11:41 1989
Date: Thu, 21 Dec 89 11:37:04 PST
From: glennw@Sun.COM (Glenn P. Wright)
To: microsoft!michelsu@Sun.COM
Subject: Re: Dec 18th meeting
Cc: unicode@Sun.COM
I think we should still give a try to converge ISO 10646 and Unicode,
but let's say after late March or early April we have to proceed.
I agree with you.
I assume here you mean "finish Unicode", followed by "Discuss
merge issues". I don't think we have ANYTHING to say until we know we have
a draft. I'm sure X3L2 are sick of listening to us bemoaning 10646 when what
we have is not in final draft form.
| Re: meetings more formal with a longer agenda
No, I am not asking for full blown conference but more like a monthly
meeting which may last a bit longer with a predetermined agenda.
I agree. I believe we need to commit more time in January to stop the thrashing
we seem to be doing.
We urgently need to get written consensus on decisions.
I propose that our next meeting be a whole day meeting sometime second or
third week of January.
I would like to see us close all issues on non-CJK (Including symbols)
at that meeting.
(yes, I know most work is done outside meetings, but....)
Glenn.
From BOSURGI1@applelink.apple.com Thu Dec 21 19:41:50 1989 Date: 21 Dec 89 18:58:00 PST From: BOSURGI1@applelink.apple.com To: unicode@Sun.COM Subject: January Meeting Hi ... This is just to second Glenn's recent suggestion of an all-day meeting next month to finalize non-CJK for this revision. I think we really need to close these points as soon as possible, maybe even start some test implementations using Unicode, see how transmission of a subset of Unicode feels, etc., etc. That feedback would be valuable to get before finalizing the whole wad and allow us to bring any hidden flaws in our assumptions (especially in handling of diacritical marks) to the surface quickly. I think the problems we have been pointing out in 10646 have received more acknowledgement recently, and we should take the opportunity soon to start working with (as well as on) what we think is a much more viable standard, to demonstrate Unicode's relative merits concretely. It could be dramatic to compare development time, testing time, and/or execution time of some sample international routines which used Unicode, or a representative subset, with other possible methods. A cross-script, international "Find/Change" routine immediately comes to mind as an example. Yoi o-toshi-o y'all, Joe Bosurgi Claris Corporation Manager, Software Internationalization
From lcollins@apple.com Tue Nov 28 13:50:57 1989 Date: Tue, 28 Nov 89 13:26:55 PST To: unicode@Sun.COM Subject: Letter to ANSI Here is my first crack at the letter we discussed at the Unicode meeting last night. Please review it and let me know if it is acceptable to use your and/or your company names in the final draft to ANSI. Lee ------------------------------------ Subject: Flaws in ISO DP 10646 Status: Industry Group Position Action requested: Consideration by X3L2 As software and computer systems producers attempting to meet growing international requirements, we would like to express our concern with the unsatisfactory direction being taken in the development of the ISO multi-octet standard character code standard, DP 10646. The failure of ISO SC2 WG2 to incorporate the modifications to DP 10646 proposed in X3L2/89-195 threaten to render 10646 unacceptable as an internal process code. Specifically, we are concerned that DP 10646 is marred by three serious flaws: 1. DP 10646 places unnecessary restrictions on the encoding of graphic characters. This forces the use of 24 or 32 bit characters, even though a fully coded 16 bits is more than sufficient for representing all but the most obsolete and obscure of the world's characters. We do not find compelling the argument that this allows backwards compatibility with 7 and 8 bit terminals since it is obvious that existing hardware will require major revisions to adequately handle the character code repertoire made available by DP 10646. 2. If 10646 fails to establish unification of the Han characters, then it will not be possible to represent the standard Han characters used by Chinese and the proposed extensions for Japanese within the much sought after property of the 16-bit Basic Multilingual Plane. Any standard that discriminates against such a large segment of computer users is clearly unacceptable. 3. If 10646 is implemented to allow large numbers of presentational forms in the basic multilingual plane, this will confuse the highly desirable distinction between text content (character codes, the jurisdiction of SC2) and form (glyph identifiers, the jurisdiction of SC18). In practice, this will mean that we have to recognize multiple encodings of characters while gaining nothing since is neither possible, useful, nor practical for a character code standard to specify all possible presentation forms of a character. Moreover, it is irresponsible to allow large numbers of glyphs to be defined within a character code space already much reduced by the above restrictions on graphic character encoding and multiple encodings of Han characters. As a result of these flaws, unless we implement the extravagant and unnecessary 32 bit characters, we will be forced to live with the variable-width encodings, losing the advantages of a fixed-width encoding already noted in X3L2/89-195. Frankly, this result is no better than the current state of the world in multilingual computing. Moreover, it is not clear that 10646 even represents a major advance over ISO 2022. As a result, we foresee the development of a de-facto industry standard encoding based on fixed-width, 16 bit characters. ----------
From Joseph_D._Becker.osbunorth@Xerox.COM Fri Nov 17 12:04:37 1989
Sender: "Joseph_D._Becker.osbunorth"@Xerox.COM
Date: 17 Nov 89 11:34:22 PST (Friday)
Subject: 11/16 Decisions on Unicode Architecture & "Symbols"
From: Becker.osbunorth@Xerox.COM
To: Unicode@Sun.COM
Those who, despite the untimely death of E-mail, actually attended
yesterday's meeting held some pretty thorough pro-and-con discussions on
the various issues, and made decisions summarized as follows:
ARCHITECTURAL STATUS QUO'S:
> NUL bytes: leave unassigned any codes ending in a 0 byte
Decision: Status Quo, i.e. "no code ranges or byte values are
systematically excluded from use".
Reason: Unitext is a new type that can't be interpreted in any "8-bit
mode" anyhow, so there's no gain in acceding to misguided 8-bit thinking.
> C1 pullout: leave unassigned the "C1" range 0080-009F
Decision: Status Quo, i.e. leave the 32 miscellaneous letters and stuff
there.
Reason: We might have been willing to concede if we thought that C1 had
any widely-accepted standard semantics (as C0 does), but since we don't,
there's no point in just leaving the space open.
> Moving unassigned alpha-symbol space
Decision: Status Quo, i.e. leave the overall allocations just as they
are.
Reason: We decided that we do NOT intend all code points after 5000 to be
exclusively Kanjilando, i.e. we ARE willing to put non-Han characters at
later code ranges as things overflow the currently-assigned regions. Then,
for "symbolic characters" (see below), we might as well make the "didactic"
assignment which states that there won't be more than 4K symbols that we
approve of. We discussed doing the same thing for "alphabets" or non-CJK
scripts (shrinking that block to 4K), but decided that in a character code
standard it seemed appropriate to allocate a generous 8K up front to
scripts. Hence, we arrive back at the current structure, but with a better
feel for what we mean by it.
> Nuke the umgels (new item): remove the absurd 2,500-4,000 prefab Korean
umgels, [...]
Decision: Not terminally decided, but so far willing to remain Status
Quo.
[...]
ARCHITECTURAL CHANGE:
> Adopt 8859 alphabet structures: restructure those alphabets that have
national standard encodings to use the arrangement of those encodings
insofar as possible
Decision: For those alphabets where national standards exist, including
the 8859/n sets, change back to using those arrangements. Existing holes
will remain as holes, and a few new holes will be created when we zap out
duplicates of characters that are already coded somewhere else. (By the
way, for better or worse this is the identical approach taken by 10646, so
we should look at the 10646 layouts as well as the old standards.) Our
extension letters will be added afterwards, starting at the first available
multiple of 16. We may also want to add a bit more expansion space after
the end of some alphabets.
Reason: It's just not sensible to enrage everyone in half a dozen nations
for no particular technical gain. If they want to make a disaster area out
of their own alphabet, that's their privilege. The position we took with
ANSI & ISO is that we are NOT trying to design the world over from scratch
because we're smarter than everyone else, but rather that we have a few
sacred principles (e.g. 16 bit encoding) and are trying to weave together
existing standards except where doing so would trash those principles (e.g.
non-unified CJK). [...]
"SYMBOLIC CHARACTERS":
Ken suggested we abandon the word "Symbols" for the phrase "Symbolic
Characters", which we liked as a more precise expression of what we intend.
The list of criteria was expanded slightly, although it still needs work:
Criteria for inclusion:
> If the symbol itself has a name, e.g. "ampersand",
"hammer-and-sickle", "one-snake caduceus"
>If the symbol is commonly used amidst text, e.g. the Japanese
ZipCode-san face that is on the inside cover of the JIS standard but not
among the JIS standard symbols
> If the symbol is widespread, i.e. actually found used in materials of
diverse types/contexts by diverse publishers, including governmental (still
need a more cogent statement of this)
Criteria for exclusion:
> If the symbol is MERELY a drawing (stylized or not) of something,
e.g. this is intended to exclude pictures of cows, dragons, etc.
> If the symbol is usually used in 2-Dimensional diagrams, e.g. circuit
components, weather chart symbols
> If the symbol is composable, e.g. a slash through some other symbol
indicating negation, APL composites(?!)
> If the symbol is recognized only by a small group of people, e.g.
technical symbols for some special field ... analogous to the Buginese
alphabet: these characters exist but are just not "common" enough, at least
for Unicode 1.0
Joe
From glennw@Sun.COM Tue Nov 14 15:41:40 1989
Date: Tue, 14 Nov 89 13:02:40 PST
From: glennw@Sun.COM (Glenn P. Wright)
To: unicode@Sun.COM
Subject: Unicode Consortium details
Dear all.
During the progress of the last two or three Unicode committee meetings we have
been discussing the notion of turning the committee into a consortium.
In following mail I will outline the proposed charter and ground rules
for the consortium.
I believe the proposed rules and charter for the consortium are roughly
in-line with the wishes of the existing committee.
I personally believe that the formation of this sonsortium is crtitical
to the dispursal of information regarding the Unicode scheme.
The consortium should allow us to have involvement from a broader range of
organisations and regions.
Please take time to review the mail that will shortly follow.
Unless there are specific objections to the layout I will begin the process of
identifying an organization, spokesperson and location for the Unicode
consortium.
In particular I would like to hear suggestions regarding other organisations
and individuals that you feel should be added to the following list of
interested parties:
Unicode Interest, Electronic mailing list, to date:
# James Higa NeXT
# Paul Hegarty NeXT
# Matt Morse NeXT
# Lee Collins Apple
# Joe Becker Xerox
# Jackson Adobe systems
# Tom Yap Sun Intercon
# Rick Kwan Sun Intercon
# Albert Fung Sun Intercon
# Nelson Ng Sun
# Bill English Sun
# Teruhiko Kurosaka Sun Intercon
# Karen Smith-Yoshimura Research Libraries Group
# Mike Kernaghan Metaphor
# Ken Whistler Metaphor/Berkeley
# Erik Wendelboe HP
# Wayne Krone HP
# Mike Ksar HP
# Gary Miller IBM
# Joe Bogurgi Claris
# Rick Mcgowan AT&T USO Japan
# Hiromichi Kogure AT&T USO Japan
# Doug Merritt Hunter Systems
Glenn Wright
================================
Sun Microsystems
2550 Garcia Avenue
Mountain View
California CA 94043
USA. Tel (1) 415 336 6983
gwright@sun.com or {..sun}!gwright
From lcollins@apple.com Tue Nov 14 17:55:20 1989 Date: Tue, 14 Nov 89 17:52:21 PST To: unicode@Sun.COM Subject: Microsoft report I talked to Michel Suignard, who handles international code sets for Microsoft. He is very interested in Unicode, likes the separation of text content and form, and has taken the Unicode charts on a trip to Asia where they were well received even in Japan. He is faxing down a list of questions and concerns which I hope to be able to discuss at this week's meeting. He would like to be invited to future meetings (given time to make the travel arrangements). Apparently dealing with code pages is such a pain that some at Microsoft have understood the vision of Unicode. We could see it on a future version of OS/2. Lee