Some question about DOM(Core) Level 1 Darft 11-September-1997

From: Yung-Fong Tang (
Date: Tue Sep 16 1997 - 22:37:41 EDT

I just do a quick scan of the document and have
some question:

> Character
> A character is an atomic unit of information with no fixed binding to either character code, or glyph/glyph image.
> It provides a large number of methods which allow its properties to be queried.
> string getName()
> Return the canonical name of the character. For characters in the repertoire of ISO 10646, this should be the
> ISO 10646 name.

Does this imply the implemenation need to include a table which have
6588 string (the line number of file ) ? Why you
nee such interface ? What is the benefit for have such benefit ?

[My personally opinion is to delete such interface unless someone have a
good reason to keep this]

> boolean isSimpleHan()
> boolean isTraditionalHan()
> boolean isKanji()
> boolean isHanzi()
Since ISO 10646 is base on Unicode and Unicode Unify Han. I don't think
it make sense to distinguish SimpleHan TraditionalHan, Kanji and Hanzi.
The different between them are in Type face (assume we are taking about
Unicode now) instead of code point. Basically all SimpleHan are
TraditionalHan and Kanji and Hanzi. The only difference is how FREQUENT
one character used in Japan and HOW FREQUENT it used in China if you
don't count the type face issue. And the type face issue should be solve
in font level instead of text processing level.
 [My personally opinion is to delete such interface unless someone have
a good reason to keep this]

> string toString()
> Conversion to native representation. Conversion may not always be perfect. String is required for
> characters that may not have a composed form in the native representation.
> UcString asUnicode()
> Conversion to Unicode representation. Conversion may not always be perfect. String is required for
> characters that may not have a composed form in the native representation.

I do not understand the difference between these two. What is string if
it is not UcString ? If string is not encoded as UCS2, what is it's
charset ? and how to figure out what is the character set. (I am sorry
that I am not expert in IDL....)

Also, to make the specificaiton of those isxxx in Character object more
specific, I would suggest the following change:

> boolean isControl()
> Returns true if this character is classified as a control character.
(classified as "Cc = Other, Control " in )

> boolean isLetter()
> Returns true if this character is classified as a letter.
(classified as either "Lu = Letter, Uppercase", " Ll = Letter,
Lowercase"," Lt = Letter, Titlecase"," Lm = Letter, Modifier" or "Lo
= Letter, Other" in )

> boolean isDiacritic()
> Returns true if this character is diacritical mark.
(classified as "Sk = Symbol, Modifier" in )

> boolean isNumeric()
> [Ed: I do not know what this is]
(classified as either "Nd = Number, Decimal Digit", "Nl = Number,
Letter" or "No = Number, Other" in )

> boolean isDigit()
> Returns true if this character is a digit in some writing system, i.e. it may return true for more than just the
> characters '0' through '9'.
(classified as "Nd = Number, Decimal Digit" in )

> boolean isPunctuation()
> Returns true if the character represents a punctuation mark.
(classified as either "Pc = Punctuation, Connector", "Pd = Punctuation,
Dash", "Ps = Punctuation, Open", "Pe = Punctuation, Close", or "Po =
Punctuation, Other" in )

> boolean isSeparator()
(classified as either "Zs = Separator, Space", " Zl = Separator, Line",
or "Zp = Separator, Paragraph"
in )

> boolean isSpaceSeparator()
(classified as "Zs = Separator, Space"in )

> boolean isLineSeparator()
(classified as " Zl = Separator, Line"in )

> boolean isParagraphSeparator()
(classified as "Zp = Separator, Paragraph"in )

> boolean isSymbol()
(classified as either "Sm = Symbol, Math", "Sc = Symbol, Currency",
"Sk = Symbol, Modifier", or "So = Symbol, Other" in )

> boolean isMathSymbol()
(classified as "Sm = Symbol, Math"in )

> boolean isCurrencySymbol()
(classified as "Sc = Symbol, Currency" in )

> boolean isUnclassified()
(classified as "Cn = Other, Not Assigned" in )

> boolean isUpperCase()
(classified as "Lu = Letter, Uppercase" in )

> boolean isLowerCase()
(classified as "Ll = Letter, Lowercase" in )

> boolean isTitleCase()
(classified as "Lt = Letter, Titlecase" in )

> boolean isModifier()
(classified as "Lm = Letter, Modifier" in )

> boolean isOpenPunctuation()
(classified as "Ps = Punctuation, Open"in )

> boolean isClosePunctuation()
(classified as "Pe = Punctuation, Close"in )

> boolean isSurrogate()
(classified as "Cs = Other, Surrogate"in )However:

> boolean isBase()
> boolean isCombining()
> boolean isComposite()
> boolean isCompatibility()
> boolean isNonSpacing()
> boolean isSmall()
> boolean isNormal()
> boolean isCapital()
> boolean isFullwidth()
> boolean isHalfwidth()
> boolean isAlphabetic()
> Returns true if the character is an alphabetic character within some character set; false otherwise.

We should include specification or reference about the above is
function. Does ANYONE understand what the above function mean ?

isFullwisth, isHalfwidth and isAlphabetic seems carried from the old C
interface which is not quite fit into the Unicode centric world (and
Propotional font world ...) But what the rest of the function mean ?

> boolean isLatin()
> boolean isGreek()
> boolean isCyrillic()
> boolean isArmenian()
> boolean isHebrew()
> boolean isArabic()
> boolean isIndic()
> boolean isDevanagari()
> boolean isBengali()
> boolean isGurmukhi()
> boolean isOriya()
> boolean isTamil()
> boolean isTeluga()
> boolean isKannada()
> boolean isMalayam()
> boolean isThai()
> boolean isLao()
> boolean isKhmer()
> boolean isBurmese()
> boolean isGeorgian()
> boolean isIdeographic()
> boolean isHan()
I also feel a little bit bad about provide such function. The problem is
it give the caller higher protential to mis-use them.

> boolean isScript(in string name)
> A more generic form of the character classification predicates above. The name parameter is the name of
> the character set as defined in the Unicode 2.0 specification. [Ed note: I think we should be
> more concrete here. Is the name interpreted case sensitively or not? Is there a
> specification that we can more concretely refer to? Is this really a "character set"
> name, or is there a more correct term that should be used?]

???? I think many people in the Unciode mailling list could help you
make a better specification.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT