Re: whitespace problem

From: Mark Davis (markdavis@ispchannel.com)
Date: Sat Apr 29 2000 - 17:40:46 EDT


Tim is quite right, and has spurred me into providing more information. If you are doing wordbreak or linebreak, there are products available that you can use without having to roll your own. I'll mention IBM's open-source products (there are others available as well), which can be obtained at:

For C/C++: http://oss.software.ibm.com/icu/
For Java: http://oss.software.ibm.com/icu4j

In both cases, both dictionary-based and rule-based code is provided. Either one can be customized, so that the behavior can be tuned for different locales. (Note: wordbreak is different than linebreak. For more information, see Version 3.0 of The Unicode Standard)

If you are doing Java, the base system from Sun has support in the BreakIterator. ICU4j contains extensions that support Thai and a new regular-expression-like mechanism for building the rule-based BreakIterators. For examples, look at the files that start with "Break" in the following directories:

code: http://oss.software.ibm.com/developerworks/opensource/cvs/icu4j/icu4j/src/com/ibm/text/
data: http://oss.software.ibm.com/developerworks/opensource/cvs/icu4j/icu4j/src/com/ibm/text/resources/

For example, here are the default wordbreak rules:

        // default rules for finding word boundaries
        { "WordBreakRules",
            // ignore non-spacing marks, enclosing marks, and format characters,
            // all of which should not influence the algorithm
            "$ignore=[[:Mn:][:Me:][:Cf:]];"

            // Hindi phrase separator, kanji, katakana, hiragana, CJK diacriticals,
            // other letters, and digits
            + "danda=[\u0964\u0965];"
            + "kanji=[\u3005\u4e00-\u9fa5\uf900-\ufa2d];"
            + "kata=[\u3099-\u309c\u30a1-\u30fe];"
            + "hira=[\u3041-\u309e\u30fc];"
            + "let=[[[:L:][:Mc:]]-[{kanji}{kata}{hira}]];"
            + "dgt=[:N:];"

            // punctuation that can occur in the middle of a word: currently
            // dashes, apostrophes, quotation marks, and periods
            + "mid-word=[[:Pd:]\u00ad\u2027\\\"\\\'\\.];"

            // punctuation that can occur in the middle of a number: currently
            // apostrophes, qoutation marks, periods, commas, and the Arabic
            // decimal point
            + "mid-num=[\\\"\\\'\\,\u066b\\.];"

            // punctuation that can occur at the beginning of a number: currently
            // the period, the number sign, and all currency symbols except the cents sign
            + "pre-num=[[[:Sc:]-[\u00a2]]\\#\\.];"

            // punctuation that can occur at the end of a number: currently
            // the percent, per-thousand, per-ten-thousand, and Arabic percent
            // signs, the cents sign, and the ampersand
            + "post-num=[\\%\\&\u00a2\u066a\u2030\u2031];"

            // line separators: currently LF, FF, PS, and LS
            + "ls=[\n\u000c\u2028\u2029];"

            // whitespace: all space separators and the tab character
            + "ws=[[:Zs:]\t];"

            // a word is a sequence of letters that may contain internal
            // punctuation, as long as it begins and ends with a letter and
            // never contains two punctuation marks in a row
            + "word=({let}+({mid-word}{let}+)*{danda}?);"

            // a number is a sequence of digits that may contain internal
            // punctuation, as long as it begins and ends with a digit and
            // never contains two punctuation marks in a row.
            + "number=({dgt}+({mid-num}{dgt}+)*);"

            // break after every character, with the following exceptions
            // (this will cause punctuation marks that aren't considered
            // part of words or numbers to be treated as words unto themselves)
            + ".;"

            // keep together any sequence of contiguous words and numbers
            // (including just one of either), plus an optional trailing
            // number-suffix character
            + "{word}?({number}{word})*({number}{post-num}?)?;"

            // keep together and sequence of contiguous words and numbers
            // that starts with a number-prefix character and a number,
            // and may end with a number-suffix character
            + "{pre-num}({number}{word})*({number}{post-num}?)?;"

            // keep together runs of whitespace (optionally with a single trailing
            // line separator or CRLF sequence)
            + "{ws}*\r?{ls}?;"

            // keep together runs of Katakana
            + "{kata}*;"

            // keep together runs of Hiragana
            + "{hira}*;"

            // keep together runs of Kanji
            + "{kanji}*;"
        },

Mark

Timothy Partridge wrote:

> Mark Davis wrote
>
> > You should look at the Unicode Character Database on
> > www.unicode.org
>
> > Tristan Rybak wrote:
>
> > > Hello
> > > I have another problem with my string class...
> > > How can I find out if unicode character is white space or
> > > not?
> > > please help!
>
> While I agree with Mark's reply, I'll answer your question
> with another question.
>
> Why do you want to know which characters are white space?
> If it is connected with finding ends of words or looking for
> somewhere to break a line, the various writing systems
> and cultural conventions worldwide make this process far more
> complex than in English.
> The Unicode Standard and technical reports give useful
> guidance on this.
>
> Tim



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT