L2/01-204 From: Karlsson Kent - keka [keka@im.se] Sent: Wednesday, May 16, 2001 12:38 PM Line oriented parsing This is a suggestion to add text to UAX 13, regarding parsing of lines, paragraphs & lines, or forms & lines, etc. as the case may be. I think it is helpful to give concrete examples. Note that they are examples; UAX 13 is only a guideline, and the appropriate choice is application dependent. The suggested text is not in final form, but rather just a basis for discussion about what should be added to UAX 13. The examples are given in a "pseudo-flex" style; so familiarity with flex (or lex) is helpful for the reader. The coding could be done in C (say), but the examples would then be much longer (this paper would be at least 20 pages, maybe even 50, rather than just 2), much more unreadable, and would not give the few-line overview that the "pseudo-flex" style gives. Character escape notation, in C/lex/flex style, for use in the pseudo-lex/flex code below: \n line feed \r carriage return \m next line \f form feed \v vertical tabulation \l line separator \p paragraph separator \w [Zs]* For "record oriented" file systems, NEXT LINE is assumed to terminate the records (lines). FILE SEPARATOR, GROUP SEPARATOR, and RECORD SEPARATOR are not dealt with here, since UAX 13 does not cover them (yet). A short mention may be useful to indicate that they are not just forgotten in UAX 13, but not in common use so as to warrant their explicit coverage in UAX 13. (I'm not sure if NEXT LINE is used as a line terminator or a line separator.) 1. Orientation to logical lines, where lines have termination indications (in the view of the program, not necessarily actually so in the "backing store"/file) (the lines may still be further wrapped on physical display/print, it's the characters in the "backing store" that counts here) \f|\v|\l|\p|\n|\r|\r\n|\m {yylval=cpy(yytext); return EOL;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); BEGIN END; return EOL;} <> {yylval=cpy(""); BEGIN END; return EOL;} <> {yylval=""; return EOF;} Note that the first EOF may return EOL, unless there is another end of line mark there already (this is important, otherwise the last line may be lost or otherwise mistreated, a common bug in many programs). ALT (for applications that are forms oriented): \f {yylval=cpy(yytext); return FSEP;} \v|\l|\p|\n|\r|\r\n|\m {yylval=cpy(yytext); return EOL;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); BEGIN END; return EOL;} <> {yylval=cpy(""); BEGIN END; return EOL;} <> {yylval=""; return EOF;} Note that \f (and \v) always, legacy-wise, are interpreted as separators, not terminators. Further alternatives that cover white space inside the line or form separators may be suitable in some applications. 2. Orientation to logical lines, where lines have separation indications (in the view of the program) \f|\v|\l|\p {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); return EOF;} <> {yylval=""; return EOF;} ALT (for applications that are forms oriented): \f {yylval=cpy(yytext); return FSEP;} \v|\l|\p {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); return EOF;} <> {yylval=""; return EOF;} Further alternatives that cover white space inside the line or form separators may be suitable in some applications. 3. Orientation to paragraphs (and lines within them), where the program view is that paragraphs and lines have separation indications \f|\v|\l {yylval=cpy(yytext); return LSEP;} \p {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); return EOF;} <> {yylval=""; return EOF;} ALT1 (follows common "plain text" conventions): \f|\v|\l {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return LSEP;} \p {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)(\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); return EOF;} <> {yylval=""; return EOF;} ALT2 (like ALT1, but also allows white space to be part of the legacy paragraph separator): \f|\v|\l {yylval=cpy(yytext); return LSEP;} (\n|\r|\r\n|\m)/. {yylval=cpy(yytext); return LSEP;} \p {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)\w(\n|\r|\r\n|\m)\w/. {yylval=cpy(yytext); return PSEP;} (\n|\r|\r\n|\m)/<> {yylval=cpy(yytext); return EOF;} <> {yylval=""; return EOF;} Further alternatives that cover more of white space inside the paragraph, line, or page (form) separators may be suitable. And, as hinted, the form separator may be a separator class on its own (FSEP, in parallel with LSEP and PSEP) in some applications, as examplified above in the other "orientations".