@q Copyright 2012-2022 Alexander Shibakov@> @q Copyright 2002-2014 Free Software Foundation, Inc.@> @q This file is part of SPLinT@> @q SPLinT is free software: you can redistribute it and/or modify@> @q it under the terms of the GNU General Public License as published by@> @q the Free Software Foundation, either version 3 of the License, or@> @q (at your option) any later version.@> @q SPLinT is distributed in the hope that it will be useful,@> @q but WITHOUT ANY WARRANTY; without even the implied warranty of@> @q MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the@> @q GNU General Public License for more details.@> @q You should have received a copy of the GNU General Public License@> @q along with SPLinT. If not, see .@> @** The lexer. \ifbootstrapmode \input limbo.sty \input yystype.sty \input grabstates.sty \immediate\openout\stlist=ldl_states.h \def\MRI{} \def\ld{} \fi The lexer used by \ld\ is almost straightforward. There are a few facilities (\Cee\ header files, some output functions) needed by the lexer that are conviniently coded into the \Cee\ code run by the driver routines that make the lexer more complex than it should have been but the function of each such facility can be easily clarified using this documentation and occasionally referring to the manual for the \bison\ parser which is part of this distribution. @(ldl.ll@>= @G @> @<\ld\ lexer definitions@> @= %{@> @<\ld\ lexer \Cee\ preamble@> @=%} @> @<\ld\ lexer options@> @= %% @> @ @= %% @O void define_all_states( void ) { @@; } @o @g @ @<\ld\ lexer options@>= @G(fs1) %option bison-bridge %option noyywrap nounput noinput reentrant %option noyy_top_state %option debug %option stack %option outfile="ldl.c" @g @ @<\ld\ lexer \Cee\ preamble@>= @ The file \.{ldl\_states.h} below contains the names of all the start conditions@^start conditions@> (or states) collected by the bootstrap parser. @= #define _register_name( name ) @[Define_State( #name, name )@] #include "ldl_states.h" #undef _register_name @ The character classes used by the scanner as well as lexer state declarations have been put in the definitions section of the input file. No attempt has been made to clean up the definitions of the character classes. @<\ld\ lexer definitions@>= @<\ld\ lexer states@>@; @G(fs1) CMDFILENAMECHAR [_a-zA-Z0-9\/\.\\_\+\$\:\[\]\\\,\=\&\!\<\>\-\~] CMDFILENAMECHAR1 [_a-zA-Z0-9\/\.\\_\+\$\:\[\]\\\,\=\&\!\<\>\~] FILENAMECHAR1 [_a-zA-Z\/\.\\\$\_\~] SYMBOLCHARN [_a-zA-Z\/\.\\\$\_\~0-9] FILENAMECHAR [_a-zA-Z0-9\/\.\-\_\+\=\$\:\[\]\\\,\~] WILDCHAR [_a-zA-Z0-9\/\.\-\_\+\=\$\:\[\]\\\,\~\?\*\^\!] WHITE [ \t\n\r]+ NOCFILENAMECHAR [_a-zA-Z0-9\/\.\-\_\+\$\:\[\]\\\~] V_TAG [.$_a-zA-Z][._a-zA-Z0-9]* V_IDENTIFIER [*?.$_a-zA-Z\[\]\-\!\^\\]([*?.$_a-zA-Z0-9\[\]\-\!\^\\]|::)* @g @ The lexer uses different sets of rules depending on the context and the current state. These can be changed from the lexer itself or externally by the parser (as is the case in \ld\ implementation). \locallink{stateswitchers}Later\endlink, a number of helper macros implement state switching so that the state names are very rarely used explicitly. Keeping all the state declarations in the same section simplifies the job of the \locallink{bootstrapstates}bootstrap parser\endlink, as well. \ifbootstrapmode\immediate\openout\stlist=ldl_states.h\fi @<\ld\ lexer states@>= @G(fs1) %s SCRIPT %s EXPRESSION %s BOTH %s DEFSYMEXP %s MRI %s VERS_START %s VERS_SCRIPT %s VERS_NODE @g @*1 Macros for lexer functions. The \locallink{pingpong}state switching\endlink\ `ping-pong' between the lexer and the parser aside, the \ld\ lexer is very traditional. One implementation choice deserving some attention is the treatment of comments. The difficulty of implementing \Cee\ style comment scanning using regular expressions is well-known so an often used alternative is a special function that simply skips to the end of the comment. This is exactly what the \ld\ lexer does with an aptly named |comment()| function. The typesetting parser uses the \.{\\ldcomment} macro for the same purpose. For the curious, here is a \flex\ style regular expression defining \Cee\ comments\footnote{Taken from W.~McKeeman's site at \url{http://www.cs.dartmouth.edu/~mckeeman/cs118/assignments/comment.html} and adapted to \flex\ syntax. Here is the same regular expression pretty printed by \splint: \flexrestyle{"/*"("/"`[\^*/]`"*"+[\^*/])*"*"+"/"}}: $$ \hbox{\.{"/*" ("/"\yl[\^*/]\yl"*"+[\^*/])* "*"+ "/"}} $$ This expression does not handle {\it every\/} practical situation, however, since it assumes that the end of line character can be matched like any other. Neither does it detect some often made mistakes such as attempting to nest comments. A few minor modifications can fix this deficiency, as well as add some error handling, however, for the sake of consistency, the approach taken here mirrors the one in the original \ld. The top level of the \.{\\ldcomment} macro simply bypasses the state setup of the lexer and enters a `|while| loop' in the input routine. This macro is a reasonable approximation of the functionality provided by |comment()|. @= @G(t) \def\ldcomment{% \let\oldyyreturn\yyreturn \let\oldyylextail\yylextail \let\yylextail\yymatch %/* start inputting characters until {\tt *}{\tt /} is seen */ \let\yyreturn\ldcommentskipchars } @g @ The rest of the |while| loop merely waits for the \.{*/} combination. @= @G(t) \def\ldcommentskipchars{% \ifnum\yycp@@=`* \yybreak{\let\yyreturn\ldcommentseekslash\yyinput}% %/* {\tt *} found, look for {\tt /} */ \else \yybreak{\yyinput}% %/* keep skipping characters */ \yycontinue }% \def\ldcommentseekslash{% \ifnum\yycp@@=`/ \yybreak{\ldcommentfinish}%/* {\tt /} found, exit */ \else \ifnum\yycp@@=`* \yybreak@@{\yyinput}% %/* keep skipping {\tt *}'s looking for a {\tt /} */ \else \yybreak@@{\let\yyreturn\ldcommentskipchars\yyinput}% %/* found a character other than {\tt *} or {\tt /} */ \fi \yycontinue }% @g @ Once the end of the comment has been found, resume lexing the input stream. @= @G(t) \def\ldcommentfinish{% \let\yyreturn\oldyyreturn \let\yylextail\oldyylextail \yylextail } @g @ The semantics of the macros defined above do not quite match that of the |comment()| function. The most significant difference is that the portion of the action following \.{\\ldcomment} expands {\it before\/} the comment characters are skipped. In most applications, |comment()| is the last function called so this would not limit the use of \.{\\ldcomment} too dramatically. A more intuitive and easier to use version of \.{\\ldcomment} is possible, however, if \.{\\yylextail} is not used inside actions (in the case of an `optimized' lexer the restriction is even weaker, namely, \.{\\yylextail} merely has to be absent in the portion of the action following \.{\\ldcomment}). Another remark might be in order. It would seem more appropriate to employ \TeX's native grouping mechanism to avoid the side effects casued by the assignments performed by the macros (such as \.{\\let\\oldyyreturn\\yyreturn}). While this is possible with some careful macro writing, a na\:\i ve grouping attempt would interfere with the assignments performed by \.{\\yymatch} (e.g.~\.{\\yyresetstreams}). Avoiding assignments like these is still possible although the effort required is bordering on excessive. @= @G(t) \def\ldcomment#1\yylextail{% \let\oldyyreturn\yyreturn \def\yylexcontinuation{#1\yylextail}% \let\yyreturn\ldcommentskipchars %/* start inputting characters until {\tt *}{\tt /} is seen */ \yymatch } \def\ldcommentfinish{% \let\yyreturn\oldyyreturn \yylexcontinuation } @g @ \namedspot{pretendbufferswlex}The same idea can be applied to `\locallink{pretendbuffersw}pretend buffer switching\endlink'. Whenever the `real' \ld\ parser encounters an \prodstyle{INCLUDE} command, it switches the input buffer for the lexer and waits for the lexer to return the tokens from the file it just opened. When the lexer scans the end of the included file, it returns a special token, \prodstyle{END} that completes the appropriate production and lets the parser continue with its job. We would like to simulate the file inclusion by inserting the appropriate end of file marker for the lexer (a double \.{\\yyeof}). After the relevant production completes, the marker has to be cleaned up from the input stream (the lexer is designed to leave it intact to be able to read the end of file multiple times while looking for the longest match). The macro below is designed to handle this task. The idea is to replace the double \.{\\yyeof} at the beginning of the input with an appropriate lexer action. The \.{\\yyreadinput} handles the input buffer and inserts the tail portion of the current \flex\ action in front of it. @= @G(t) \def\ldcleanyyeof#1\yylextail{% \yyreadinput{\ldcl@@anyyeof{#1\yylextail}}{\romannumeral0\yyr@@@@dinput}% } \def\ldcl@@anyyeof#1#2#3{% #3\ldcl@@anyye@@f{#1}#2% } \def\ldcl@@anyye@@f#1#2\yyeof\yyeof{#1} @g @*1 Regular expressions. The `heart' of any lexer is the collection of regular expressions that describe the {\it tokens\/} of the appropriate language. The variey of tokens recognized by \ld\ is quite extensive and is described in the sections that follow. Variable names, constants, and algebraic operations come first. @= @G(fs2) { "/*" {@> @[TeX_( "/ldcomment/yylexnext" );@]@=} } { "-" {@> @[TeX_( "/yylexreturnchar" );@]@=} "+" {@> @[TeX_( "/yylexreturnchar" );@]@=} {FILENAMECHAR1}{SYMBOLCHARN}* {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=} "=" {@> @[TeX_( "/yylexreturnchar" );@]@=} } { "$"([0-9A-Fa-f])+ {@> @ @=} ([0-9A-Fa-f])+(H|h|X|x|B|b|O|o|D|d) {@> @@=} } { ((("$"|0[xX])([0-9A-Fa-f])+)|(([0-9])+))(M|K|m|k)? {@> @@=} } { "<<=" {@> @[TeX_( "/yylexreturnptr{LSHIFTEQ}" );@]@=} ">>=" {@> @[TeX_( "/yylexreturnptr{RSHIFTEQ}" );@]@=} "||" {@> @[TeX_( "/yylexreturnptr{OROR}" );@]@=} "==" {@> @[TeX_( "/yylexreturnptr{EQ}" );@]@=} "!=" {@> @[TeX_( "/yylexreturnptr{NE}" );@]@=} ">=" {@> @[TeX_( "/yylexreturnptr{GE}" );@]@=} "<=" {@> @[TeX_( "/yylexreturnptr{LE}" );@]@=} "<<" {@> @[TeX_( "/yylexreturnptr{LSHIFT}" );@]@=} ">>" {@> @[TeX_( "/yylexreturnptr{RSHIFT}" );@]@=} "+=" {@> @[TeX_( "/yylexreturnptr{PLUSEQ}" );@]@=} "-=" {@> @[TeX_( "/yylexreturnptr{MINUSEQ}" );@]@=} "*=" {@> @[TeX_( "/yylexreturnptr{MULTEQ}" );@]@=} "/=" {@> @[TeX_( "/yylexreturnptr{DIVEQ}" );@]@=} "&=" {@> @[TeX_( "/yylexreturnptr{ANDEQ}" );@]@=} "|=" {@> @[TeX_( "/yylexreturnptr{OREQ}" );@]@=} "&&" {@> @[TeX_( "/yylexreturnptr{ANDAND}" );@]@=} [&|~!?*+\-/%=<>{}()\[\]:;,] {@> @[TeX_( "/yylexreturnchar" );@]@=} } @g @ The bulk of tokens produced by the lexer are the keywords used inside script files. File name syntax is listed as well, along with the miscellanea like whitespace and version symbols. @= @G(fs2) { "MEMORY" {@> @[TeX_( "/yylexreturnptr{MEMORY}" );@]@=} "REGION_ALIAS" {@> @[TeX_( "/yylexreturnptr{REGION_ALIAS}" );@]@=} "LD_FEATURE" {@> @[TeX_( "/yylexreturnptr{LD_FEATURE}" );@]@=} "VERSION" {@> @[TeX_( "/yylexreturnptr{VERSIONK}" );@]@=} "TARGET" {@> @[TeX_( "/yylexreturnptr{TARGET_K}" );@]@=} "SEARCH_DIR" {@> @[TeX_( "/yylexreturnptr{SEARCH_DIR}" );@]@=} "OUTPUT" {@> @[TeX_( "/yylexreturnptr{OUTPUT}" );@]@=} "INPUT" {@> @[TeX_( "/yylexreturnptr{INPUT}" );@]@=} "ENTRY" {@> @[TeX_( "/yylexreturnptr{ENTRY}" );@]@=} "MAP" {@> @[TeX_( "/yylexreturnptr{MAP}" );@]@=} "CREATE_OBJECT_SYMBOLS" {@> @[TeX_( "/yylexreturnptr{CREATE_OBJECT_SYMBOLS}" );@]@=} "CONSTRUCTORS" {@> @[TeX_( "/yylexreturnptr{CONSTRUCTORS}" );@]@=} "FORCE_COMMON_ALLOCATION" {@> @[TeX_( "/yylexreturnptr{FORCE_COMMON_ALLOCATION}" );@]@=} "INHIBIT_COMMON_ALLOCATION" {@> @[TeX_( "/yylexreturnptr{INHIBIT_COMMON_ALLOCATION}" );@]@=} "SECTIONS" {@> @[TeX_( "/yylexreturnptr{SECTIONS}" );@]@=} "INSERT" {@> @[TeX_( "/yylexreturnptr{INSERT_K}" );@]@=} "AFTER" {@> @[TeX_( "/yylexreturnptr{AFTER}" );@]@=} "BEFORE" {@> @[TeX_( "/yylexreturnptr{BEFORE}" );@]@=} "FILL" {@> @[TeX_( "/yylexreturnptr{FILL}" );@]@=} "STARTUP" {@> @[TeX_( "/yylexreturnptr{STARTUP}" );@]@=} "OUTPUT_FORMAT" {@> @[TeX_( "/yylexreturnptr{OUTPUT_FORMAT}" );@]@=} "OUTPUT_ARCH" {@> @[TeX_( "/yylexreturnptr{OUTPUT_ARCH}" );@]@=} "HLL" {@> @[TeX_( "/yylexreturnptr{HLL}" );@]@=} "SYSLIB" {@> @[TeX_( "/yylexreturnptr{SYSLIB}" );@]@=} "FLOAT" {@> @[TeX_( "/yylexreturnptr{FLOAT}" );@]@=} "QUAD" {@> @[TeX_( "/yylexreturnptr{QUAD}" );@]@=} "SQUAD" {@> @[TeX_( "/yylexreturnptr{SQUAD}" );@]@=} "LONG" {@> @[TeX_( "/yylexreturnptr{LONG}" );@]@=} "SHORT" {@> @[TeX_( "/yylexreturnptr{SHORT}" );@]@=} "BYTE" {@> @[TeX_( "/yylexreturnptr{BYTE}" );@]@=} "NOFLOAT" {@> @[TeX_( "/yylexreturnptr{NOFLOAT}" );@]@=} "OVERLAY" {@> @[TeX_( "/yylexreturnptr{OVERLAY}" );@]@=} "SORT_BY_NAME" {@> @[TeX_( "/yylexreturnptr{SORT_BY_NAME}" );@]@=} "SORT_BY_ALIGNMENT" {@> @[TeX_( "/yylexreturnptr{SORT_BY_ALIGNMENT}" );@]@=} "SORT" {@> @[TeX_( "/yylexreturnptr{SORT_BY_NAME}" );@]@=} "SORT_BY_INIT_PRIORITY" {@> @[TeX_( "/yylexreturnptr{SORT_BY_INIT_PRIORITY}" );@]@=} "SORT_NONE" {@> @[TeX_( "/yylexreturnptr{SORT_NONE}" );@]@=} "EXTERN" {@> @[TeX_( "/yylexreturnptr{EXTERN}" );@]@=} "o"|"org" {@> @[TeX_( "/yylexreturnptr{ORIGIN}" );@]@=} "l"|"len" {@> @[TeX_( "/yylexreturnptr{LENGTH}" );@]@=} "PHDRS" {@> @[TeX_( "/yylexreturnptr{PHDRS}" );@]@=} } { "BLOCK" {@> @[TeX_( "/yylexreturnptr{BLOCK}" );@]@=} "BIND" {@> @[TeX_( "/yylexreturnptr{BIND}" );@]@=} "LENGTH" {@> @[TeX_( "/yylexreturnptr{LENGTH}" );@]@=} "ORIGIN" {@> @[TeX_( "/yylexreturnptr{ORIGIN}" );@]@=} "ALIGN" {@> @[TeX_( "/yylexreturnptr{ALIGN_K}" );@]@=} "DATA_SEGMENT_ALIGN" {@> @[TeX_( "/yylexreturnptr{DATA_SEGMENT_ALIGN}" );@]@=} "DATA_SEGMENT_RELRO_END" {@> @[TeX_( "/yylexreturnptr{DATA_SEGMENT_RELRO_END}" );@]@=} "DATA_SEGMENT_END" {@> @[TeX_( "/yylexreturnptr{DATA_SEGMENT_END}" );@]@=} "ADDR" {@> @[TeX_( "/yylexreturnptr{ADDR}" );@]@=} "LOADADDR" {@> @[TeX_( "/yylexreturnptr{LOADADDR}" );@]@=} "ALIGNOF" {@> @[TeX_( "/yylexreturnptr{ALIGNOF}" );@]@=} "ASSERT" {@> @[TeX_( "/yylexreturnptr{ASSERT_K}" );@]@=} "NEXT" {@> @[TeX_( "/yylexreturnptr{NEXT}" );@]@=} "sizeof_headers" {@> @[TeX_( "/yylexreturnptr{SIZEOF_HEADERS}" );@]@=} "SIZEOF_HEADERS" {@> @[TeX_( "/yylexreturnptr{SIZEOF_HEADERS}" );@]@=} "SEGMENT_START" {@> @[TeX_( "/yylexreturnptr{SEGMENT_START}" );@]@=} "SIZEOF" {@> @[TeX_( "/yylexreturnptr{SIZEOF}" );@]@=} "GROUP" {@> @[TeX_( "/yylexreturnptr{GROUP}" );@]@=} "AS_NEEDED" {@> @[TeX_( "/yylexreturnptr{AS_NEEDED}" );@]@=} "DEFINED" {@> @[TeX_( "/yylexreturnptr{DEFINED}" );@]@=} "NOCROSSREFS" {@> @[TeX_( "/yylexreturnptr{NOCROSSREFS}" );@]@=} "NOLOAD" {@> @[TeX_( "/yylexreturnptr{NOLOAD}" );@]@=} "DSECT" {@> @[TeX_( "/yylexreturnptr{DSECT}" );@]@=} "COPY" {@> @[TeX_( "/yylexreturnptr{COPY}" );@]@=} "INFO" {@> @[TeX_( "/yylexreturnptr{INFO}" );@]@=} "OVERLAY" {@> @[TeX_( "/yylexreturnptr{OVERLAY}" );@]@=} "ONLY_IF_RO" {@> @[TeX_( "/yylexreturnptr{ONLY_IF_RO}" );@]@=} "ONLY_IF_RW" {@> @[TeX_( "/yylexreturnptr{ONLY_IF_RW}" );@]@=} "SPECIAL" {@> @[TeX_( "/yylexreturnptr{SPECIAL}" );@]@=} "INPUT_SECTION_FLAGS" {@> @[TeX_( "/yylexreturnptr{INPUT_SECTION_FLAGS}" );@]@=} "INCLUDE" {@> @[TeX_( "/yylexreturnptr{INCLUDE}" );@]@=} "AT" {@> @[TeX_( "/yylexreturnptr{AT}" );@]@=} "ALIGN_WITH_INPUT" {@> @[TeX_( "/yylexreturnptr{ALIGN_WITH_INPUT}" );@]@=} "SUBALIGN" {@> @[TeX_( "/yylexreturnptr{SUBALIGN}" );@]@=} "HIDDEN" {@> @[TeX_( "/yylexreturnptr{HIDDEN}" );@]@=} "PROVIDE" {@> @[TeX_( "/yylexreturnptr{PROVIDE}" );@]@=} "PROVIDE_HIDDEN" {@> @[TeX_( "/yylexreturnptr{PROVIDE_HIDDEN}" );@]@=} "KEEP" {@> @[TeX_( "/yylexreturnptr{KEEP}" );@]@=} "EXCLUDE_FILE" {@> @[TeX_( "/yylexreturnptr{EXCLUDE_FILE}" );@]@=} "CONSTANT" {@> @[TeX_( "/yylexreturnptr{CONSTANT}" );@]@=} "\n" {@> @[TeX_( "/yylexnext" );@]@=} } { "MAX" {@> @[TeX_( "/yylexreturnptr{MAX_K}" );@]@=} "MIN" {@> @[TeX_( "/yylexreturnptr{MIN_K}" );@]@=} "LOG2CEIL" {@> @[TeX_( "/yylexreturnptr{LOG2CEIL}" );@]@=} } { "ABSOLUTE"|"absolute" {@> @[TeX_( "/yylexreturnptr{ABSOLUTE}" );@]@=} [ \t\r]+ {@> @[TeX_( "/yylexnext" );@]@=} } { {FILENAMECHAR1}{FILENAMECHAR}* {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=} "-l"{FILENAMECHAR}+ {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=} } { {FILENAMECHAR1}{NOCFILENAMECHAR}* {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=} "-l"{NOCFILENAMECHAR}+ {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=} }