% Copyright 2012-2024, Alexander Shibakov % This file is part of SPLinT % % SPLinT is free software: you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation, either version 3 of the License, or % (at your option) any later version. % % SPLinT is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You should have received a copy of the GNU General Public License % along with SPLinT. If not, see . @s TeX_ TeX \def\optimization{5} \input ldman.sty \input ldsetup.sty \modenormal \input ldfrontmatter.sty % multi-column output \input dcols.sty \topskip=9pt \let\oldN\N \let\N\chapterN \let\M\textM \initauxstream \showlastactiontrue @** Introduction. This is a manual documenting the development of a parser that can be used to typeset \ld\ files (linker scripts) with or without the help of \CWEB. An existing parser for \ld\ has been adopted as a base, with appropriately designed actions specific to the task of typesetting. The appendix to this manual contains the full source code (including the parts written in \Cee) of both the scanner and the parser for \ld, used in the original program. Some very minor modifications have been made to make the programs more `presentable' in \CWEB\ (in particular, the file had to be split into smaller chunks to satisfy \CWEAVE's limitations). Nearly every aspect of the design is discussed, including the supporting \TeX\ macros that make both the parser and this documentation possible. The \TeX\ macros presented here are collected in \.{ldman.sty} and \.{ldsetup.sty} which are later included in the \TeX\ file produced by \CWEAVE. @= @G(t) \ifx\optimization\UNDEFINED %/* this trick is based on the premise that \.{\\UNDEFINED} */ \def\optimization{0} %/* is never defined nor created with \.{\\csname$\ldots$\\endcsname} */ \fi \let\nx\noexpand %/* a convenient shortcut*/ \input limbo.sty %/* general setup macros */ \input yycommon.sty %/* general routines for stack and array access */ \input yymisc.sty %/* helper macros (stack manipulation, table processing, value stack pointers */ %/* parser initialization, optimization) */ \input yyinput.sty %/* input functions */ \input yyparse.sty %/* parser machinery */ \input flex.sty %/* lexer functions */ \ifnum\optimization>\tw@ \input yyfaststack.sty \fi \input yystype.sty %/* scanner auxiliary types and functions */ \input yyunion.sty %/* parser data structures */ \amendswitch %/* adjust the \.{\\yyinput} to recognize \.{\\yyendgame} */ \multicharswitch\near\yyeof\by\yyendgame\to\multicharswitch % /* add a new label */ \replaceaction\multicharswitch\at\yyendgame % /* replace the new empty action */ \by{{\yyinput\yyeof\yyeof\endparseinput\removefinalvb}}\to\multicharswitch \def\MRI{{\sc MRI}} %/* so we can say that \MRI\ script typesetting is not supported$\ldots$ */ @g @ Up to this point, the parser initialization process has been non parser specific (although sensitive to the namespace chosen). The next command inputs the data structures used by the \ld\ parser. The reader should consult the file below for the details on the \AST\ produced by the parser and its semantics. @= @G(t) \input ldunion.sty %@>/* \ld\ parser data structures */@+ @g @*1 Bootstrapping. To produce a usable parser/scanner duo, several pieces of code must be generated. The most important of these are the {\it table files\/} (\.{ldptab.tex} and \.{ldltab.tex}) for the parser and the scanner. These consist of the integer tables defining the operation of the parser and scanner automata, the values of some constants, and the `action switch'. Just like in the case of `real' parsers and scanners, in order to make the parser and the scanner interact seamlessly, some amount of `glue' is required. As an example, a file containing the (numerical) definitions of the token values is generated by \bison\ to be used by a \flex\ generated scanner. Unfortunately, this file has too little structure for our purposes (it contains definitions of token values mixed in with other constants making it hard to distinguish one kind of definition from another). Therefore, the `glue' is generated by parsing our grammar once again, this time with a \bison\ grammar designed for typesetting \bison\ files. A special {\it bootstrapping\/} mode is used to extract the appropriate information. The name `bootstrapping' notwithstanding, the parser and lexer used in the bootstrapping phase are not necessarily the minimized versions used in bootstrapping the \bison\ parser. One component generated during the bootstrapping pass is a list of `token equivalences' (or `aliases') to be used by the lexer. Every token (to be precise, every {\it named token type}) used in a \bison\ grammar is declared using one of the \prodstyle{\%token}, \prodstyle{\%left}, \prodstyle{\%right}, \prodstyle{\%precedence}, or \prodstyle{\%nonassoc} declarations. If no {\it alias\/} (see below) has been declared using a \prodstyle{\%token} declaration, this name ends up in the |yytname| array output by \bison\ and can be used by the lexer after associating the token names with their numerical values (accomplished by \.{\\settokens}). If all tokens are named tokens, no token equivalence list is necessary to set up the interaction between the lexer and the parser. In this case (the present \ld\ parser is a typical example), the token list may also serve a secondary role: to provide hints for the macros that typeset the grammar terms, after the \.{\\tokeneq} macro is redefined for this purpose. On the other hand, after a declaration such as `\prodstyle{\%token} \.{CHAR} \.{"char"}' the string \.{"char"} becomes an alias for the named token \.{CHAR}. Only the string version gets recorded in the |yytname| array. Establishing the equivalence between the two token forms now can be accomplished only by examining the grammar source file and is delegated to the bootstrapping phase parser. \leavevmode\namedspot{bootstrapstates}Another (perhaps most important) goal of the bootstrapping phase is to extract the information about \flex\ {\it states\/} used by the lexer from the appropriate source file. As is the case with token names, this information is output in a rather chaotic fashion by the scanner generator and is all but useless for our purposes. The bootstrapping macros are designed to handle \flex's \prodstyle{\%x} and \prodstyle{\%s} state declarations and produce a \Cee\ file with the appropriate definitions. This file can later be included by the `driver' routine to generate the appropriate table file for the lexer. To round off the bootstrapping mode we only need to establish the output streams for the tokens and the states, supply the appropriate file names for the two lists, flag the bootstrapping mode for the bootstrapping macros and inline typesetting (\.{\\prodstyle} macros) and input the appropriate machinery. This is done by the macros below. The bootstrap lexer setup (\.{\\bootstraplexersetup}) consists of inputting the token equivalence table for the \bison\ parser (i.e.~the parser that processes the \bison\ grammar file) and defining a robust token output function which simply ignores the token values the lexer is not aware of (it should not be necessary in our case since we are using a full featured lexer). @= @G(t) \def\modebootstrap{% \def\bootstraplexersetup{% \let\yylexreturn\yylexreturnbootstrap /* only return tokens whose value is known */ %\let\yylexreturn\yylexreturnregular /* should also work */ }% \input yybootstrap.sty% \input yytexlex.sty% } @g @*1 Namespaces and modes. Every parser/lexer pair (as well as some other macros) operates within a set of dedicated {\it namespaces\/}. This simply means that the macros that output token values, switch lexer states and access various tables `tack on' a string of characters representing the current namespace to the `low level' control sequence name that performs the actual output or access. Say, \.{\\yytname} becomes an alias of \.{\\yytname[main]} while in the \.{[main]} namespace. When a parser or lexer is initialized, the appropriate tables are aliased with a generic name in the case of an `unoptimized' parser or lexer. The optimized parser or lexer handles the namespace referencing internally. The mode setup macros for this manual define several separate namespaces: {% \def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}% \setbox0=\hbox{\sqebullet\enspace} \parindent=0pt \advance\parindent by \wd0 \smallskip \aterm{main}the \.{[main]} namespace is established for the parser that does the typesetting of the grammar. \aterm{ld}every time a term name is processed, the token names are looked up in the \.{[ld]} namespace. The same namespace is used by the parser that typesets \ld\ script examples in the manual (i.e.~the parser described here). This is done to provide visual consistency between the description of the parser and its output. \aterm{small{ \rm and} ldsmall}the \.{[small]}namespace is used by the term name parser itself. Since we use a customized version of the name parser, we dedicate a separate namespace for this purpose, \.{[ldsmall]}. \aterm{prologue}the parser based on a subset of the full \bison\ grammar describing prologue declarations uses the \.{[prologue]} namespace. \aterm{index}the \.{[index]} namespace is used for typesetting the index entries for the terms that are automatically inserted by the main parser (such as an empty right hand side of a production or an inline action). \aterm{index:tex{\rm, }index:visual{\rm, and }texline} the macros that typeset \TeX\ entries, use \.{index:tex} as a pseudonamespace to display \TeX\ terms in the index (due to the design of these typesetting macros, many of them take parameters, which can lead to chaos in the index). The \.{index:visual} pseudonamespace is used for ordering the index entries for \TeX\ macros (this is how the entry for \.{\\getsecond} (\texref{/getsecond}) ends up in the {\ssf P} section of the index). Finally, the \TeX\ typesetting macros use the \.{texline} pseudonamespace to process the names of \TeX\ command sequences in action code. The reason we refer to these as {\it pseudo\/}namespaces is that only the token (or term) names are aliased, and not, say, the finite automata names in the case of the \TeX\ typesetting. \aterm{flexre{\rm, }flexone{, \rm and} flextwo}the parsers for \flex\ input use the \.{[flexre]}, \.{[flexone]}, and~\.{[flextwo]} namespaces for their operation. Another convention is to use the \.{\\flexpseudonamespace} to typeset \flex\ state names, and the \.{\\flexpseudorenamespace} for typesetting the names of \flex\ regular expressions. Currently, \.{\\flexpseudo...} namespaces are set equal to their non-\.{pseudo} versions by default. This setting may be changed whenever several parsers are used in the same document and tokens with the same names must be typeset in different styles. All \flex\ namespaces, as well as~\.{[main]}, \.{[small]}, and~\.{[ldsmall]} are defined by the \.{\\genericparser} macros. \aterm{cwebclink}finally, the \.{[cwebclink]} namespace is used for typesetting the variables {\it inside\/} \ld\ scripts. This way, the symbols exported by the linker may be typeset in a style similar to \Cee\ variables, if desired (as they play very similar roles). } @= @G(t) \def\indexpseudonamespace{[index]} \def\cwebclinknamespace{[cwebclink]} \let\parsernamespace\empty @g @ After all the appropriate tables and `glue' have been generated, the typesetting of this manual can be handled by the {\tt normal} mode. Note that this requires the \ld\ parser, as well as the \bison\ parser, including all the appropriate machinery. The normal mode is started by including the tables and lists and initializing the \bison\ parser (accomplished by inputting \.{yyinit.sty}), followed by handling the token typesetting for the \ld\ grammar. @= @G(t) \newtoks\ldcmds \def\modenormal{% \commonstartup \input ldtexlex.sty% /* \TeX\ typesetting specific to \ld */ \expandafter\def\csname index domain translation [L]\endcsname{% {\noexpand\ld\ index}% /* used to typeset the table of contents */ {ld index}% /* outline entry */ {L\sc D INDEX}% /* index header */ }% \def\otherlangindexheader{% B{\sc ISON}, F{\sc LEX}, \TeX, {\sc AND} L{\sc D INDICES}% }% /* modify the index header */ @>@[@@] @>@[@@] } @g @ @= @G(t) \def\commonstartup{ \def\appendr##1##2{\edef\appnext{##1{\the##1##2}}\appnext}% \def\appendl##1##2{\edef\appnext{##1{##2\the##1}}\appnext}% \input yyinit.sty % \input yytexlex.sty% /* \TeX\ typesetting macros */ \input gindex.sty % /* indexing macros specific to \flex, \bison, and \ld */ \input noweb.sty % /* \noweb\ style references */ \xreflocaltrue \let\sectionlistsetup\lxrefseparator \let\inx\inxmod \let\fin\finmod \termindextrue \immediate\openout\gindex=\jobname.gdx \let\hostparsernamespace\ldnamespace /* the namespace where tokens are looked up for typesetting purposes */ } @g @ The \ld\ parser initialization requires setting a few global variables, as well as entering the \.{INITIAL} state for the \ld\ lexer. The latter is somewhat counterintuitive and is necessitated by the ability of the parser to switch lexer states. Thus, the parser can switch the lexer state before the lexer is invoked for the first time wreaking havoc on the lexer state stack. @= @G(t) \def\ldparserinit{% \basicparserinit \includestackptr=\@@ne \versnodenesting=\z@@ \ldcmds{}% \yyBEGIN{INITIAL}% } @g @ This is the \ld\ parser invocation routine. It is coded according to a straightforward sequence initialize-invoke-execute-or-fall back. The \.{\\preparseld} macro is invoked by \.{\\lsectionbegin} (see \.{limbo.sty}). It starts by defining the postprocessing and typesetting macro (\.{\\postparseld}) followed by the parser setup. @= @G(t) \def\preparseld{% \let\postparse\postparseld \hidecslist\ldunion % /* inhibit expansion so that fewer \.{\\noexpand}s are necessary */ \toldparser \displayrawtable % /* do this after the parser namespaces are setup */ \ldparserinit \yyparse } @g @ The postprocessing macro defines a procedure for typesetting the output and saving the parsed result to a file. After that, a generic postprocessing macro is executed. @= @G(t) \def\postparseld{% \let\saveparsedtable\saveldtable \let\typesetparsedtables\typesetldtables \postparsegeneric{(ld script)}% } @g @ The table output and typesetting macros are responsible for the parsed table output to a log file and the typographic representation of the result. The table is output after expanding its contents. The commands \begindemo ^\hidecslist\cwebstreamchars ^\restorecslist{ld-parser-debug}\ldunion \enddemo are meant to inhibit most expansions so that only the list (iterator) macros are expanded. The parser for \ld\ does not currently use the list macros to speed up the parsing process but the general set up is here in case it is needed in the future. Unfortunately, the same technique cannot be applied to the display of the \.{\\yystash} and \.{\\yyformat} streams (that use linked lists by default), since there are too many random sequences that can appear in such streams. Therefore, to facilitate debugging, one should expand the lists before displaying them (see \.{yyinput.sty} for details; lists with iterators are a bit more convenient due to their flexibility). @= @G(t) \def\saveldtable#1{{% \hidecslist\cwebstreamchars \restorecslist{ld-parser-debug}\ldunion \expandafter\saveoutputcode\expandafter{\the\ldcmds}\exampletable{#1}% }} \def\typesetldtables{% \begingroup \displayparsedoutput\ldcmds \restorecslist{ld-parser:restash}\ldunion % /* mark variables, preprocess stash */ \setprodtable \the\ldcmds \restorecslist{ld-display}\ldunion \setprodtable /* use the \bison's parser typesetting definitions */ \restorecs{ld-display}{\anint\bint\hexint} % /* $\ldots$ except for integer typesetting */ \the\ldcmds \par \vskip-\baselineskip \the\lddisplay \endgroup } @g @ The parsing routine defined above is the first macro in the \ld\ parser stack. The only remaining procedure on the stack is an error reporting macro in case the parsing pass failed. @= @G(t) \fillpstack{l}{% \preparseld {\preparsefallback{++}}% /* skip this section if parsing failed, put \.{++} on the screen */ \relax % /* this \.{\\relax} `guards' the braces above during \.{\\poppstack} */ } @g @ Unless they are being bootstrapped, the \ld\ parser and its term parser are initialized by the normal mode. The token typesetting of \ld\ grammar tokens is adjusted at the same time (see the remarks above about the mechanism that is responsible for this). Most terminals (such as keywords, etc.) may be displayed unchanged (provided the names used by the lexer agree with their appearance in the script file, see below), while the typeseting of others is modified in \.{ltokenset.sty}. In the original \bison-\flex\ interface, token names are defined as straightforward macros (a poor choice as will be seen shortly) which can sometimes clash with the standard \Cee\ macros. This is why \ld\ lexer returns \prodstyle{ASSERT} as \prodstyle{ASSERT\_K}. The name parser treats \.{K} as a suffix to supply a visual reminder of this flaw. Note that the `suffixless' part of these tokens (such as \prodstyle{ASSERT}) is never declared and thus has to be entered in \.{ltokenset.sty} by hand. The tokens that never appear as part of the input (such as \prodstyle{END} and \prodstyle{UNARY}) or those that do but have no fixed appearance (for example, \prodstyle{NAME}) are typeset in a style that indicates their origin. The details can be found by examining \.{ltokenset.sty}. @= @G(t) \genericparser name: ld, ptables: ldptab.tex, ltables: ldltab.tex, tokens: {}, asetup: {}, dsetup: {}, rsetup: {}, optimization: {};% /* the parser and lexer are optimized when output */ \genericprettytokens namespace: ld, tokens: ldp.tok, correction: ltokenset.sty, host: ld;% @g @ We also need some modifications to the indexing macros in order to typeset \ld\ terms and variables separately. @= @G(t) @g @ The macros are collected in two files included at the beginning of this documentation\footnote{An attentive reader may have noticed that the files have the extension \.{.stx} instead of the traditional \.{.sty}. The reason for this is the postprocessing step that `cleans' the generated files of various \Cee\ artifacts output by \CWEB\ and turns \.{ldman.stx} and \.{ldsetup.stx} into \.{ldman.sty} and \.{ldsetup.sty} included by \.{ldman.tex}$\ldots$}. @(ldman.stx@>= @@; @@; @@; @@; @ The macros collected in the file below are not specific to this manual but are needed in order to use the \ld\ parser generated and are thus put in a separate which can be included by other \TeX\ programs that wish to use them. @(ldsetup.stx@>= @@; @@; @i ldgram.x @i ldlex.x @** Example output. Here is an example output of the \ld\ parser designed in this document. The original linker script is presented in the section that follows. The same parser can be used to present examples of \ld\ scripts in text similar to the one below. \beginldprod MEMORY { RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 20K FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 128K ASH (rx) : ORIGIN = 8000000, LENGTH = 128K CLASH (rx) : ORIGIN = 700000, LENGTH = 128K ASH (rx) : ORIGIN = $8000000, LENGTH = 128K CLASH (rx) : ORIGIN = 700000B, LENGTH = 128K INCLUDE file.mem } \endprod \noindent The syntax of \ld\ is modular enough so there does not seem to be a need for a `parser stack' as in the case of the \bison\ parser. If one must be able to display still smaller segments of \ld\ code, using `hidden context' tricks (discussed elsewhere) seems to be a better approach. %\saveparseoutputtrue\checktabletrue %\saveparseoutputfalse\checktablefalse @= @i ldexample_l.hx @ @= @i ldexample_t.hx @ @= @i ldnp.x @** Appendix. The original code of the \ld\ parser and lexer is reproduced below. It is left mostly intact and is typeset by the pretty printing parser for \bison\ input. The lexer (\flex) input is given a similar treatment. The treatment of comments is a bit more invasive. \CWEB\ silently assumes that the comment refers to the preceding statement or a group of statements which is reflected in the way the comment is typeset. The comments in \ld\ source files use the opposite convention. For the sake of consistency, such comments have been moved so as to make them fit the \CWEB\ style. The comments meant to refer to a sizable portion of the program (such as a whole function or a group of functions) are put at the beginning of a \CWEB\ section containing the appropriate part of the program. \CWEB\ treats comments as ordinary \TeX\ so the comments are changed to take advantage of \TeX\ formatting and introduce some visual cues. The convention of using {\it italics\/} for the original comments has been reversed: the italicized comments are the ones introduced by the author, {\it not\/} the original creators of \ld.%\checktabletrue\saveparseoutputtrue @i ldgramo.x @i ldlexo.x @q Include the list of index section markers; this is a hack to get around @> @q the lack of control over the generation of \CWEB's index; the correct order @> @q of index entries depends on the placement of this inclusion @> @i alphas.hx @** Index. \checktablefalse\saveparseoutputtrue \global\let\secrangedisplay\empty This section lists the variable names and (in some cases) the keywords used inside the `language sections' of the \CWEB\ source. It takes advantage of the built-in facility of \CWEB\ to supply references for both definitions (set in {\it italic}) as well as uses for each \Cee\ identifier in the text. Special facilities have been added to extend indexing to \bison\ grammar terms, \TeX\ control sequences encountered in \bison\ actions, and file and section names encountered in \ld\ scripts. For a detailed description of the various conventions adhered to by the index entries the reader is encouraged to consult the remarks preceding the index of the document describing the core of the \splint\ suite. We will only mention here that (consistent with the way \bison\ references are treated) a script example: $$ \vbox{ \beginldprod MEMORY { MEMORY1 (xrw) : ORIGIN = 0x20000000, LENGTH = 20K MEMORY2 (rx) : ORIGIN = 0x8000000, LENGTH = 128K } _var_1 = 0x20005000; \endprod }% $$ \noindent inside the \TeX\ part of a \CWEB\ section will generate several index entries, as well, mimicking \CWEB's behavior for the {\it inline \Cee\/} (\.{\yl}$\ldots$\.{\yl}). Such entries are labeled with $^\circ$, to provide a reminder of their origin. \makeunindexablex{{\csstring\the}{\csstring\nx}{\csstring\yy}{\csstring\yylexnext}% {\csstring\else}{\csstring\fi}{\csstring\yyBEGIN}{\csstring\next}} \let\oldMRL\MRL \def\MRL#1{\smash{\oldMRL{#1}}} % a more sophisticated way to handle it woud be to add a \smash whenever we are % in the [index] namespace but this is simpler and works as well