% Copyright 2012-2024, Alexander Shibakov
% This file is part of SPLinT
%
% SPLinT is free software: you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation, either version 3 of the License, or
% (at your option) any later version.
%
% SPLinT is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
% GNU General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with SPLinT.  If not, see <http://www.gnu.org/licenses/>.
@s TeX_ TeX
\def\optimization{5}
\input ldman.sty
\input ldsetup.sty
\modenormal
\input ldfrontmatter.sty
% multi-column output
\input dcols.sty
\topskip=9pt

\let\oldN\N
\let\N\chapterN
\let\M\textM
\initauxstream

\showlastactiontrue
@** Introduction.
This is a manual documenting the development of a parser that can be
used to typeset \ld\ files (linker scripts) with or without the help
of \CWEB. An existing parser for \ld\ has been adopted as a base, with
appropriately designed actions specific to the task of
typesetting. The appendix to this manual contains the full source code
(including the parts written in \Cee) of both the scanner and the parser for \ld,
used in the original program. Some very minor modifications have been
made to make the programs more `presentable' in \CWEB\ (in particular, the
file had to be split into smaller chunks to satisfy \CWEAVE's
limitations). 

Nearly every aspect of the design is discussed, including the
supporting \TeX\ macros that make both the parser and this
documentation possible. The \TeX\ macros presented here are collected
in \.{ldman.sty} and \.{ldsetup.sty} which are later included in the 
\TeX\ file produced by \CWEAVE.
@<Set up the generic parser machinery@>=
@G(t)
\ifx\optimization\UNDEFINED %/* this trick is based on the premise that \.{\\UNDEFINED} */
    \def\optimization{0}    %/* is never defined nor created with \.{\\csname$\ldots$\\endcsname} */
\fi

\let\nx\noexpand    %/* a convenient shortcut*/

\input limbo.sty    %/* general setup macros */
\input yycommon.sty %/* general routines for stack and array access */
\input yymisc.sty   %/* helper macros (stack manipulation, table processing, value stack pointers */
                    %/* parser initialization, optimization) */
\input yyinput.sty  %/* input functions */
\input yyparse.sty  %/* parser machinery */
\input flex.sty     %/* lexer functions */

\ifnum\optimization>\tw@
    \input yyfaststack.sty
\fi

\input yystype.sty  %/* scanner auxiliary types and functions */
\input yyunion.sty  %/* parser data structures */

\amendswitch        %/* adjust the \.{\\yyinput} to recognize \.{\\yyendgame} */
    \multicharswitch\near\yyeof\by\yyendgame\to\multicharswitch % /* add a new label */
\replaceaction\multicharswitch\at\yyendgame % /* replace the new empty action */
    \by{{\yyinput\yyeof\yyeof\endparseinput\removefinalvb}}\to\multicharswitch

\def\MRI{{\sc MRI}} %/* so we can say that \MRI\ script typesetting is not supported$\ldots$ */
@g

@ Up to this point, the parser initialization process has been non parser specific (although
sensitive to the namespace chosen). The next command inputs the data structures used
by the \ld\ parser. The reader should consult the file below for the details on the \AST\ produced
by the parser and its semantics.
@<Define the normal mode@>=
@G(t)
\input ldunion.sty  %@>/* \ld\ parser data structures */@+
@g

@*1 Bootstrapping. 
To produce a usable parser/scanner duo, several pieces of code must
be generated. The most important of these are the {\it table files\/}
(\.{ldptab.tex} and \.{ldltab.tex}) for the parser and the scanner. These
consist of the integer tables defining the operation of the parser and
scanner automata, the values of some constants, and the `action switch'.

Just like in the case of `real' parsers and scanners, in order to
make the parser and the scanner interact seamlessly, some amount of `glue'
is required. As an example, a file containing the (numerical)
definitions of the token values is generated by \bison\ to be used by
a \flex\ generated scanner. Unfortunately, this file has too little
structure for our purposes (it contains definitions of token values
mixed in with other constants making it hard to distinguish one kind of
definition from another). Therefore, the `glue' is generated by
parsing our grammar once again, this time with a \bison\ grammar
designed for typesetting \bison\ files. A special {\it
bootstrapping\/} mode is used to extract the appropriate
information. The name `bootstrapping' notwithstanding, the parser and
lexer used in the bootstrapping phase are not necessarily the minimized versions
used in bootstrapping the \bison\ parser.

One component generated during the bootstrapping pass is a list
of `token equivalences' (or `aliases') to be used by the lexer. Every
token (to be precise, every {\it named token type}) used in
a \bison\ grammar is declared using one of the 
\prodstyle{\%token}, \prodstyle{\%left}, \prodstyle{\%right},
\prodstyle{\%precedence}, or \prodstyle{\%nonassoc} declarations. If
no {\it alias\/} (see below) has been declared using a
\prodstyle{\%token} declaration, this name ends up in the |yytname|
array output by \bison\ and can be used by the lexer after associating
the token names with their numerical values (accomplished by
\.{\\settokens}). If all tokens are named tokens, no token equivalence
list is necessary to set up the interaction between the lexer and the
parser. In this case (the present \ld\ parser is a typical
example), the token list may also serve a secondary role: to provide hints
for the macros that typeset the grammar terms, after the \.{\\tokeneq}
macro is redefined for this purpose.

On the other hand, after a declaration such as `\prodstyle{\%token}
\.{CHAR} \.{"char"}' the string \.{"char"} becomes an alias
for the named token \.{CHAR}. Only the string version gets recorded in
the |yytname| array. Establishing the equivalence between the two
token forms now can be accomplished only by examining the grammar
source file and is delegated to the bootstrapping phase parser.

\leavevmode\namedspot{bootstrapstates}Another (perhaps most important)
goal of the bootstrapping phase is to extract the information about \flex\
{\it states\/} used by the lexer from the appropriate source file. As
is the case with token names, this information is output in a rather
chaotic fashion by the scanner generator and is all but useless for
our purposes. The bootstrapping macros are designed to
handle \flex's \prodstyle{\%x} and \prodstyle{\%s} state declarations
and produce a \Cee\ file with the appropriate definitions. This file
can later be included by the `driver' routine to generate the
appropriate table file for the lexer.  To round off the bootstrapping
mode we only need to establish the output streams for the tokens and
the states, supply the appropriate file names for the two lists, flag
the bootstrapping mode for the bootstrapping macros and inline
typesetting (\.{\\prodstyle} macros) and input the appropriate
machinery.

This is done by the macros below. The bootstrap lexer setup
(\.{\\bootstraplexersetup}) consists of inputting the token
equivalence table for the \bison\ parser (i.e.~the parser that
processes the \bison\ grammar file) and defining a robust token output
function which simply ignores the token values the lexer is not aware
of (it should not be necessary in our case since we are using a full
featured lexer).
@<Define the bootstrapping mode@>=
@G(t)
\def\modebootstrap{%
    \def\bootstraplexersetup{%
        \let\yylexreturn\yylexreturnbootstrap /* only return tokens whose value is known */
        %\let\yylexreturn\yylexreturnregular /* should also work */
    }%
    \input yybootstrap.sty%
    \input yytexlex.sty%
}
@g

@*1 Namespaces and modes. 
Every parser/lexer pair (as well as some other macros) operates
within a set of dedicated {\it namespaces\/}. This simply means that the macros
that output token values, switch lexer states and access various
tables `tack on' a string of characters representing the current
namespace to the `low level' control sequence name that performs the
actual output or access. Say, \.{\\yytname} becomes an alias of
\.{\\yytname[main]} while in the \.{[main]} namespace. When a parser
or lexer is initialized, the appropriate tables are aliased with a
generic name in the case of an `unoptimized' parser or lexer. The
optimized parser or lexer handles the namespace referencing internally.

The mode setup macros for this manual define several separate
namespaces:
{%
\def\aterm#1{\item{\sqebullet}{\ttl #1}: \ignorespaces}%
\setbox0=\hbox{\sqebullet\enspace}
\parindent=0pt
\advance\parindent by \wd0
\smallskip
\aterm{main}the \.{[main]} namespace is established for the parser
that does the typesetting of the grammar.

\aterm{ld}every time a term name is
processed, the token names are looked up in the \.{[ld]}
namespace. The same namespace is used by the parser that typesets \ld\
script examples in the manual (i.e.~the parser described here). This
is done to provide visual consistency between the description of the
parser and its output.

\aterm{small{ \rm and} ldsmall}the \.{[small]}namespace is used by the term
name parser itself. Since we use a customized version of the name parser, we
dedicate a separate namespace for this purpose, \.{[ldsmall]}.

\aterm{prologue}the parser based on a subset of the full \bison\
grammar describing prologue declarations uses the \.{[prologue]}
namespace.

\aterm{index}the \.{[index]} namespace is used for typesetting the
index entries for the terms that are automatically inserted by the main parser
(such as an empty right hand side of a production or an inline action).

\aterm{index:tex{\rm, }index:visual{\rm, and }texline}
the macros that typeset \TeX\ entries, use
\.{index:tex} as a pseudonamespace to display \TeX\
terms in the index (due to the design of these typesetting macros, many
of them take parameters, which can lead to chaos in the index).
The \.{index:visual} pseudonamespace is used for ordering the index entries for
\TeX\ macros (this is how the entry for \.{\\getsecond} (\texref{/getsecond})
ends up in the {\ssf P} section of the index). Finally, the \TeX\ typesetting macros
use the \.{texline} pseudonamespace to process the names of \TeX\ command sequences
in action code. The reason we refer to these as {\it pseudo\/}namespaces is that
only the token (or term) names are aliased, and not, say, the finite automata names
in the case of the \TeX\ typesetting.

\aterm{flexre{\rm, }flexone{, \rm and} flextwo}the parsers for
\flex\ input use the \.{[flexre]}, \.{[flexone]}, and~\.{[flextwo]} namespaces for
their operation. Another convention is to use  the \.{\\flexpseudonamespace}
to typeset \flex\ state names, and the \.{\\flexpseudorenamespace} for typesetting
the names of \flex\ regular
expressions. Currently, \.{\\flexpseudo...} namespaces are set equal to
their non-\.{pseudo} versions by default. This setting may be changed
whenever several parsers are used in the same document and tokens with
the same names must be typeset in different styles.
All \flex\ namespaces, as well as~\.{[main]}, \.{[small]},
and~\.{[ldsmall]} are defined by the \.{\\genericparser}
macros. 

\aterm{cwebclink}finally, the \.{[cwebclink]} namespace is used for
typesetting the variables {\it inside\/} \ld\ scripts. This way, the
symbols exported by the linker may be typeset in a style similar to
\Cee\ variables, if desired (as they play very similar roles).

}
@<Begin namespace setup@>=
@G(t)
\def\indexpseudonamespace{[index]}
\def\cwebclinknamespace{[cwebclink]}
\let\parsernamespace\empty
@g

@ After all the appropriate tables and `glue' have been generated, the
typesetting of this manual can be handled by the {\tt normal} mode. Note
that this requires the \ld\ parser, as well as the \bison\ parser,
including all the appropriate machinery.

The normal mode is started by including the tables and lists and
initializing the \bison\ parser (accomplished by inputting
\.{yyinit.sty}), followed by handling the token typesetting for the
\ld\ grammar. 
@<Define the normal mode@>=
@G(t)
\newtoks\ldcmds

\def\modenormal{%
    \commonstartup
    \input ldtexlex.sty% /* \TeX\ typesetting specific to \ld */
    \expandafter\def\csname index domain translation [L]\endcsname{%
        {\noexpand\ld\ index}% /* used to typeset the table of contents */
        {ld index}% /* outline entry */
        {L\sc D INDEX}% /* index header */
    }%
    \def\otherlangindexheader{%
        B{\sc ISON}, F{\sc LEX}, \TeX, {\sc AND} L{\sc D INDICES}%
    }%               /* modify the index header */
    @>@[@<Initialize \ld\ parsers@>@]
    @>@[@<Modified name parser for \ld\ grammar@>@]
}
@g

@ @<Common startup routine@>=
@G(t)
\def\commonstartup{
    \def\appendr##1##2{\edef\appnext{##1{\the##1##2}}\appnext}%
    \def\appendl##1##2{\edef\appnext{##1{##2\the##1}}\appnext}%
    \input yyinit.sty  %
    \input yytexlex.sty% /* \TeX\ typesetting macros */
    \input gindex.sty  % /* indexing macros specific to \flex, \bison, and \ld */
    \input noweb.sty   % /* \noweb\ style references */
        \xreflocaltrue
        \let\sectionlistsetup\lxrefseparator
    \let\inx\inxmod
    \let\fin\finmod
    \termindextrue
    \immediate\openout\gindex=\jobname.gdx
    \let\hostparsernamespace\ldnamespace /* the namespace where tokens are looked up for typesetting purposes */
}
@g

@ The \ld\ parser initialization requires setting a few global
variables, as well as entering the \.{INITIAL} state for the \ld\
lexer. The latter is somewhat counterintuitive and is necessitated by
the ability of the parser to switch lexer states. Thus, the parser can
switch the lexer state before the lexer is invoked for the first time
wreaking havoc on the lexer state stack.
@<Define the normal mode@>=
@G(t)
\def\ldparserinit{%
    \basicparserinit
    \includestackptr=\@@ne
    \versnodenesting=\z@@
    \ldcmds{}%
    \yyBEGIN{INITIAL}%
}
@g

@ This is the \ld\ parser invocation routine. It is coded according to
a straightforward sequence initialize-invoke-execute-or-fall back. The
\.{\\preparseld} macro is invoked by \.{\\lsectionbegin} (see \.{limbo.sty}).
It starts by defining the postprocessing and typesetting macro (\.{\\postparseld})
followed by the parser setup.
@<Define the normal mode@>=
@G(t)
\def\preparseld{%
    \let\postparse\postparseld
    \hidecslist\ldunion % /* inhibit expansion so that fewer \.{\\noexpand}s are necessary */
    \toldparser
    \displayrawtable % /* do this after the parser namespaces are setup */
    \ldparserinit
    \yyparse
}
@g

@ The postprocessing macro defines a procedure for typesetting the output and saving the
parsed result to a file. After that, a generic postprocessing macro is executed.
@<Define the normal mode@>=
@G(t)
\def\postparseld{%
    \let\saveparsedtable\saveldtable
    \let\typesetparsedtables\typesetldtables
    \postparsegeneric{(ld script)}%
}
@g

@ The table output and typesetting macros are responsible for the parsed table output
to a log file and the typographic representation of the result. The table is output
after expanding its contents. The commands
\begindemo
^\hidecslist\cwebstreamchars
^\restorecslist{ld-parser-debug}\ldunion
\enddemo
are meant to inhibit most expansions so that only the list (iterator) macros are expanded.
The parser for \ld\ does not currently use the list macros to speed up the parsing process
but the general set up is here in case it is needed in the future. Unfortunately, the same technique
cannot be applied to the display of the \.{\\yystash} and \.{\\yyformat} streams (that use linked lists
by default), since there are too many random sequences that can appear in such streams. Therefore,
to facilitate debugging, one should expand the lists before displaying them (see \.{yyinput.sty} for
details; lists with iterators are a bit more convenient due to their flexibility).
@<Define the normal mode@>=
@G(t)
\def\saveldtable#1{{%
    \hidecslist\cwebstreamchars
    \restorecslist{ld-parser-debug}\ldunion
    \expandafter\saveoutputcode\expandafter{\the\ldcmds}\exampletable{#1}%
}}

\def\typesetldtables{%
    \begingroup
        \displayparsedoutput\ldcmds
        \restorecslist{ld-parser:restash}\ldunion % /* mark variables, preprocess stash */
        \setprodtable
        \the\ldcmds
        \restorecslist{ld-display}\ldunion
        \setprodtable /* use the \bison's parser typesetting definitions */
        \restorecs{ld-display}{\anint\bint\hexint} % /* $\ldots$ except for integer typesetting */
        \the\ldcmds
        \par
        \vskip-\baselineskip
        \the\lddisplay
    \endgroup
}
@g

@ The parsing routine defined above is the first macro in the \ld\ parser stack.
The only remaining procedure on the stack is an error reporting macro in case
the parsing pass failed.
@<Define the normal mode@>=
@G(t)
\fillpstack{l}{%
    \preparseld
    {\preparsefallback{++}}% /* skip this section if parsing failed, put \.{++} on the screen */
    \relax %                 /* this \.{\\relax} `guards' the braces above during \.{\\poppstack} */
}
@g

@ Unless they are being bootstrapped, the \ld\ parser and its
term parser are initialized by the normal mode. The token typesetting
of \ld\ grammar tokens is adjusted at the same time (see the remarks
above about the mechanism that is responsible for this). Most terminals
(such as keywords, etc.) may be displayed unchanged (provided the
names used by the lexer agree with their appearance in the script file,
see below), while the typeseting of others is modified in
\.{ltokenset.sty}. 

In the original \bison-\flex\ interface, token names are
defined as straightforward macros (a poor choice as will be seen
shortly) which can sometimes clash with the standard \Cee\ macros. 
This is why \ld\ lexer returns \prodstyle{ASSERT} as
\prodstyle{ASSERT\_K}. The name parser treats \.{K} as a suffix to
supply a visual reminder of this flaw. Note that the `suffixless' part
of these tokens (such as \prodstyle{ASSERT}) is never declared and
thus has to be entered in \.{ltokenset.sty} by hand.

The tokens that never appear as part of the input (such as
\prodstyle{END} and \prodstyle{UNARY}) or those that do but have no
fixed appearance (for example, \prodstyle{NAME}) are typeset in a
style that indicates their origin. The details can be found by
examining \.{ltokenset.sty}.
@<Initialize \ld\ parsers@>=
@G(t)
\genericparser 
    name: ld, 
    ptables: ldptab.tex, 
    ltables: ldltab.tex, 
    tokens: {}, 
    asetup: {}, 
    dsetup: {}, 
    rsetup: {},
    optimization: {};% /* the parser and lexer are optimized when output */
\genericprettytokens 
    namespace: ld, 
    tokens: ldp.tok, 
    correction: ltokenset.sty, 
    host: ld;%
@g

@ We also need some modifications to the indexing macros in order to typeset \ld\ terms
and variables separately.
@<Additional macros for the \ld\ lexer/parser@>=
@G(t)
@g

@ The macros are collected in two files included at the beginning
of this documentation\footnote{An attentive reader may have noticed that the files
have the extension \.{.stx} instead of the traditional \.{.sty}. The reason for this 
is the postprocessing step that `cleans' the generated files of various \Cee\
artifacts output by \CWEB\ and turns \.{ldman.stx} and \.{ldsetup.stx} into 
\.{ldman.sty} and \.{ldsetup.sty} included by \.{ldman.tex}$\ldots$}.
@(ldman.stx@>=
@<Set up the generic parser machinery@>@;
@<Begin namespace setup@>@;
@<Define the bootstrapping mode@>@;
@<Common startup routine@>@;

@ The macros collected in the file below are not specific to this manual but are needed in
order to use the \ld\ parser generated and are thus put in a separate which can be included 
by other \TeX\ programs that wish to use them.
@(ldsetup.stx@>=
@<Define the normal mode@>@;
@<Additional macros for the \ld\ lexer/parser@>@;

@i ldgram.x
@i ldlex.x

@** Example output. Here is an example output of the \ld\ parser designed in this
document. The original linker script is presented in the section that
follows. The same parser can be used to present examples of \ld\ scripts
in text similar to the one below.
\beginldprod
MEMORY
{
  RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 20K
  FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 128K
  ASH (rx) : ORIGIN = 8000000, LENGTH = 128K
  CLASH (rx) : ORIGIN = 700000, LENGTH = 128K
  ASH (rx) : ORIGIN = $8000000, LENGTH = 128K
  CLASH (rx) : ORIGIN = 700000B, LENGTH = 128K
  INCLUDE file.mem
}
\endprod
\noindent The syntax of \ld\ is modular enough so there does not seem to be a
need for a `parser stack' as in the case of the \bison\
parser. If one must be able to display still smaller segments of \ld\
code, using `hidden context' tricks (discussed elsewhere) seems to be
a better approach.
%\saveparseoutputtrue\checktabletrue
%\saveparseoutputfalse\checktablefalse 
@<Example \ld\ script@>=
@i ldexample_l.hx

@ @<The same example of an \ld\ script@>=
@i ldexample_t.hx

@ @<Some random portion of \ld\ code@>=

@i ldnp.x

@** Appendix. The original code of the \ld\ parser and lexer is reproduced below. It
is left mostly intact and is typeset by the pretty printing parser for
\bison\ input. The lexer (\flex) input is given a similar treatment.

The treatment of comments is a bit more invasive. \CWEB\ silently
assumes that the comment refers to the preceding statement or a group
of statements which is reflected in the way the comment is
typeset. The comments in \ld\ source files use the
opposite convention. For the sake of consistency, such comments have
been moved so as to make them fit the \CWEB\ style. The comments meant to refer to a
sizable portion of the program (such as a whole function or a group
of functions) are put at the beginning of a \CWEB\ section containing
the appropriate part of the program.

\CWEB\ treats comments as ordinary \TeX\ so the comments are changed
to take advantage of \TeX\ formatting and introduce some visual
cues. The convention of using {\it italics\/} for the original
comments has been reversed: the italicized comments are the ones
introduced by the author, {\it not\/} the original
creators of \ld.%\checktabletrue\saveparseoutputtrue

@i ldgramo.x
@i ldlexo.x
@q Include the list of index section markers; this is a hack to get around @>
@q the lack of control over the generation of \CWEB's index; the correct order @>
@q of index entries depends on the placement of this inclusion @>
@i alphas.hx

@** Index. \checktablefalse\saveparseoutputtrue \global\let\secrangedisplay\empty
This section lists the variable names and (in some cases)
the keywords used inside the `language sections' of the \CWEB\
source. It takes advantage of the built-in facility of \CWEB\ to supply
references for both definitions (set in {\it
italic}) as well as uses for each \Cee\ identifier in the text. 

Special facilities have been added to extend indexing to
\bison\ grammar terms, \TeX\ control sequences encountered in
\bison\ actions, and file and section names encountered in \ld\
scripts. For a detailed description of the various conventions adhered
to by the index entries the reader is encouraged to consult the
remarks preceding the index of the document describing the core of
the \splint\ suite. We will only mention here that (consistent with
the way \bison\ references are treated) a script example:
$$
\vbox{
\beginldprod
MEMORY
{
  MEMORY1 (xrw) : ORIGIN = 0x20000000, LENGTH = 20K
  MEMORY2 (rx) : ORIGIN = 0x8000000, LENGTH = 128K
}
_var_1 = 0x20005000;
\endprod
}%
$$
\noindent inside the \TeX\ part of a \CWEB\ section will generate several
index entries, as well, mimicking \CWEB's behavior for the
{\it inline \Cee\/} (\.{\yl}$\ldots$\.{\yl}). Such entries are labeled
with $^\circ$, to provide a reminder of their origin.
\makeunindexablex{{\csstring\the}{\csstring\nx}{\csstring\yy}{\csstring\yylexnext}%
                 {\csstring\else}{\csstring\fi}{\csstring\yyBEGIN}{\csstring\next}}
\let\oldMRL\MRL
\def\MRL#1{\smash{\oldMRL{#1}}} % a more sophisticated way to handle it woud be to add a \smash whenever we are
                                % in the [index] namespace but this is simpler and works as well