summaryrefslogtreecommitdiffstats
path: root/util/newconfig/yapps2.tex
diff options
context:
space:
mode:
Diffstat (limited to 'util/newconfig/yapps2.tex')
-rw-r--r--util/newconfig/yapps2.tex1225
1 files changed, 0 insertions, 1225 deletions
diff --git a/util/newconfig/yapps2.tex b/util/newconfig/yapps2.tex
deleted file mode 100644
index 9d2bddf19c9a..000000000000
--- a/util/newconfig/yapps2.tex
+++ /dev/null
@@ -1,1225 +0,0 @@
-\documentclass[10pt]{article}
-\usepackage{palatino}
-\usepackage{html}
-\usepackage{color}
-
-\setlength{\headsep}{0in}
-\setlength{\headheight}{0in}
-\setlength{\textheight}{8.5in}
-\setlength{\textwidth}{5.9in}
-\setlength{\oddsidemargin}{0.25in}
-
-\definecolor{darkblue}{rgb}{0,0,0.6}
-\definecolor{darkerblue}{rgb}{0,0,0.3}
-
-%% \newcommand{\mysection}[1]{\section{\textcolor{darkblue}{#1}}}
-%% \newcommand{\mysubsection}[1]{\subsection{\textcolor{darkerblue}{#1}}}
-\newcommand{\mysection}[1]{\section{#1}}
-\newcommand{\mysubsection}[1]{\subsection{#1}}
-
-\bodytext{bgcolor=white text=black link=#004080 vlink=#006020}
-
-\newcommand{\first}{\textsc{first}}
-\newcommand{\follow}{\textsc{follow}}
-
-\begin{document}
-
-\begin{center}
-\hfill \begin{tabular}{c}
-{\Large The \emph{Yapps} Parser Generator System}\\
-\verb|http://theory.stanford.edu/~amitp/Yapps/|\\
- Version 2\\
-\\
-Amit J. Patel\\
-\htmladdnormallink{http://www-cs-students.stanford.edu/~amitp/}
-{http://www-cs-students.stanford.edu/~amitp/}
-
-\end{tabular} \hfill \rule{0in}{0in}
-\end{center}
-
-\mysection{Introduction}
-
-\emph{Yapps} (\underline{Y}et \underline{A}nother \underline{P}ython
-\underline{P}arser \underline{S}ystem) is an easy to use parser
-generator that is written in Python and generates Python code. There
-are several parser generator systems already available for Python,
-including \texttt{PyLR, kjParsing, PyBison,} and \texttt{mcf.pars,}
-but I had different goals for my parser. Yapps is simple, is easy to
-use, and produces human-readable parsers. It is not the fastest or
-most powerful parser. Yapps is designed to be used when regular
-expressions are not enough and other parser systems are too much:
-situations where you may write your own recursive descent parser.
-
-Some unusual features of Yapps that may be of interest are:
-
-\begin{enumerate}
-
- \item Yapps produces recursive descent parsers that are readable by
- humans, as opposed to table-driven parsers that are difficult to
- read. A Yapps parser for a simple calculator looks similar to the
- one that Mark Lutz wrote by hand for \emph{Programming Python.}
-
- \item Yapps also allows for rules that accept parameters and pass
- arguments to be used while parsing subexpressions. Grammars that
- allow for arguments to be passed to subrules and for values to be
- passed back are often called \emph{attribute grammars.} In many
- cases parameterized rules can be used to perform actions at ``parse
- time'' that are usually delayed until later. For example,
- information about variable declarations can be passed into the
- rules that parse a procedure body, so that undefined variables can
- be detected at parse time. The types of defined variables can be
- used in parsing as well---for example, if the type of {\tt X} is
- known, we can determine whether {\tt X(1)} is an array reference or
- a function call.
-
- \item Yapps grammars are fairly easy to write, although there are
- some inconveniences having to do with ELL(1) parsing that have to be
- worked around. For example, rules have to be left factored and
- rules may not be left recursive. However, neither limitation seems
- to be a problem in practice.
-
- Yapps grammars look similar to the notation used in the Python
- reference manual, with operators like \verb:*:, \verb:+:, \verb:|:,
- \verb:[]:, and \verb:(): for patterns, names ({\tt tim}) for rules,
- regular expressions (\verb:"[a-z]+":) for tokens, and \verb:#: for
- comments.
-
- \item The Yapps parser generator is written as a single Python module
- with no C extensions. Yapps produces parsers that are written
- entirely in Python, and require only the Yapps run-time module (5k)
- for support.
-
- \item Yapps's scanner is context-sensitive, picking tokens based on
- the types of the tokens accepted by the parser. This can be
- helpful when implementing certain kinds of parsers, such as for a
- preprocessor.
-
-\end{enumerate}
-
-There are several disadvantages of using Yapps over another parser system:
-
-\begin{enumerate}
-
- \item Yapps parsers are \texttt{ELL(1)} (Extended LL(1)), which is
- less powerful than \texttt{LALR} (used by \texttt{PyLR}) or
- \texttt{SLR} (used by \texttt{kjParsing}), so Yapps would not be a
- good choice for parsing complex languages. For example, allowing
- both \texttt{x := 5;} and \texttt{x;} as statements is difficult
- because we must distinguish based on only one token of lookahead.
- Seeing only \texttt{x}, we cannot decide whether we have an
- assignment statement or an expression statement. (Note however
- that this kind of grammar can be matched with backtracking; see
- section \ref{sec:future}.)
-
- \item The scanner that Yapps provides can only read from strings, not
- files, so an entire file has to be read in before scanning can
- begin. It is possible to build a custom scanner, though, so in
- cases where stream input is needed (from the console, a network, or
- a large file are examples), the Yapps parser can be given a custom
- scanner that reads from a stream instead of a string.
-
- \item Yapps is not designed with efficiency in mind.
-
-\end{enumerate}
-
-Yapps provides an easy to use parser generator that produces parsers
-similar to what you might write by hand. It is not meant to be a
-solution for all parsing problems, but instead an aid for those times
-you would write a parser by hand rather than using one of the more
-powerful parsing packages available.
-
-Yapps 2.0 is easier to use than Yapps 1.0. New features include a
-less restrictive input syntax, which allows mixing of sequences,
-choices, terminals, and nonterminals; optional matching; the ability
-to insert single-line statements into the generated parser; and
-looping constructs \verb|*| and \verb|+| similar to the repetitive
-matching constructs in regular expressions. Unfortunately, the
-addition of these constructs has made Yapps 2.0 incompatible with
-Yapps 1.0, so grammars will have to be rewritten. See section
-\ref{sec:Upgrading} for tips on changing Yapps 1.0 grammars for use
-with Yapps 2.0.
-
-\mysection{Examples}
-
-In this section are several examples that show the use of Yapps.
-First, an introduction shows how to construct grammars and write them
-in Yapps form. This example can be skipped by someone familiar with
-grammars and parsing. Next is a Lisp expression grammar that produces
-a parse tree as output. This example demonstrates the use of tokens
-and rules, as well as returning values from rules. The third example
-is a expression evaluation grammar that evaluates during parsing
-(instead of producing a parse tree).
-
-\mysubsection{Introduction to Grammars}
-
-A \emph{grammar} for a natural language specifies how words can be put
-together to form large structures, such as phrases and sentences. A
-grammar for a computer language is similar in that it specifies how
-small components (called \emph{tokens}) can be put together to form
-larger structures. In this section we will write a grammar for a tiny
-subset of English.
-
-Simple English sentences can be described as being a noun phrase
-followed by a verb followed by a noun phrase. For example, in the
-sentence, ``Jack sank the blue ship,'' the word ``Jack'' is the first
-noun phrase, ``sank'' is the verb, and ``the blue ship'' is the second
-noun phrase. In addition we should say what a noun phrase is; for
-this example we shall say that a noun phrase is an optional article
-(a, an, the) followed by any number of adjectives followed by a noun.
-The tokens in our language are the articles, nouns, verbs, and
-adjectives. The \emph{rules} in our language will tell us how to
-combine the tokens together to form lists of adjectives, noun phrases,
-and sentences:
-
-\begin{itemize}
- \item \texttt{sentence: noun\_phrase verb noun\_phrase}
- \item \texttt{noun\_phrase: [article] adjective* noun}
-\end{itemize}
-
-Notice that some things that we said easily in English, such as
-``optional article,'' are expressed using special syntax, such as
-brackets. When we said, ``any number of adjectives,'' we wrote
-\texttt{adjective*}, where the \texttt{*} means ``zero or more of the
-preceding pattern''.
-
-The grammar given above is close to a Yapps grammar. We also have to
-specify what the tokens are, and what to do when a pattern is matched.
-For this example, we will do nothing when patterns are matched; the
-next example will explain how to perform match actions.
-
-\begin{verbatim}
-parser TinyEnglish:
- ignore: "\\W+"
- token noun: "(Jack|spam|ship)"
- token verb: "(sank|threw)"
- token article: "(a|an|the)"
- token adjective: "(blue|red|green)"
-
- rule sentence: noun_phrase verb noun_phrase
- rule noun_phrase: [article] adjective* noun
-\end{verbatim}
-
-The tokens are specified as Python \emph{regular expressions}. Since
-Yapps produces Python code, you can write any regular expression that
-would be accepted by Python. (\emph{Note:} These are Python 1.5
-regular expressions from the \texttt{re} module, not Python 1.4
-regular expressions from the \texttt{regex} module.) In addition to
-tokens that you want to see (which are given names), you can also
-specify tokens to ignore, marked by the \texttt{ignore} keyword. In
-this parser we want to ignore whitespace.
-
-The TinyEnglish grammar shows how you define tokens and rules, but it
-does not specify what should happen once we've matched the rules. In
-the next example, we will take a grammar and produce a \emph{parse
-tree} from it.
-
-\mysubsection{Lisp Expressions}
-
-Lisp syntax, although hated by many, has a redeeming quality: it is
-simple to parse. In this section we will construct a Yapps grammar to
-parse Lisp expressions and produce a parse tree as output.
-
-\subsubsection*{Defining the Grammar}
-
-The syntax of Lisp is simple. It has expressions, which are
-identifiers, strings, numbers, and lists. A list is a left
-parenthesis followed by some number of expressions (separated by
-spaces) followed by a right parenthesis. For example, \verb|5|,
-\verb|"ni"|, and \verb|(print "1+2 = " (+ 1 2))| are Lisp expressions.
-Written as a grammar,
-
-\begin{verbatim}
- expr: ID | STR | NUM | list
- list: ( expr* )
-\end{verbatim}
-
-In addition to having a grammar, we need to specify what to do every
-time something is matched. For the tokens, which are strings, we just
-want to get the ``value'' of the token, attach its type (identifier,
-string, or number) in some way, and return it. For the lists, we want
-to construct and return a Python list.
-
-Once some pattern is matched, we enclose a return statement enclosed
-in \verb|{{...}}|. The braces allow us to insert any one-line
-statement into the parser. Within this statement, we can refer to the
-values returned by matching each part of the rule. After matching a
-token such as \texttt{ID}, ``ID'' will be bound to the text of the
-matched token. Let's take a look at the rule:
-
-\begin{verbatim}
- rule expr: ID {{ return ('id', ID) }}
- ...
-\end{verbatim}
-
-In a rule, tokens return the text that was matched. For identifiers,
-we just return the identifier, along with a ``tag'' telling us that
-this is an identifier and not a string or some other value. Sometimes
-we may need to convert this text to a different form. For example, if
-a string is matched, we want to remove quotes and handle special forms
-like \verb|\n|. If a number is matched, we want to convert it into a
-number. Let's look at the return values for the other tokens:
-
-\begin{verbatim}
- ...
- | STR {{ return ('str', eval(STR)) }}
- | NUM {{ return ('num', atoi(NUM)) }}
- ...
-\end{verbatim}
-
-If we get a string, we want to remove the quotes and process any
-special backslash codes, so we run \texttt{eval} on the quoted string.
-If we get a number, we convert it to an integer with \texttt{atoi} and
-then return the number along with its type tag.
-
-For matching a list, we need to do something slightly more
-complicated. If we match a Lisp list of expressions, we want to
-create a Python list with those values.
-
-\begin{verbatim}
- rule list: "\\(" # Match the opening parenthesis
- {{ result = [] }} # Create a Python list
- (
- expr # When we match an expression,
- {{ result.append(expr) }} # add it to the list
- )* # * means repeat this if needed
- "\\)" # Match the closing parenthesis
- {{ return result }} # Return the Python list
-\end{verbatim}
-
-In this rule we first match the opening parenthesis, then go into a
-loop. In this loop we match expressions and add them to the list.
-When there are no more expressions to match, we match the closing
-parenthesis and return the resulting. Note that \verb:#: is used for
-comments, just as in Python.
-
-The complete grammar is specified as follows:
-\begin{verbatim}
-parser Lisp:
- ignore: '\\s+'
- token NUM: '[0-9]+'
- token ID: '[-+*/!@%^&=.a-zA-Z0-9_]+'
- token STR: '"([^\\"]+|\\\\.)*"'
-
- rule expr: ID {{ return ('id', ID) }}
- | STR {{ return ('str', eval(STR)) }}
- | NUM {{ return ('num', atoi(NUM)) }}
- | list {{ return list }}
- rule list: "\\(" {{ result = [] }}
- ( expr {{ result.append(expr) }}
- )*
- "\\)" {{ return result }}
-\end{verbatim}
-
-One thing you may have noticed is that \verb|"\\("| and \verb|"\\)"|
-appear in the \texttt{list} rule. These are \emph{inline tokens}:
-they appear in the rules without being given a name with the
-\texttt{token} keyword. Inline tokens are more convenient to use, but
-since they do not have a name, the text that is matched cannot be used
-in the return value. They are best used for short simple patterns
-(usually punctuation or keywords).
-
-Another thing to notice is that the number and identifier tokens
-overlap. For example, ``487'' matches both NUM and ID. In Yapps, the
-scanner only tries to match tokens that are acceptable to the parser.
-This rule doesn't help here, since both NUM and ID can appear in the
-same place in the grammar. There are two rules used to pick tokens if
-more than one matches. One is that the \emph{longest} match is
-preferred. For example, ``487x'' will match as an ID (487x) rather
-than as a NUM (487) followed by an ID (x). The second rule is that if
-the two matches are the same length, the \emph{first} one listed in
-the grammar is preferred. For example, ``487'' will match as an NUM
-rather than an ID because NUM is listed first in the grammar. Inline
-tokens have preference over any tokens you have listed.
-
-Now that our grammar is defined, we can run Yapps to produce a parser,
-and then run the parser to produce a parse tree.
-
-\subsubsection*{Running Yapps}
-
-In the Yapps module is a function \texttt{generate} that takes an
-input filename and writes a parser to another file. We can use this
-function to generate the Lisp parser, which is assumed to be in
-\texttt{lisp.g}.
-
-\begin{verbatim}
-% python
-Python 1.5.1 (#1, Sep 3 1998, 22:51:17) [GCC 2.7.2.3] on linux-i386
-Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
->>> import yapps
->>> yapps.generate('lisp.g')
-\end{verbatim}
-
-At this point, Yapps has written a file \texttt{lisp.py} that contains
-the parser. In that file are two classes (one scanner and one parser)
-and a function (called \texttt{parse}) that puts things together for
-you.
-
-Alternatively, we can run Yapps from the command line to generate the
-parser file:
-
-\begin{verbatim}
-% python yapps.py lisp.g
-\end{verbatim}
-
-After running Yapps either from within Python or from the command
-line, we can use the Lisp parser by calling the \texttt{parse}
-function. The first parameter should be the rule we want to match,
-and the second parameter should be the string to parse.
-
-\begin{verbatim}
->>> import lisp
->>> lisp.parse('expr', '(+ 3 4)')
-[('id', '+'), ('num', 3), ('num', 4)]
->>> lisp.parse('expr', '(print "3 = " (+ 1 2))')
-[('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]]
-\end{verbatim}
-
-The \texttt{parse} function is not the only way to use the parser;
-section \ref{sec:Parser-Objects} describes how to access parser objects
-directly.
-
-We've now gone through the steps in creating a grammar, writing a
-grammar file for Yapps, producing a parser, and using the parser. In
-the next example we'll see how rules can take parameters and also how
-to do computations instead of just returning a parse tree.
-
-\mysubsection{Calculator}
-
-A common example parser given in many textbooks is that for simple
-expressions, with numbers, addition, subtraction, multiplication,
-division, and parenthesization of subexpressions. We'll write this
-example in Yapps, evaluating the expression as we parse.
-
-Unlike \texttt{yacc}, Yapps does not have any way to specify
-precedence rules, so we have to do it ourselves. We say that an
-expression is the sum of terms, and that a term is the product of
-factors, and that a factor is a number or a parenthesized expression:
-
-\begin{verbatim}
- expr: factor ( ("+"|"-") factor )*
- factor: term ( ("*"|"/") term )*
- term: NUM | "(" expr ")"
-\end{verbatim}
-
-In order to evaluate the expression as we go, we should keep along an
-accumulator while evaluating the lists of terms or factors. Just as
-we kept a ``result'' variable to build a parse tree for Lisp
-expressions, we will use a variable to evaluate numerical
-expressions. The full grammar is given below:
-
-\begin{verbatim}
-parser Calculator:
- token END: "$" # $ means end of string
- token NUM: "[0-9]+"
-
- rule goal: expr END {{ return expr }}
-
- # An expression is the sum and difference of factors
- rule expr: factor {{ v = factor }}
- ( "[+]" factor {{ v = v+factor }}
- | "-" factor {{ v = v-factor }}
- )* {{ return v }}
-
- # A factor is the product and division of terms
- rule factor: term {{ v = term }}
- ( "[*]" term {{ v = v*term }}
- | "/" term {{ v = v/term }}
- )* {{ return v }}
-
- # A term is either a number or an expression surrounded by parentheses
- rule term: NUM {{ return atoi(NUM) }}
- | "\\(" expr "\\)" {{ return expr }}
-\end{verbatim}
-
-The top-level rule is \emph{goal}, which says that we are looking for
-an expression followed by the end of the string. The \texttt{END}
-token is needed because without it, it isn't clear when to stop
-parsing. For example, the string ``1+3'' could be parsed either as
-the expression ``1'' followed by the string ``+3'' or it could be
-parsed as the expression ``1+3''. By requiring expressions to end
-with \texttt{END}, the parser is forced to take ``1+3''.
-
-In the two rules with repetition, the accumulator is named \texttt{v}.
-After reading in one expression, we initialize the accumulator. Each
-time through the loop, we modify the accumulator by adding,
-subtracting, multiplying by, or dividing the previous accumulator by
-the expression that has been parsed. At the end of the rule, we
-return the accumulator.
-
-The calculator example shows how to process lists of elements using
-loops, as well as how to handle precedence of operators.
-
-\emph{Note:} It's often important to put the \texttt{END} token in, so
-put it in unless you are sure that your grammar has some other
-non-ambiguous token marking the end of the program.
-
-\mysubsection{Calculator with Memory}
-
-In the previous example we learned how to write a calculator that
-evaluates simple numerical expressions. In this section we will
-extend the example to support both local and global variables.
-
-To support global variables, we will add assignment statements to the
-``goal'' rule.
-
-\begin{verbatim}
- rule goal: expr END {{ return expr }}
- | 'set' ID expr END {{ global_vars[ID] = expr }}
- {{ return expr }}
-\end{verbatim}
-
-To use these variables, we need a new kind of terminal:
-
-\begin{verbatim}
- rule term: ... | ID {{ return global_vars[ID] }}
-\end{verbatim}
-
-So far, these changes are straightforward. We simply have a global
-dictionary \texttt{global\_vars} that stores the variables and values,
-we modify it when there is an assignment statement, and we look up
-variables in it when we see a variable name.
-
-To support local variables, we will add variable declarations to the
-set of allowed expressions.
-
-\begin{verbatim}
- rule term: ... | 'let' VAR '=' expr 'in' expr ...
-\end{verbatim}
-
-This is where it becomes tricky. Local variables should be stored in
-a local dictionary, not in the global one. One trick would be to save
-a copy of the global dictionary, modify it, and then restore it
-later. In this example we will instead use \emph{attributes} to
-create local information and pass it to subrules.
-
-A rule can optionally take parameters. When we invoke the rule, we
-must pass in arguments. For local variables, let's use a single
-parameter, \texttt{local\_vars}:
-
-\begin{verbatim}
- rule expr<<local_vars>>: ...
- rule factor<<local_vars>>: ...
- rule term<<local_vars>>: ...
-\end{verbatim}
-
-Each time we want to match \texttt{expr}, \texttt{factor}, or
-\texttt{term}, we will pass the local variables in the current rule to
-the subrule. One interesting case is when we pass as an argument
-something \emph{other} than \texttt{local\_vars}:
-
-\begin{verbatim}
- rule term<<local_vars>>: ...
- | 'let' VAR '=' expr<<local_vars>>
- {{ local_vars = [(VAR, expr)] + local_vars }}
- 'in' expr<<local_vars>>
- {{ return expr }}
-\end{verbatim}
-
-Note that the assignment to the local variables list does not modify
-the original list. This is important to keep local variables from
-being seen outside the ``let''.
-
-The other interesting case is when we find a variable:
-
-\begin{verbatim}
-global_vars = {}
-
-def lookup(map, name):
- for x,v in map: if x==name: return v
- return global_vars[name]
-%%
- ...
- rule term<<local_vars>: ...
- | VAR {{ return lookup(local_vars, VAR) }}
-\end{verbatim}
-
-The lookup function will search through the local variable list, and
-if it cannot find the name there, it will look it up in the global
-variable dictionary.
-
-A complete grammar for this example, including a read-eval-print loop
-for interacting with the calculator, can be found in the examples
-subdirectory included with Yapps.
-
-In this section we saw how to insert code before the parser. We also
-saw how to use attributes to transmit local information from one rule
-to its subrules.
-
-\mysection{Grammars}
-
-Each Yapps grammar has a name, a list of tokens, and a set of
-production rules. A grammar named \texttt{X} will be used to produce
-a parser named \texttt{X} and a scanner anmed \texttt{XScanner}. As
-in Python, names are case sensitive, start with a letter, and contain
-letters, numbers, and underscores (\_).
-
-There are three kinds of tokens in Yapps: named, inline, and ignored.
-As their name implies, named tokens are given a name, using the token
-construct: \texttt{token \emph{name} : \emph{regexp}}. In a rule, the
-token can be matched by using the name. Inline tokens are regular
-expressions that are used in rules without being declared. Ignored
-tokens are declared using the ignore construct: \texttt{ignore:
- \emph{regexp}}. These tokens are ignored by the scanner, and are
-not seen by the parser. Often whitespace is an ignored token. The
-regular expressions used to define tokens should use the syntax
-defined in the \texttt{re} module, so some symbols may have to be
-backslashed.
-
-Production rules in Yapps have a name and a pattern to match. If the
-rule is parameterized, the name should be followed by a list of
-parameter names in \verb|<<...>>|. A pattern can be a simple pattern
-or a compound pattern. Simple patterns are the name of a named token,
-a regular expression in quotes (inline token), the name of a
-production rule (followed by arguments in \verb|<<...>>|, if the rule
-has parameters), and single line Python statements (\verb|{{...}}|).
-Compound patterns are sequences (\verb|A B C ...|), choices (
-\verb:A | B | C | ...:), options (\verb|[...]|), zero-or-more repetitions
-(\verb|...*|), and one-or-more repetitions (\verb|...+|). Like
-regular expressions, repetition operators have a higher precedence
-than sequences, and sequences have a higher precedence than choices.
-
-Whenever \verb|{{...}}| is used, a legal one-line Python statement
-should be put inside the braces. The token \verb|}}| should not
-appear within the \verb|{{...}}| section, even within a string, since
-Yapps does not attempt to parse the Python statement. A workaround
-for strings is to put two strings together (\verb|"}" "}"|), or to use
-backslashes (\verb|"}\}"|). At the end of a rule you should use a
-\verb|{{ return X }}| statement to return a value. However, you
-should \emph{not} use any control statements (\texttt{return},
-\texttt{continue}, \texttt{break}) in the middle of a rule. Yapps
-needs to make assumptions about the control flow to generate a parser,
-and any changes to the control flow will confuse Yapps.
-
-The \verb|<<...>>| form can occur in two places: to define parameters
-to a rule and to give arguments when matching a rule. Parameters use
-the syntax used for Python functions, so they can include default
-arguments and the special forms (\verb|*args| and \verb|**kwargs|).
-Arguments use the syntax for Python function call arguments, so they
-can include normal arguments and keyword arguments. The token
-\verb|>>| should not appear within the \verb|<<...>>| section.
-
-In both the statements and rule arguments, you can use names defined
-by the parser to refer to matched patterns. You can refer to the text
-matched by a named token by using the token name. You can use the
-value returned by a production rule by using the name of that rule.
-If a name \texttt{X} is matched more than once (such as in loops), you
-will have to save the earlier value(s) in a temporary variable, and
-then use that temporary variable in the return value. The next
-section has an example of a name that occurs more than once.
-
-\mysubsection{Left Factoring}
-\label{sec:Left-Factoring}
-
-Yapps produces ELL(1) parsers, which determine which clause to match
-based on the first token available. Sometimes the leftmost tokens of
-several clauses may be the same. The classic example is the
-\emph{if/then/else} construct in Pascal:
-
-\begin{verbatim}
-rule stmt: "if" expr "then" stmt {{ then_part = stmt }}
- "else" stmt {{ return ('If',expr,then_part,stmt) }}
- | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }}
-\end{verbatim}
-
-(Note that we have to save the first \texttt{stmt} into a variable
-because there is another \texttt{stmt} that will be matched.) The
-left portions of the two clauses are the same, which presents a
-problem for the parser. The solution is \emph{left-factoring}: the
-common parts are put together, and \emph{then} a choice is made about
-the remaining part:
-
-\begin{verbatim}
-rule stmt: "if" expr
- "then" stmt {{ then_part = stmt }}
- {{ else_part = [] }}
- [ "else" stmt {{ else_part = stmt }} ]
- {{ return ('If', expr, then_part, else_part) }}
-\end{verbatim}
-
-Unfortunately, the classic \emph{if/then/else} situation is
-\emph{still} ambiguous when you left-factor. Yapps can deal with this
-situation, but will report a warning; see section
-\ref{sec:Ambiguous-Grammars} for details.
-
-In general, replace rules of the form:
-
-\begin{verbatim}
-rule A: a b1 {{ return E1 }}
- | a b2 {{ return E2 }}
- | c3 {{ return E3 }}
- | c4 {{ return E4 }}
-\end{verbatim}
-
-with rules of the form:
-
-\begin{verbatim}
-rule A: a ( b1 {{ return E1 }}
- | b2 {{ return E2 }}
- )
- | c3 {{ return E3 }}
- | c4 {{ return E4 }}
-\end{verbatim}
-
-\mysubsection{Left Recursion}
-
-A common construct in grammars is for matching a list of patterns,
-sometimes separated with delimiters such as commas or semicolons. In
-LR-based parser systems, we can parse a list with something like this:
-
-\begin{verbatim}
-rule sum: NUM {{ return NUM }}
- | sum "+" NUM {{ return (sum, NUM) }}
-\end{verbatim}
-
-Parsing \texttt{1+2+3+4} would produce the output
-\texttt{(((1,2),3),4)}, which is what we want from a left-associative
-addition operator. Unfortunately, this grammar is \emph{left
-recursive,} because the \texttt{sum} rule contains a clause that
-begins with \texttt{sum}. (The recursion occurs at the left side of
-the clause.)
-
-We must restructure this grammar to be \emph{right recursive} instead:
-
-\begin{verbatim}
-rule sum: NUM {{ return NUM }}
- | NUM "+" sum {{ return (NUM, sum) }}
-\end{verbatim}
-
-Unfortunately, using this grammar, \texttt{1+2+3+4} would be parsed as
-\texttt{(1,(2,(3,4)))}, which no longer follows left associativity.
-The rule also needs to be left-factored. Instead, we write the
-pattern as a loop instead:
-
-\begin{verbatim}
-rule sum: NUM {{ v = NUM }}
- ( "[+]" NUM {{ v = (v,NUM) }} )*
- {{ return v }}
-\end{verbatim}
-
-In general, replace rules of the form:
-
-\begin{verbatim}
-rule A: A a1 -> << E1 >>
- | A a2 -> << E2 >>
- | b3 -> << E3 >>
- | b4 -> << E4 >>
-\end{verbatim}
-
-with rules of the form:
-
-\begin{verbatim}
-rule A: ( b3 {{ A = E3 }}
- | b4 {{ A = E4 }} )
- ( a1 {{ A = E1 }}
- | a2 {{ A = E2 }} )*
- {{ return A }}
-\end{verbatim}
-
-We have taken a rule that proved problematic for with recursion and
-turned it into a rule that works well with looping constructs.
-
-\mysubsection{Ambiguous Grammars}
-\label{sec:Ambiguous-Grammars}
-
-In section \ref{sec:Left-Factoring} we saw the classic if/then/else
-ambiguity, which occurs because the ``else \ldots'' portion of an ``if
-\ldots then \ldots else \ldots'' construct is optional. Programs with
-nested if/then/else constructs can be ambiguous when one of the else
-clauses is missing:
-\begin{verbatim}
-if 1 then if 1 then
- if 5 then if 5 then
- x := 1; x := 1;
- else else
- y := 9; y := 9;
-\end{verbatim}
-
-The indentation shows that the program can be parsed in two different
-ways. (Of course, if we all would adopt Python's indentation-based
-structuring, this would never happen!) Usually we want the parsing on
-the left: the ``else'' should be associated with the closest ``if''
-statement. In section \ref{sec:Left-Factoring} we ``solved'' the
-problem by using the following grammar:
-
-\begin{verbatim}
-rule stmt: "if" expr
- "then" stmt {{ then_part = stmt }}
- {{ else_part = [] }}
- [ "else" stmt {{ else_part = stmt }} ]
- {{ return ('If', expr, then_part, else_part) }}
-\end{verbatim}
-
-Here, we have an optional match of ``else'' followed by a statement.
-The ambiguity is that if an ``else'' is present, it is not clear
-whether you want it parsed immediately or if you want it to be parsed
-by the outer ``if''.
-
-Yapps will deal with the situation by matching when the else pattern
-when it can. The parser will work in this case because it prefers the
-\emph{first} matching clause, which tells Yapps to parse the ``else''.
-That is exactly what we want!
-
-For ambiguity cases with choices, Yapps will choose the \emph{first}
-matching choice. However, remember that Yapps only looks at the first
-token to determine its decision, so {\tt (a b | a c)} will result in
-Yapps choosing {\tt a b} even when the input is {\tt a c}. It only
-looks at the first token, {\tt a}, to make its decision.
-
-\mysection{Customization}
-
-Both the parsers and the scanners can be customized. The parser is
-usually extended by subclassing, and the scanner can either be
-subclassed or completely replaced.
-
-\mysubsection{Customizing Parsers}
-
-If additional fields and methods are needed in order for a parser to
-work, Python subclassing can be used. (This is unlike parser classes
-written in static languages, in which these fields and methods must be
-defined in the generated parser class.) We simply subclass the
-generated parser, and add any fields or methods required. Expressions
-in the grammar can call methods of the subclass to perform any actions
-that cannot be expressed as a simple expression. For example,
-consider this simple grammar:
-
-\begin{verbatim}
-parser X:
- rule goal: "something" {{ self.printmsg() }}
-\end{verbatim}
-
-The \texttt{printmsg} function need not be implemented in the parser
-class \texttt{X}; it can be implemented in a subclass:
-
-\begin{verbatim}
-import Xparser
-
-class MyX(Xparser.X):
- def printmsg(self):
- print "Hello!"
-\end{verbatim}
-
-\mysubsection{Customizing Scanners}
-
-The generated parser class is not dependent on the generated scanner
-class. A scanner object is passed to the parser object's constructor
-in the \texttt{parse} function. To use a different scanner, write
-your own function to construct parser objects, with an instance of a
-different scanner. Scanner objects must have a \texttt{token} method
-that accepts an integer \texttt{N} as well as a list of allowed token
-types, and returns the Nth token, as a tuple. The default scanner
-raises \texttt{NoMoreTokens} if no tokens are available, and
-\texttt{SyntaxError} if no token could be matched. However, the
-parser does not rely on these exceptions; only the \texttt{parse}
-convenience function (which calls \texttt{wrap\_error\_reporter}) and
-the \texttt{print\_error} error display function use those exceptions.
-
-The tuples representing tokens have four elements. The first two are
-the beginning and ending indices of the matched text in the input
-string. The third element is the type tag, matching either the name
-of a named token or the quoted regexp of an inline or ignored token.
-The fourth element of the token tuple is the matched text. If the
-input string is \texttt{s}, and the token tuple is
-\texttt{(b,e,type,val)}, then \texttt{val} should be equal to
-\texttt{s[b:e]}.
-
-The generated parsers do not the beginning or ending index. They use
-only the token type and value. However, the default error reporter
-uses the beginning and ending index to show the user where the error
-is.
-
-\mysection{Parser Mechanics}
-
-The base parser class (Parser) defines two methods, \texttt{\_scan}
-and \texttt{\_peek}, and two fields, \texttt{\_pos} and
-\texttt{\_scanner}. The generated parser inherits from the base
-parser, and contains one method for each rule in the grammar. To
-avoid name clashes, do not use names that begin with an underscore
-(\texttt{\_}).
-
-\mysubsection{Parser Objects}
-\label{sec:Parser-Objects}
-
-Yapps produces as output two exception classes, a scanner class, a
-parser class, and a function \texttt{parse} that puts everything
-together. The \texttt{parse} function does not have to be used;
-instead, one can create a parser and scanner object and use them
-together for parsing.
-
-\begin{verbatim}
- def parse(rule, text):
- P = X(XScanner(text))
- return wrap_error_reporter(P, rule)
-\end{verbatim}
-
-The \texttt{parse} function takes a name of a rule and an input string
-as input. It creates a scanner and parser object, then calls
-\texttt{wrap\_error\_reporter} to execute the method in the parser
-object named \texttt{rule}. The wrapper function will call the
-appropriate parser rule and report any parsing errors to standard
-output.
-
-There are several situations in which the \texttt{parse} function
-would not be useful. If a different parser or scanner is being used,
-or exceptions are to be handled differently, a new \texttt{parse}
-function would be required. The supplied \texttt{parse} function can
-be used as a template for writing a function for your own needs. An
-example of a custom parse function is the \texttt{generate} function
-in \texttt{Yapps.py}.
-
-\mysubsection{Context Sensitive Scanner}
-
-Unlike most scanners, the scanner produced by Yapps can take into
-account the context in which tokens are needed, and try to match only
-good tokens. For example, in the grammar:
-
-\begin{verbatim}
-parser IniFile:
- token ID: "[a-zA-Z_0-9]+"
- token VAL: ".*"
-
- rule pair: ID "[ \t]*=[ \t]*" VAL "\n"
-\end{verbatim}
-
-we would like to scan lines of text and pick out a name/value pair.
-In a conventional scanner, the input string \texttt{shell=progman.exe}
-would be turned into a single token of type \texttt{VAL}. The Yapps
-scanner, however, knows that at the beginning of the line, an
-\texttt{ID} is expected, so it will return \texttt{"shell"} as a token
-of type \texttt{ID}. Later, it will return \texttt{"progman.exe"} as
-a token of type \texttt{VAL}.
-
-Context sensitivity decreases the separation between scanner and
-parser, but it is useful in parsers like \texttt{IniFile}, where the
-tokens themselves are not unambiguous, but \emph{are} unambiguous
-given a particular stage in the parsing process.
-
-Unfortunately, context sensitivity can make it more difficult to
-detect errors in the input. For example, in parsing a Pascal-like
-language with ``begin'' and ``end'' as keywords, a context sensitive
-scanner would only match ``end'' as the END token if the parser is in
-a place that will accept the END token. If not, then the scanner
-would match ``end'' as an identifier. To disable the context
-sensitive scanner in Yapps, add the
-\texttt{context-insensitive-scanner} option to the grammar:
-
-\begin{verbatim}
-Parser X:
- option: "context-insensitive-scanner"
-\end{verbatim}
-
-Context-insensitive scanning makes the parser look cleaner as well.
-
-\mysubsection{Internal Variables}
-
-There are two internal fields that may be of use. The parser object
-has two fields, \texttt{\_pos}, which is the index of the current
-token being matched, and \texttt{\_scanner}, which is the scanner
-object. The token itself can be retrieved by accessing the scanner
-object and calling the \texttt{token} method with the token index. However, if you call \texttt{token} before the token has been requested by the parser, it may mess up a context-sensitive scanner.\footnote{When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point. If you call \texttt{token} before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.} A
-potentially useful combination of these fields is to extract the
-portion of the input matched by the current rule. To do this, just save the scanner state (\texttt{\_scanner.pos}) before the text is matched and then again after the text is matched:
-
-\begin{verbatim}
- rule R:
- {{ start = self._scanner.pos }}
- a b c
- {{ end = self._scanner.pos }}
- {{ print 'Text is', self._scanner.input[start:end] }}
-\end{verbatim}
-
-\mysubsection{Pre- and Post-Parser Code}
-
-Sometimes the parser code needs to rely on helper variables,
-functions, and classes. A Yapps grammar can optionally be surrounded
-by double percent signs, to separate the grammar from Python code.
-
-\begin{verbatim}
-... Python code ...
-%%
-... Yapps grammar ...
-%%
-... Python code ...
-\end{verbatim}
-
-The second \verb|%%| can be omitted if there is no Python code at the
-end, and the first \verb|%%| can be omitted if there is no extra
-Python code at all. (To have code only at the end, both separators
-are required.)
-
-If the second \verb|%%| is omitted, Yapps will insert testing code
-that allows you to use the generated parser to parse a file.
-
-The extended calculator example in the Yapps examples subdirectory
-includes both pre-parser and post-parser code.
-
-\mysubsection{Representation of Grammars}
-
-For each kind of pattern there is a class derived from Pattern. Yapps
-has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus,
-Star, and Eval. Each of these classes has the following interface:
-
-\begin{itemize}
- \item[setup(\emph{gen})] Set accepts-$\epsilon$, and call
- \emph{gen.changed()} if it changed. This function can change the
- flag from false to true but \emph{not} from true to false.
- \item[update(\emph(gen))] Set \first and \follow, and call
- \emph{gen.changed()} if either changed. This function can add to
- the sets but \emph{not} remove from them.
- \item[output(\emph{gen}, \emph{indent})] Generate code for matching
- this rule, using \emph{indent} as the current indentation level.
- Writes are performed using \emph{gen.write}.
- \item[used(\emph{vars})] Given a list of variables \emph{vars},
- return two lists: one containing the variables that are used, and
- one containing the variables that are assigned. This function is
- used for optimizing the resulting code.
-\end{itemize}
-
-Both \emph{setup} and \emph{update} monotonically increase the
-variables they modify. Since the variables can only increase a finite
-number of times, we can repeatedly call the function until the
-variable stabilized. The \emph{used} function is not currently
-implemented.
-
-With each pattern in the grammar Yapps associates three pieces of
-information: the \first set, the \follow set, and the
-accepts-$\epsilon$ flag.
-
-The \first set contains the tokens that can appear as we start
-matching the pattern. The \follow set contains the tokens that can
-appear immediately after we match the pattern. The accepts-$\epsilon$
-flag is true if the pattern can match no tokens. In this case, \first
-will contain all the elements in \follow. The \follow set is not
-needed when accepts-$\epsilon$ is false, and may not be accurate in
-those cases.
-
-Yapps does not compute these sets precisely. Its approximation can
-miss certain cases, such as this one:
-
-\begin{verbatim}
- rule C: ( A* | B )
- rule B: C [A]
-\end{verbatim}
-
-Yapps will calculate {\tt C}'s \follow set to include {\tt A}.
-However, {\tt C} will always match all the {\tt A}'s, so {\tt A} will
-never follow it. Yapps 2.0 does not properly handle this construct,
-but if it seems important, I may add support for it in a future
-version.
-
-Yapps also cannot handle constructs that depend on the calling
-sequence. For example:
-
-\begin{verbatim}
- rule R: U | 'b'
- rule S: | 'c'
- rule T: S 'b'
- rule U: S 'a'
-\end{verbatim}
-
-The \follow set for {\tt S} includes {\tt a} and {\tt b}. Since {\tt
- S} can be empty, the \first set for {\tt S} should include {\tt a},
-{\tt b}, and {\tt c}. However, when parsing {\tt R}, if the lookahead
-is {\tt b} we should \emph{not} parse {\tt U}. That's because in {\tt
- U}, {\tt S} is followed by {\tt a} and not {\tt b}. Therefore in
-{\tt R}, we should choose rule {\tt U} only if there is an {\tt a} or
-{\tt c}, but not if there is a {\tt b}. Yapps and many other LL(1)
-systems do not distinguish {\tt S b} and {\tt S a}, making {\tt
- S}'s \follow set {\tt a, b}, and making {\tt R} always try to match
-{\tt U}. In this case we can solve the problem by changing {\tt R} to
-\verb:'b' | U: but it may not always be possible to solve all such
-problems in this way.
-
-\appendix
-
-\mysection{Grammar for Parsers}
-
-This is the grammar for parsers, without any Python code mixed in.
-The complete grammar can be found in \texttt{parsedesc.g} in the Yapps
-distribution.
-
-\begin{verbatim}
-parser ParserDescription:
- ignore: "\\s+"
- ignore: "#.*?\r?\n"
- token END: "$" # $ means end of string
- token ATTR: "<<.+?>>"
- token STMT: "{{.+?}}"
- token ID: '[a-zA-Z_][a-zA-Z_0-9]*'
- token STR: '[rR]?\'([^\\n\'\\\\]|\\\\.)*\'|[rR]?"([^\\n"\\\\]|\\\\.)*"'
-
- rule Parser: "parser" ID ":"
- Options
- Tokens
- Rules
- END
-
- rule Options: ( "option" ":" STR )*
- rule Tokens: ( "token" ID ":" STR | "ignore" ":" STR )*
- rule Rules: ( "rule" ID OptParam ":" ClauseA )*
-
- rule ClauseA: ClauseB ( '[|]' ClauseB )*
- rule ClauseB: ClauseC*
- rule ClauseC: ClauseD [ '[+]' | '[*]' ]
- rule ClauseD: STR | ID [ATTR] | STMT
- | '\\(' ClauseA '\\) | '\\[' ClauseA '\\]'
-\end{verbatim}
-
-\mysection{Upgrading}
-
-Yapps 2.0 is not backwards compatible with Yapps 1.0. In this section
-are some tips for upgrading:
-
-\begin{enumerate}
- \item Yapps 1.0 was distributed as a single file. Yapps 2.0 is
- instead distributed as two Python files: a \emph{parser generator}
- (26k) and a \emph{parser runtime} (5k). You need both files to
- create parsers, but you need only the runtime (\texttt{yappsrt.py})
- to use the parsers.
-
- \item Yapps 1.0 supported Python 1.4 regular expressions from the
- \texttt{regex} module. Yapps 2.0 uses Python 1.5 regular
- expressions from the \texttt{re} module. \emph{The new syntax for
- regular expressions is not compatible with the old syntax.}
- Andrew Kuchling has a \htmladdnormallink{guide to converting
- regular
- expressions}{http://www.python.org/doc/howto/regex-to-re/} on his
- web page.
-
- \item Yapps 1.0 wants a pattern and then a return value in \verb|->|
- \verb|<<...>>|. Yapps 2.0 allows patterns and Python statements to
- be mixed. To convert a rule like this:
-
-\begin{verbatim}
-rule R: A B C -> << E1 >>
- | X Y Z -> << E2 >>
-\end{verbatim}
-
- to Yapps 2.0 form, replace the return value specifiers with return
- statements:
-
-\begin{verbatim}
-rule R: A B C {{ return E1 }}
- | X Y Z {{ return E2 }}
-\end{verbatim}
-
- \item Yapps 2.0 does not perform tail recursion elimination. This
- means any recursive rules you write will be turned into recursive
- methods in the parser. The parser will work, but may be slower.
- It can be made faster by rewriting recursive rules, using instead
- the looping operators \verb|*| and \verb|+| provided in Yapps 2.0.
-
-\end{enumerate}
-
-\mysection{Troubleshooting}
-
-\begin{itemize}
- \item A common error is to write a grammar that doesn't have an END
- token. End tokens are needed when it is not clear when to stop
- parsing. For example, when parsing the expression {\tt 3+5}, it is
- not clear after reading {\tt 3} whether to treat it as a complete
- expression or whether the parser should continue reading.
- Therefore the grammar for numeric expressions should include an end
- token. Another example is the grammar for Lisp expressions. In
- Lisp, it is always clear when you should stop parsing, so you do
- \emph{not} need an end token. In fact, it may be more useful not
- to have an end token, so that you can read in several Lisp expressions.
- \item If there is a chance of ambiguity, make sure to put the choices
- in the order you want them checked. Usually the most specific
- choice should be first. Empty sequences should usually be last.
- \item The context sensitive scanner is not appropriate for all
- grammars. You might try using the insensitive scanner with the
- {\tt context-insensitive-scanner} option in the grammar.
- \item If performance turns out to be a problem, try writing a custom
- scanner. The Yapps scanner is rather slow (but flexible and easy
- to understand).
-\end{itemize}
-
-\mysection{History}
-
-Yapps 1 had several limitations that bothered me while writing
-parsers:
-
-\begin{enumerate}
- \item It was not possible to insert statements into the generated
- parser. A common workaround was to write an auxilliary function
- that executed those statements, and to call that function as part
- of the return value calculation. For example, several of my
- parsers had an ``append(x,y)'' function that existed solely to call
- ``x.append(y)''.
- \item The way in which grammars were specified was rather
- restrictive: a rule was a choice of clauses. Each clause was a
- sequence of tokens and rule names, followed by a return value.
- \item Optional matching had to be put into a separate rule because
- choices were only made at the beginning of a rule.
- \item Repetition had to be specified in terms of recursion. Not only
- was this awkward (sometimes requiring additional rules), I had to
- add a tail recursion optimization to Yapps to transform the
- recursion back into a loop.
-\end{enumerate}
-
-Yapps 2 addresses each of these limitations.
-
-\begin{enumerate}
- \item Statements can occur anywhere within a rule. (However, only
- one-line statements are allowed; multiline blocks marked by
- indentation are not.)
- \item Grammars can be specified using any mix of sequences, choices,
- tokens, and rule names. To allow for complex structures,
- parentheses can be used for grouping.
- \item Given choices and parenthesization, optional matching can be
- expressed as a choice between some pattern and nothing. In
- addition, Yapps 2 has the convenience syntax \verb|[A B ...]| for
- matching \verb|A B ...| optionally.
- \item Repetition operators \verb|*| for zero or more and \verb|+| for
- one or more make it easy to specify repeating patterns.
-\end{enumerate}
-
-It is my hope that Yapps 2 will be flexible enough to meet my needs
-for another year, yet simple enough that I do not hesitate to use it.
-
-\mysection{Future Extensions}
-\label{sec:future}
-
-I am still investigating the possibility of LL(2) and higher
-lookahead. However, it looks like the resulting parsers will be
-somewhat ugly.
-
-It would be nice to control choices with user-defined predicates.
-
-The most likely future extension is backtracking. A grammar pattern
-like \verb|(VAR ':=' expr)? {{ return Assign(VAR,expr) }} : expr {{ return expr }}|
-would turn into code that attempted to match \verb|VAR ':=' expr|. If
-it succeeded, it would run \verb|{{ return ... }}|. If it failed, it
-would match \verb|expr {{ return expr }}|. Backtracking may make it
-less necessary to write LL(2) grammars.
-
-\mysection{References}
-
-\begin{enumerate}
- \item The \htmladdnormallink{Python-Parser
- SIG}{http://www.python.org/sigs/parser-sig/} is the first place
- to look for a list of parser systems for Python.
-
- \item ANTLR/PCCTS, by Terrence Parr, is available at
- \htmladdnormallink{The ANTLR Home Page}{http://www.antlr.org/}.
-
- \item PyLR, by Scott Cotton, is at \htmladdnormallink{his Starship
- page}{http://starship.skyport.net/crew/scott/PyLR.html}.
-
- \item John Aycock's \htmladdnormallink{Compiling Little Languages
- Framework}{http://www.csr.UVic.CA/~aycock/python/}.
-
- \item PyBison, by Scott Hassan, can be found at
- \htmladdnormallink{his Python Projects
- page}{http://coho.stanford.edu/\~{}hassan/Python/}.
-
- \item mcf.pars, by Mike C. Fletcher, is available at
- \htmladdnormallink{his web
- page}{http://www.golden.net/\~{}mcfletch/programming/}.
-
- \item kwParsing, by Aaron Watters, is available at
- \htmladdnormallink{his Starship
- page}{http://starship.skyport.net/crew/aaron_watters/kwParsing/}.
-\end{enumerate}
-
-\end{document}