RunCC - A Java Runtime Compiler Compiler
RunCC is a new kind of parsergenerator that generates parsers
and lexers at runtime. Source generation is only optional. It features
simplicity, a standalone useable UNICODE enabled lexer, and a simple EBNF
dialect for specifying syntaxes in a File. Although intended for small languages,
it comes with Java and XML example parsers.
This package is open source, published under Lesser GNU Public License
(LGPL).
This document applies to runcc 0.6. Old documentation is contained in old
download archives.
Quick Start
A parser generator creates a parser from syntax rules. Rules are
used to describe the inner structure of a text to analyze. When analyzed
by the generated parser, the parts of the text can be processed by a semantic
implementation. In short: a parser generator is a tool to create analyzers.
RunCC is written in Java. No additional libraries (except JRE)
are required. It needs Java 2 collections (does not run on JDK 1.1 or
older).
RunCC provides a bottom-up (LR) parser that can be SLR, LR or LALR.
It provides a lexer that actually is a top-down (LL) parser and can read
character (UNICODE) as well as byte input.
RunCC features simplicity and the absence of any cryptography except
an EBNF dialect.
RunCC serializes built parsers and lexers to speed up the building
process next time they are needed, so
RunCC provides source generation only as option. It can generate
Java code for ParserTables, Syntax, and Semantic
skeletons.
RunCC defines a clear separation of responsibilities: Syntax/Rule,
Lexer, Parser, ParserTables, Semantic. No source fragments are possible
within the syntax notation. Responsibilities are:
- The Syntax holding the specified language
rules for lexer and parser,
- the builders that serializes lexers and parsers after
build to bridge the building process next time the parser is needed,
- source generation utilities,
- the universal bottom-up Parser algorithm,
loaded with
- ParserTables that hold the syntax processing
tables ("Goto", "Parse-Action"),
- the Lexer that scans tokens from input
to feed the parser
- and (last but not least) the Semantic
that processes the parsing results
The syntax is written either embedded in Java as String arrays
(representing rules), or within a separate file, using an EBNF dialect.
No source fragments are possible within the syntax notation. This
is done by Semantic implementations.
To understand those principles look at the following (classical)
Calculator example, showing the elegance of ReflectSemantic.
Try it by typing
java -jar runcc.jar '(4+2.3) *(2 - -6) + 3*2'
or
java -cp runcc.jar fri.patterns.interpreter.parsergenerator.examples.Calculator
'(4+2.3) *(2 - -6) + 3*2'
- i hope it will be 56.4 !
public class
Calculator extends ReflectSemantic
{
private static String [][] rules
= {
{ "EXPRESSION",
"TERM" },
{ "EXPRESSION",
"EXPRESSION", "'+'", "TERM" },
{ "EXPRESSION",
"EXPRESSION", "'-'", "TERM" },
{ "TERM",
"FACTOR", },
{ "TERM",
"TERM", "'*'", "FACTOR" },
{ "TERM",
"TERM", "'/'", "FACTOR" },
{ "FACTOR",
"`number`", },
{ "FACTOR",
"'-'", "FACTOR" },
{ "FACTOR",
"'('", "EXPRESSION", "')'" },
{ Token.IGNORED,
"`whitespaces`" },
};
public Object EXPRESSION(Object TERM)
{
return TERM;
}
public Object EXPRESSION(Object EXPRESSION,
Object operator, Object TERM) {
if (operator.equals("+"))
return new Double(((Double) EXPRESSION).doubleValue() + ((Double) TERM).doubleValue());
return new Double(((Double)
EXPRESSION).doubleValue() - ((Double) TERM).doubleValue());
}
public Object TERM(Object FACTOR)
{
return FACTOR;
}
public Object TERM(Object TERM, Object
operator, Object FACTOR) {
if (operator.equals("*"))
return new Double(((Double) TERM).doubleValue() * ((Double) FACTOR).doubleValue());
return new Double(((Double)
TERM).doubleValue() / ((Double) FACTOR).doubleValue());
}
public Object FACTOR(Object number)
{
return Double.valueOf((String)
number);
}
public Object FACTOR(Object minus,
Object FACTOR) {
return new Double(
- ((Double) FACTOR).doubleValue() );
}
public Object FACTOR(Object leftParenthesis,
Object EXPRESSION, Object rightParenthesis) {
return EXPRESSION;
}
public static void main(String
[] args) throws Exception {
String input =
args[0]; // define some arithmetic expressions on commandline
Parser parser
= new SerializedParser().get(rules, "Calculator");
// allocates a default lexer
boolean ok = parser.parse(input,
new Calculator());
System.err.println("Parse
return "+ok+", result: "+parser.getResult());
}
}
You can find this sample in package fri/patterns/interpreter/parsergenerator/examples/Calculator.java.
Implementation Steps
Following are the steps to achieve your language implementation
with RunCC.
- Define
the target language syntax.
- Decide if the syntax is complicated and big and must
be put into a separate file, or can be embedded as String arrays within
Java source.
You can write both parser and lexer rules into the syntax.
You can use the special nonterminals "token" (Java:
Token.TOKEN) and "ignored" (Token.IGNORED)
as token markers to draw the line between lexer and parser rules.
You can use a lot of predefined lexer rules
by referencing them within `backquotes`. Look
at javadoc of StandardLexerRules class for a
list of available symbols.
You can use ".." to define character sets like A..Z
(Java: Token.UPTO).
You can use "-" to define intersections like `char`
- `newline`(Java: Token.BUTNOT).
Following notations are possible, whereby the first is written
into Java source, the second is written into a separate file that can
be read in using SyntaxBuilder class (see below):
- // Embedded Java notation
private static final String [][] rules = {
{ Token.TOKEN, "others" },
// TOKEN defines what we want to receive
{ Token.TOKEN, "`stringdef`"
},
{ Token.IGNORED, "`cstylecomment`"
}, // IGNORED defines what we want to ignore
{ "others", "others", "other" },
{ "others", "other" },
{ "other", "`char`", Token.BUTNOT,
"`cstylecomment`", Token.BUTNOT, "`stringdef`" },
};
- # EBNF notation (Extended Backus-Naur
Form)
token ::= others ; // TOKEN defines
what we want to receive
token ::= `stringdef` ;
ignored ::= `cstylecomment` ;
// IGNORED defines what we want to ignore
others ::= others other ;
others ::= other ;
other ::= `char` - `cstylecomment` - `stringdef`
;
In EBNF notation you can use quantifiers like *, +,
? to mark symbols as 0-n (*, optional list), 1-n (+, list) or 0-1
(?, optional), for every symbol on right side of the rule.
IMPORTANT: Please do not use nonterminal symbols starting
with '_', as such symbols are used for artificial rules which get optimized,
and probably you would not receive them in the semantic module!
- You need not to define "token" and "ignored".
When you mix lexer and parser rules, RunCC will try to separate lexer and
parser rules automatically.
But when there are complex lexing rules that require nonterminals on the
right side, it is necessary to define "token" and "ignored" to tell
the LexerBuilder which rules are for the Lexer. Rules marked as "token" or
"ignored" and all underlying ones will be handled by the Lexer (that is
in fact implemented as top-down parser) and will never reach the Parser.
Furthermore it is possible to use the Lexer standalone, then you must mark
the nonterminal of the root rule as "token".
- Implement a Semantic
The Semantic interface defines one method that receives
all callbacks during parsing:
doSemantic(Rule rule, List parseResult,
List resultRanges)
The parseResult will contain ...
- When using Parser: a List of Objects as long as the count
of symbols on the right side of the rule, every Object is a result from
an underlying doSemantic() call.
- When using standalone Lexer: one token text that is the
result of the rule
The resultRanges will contain the range(s) of the token
texts within input text (e.g. for syntax-highlighting). The ranges List
will be as long as parseResult list. It contains elements of type Token.Range.
For convenience you can use ReflectSemantic. This class
provides rule recognition by reflection (java.lang.reflect):
- the nonterminal on the left side is considered to
be a method name,
- every symbol on the right side is passed
as Object argument to that method.
When the method is not found, a fallback() method
is called. This returns the token when it is just one, when the rule is
left recursive and the first argument is instanceof List, the second argument
is added to that List (can be used for automatic list aggregation). The
fallback() is overrideable (protected).
The implemented semantic must be passed to the Parser for processing:
- parser.parse(new MySemantic());
For a sample look at the HelloWorld example.
For standalone lexers you need to use a
implementation. The interface method receives every evaluated rule top-down.
A ReflectLexerSemantic is provided in a sub-package, which works similar to
ReflectSemantic for Parsers. The difference is that it can provide a Set of
nonterminal Strings (left side of the rule) it wants to receive.
- Make the syntax
available.
Write a Java main class that contains the embedded syntax or
uses SyntaxBuilder to read the EBNF file.
When using Java-embedded rules (2D String array), the following
will do it:
String [][] rules = ...
Syntax syntax = new Syntax(rules);
When using an external EBNF file, the following will read it
and convert it to a Syntax object (assuming the file is in
same directory as the implementation):
Reader syntaxInput = new InputStreamReader(MyLanguage.class.getResourceAsStream("MyLanguage.syntax"));
SyntaxBuilder builder = new
SyntaxBuilder(syntaxInput);
Syntax syntax = builder.getSyntax();
SyntaxBuilder resolves all wildcard directives and alternative
sentences within rules to plain rules. Doing that several new rules must
be constructed. All nonterminals on the left sind of such artificial rules
will start with "_".
- Load the lexer with
input.
Decide if this is a lexer problem or a parser is the better
choice. Most likely you will try a lexer/parser combination first, but
when there are too many shift/reduce conflicts, a standalone lexer might
do it as well (the contained lexer can process even recursive rules, in
fact it is a top-down parser).
Assuming you defined a mixed parser and lexer syntax:
SyntaxSeparation separation
= new SyntaxSeparation(syntax); // takes off TOKEN and
IGNORED
LexerBuilder builder = new LexerBuilder(separation.getLexerSyntax(),
separation.getIgnoredSymbols());
Lexer lexer = builder.getLexer();
lexer.setInput("Very complicated text to analyze
...");
Assuming you defined a pure lexer syntax (this is a complicated
way, in fact you will use the lexer builder):
SyntaxSeparation separation
= new SyntaxSeparation(syntax); // takes off TOKEN and
IGNORED
LexerBuilder builder = new
LexerBuilder(separation.getLexerSyntax(), separation.getIgnoredSymbols());
Lexer lexer = builder.getLexer();
lexer.setTerminals(separation.getTokenSymbols());
lexer.setInput("Very complicated text to analyze
...");
- Parse the input.
The lexer was loaded with the input to parse, it represents
the input of the parser.
When using a parser-lexer combination, create the parser and
parse:
ParserTables parserTables = new LALRParserTables(separation.getParserSyntax());
Parser parser = new Parser(parserTables);
parser.parse(lexer, new MySemantic());
// here the missed setTerminals() happens
When using a standalone lexer, you can lex using two different
methods.
- You just want to receive tokens and output certain
token texts (like in C-style-comment-strip example):
Token token;
do {
token = lexer.getNextToken(null);
// null: no hints what is expected
if (token.symbol == null)
lexer.dump(System.err);
else
if (token.text != null)
System.out.write(token.text.toString());
}
while (token.symbol != null && Token.isEpsilon(token)
== false);
- You want to parse top-down using the lexer (like in XML example),
this evaluates the input (which means the Lexer returns false if there is
more input to read!):
LexerImpl lexer = (LexerImpl)
builder.getLexer(); // need to cast to LexerImpl for lex() method
boolean ok = lexer.lex(new MyLexerSemantic());
IMPORTANT:
When studying how to use RunCC, do not miss to read the examples. RunCC
can be used in a lot of different ways for special purposes. The more complicated
the problem, the deeper you will have to go into the framework. There are
several examples packages, the most important is fri.patterns.interpreter.parsergenerator.examples.
Using the builders
Parser builder
To speed up the building of parsers and to write some of those
standard building sequences shorter you can use the builders in package
fri/patterns/interpreter/parsergenerator/builders.
Lets look at "HelloWorld" example. Here is the first version
of HelloWorld.java without builder (this simple Syntax can be
done by simple SLRParserTables,
normally you need not define the ParserTables class).
private static final String [][] syntax =
{
{ "Start", "\"Hello\"",
"\"World\"" },
{ Token.IGNORED, "`whitespaces`"
},
};
public static void main(String [] args) throws
Exception {
// separate parser and lexer
rules
SyntaxSeparation separation
= new SyntaxSeparation(new Syntax(syntax));
// build the lexer from the
lexer rules part
LexerBuilder builder
= new LexerBuilder(separation.getLexerSyntax(), separation.getIgnoredSymbols());
Lexer lexer = builder.getLexer("\tHello
\r\n\tWorld\n"); // the text to parse
// build the parser
from the parser rules part
Parser parser = new Parser(new
SLRParserTables(separation.getParserSyntax()));
parser.parse(lexer, new
PrintSemantic());
}
To write this shorter and to speed up loading, a builder is
used in HelloWorld2.java. This version lasts a little longer
at the first time, but loads much faster at any further time.
public static void main(String [] args)
throws Exception {
// build SLRParserTables,
this is the simplest one, default is LALRParserTables
Parser parser = new SerializedParser().get(SLRParserTables.class,
syntax, "HelloWorld2");
// generates "$HOME/.friware/parsers/HelloWorld2Parser.ser"
parser.setInput("\tHello
\r\n\tWorld\n"); // there was a default Lexer provided
parser.parse(new PrintSemantic());
}
Lexer builder
The following example shows how to build and use a Lexer.
Reader syntaxInput = new InputStreamReader(XmlParser.class.getResourceAsStream("Xml.syntax"));
Lexer lexer = new SerializedLexer().get(syntaxInput,
"Xml");
Reader parseInput = new FileReader(parseFile);
lexer.setInput(parseInput);
// scan the input with some LexerSemantic
boolean ok = lexer.lex(new PrintLexerSemantic());
The PrintLexerSemantic implements LexerSemantic and communicates
the rules it wants to receives by providing a Set of nonterminal Strings
(that must match those written in the Syntax). See XmlLexer for
a full example how to do this.
Lexer services
You can use the LexerImpl as top-down parser with a special semantic:
LexerSemantic, or its utility implementation LexerReflectSemantic.
You will receive the Rule and its ResultTree object in interface method, for
LexerReflectSemantic you can simply implement the wanted callback methods,
the Set of wanted nonterminal Strings will be provided automatically from
those method names.
Example (assumed there is rule like "EncodingDecl ::= IsoString;"
in lexing Syntax):
public class MyLexerSemantic extends LexerReflectSemantic
{
private String encodingDecl;
...
public void EncodingDecl(ResultTree
resultTree) {
// process the result
this.encodingDecl
= resultTree.toString(); // you will get e.g. "ISO-8859-1"
}
...
}
Lexer token listener
To receive every Token the Lexer has scanned, you can install a
listener to the Lexer. This is for catching comments and spaces/newlines
that are marked as ignored in the EBNF (will not reach the Parser). The
start and end Token.Range is contained within the Token
object. You can associate those ranges with the result ranges received in
Semantic callbacks (where you have the ranges list of all parsed right side
objects) to reconstruct the whole text. Token.Range implements hashCode(),
equals() and Comparable to be manageable in some Map.
You install the listener by following code:
Lexer.TokenListener tokenListener = new Lexer.TokenListener()
{
public void tokenReceived(Token token, boolean
ignored) {
if (ignored)
{
System.err.println("Having
received ignored token: >"+token.getText()+"<");
}
}
};
parser.getLexer().addTokenListener(tokenListener);
parser.parse(new SomeSemantic());
Lexing without end-of-input
When you want to read some input that does not provide an end-of-input
mark (EOF), but might contain valid text, you can use the getNextToken(LexerSemantic)
method of LexerImpl:
LexerImpl lexer = (LexerImpl) builder.getLexer(); //
need to cast to LexerImpl for getNextToken(LexerSemantic) method
Token token = getNextToken(someLexerSemantic);
if (token.symbol == null)
System.out.println("error!");
System.out.println("Unread text starts at offset "+token.range.end.offset);
Three kinds of ParserTables
The provided bottom-up parser is a generic algorithm that runs
with every syntax. It is driven by a ParserTables object that represents
the syntax to be used on processing input. ParserTables contain the so-called
GOTO-Table, the PARSE-ACTION table and some helper lists.
When implementing a language that uses a parser you must to decide
which kind of parser tables you need. RunCC provides three kinds of
parser tables:
- LALRParserTables
The most popular solution for bottom-up parsers, the Java parser
example and the Calculator sample use it. This is default when no ParserTables
class is passed to the parser builder.
- SLRParserTables
A small implementation for simple languages.
- LRParserTables
Creates very big tables. Not well tested, more theory than practice.
The ParserTables implementations build on each other in an
objectoriented way: SLRParserTables is the base, LRParserTables adds some
functionality, and LALRParserTables is the most specialized derivation.
Optional source generation
When you want to increase the parser loading performance you can
generate Java source code for following parts:
- Syntax
Commandline:
java fri.patterns.interpreter.parsergenerator.util.SourceGenerator
MyEbnf.syntax
This commandline generates Java code representing the syntax written
within the passed .syntax file. Wildcards *+? are resolved to artificial
rules that start with "_". The result would be a file named MyEbnfSyntax.java.
- ParserTables
Commandline:
java fri.patterns.interpreter.parsergenerator.util.SourceGenerator
LALR MyEbnf.syntax
This commandline generates Java code representing the ParserTables
implementation for the syntax written within the passed .syntax file.
The result would be a file named MyEbnfParserTables.java.
- Semantic skeleton (code base for semantic implementations
of big languages)
Commandline:
java fri.patterns.interpreter.parsergenerator.util.SourceGenerator
semantic MyEbnf.syntax
This commandline generates Java code that contains a do-nothing
Semantic implementation for the syntax written within the
passed .syntax file. The result would be a file named MyEbnfSemantic.java,
that (when compiled and passed to the parser) just prints out the method
names and all arguments the method is called with.
If the target source file exists, it will not be overwritten, as
it could already contain a manual implementation.
When launched without arguments, the SourceGenerator outputs the
following message:
SYNTAX: java fri.patterns.interpreter.parsergenerator.util.SourceGenerator
[semantic|LALR|SLR|LR] file.grammar [file.syntax ...]
Predefined semantics
RunCC provides three Semantic implementations:
- PrintSemantic
This is a very simple helper semantic that just prints out the
rule and the parsed values of that rule. Nice for testing during development.
-
ReflectSemantic
This is a semantic that assumes that rules are directly mapped
to methods:
- every nonterminal on the left side
of a rule will be interpreted as a method name,
- every symbol on the right side of
that rule will be an argument for that method.
All arguments are of type Object. Look at the Calculator sample.
ReflectSemantic contains a fallback method that can
be used to aggregate lists. You need to read and test the source to understand
what you can do with that class.
-
TreeBuilderSemantic
This is a semantic to build a syntax instance tree from some input
(like DOM in XML). The tree node is a simple inner class, made to be
extended, or to be iterated by some follower semantic that needs a ready-made
parse tree instance.
Syntax checker
RunCC provides a utility class to print diagnostics of an EBNF:
java fri.patterns.interpreter.parsergenerator.util.SyntaxChecker
MySyntax.syntax
It detects
- unresolved nonterminals (nonterminals without rule)
- singular rules (nonterminal can be substituted by its singular
right symbol)
- isolated rules (redundant, can be removed)
- none or more than one toplevel rule
EBNF dialect
Following is the EBNF of the syntax definition language that you
can use when not using embedded String array rules. It is described
in its own syntax.
This file can be found in parsergenerator/syntax/builder/examples/SyntaxBuilder.syntax,
but this is just a sample. The actual rules are embedded within SyntaxBuilderSemantic.java.
Mind that you should not use a starting '_' for nonterminals, as
artificial rules do so (added when resolving parentheses and wildcards),
and those will be optimized. It could happen that a rule with such a nonterminal
will be removed before building the parser, which has the effect that the
associated semantic never will be called.
syntax ::= rule +
// a syntax consists of one or more rules
;
set ::= `bnf_chardef` ".."
`bnf_chardef` // character set definition
;
intersectionstartunit ::= set
| `identifier` | `ruleref` // intersection of character
sets
;
intersectionunit ::= `bnf_chardef`
| `stringdef` | intersectionstartunit
;
intersection ::= intersectionstartunit
('-' intersectionunit)+
;
sequnit ::= intersection |
intersectionunit | '(' unionseq ')' // unit of a
sequence
;
quantifiedsequnit ::= sequnit
`quantifier` | sequnit // unit can be quantified
;
sequence ::= quantifiedsequnit
* // sequence of units with significant order
;
unionseq ::= sequence ('|'
sequence)* // rule alternatives
;
rule ::= `identifier` "::="
unionseq ';' // syntax rule
;
ignored ::= `comment` | `whitespaces`
;
EBNF syntax symbols
Following symbols can be used within this language (as described
above):
Symbol
|
Meaning
|
..
|
For character set definitions, Java symbol
is Token.UPTO, for example 'A' .. 'Z'
|
-
|
For character set intersections (set constraints),
Java symbol is Token.BUTNOT, for example `char` - `newline`
|
`
|
The backquote (a UNIX shell reminiscence)
is used to mark predefined lexer rules, these are implemented as String
arrays within StandardLexerRules . Using `whitespaces` will import
all rules describing whitespaces into the syntax. At the time there is
no possibility to write your own importable lexer rules. |
"
|
Double quote is used to mark string literal
like "for" or "if". It can also be used for characters: "c" is equal
to 'c'.
|
'
|
Single quote is used to mark characters
or escape sequences like 'A' or '\n'. But you can NOT write 'string'!
|
10 |
Numbers express characters. The decimal number
10 would be \n. Stay within UNICODE range!
|
0xA
|
A hexadecimal number, expressing the character
\n.
|
012
|
An octal number (starting with zero), expressing
the character \n.
|
::=
|
The original EBNF symbol to express "derives
to".
|
|
|
The pipe expresses "or", used for alternative
rules: "a ::= b | c ;" means "a" derives to "b" or "c". Will be resolved
to "a ::= b; a ::= c;"
|
;
|
The rule (or rule alternation) terminator.
|
()
|
Parenthesis are grouping markers. Such a
group can be quantified by *+?, e.g. "(`char` - `newline`)*".
|
*
|
Star marks a symbol or group that can repeat
0-n times ("optional list").
|
+
|
Plus marks a symbol/group that can repeat
1-n times ("list").
|
?
|
Question mark is used to express 0-1 occurences
of a symbol/group ("optional").
|
/* ... */
|
Comment marker, like in Java or C/C++.
|
//
|
Comment marker
|
#
|
Comment marker
|
Package summary
This is a short introduction into the contents of every
package contained in RunCC.
- fri/patterns/interpreter/parsergenerator
Contains toplevel classes like the universal bottom-up
Parser algorithm, the Token class with constants definitions, and all
interfaces to drive the parser: Lexer, ParserTables, Semantic.
- fri/patterns/interpreter/parsergenerator/builder
Contains all builders that use serialization to speed up building
parsers and lexers. The default source sequences to build a parser manually
can be taken from here.
- fri/patterns/interpreter/parsergenerator/examples
Contains examples: a Java parser, a XML lexer, a DTD lexer that
imports the XML syntax, a Calculator parser for arithmetic expressions,
and two HelloWorld parsers.
- fri/patterns/interpreter/parsergenerator/lexer
Contains all lexer implementations and the lexer builder (not related
to serialization), and the StandarLexerRules rule library.
- fri/patterns/interpreter/parsergenerator/parsertables
Contains all bottom-up syntax analysis implementations like SLR,
LR, LALR, building on an AbstractParserTables class that can
generate source.
- fri/patterns/interpreter/parsergenerator/semantics
Contains some default Semantic implementations like
ReflectSemantic.
- fri/patterns/interpreter/parsergenerator/syntax
Contains the Syntax and Rule classes and helpers.
- fri/patterns/interpreter/parsergenerator/syntax/builder
Contains the implementations for the EBNF language, and a class
to separate lexer and parser rules within a syntax.
- fri/patterns/interpreter/parsergenerator/util
Contains most source generator implementations, a SyntaxChecker
class to print diagnostics for an EBNF file, and helper classes.
There are some other packages containing context-specific examples.
ParserTables sample dump
Dumping parser tables can be useful when studying the complicated
ways of bottom-up parsing.
The follwing dump can be printed by writing the following Java source
lines into a main class:
String [][] rules = ...;
ParserTables parserTables = new SLRParserTables(new Syntax(rules));
parserTables.dump(System.out);
The dump method lists rules, FIRST- and FOLLOW-sets, syntax nodes,
and GOTO- and PARSE/ACTION-tables:
(Rule 0) <START> : EXPR
(Rule 1) EXPR : TERM
(Rule 2) EXPR : EXPR '+' TERM
(Rule 3) EXPR : EXPR '-' TERM
(Rule 4) TERM : FAKT
(Rule 5) TERM : TERM '*' FAKT
(Rule 6) TERM : TERM '/' FAKT
(Rule 7) FAKT : `NUMBER`
(Rule 8) FAKT : '(' EXPR ')'
FIRST(EXPR) = [`NUMBER`, '(']
FIRST(FAKT) = [`NUMBER`, '(']
FIRST(<START>) = [`NUMBER`, '(']
FIRST(TERM) = [`NUMBER`, '(']
FOLLOW(FAKT) = [EOF, '+', '-', ')', '*', '/']
FOLLOW(EXPR) = [EOF, '+', '-', ')']
FOLLOW(<START>) = [EOF]
FOLLOW(TERM) = [EOF, '+', '-', ')', '*',
'/']
State 0
(Rule 0) <START> : .EXPR
-> State 2
(Rule 1) EXPR : .TERM -> State
1
(Rule 2) EXPR : .EXPR '+' TERM
-> State 2
(Rule 3) EXPR : .EXPR '-' TERM
-> State 2
(Rule 4) TERM : .FAKT -> State
4
(Rule 5) TERM : .TERM '*' FAKT
-> State 1
(Rule 6) TERM : .TERM '/' FAKT
-> State 1
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 1
(Rule 1) EXPR : TERM .
(Rule 5) TERM : TERM .'*' FAKT
-> State 7
(Rule 6) TERM : TERM .'/' FAKT
-> State 6
State 2
(Rule 0) <START> : EXPR .
(Rule 2) EXPR : EXPR .'+' TERM
-> State 9
(Rule 3) EXPR : EXPR .'-' TERM
-> State 8
State 3
(Rule 1) EXPR : .TERM -> State
1
(Rule 2) EXPR : .EXPR '+' TERM
-> State 10
(Rule 3) EXPR : .EXPR '-' TERM
-> State 10
(Rule 4) TERM : .FAKT -> State
4
(Rule 5) TERM : .TERM '*' FAKT
-> State 1
(Rule 6) TERM : .TERM '/' FAKT
-> State 1
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : '(' .EXPR ')' ->
State 10
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 4
(Rule 4) TERM : FAKT .
State 5
(Rule 7) FAKT : `NUMBER` .
State 6
(Rule 6) TERM : TERM '/' .FAKT
-> State 11
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 7
(Rule 5) TERM : TERM '*' .FAKT
-> State 12
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 8
(Rule 3) EXPR : EXPR '-' .TERM
-> State 13
(Rule 4) TERM : .FAKT -> State
4
(Rule 5) TERM : .TERM '*' FAKT
-> State 13
(Rule 6) TERM : .TERM '/' FAKT
-> State 13
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 9
(Rule 2) EXPR : EXPR '+' .TERM
-> State 14
(Rule 4) TERM : .FAKT -> State
4
(Rule 5) TERM : .TERM '*' FAKT
-> State 14
(Rule 6) TERM : .TERM '/' FAKT
-> State 14
(Rule 7) FAKT : .`NUMBER` ->
State 5
(Rule 8) FAKT : .'(' EXPR ')' ->
State 3
State 10
(Rule 2) EXPR : EXPR .'+' TERM
-> State 9
(Rule 3) EXPR : EXPR .'-' TERM
-> State 8
(Rule 8) FAKT : '(' EXPR .')' ->
State 15
State 11
(Rule 6) TERM : TERM '/' FAKT .
State 12
(Rule 5) TERM : TERM '*' FAKT .
State 13
(Rule 3) EXPR : EXPR '-' TERM .
(Rule 5) TERM : TERM .'*' FAKT
-> State 7
(Rule 6) TERM : TERM .'/' FAKT
-> State 6
State 14
(Rule 2) EXPR : EXPR '+' TERM .
(Rule 5) TERM : TERM .'*' FAKT
-> State 7
(Rule 6) TERM : TERM .'/' FAKT
-> State 6
State 15
(Rule 8) FAKT : '(' EXPR ')' .
GOTO TABLE
==========
| <START>
EXPR TERM FAKT
'+' '-' '*'
'/' `NUMBER '(' ')'
________________________________________________________________________________________________
0 |
- 2
1 4
- -
- -
5 3
-
1 |
- -
- -
- -
7 6
- -
-
2 |
- -
- -
9 8
- -
- -
-
3 |
- 10
1 4
- -
- -
5 3
-
4 |
- -
- -
- -
- -
- -
-
5 |
- -
- -
- -
- -
- -
-
6 |
- -
- 11
- -
- -
5 3
-
7 |
- -
- 12
- -
- -
5 3
-
8 |
- -
13 4
- -
- -
5 3
-
9 |
- -
14 4
- -
- -
5 3
-
10 |
- -
- -
9 8
- -
- -
15
11 |
- -
- -
- -
- -
- -
-
12 |
- -
- -
- -
- -
- -
-
13 |
- -
- -
- -
7 6
- -
-
14 |
- -
- -
- -
7 6
- -
-
15 |
- -
- -
- -
- -
- -
-
PARSE-ACTION TABLE
==================
|
'+' '-' '*'
'/' `NUMBER '(' ')'
<EOF>
________________________________________________________________________
0 |
- -
- -
SH SH
- -
1 |
1 1
SH SH
- -
1 1
2 |
SH SH
- -
- -
- AC
3 |
- -
- -
SH SH
- -
4 |
4 4
4 4
- -
4 4
5 |
7 7
7 7
- -
7 7
6 |
- -
- -
SH SH
- -
7 |
- -
- -
SH SH
- -
8 |
- -
- -
SH SH
- -
9 |
- -
- -
SH SH
- -
10 |
SH SH
- -
- -
SH -
11 |
6 6
6 6
- -
6 6
12 |
5 5
5 5
- -
5 5
13 |
3 3
SH SH
- -
3 3
14 |
2 2
SH SH
- -
2 2
15 |
8 8
8 8
- -
8 8
Undone
- The parser (and the lexer) breaks at the first syntax error.
No error recovery is implemented. Start implementing at Parser.recover().
- There is no "import" statement in the EBNF dialect. This
would be useful for syntaxes that use other syntaxes, like DTD that uses
XML. At the time such syntaxes must be associated programmatically by
Syntax.resolveFrom() , quite inflexible.
- There is no graphical userinterface. It would be nice
to have a syntax editor, and a panel that shows a syntax-highlighting
test view. Syntax-highlighting would be easy with RunCC, as you only
need a new EBFN file instead of a syntax-plugin.
- Requirements like "read next 100 characters" can not
be expressed by the current EBNF dialect. This is needed for byte formats
(first 10 bytes of this, then 40 bytes of that ...). The way to express
repetitions is the list aggregation (left recursion), but its repetition
can not be limited without a major change of lexer implementation.
- Reorganize StandardLexerRules.java, divide the
monolithic arrays into smaller units.
- JUnit test cases
- Writing XML instead of the EBNF dialect to define a syntax
Other parser generators
Thank god - nobody is alone!