B 2 J


  1. Add bex.zip to your CLASSPATH (global or in command line)
  2. Start de.bb.bex.B2J (e.g. java -classpath %CLASSPATH%;bex.zip de.bb.bex.B2J <filename>)
  3. The target is written into current directory
  4. The optional 2nd parameter is used for an source directory.
  5. The generated class "<name>Lex" is always derived from "<name>" . ( XyzLex -> Xyz )
  6. The real code should be implemented in "<name>". All functions from "<name>" can be used in your bnf file.

BNF - File:


<terminal> is any text within  ' '	e.g.:	'include'

The file consist from nonterminals, which use nonterminals and / or
terminals. Use the given syntax:

//		single line comment
/* ... */	multi line comment
( ... )		group elements
{ ... }		embedd Java source code
|		or
?		0 .. 1 occurence
*		0 .. n occurence
+		1 .. n occurence

terminal = (
             terminal
           | nonterminal
           | '(' terminal ')'
           | '{' javaCode '}'
           | '|'
           | '?'
           | '*'
           | '+'
           | '//' single line comment
           | '/*' multi line comment '*/'
           ) *
;


Special keywords:

ALPHA = <terminal>;	// mark characters as alpha (default: A-Z, a-z)
PACKAGE = <terminal>;	// name of used Java package
IMPORT = <terminal>;    // emit an import statement to generated source code


How does it work


B2J uses simple LL grammars: "eat what you can". That means, if there
are alternatives, the first is used either it succeeds or it failes.
If one alternative fails, the next is tried, and so on.
Is there no alternative left, the nonterminal fails.
If the outmost nonterminal fails, a syntax error is generated.

Its importand to avoid direct recursions! At least one token must be
processed until a recursion may occur:

FALSCH = FALSCH 'x' | 'x' ; // Leads to stack overflow
                            // ... is detected by B2J!

BESSER = 'x' BESSER*; // may lead to stack overflow
 	              // (if input consists from lots of 'x')

GUT = 'x'*;	// without recursion


Every grammar can be written wihtout recursion!

Builtin Features


The parser is able to detect several kinds of nonterminals:

  public final static int LX_LITERAL = 1;      // default a-z A-Z
  public final static int LX_STRING = 2;       // "..."
  public final static int LX_CHAR = 3;         // '...'
  public final static int LX_INT = 4;          // 123 0x22 0777 -5
  public final static int LX_LONG = 5;         // 123L 0x22l 0777L -5l
  public final static int LX_DOUBLE = 6;       // 123. 123.0 -1e2 -1E+2 1.2e-2
  public final static int LX_COMMENTLINE = 7;  // //...
  public final static int LX_COMMENTBLOCK = 8; // /*...*/

The type can be read from the variable 'type'.
The value can also be read:

  public int type;    // one of above defined types
  public char cval;   // valid if LX_CHAR
  public int  ival;   // valid if LX_INT
  public long lval;   // valid if LX_LONG
  public int base;    // base for number: 8, 10, 16, 0 == double, 1 == Double
  public double dval; // valid if LX_DOUBLE
  public String sval; // valid if LX_STRING

To detect terminals given groups of characters are used:

  public static boolean digits[];                         // 0..9
  public static boolean hexdigits[];                      // 0..9 a..f A..F
  public boolean alpha[]; // can be modified by subclass  default: a..z A..Z

All characters <= 32 (space) are white spaces.

The auto detection of some nonterminals can be switched off:
Default settings are:

  public boolean useString = true;  // LX_STRING
  public boolean useChar   = true;  // LX_CHAR
  public boolean useComment= true;  // LX_COMMENTLINE, LX_COMMENTBLOCK
  public boolean useNumber = true;  // LX_INT, LX_LONG, LX_DOUBLE

Additionally the text of each token can be determined by using:

  #1.getString()  // get the text for the last token
  #2.getString()  // get the text for the 2nd last token
  ...

  where the number marksthe  nth token from right to left.

  #1.getStringS()
  retreives also the text. BUT: If the token is a nonterminal and consists
  from several tokens, all of them are separated by a space.


Functions as Nonterminals


For each nonterminal Xxxxx a Function isXxxxx() is generated.
The use of a Nonterminal Xxxxx *is* a function call to isXxxxx()!

=> Insted of nonterminals functions can be used.

See Calc1.bnf: Number is used, but not defined in BNF.

=> There is the function boolean isNumber() in the base class.

Every function which replaces a nonterminal must behave correct:
t.m.

- must call funktion next() and return true
  if the input stream is correct (parseable)
- return false either.

The function next() calls the parser again and detects the next token.
Because that next() must be called if the processing in yout function
was successful!

The actual token is always available as value in the variable token:

	-1 end of input stream
	some value either.
        If type == LX_LITERAL, the name can be retrieved using:

        String getString(int tokenVal); // name lookup

The function boolean run(String args[])


The generated code contains a tiny main, which starts the processing.
You may implement your own starup code, or add the code to a bigger
project.
To start the parser the function __run() is used.

History


V 1.6

bugfix: source tracking for error reports tracked not only last line, but all lines.

  When the parser detected an error, not only the last line was displayed.
  It is still not perfect: When an error is at a line break, the next empty line can be shown.
  So look at the previous line in cour code.

bugfix: redesigned tokenizer:
since any alpha-literal is accepted and a new tokens are generated,
there was a problem with keywords consisting from alpha and non-alpha characters.
e.g. having 'div' and 'div=' as keywords:
  'div=' was recognized as 'div' since the alpha-literal was used.
now:
  'div=' is recognized correctly

new: added IMPORT = ; to add import statements to generated source.

  Especially when you need other classes in your inlined code, you now can import them.


V 1.5

change: enableStrings() disableStrings() also affects useChar

Either you need to parse strings *and* chars, or not. So both are switched simultaneously.

V 1.4

bugfix: sometimes EOF was unread as char 0xff.

V 1.3

bugfix: sometimes a white space was not properly unread.

V 1.2

Did some internal redesign to enhance the parsers capabilities.

V 1.1

added public functions

  public void enableComments() throws IOException
  public void disableComments() throws IOException
  public void enableStrings() throws IOException
  public void disableStrings() throws IOException

to switch the flags

  useComment
  useString

on or off. The advantage of these functions: The current token gets
reparsed.

E.g.

assume:
  useString is true
  token = 0
  type = LX_STRING
  sval = "test"

disableStrings() causes:

  useString = false;
  token = <token value for ">

  and the remaining characters     test"    are unread.

etc.

V 1.0

public release