Scanatra is a toolset to create parsers, compilers, and other similar text processing tools. It implements Regular Expressions and Context Free Grammars (LL1, SLR, LR1).
Find a file
2025-08-15 00:15:58 +02:00
examples add lr1 simplification, lots of fixes 2025-08-11 23:11:00 +02:00
rautomaton graph fix 2025-08-15 00:15:58 +02:00
rautomaton-macro add lr1 simplification, lots of fixes 2025-08-11 23:11:00 +02:00
scanatra-core add compile time debug message 2025-07-25 20:59:23 +02:00
scanatra-macro clippy satisfy 2025-07-17 00:44:17 +02:00
src passthrough token position 2025-01-12 18:21:59 +01:00
.gitignore satisfy clippy 2025-01-20 11:12:21 +01:00
Cargo.lock impl compile time grammar parser 2025-02-26 02:02:36 +01:00
Cargo.toml impl compile time grammar parser 2025-02-26 02:02:36 +01:00
readme.md changed readme, fix multiple grammar errors 2025-03-03 00:57:33 +01:00

scanatra

scanatra is a tool set to create parsers, compilers and other similar text processing tools.

Content:

  • rautomaton
    • finite automaton (regular expressions, dfa, nfa)
    • cfg (ll1, lr0/slr, lr1 parsers)
  • rautomaton
    • cfg_grammar_const - compile time cfg parse table generation
  • scanatra-core
    • double_enum - to create token enums for scanning
    • scanner - lex text to tokens, remove irrelevant data
    • ast_gen - macro creating ast gen code
      • creates a grammar
      • converts the grammar to an ast
    • regex - a regex syntax parser, to use regex for creating regular expressions
  • scanatra-macro
    • regex - parses a regex string on compile time

Usage

Tokens

For lexing two kinds of tokens are needed: Tokens with data, Tokens without data, to represent the token kind.

Create them like this:

use scanatra::prelude::*;

double_enum!(
    BareTokens, Tokens {
        WhiteSpace,
        Semicolon,
        Ident(String),
        // ...
    }
);

This will create two enums (with some impl's):

#[derive(Debug, Clone, PartialEq)]
enum Tokens {
    WhiteSpace,
    Semicolon,
    Ident(String),
    // ...
}

#[derive(Terminal, Debug, Clone, PartialEq, Eq, Hash, Ord, PartialOrd)]
enum BareTokens {
    WhiteSpace,
    Semicolon,
    Ident,
    // ...
}

impl PartialEq<Tokens> for BareTokens { /* ... */ }
impl PartialEq<BareTokens> for Tokens { /* ... */ }
impl Form<Tokens> for BareTokens { /* ... */ }

You can create them yourself, but remember to implement those traits.

Scanner

For lexing, create a scanner. A scanner just needs an enum like Tokens above. This enum must impl the trait CreateMatchTable<Self>.

Write it your self or use the following macro:

use scanatra::prelude::*;

token_scanner!(
    Tokens,
    or!(' ', '\t', '\n', '\r') ;-> |_| {
        Some(WhiteSpace)
    }
    ';' ;-> |_| {
        Some(Semicolon)
    }
    or!('a'..='z', 'A'..='Z') * star!(word!()) ;-> |m: &str| {
        Some(Ident(String::from(m)))
    }
    regex!(r#""( [^\\"] | \\[\\"] )*""#) ;-> |m| {
        //...
    }
    // ...
);

The macro uses the regular expressions from rautomaton.

regex

If you do not want to construct the regular expressions your self, you can use the regex!() macro.

This macro currently supports a subset of regex syntax.

features:

  • a | b: matches a or b.
  • ab: matches a followed by b.
  • (): match group
  • [abc0-9]: matches any character in the brackets. Matching ranges (e.g 0 to 9) is supported.
  • [^abc0-9]: same as above, but matches anything besides the given characters / ranges.
  • ?: matches element before 0 or 1 times.
  • +: matches element before 1 or more times.
  • *: matches element before 0 or more times.
  • \x00: matches an escaped one byte character.
  • \u0000 or \u{00}: matches an escaped unicode.
  • \/: matches an escaped character (but not x, c or u).
  • special characters: \0, \r, \f, \v, \r, \t, \c
  • control characters: \cA to \cZ
  • \w and \W: only word characters / every other character
  • \d and \D: only digits / every other character
  • \s and \S: only whitespace characters / every other character

The macro regex! will parse the regex on compile time, if possible. When the regex should be parsed on runtime, use regex_live!.

Scanning

To actually scan text, create a scanner, and supply it with some text.

Node: the finite Automatons from rautomaton could lex other data, but the Scanner is limited to Text.

use scanatra::prelude::*;

fn main(){
    let code = String::from("...");
    let scanner = Scanner::<Tokens>::new()
        .with_skipping(Tokens::WhiteSpace);
    let token_iter = scanner.iter(code);
    // ...
}

Use .with_skipping to ignore tokens.

Parser

Preparation

A grammar needs none-terminals, to create the rules.

Create them like this:

#[derive(NoneTerminal, Debug, PartialEq, Eq, Hash, Clone, PartialOrd, Ord)]
enum NoneTerminals {
    SomeNoneTerminal,
    S,A,B,
    // ...
}

To use Terminals and NoneTerminals, they have to impl Terminal and NoneTerminal. double_enum! will do Terminal automatically.

As alternative a method could be implemented:

  • for NoneTerminals: fn to_sentential<T>(self) -> Sentential<Self, T>
  • for Terminals: fn to_sentential<N>(self) -> Sentential<N, Self>

Grammar

A grammar can be created with Grammar<NoneTerminals, BareTokens>::new() or with the following macro:

fn grammar() -> Grammar<NoneTerminals, BareTokens> {
    use BareTokens::*;
    use NoneTerminals::*;
    cfg_grammar![
        start: P;
        P => E;
        E => E, Add, T;
        E => T;
        T => Ident, LBrace, E, RBrace;
        T => Ident;
        // ...
    ]
}

AST

For this a AST datatype is needed like:

#[derive(Clone)]
enum AST {
    Var1(data, ...),
    // ...
} 

When generating an ast the following macro will help. This macro will replace cfg_grammar! and the previous mentioned grammar func.

gen_ast!(
    (trait: a_custom_trait)?
    types: AST;
    pos_type: ParsePos;
    custom_err: String;
    tokens: Tokens, BareTokens;
    none_terminals: NoneTerminals;
    start: P;
    P => Pi=Var1(v) => {
        Ok(AST::Var1(v))
    },
    Pi in Some(pos) => Pi=Var1(v), Comma, Ident(name) => {
        println!("{} in {}", name, pos);
        Ok(...)
    }
)

Content:

  • trait: can optionally be used to create the implementations in a custom trait.
  • type and custom_err: Every rule has to return a Result<type, custom_err>. When an Err is returned the parsing fails.
  • pos_type: The position type of the used scanner. The build in uses ParsePos.
  • tokens: The Terminals defined by double_enum!.
  • none_terminals: the defined NoneTerminals.
  • start: The starting rule.
  • grammar:
    • rule (the element before =>): rule to derive content from.
    • derived element:
      • to a none terminal rule=pattern: the pattern can be used to deconstruct the AST data.
      • to a terminal token(args, ...): the terminal token. With optional () to destruct a Token with data.
    • behind every element in pos can be added to get a Option<pos_type> of the position of an element or the complete rule. pos is a pattern, that could be used for destructuring.

In the {} bracket code can be added to create an AST element out of the parsed rules and Terminals.

Note: This macro will try to create the least complex parse table in this order: LL(1), SLR, LR(1). When the grammar is not deterministic context free, i compiler error will be thrown.

For a complete example take a look at the build in regex parser or the json example.