Skip to content

tstih/lingo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

status.badge language.badge standard.badge license.badge

Lingo

Lingo is a C++20 single-header library for building recursive descent parsers. Following a simplicity before performance philosophy, Lingo is intended for parsing small to medium-sized source files or structured text.

Lingo is currently under development. Expect strange things.

Introduction

With Lingo, you can express grammars in a syntax resembling EBNF (Extended Backus–Naur Form) directly in C++ code using operator overloading. There are no external dependencies — just drop include/lingo.h into your project.

Defining a grammar

Here's an example grammar for parsing a list of numbers separated by commas:

<comma>    = ","
<digit>    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<number>   = <digit>+
<num_list> = <number> (<comma> <number>)*

In Lingo:

lingo::rule comma(',');
lingo::rule digit('0', '9');
lingo::rule number   = lingo::repeat(digit, 1);       // at least one digit
lingo::rule num_list = number + lingo::repeat(comma + number, 0);

Grammars

You build grammars using lingo::rule and lingo::placeholder. Placeholders allow recursive grammars by acting as forward declarations.

Rule constructors

Form Matches
rule('a') single character a
rule('a', 'z') any character in range az
rule({'x','y','z'}) any character from the list
rule("hello") the literal string hello

Operators

Operator Meaning
a + b sequence: a followed by b
a | b alternation: a or b
!a negation lookahead: succeeds if a does not match (does not consume input)
repeat(r, min) repeat r at least min times
repeat(r, min, max) repeat r between min and max times (0 = unlimited)

Example: recursive expression grammar

<expression> = ["+"|"-"] <term> {("+"|"-") <term>}
<term>       = <factor> {("*"|"/") <factor>}
<factor>     = <number> | "(" <expression> ")"

In Lingo:

lingo::rule lparen('('), rparen(')');
lingo::rule sign({'+', '-'});
lingo::rule plus('+'), minus('-'), mul('*'), div('/');
lingo::rule digit('0', '9');
lingo::rule number = lingo::repeat(digit, 1);

lingo::placeholder expression_placeholder;
lingo::rule factor = number | (lparen + expression_placeholder + rparen);
lingo::rule term   = factor + lingo::repeat((mul | div) + factor, 0);
lingo::rule expression =
    lingo::repeat(sign, 0, 1) +
    term +
    lingo::repeat((plus | minus) + term, 0);

expression_placeholder.set(expression);

Named rules

To produce meaningful error messages, assign names to rules using the named constructor variants:

lingo::rule anon_digit('0', '9');               // anonymous
lingo::rule named_digit("digit", '0', '9');     // named
lingo::rule named_number("number", lingo::repeat(digit, 1));

Only some rules need names — diagnostics reference the nearest named ancestor.

Parsing

Source input

Lingo includes a lightweight source class that owns a string and tracks position, line, and column.

// From a string
lingo::source src("1 + 2 * 3");

// From a file
lingo::source src = lingo::source::from_file("data/example.txt");

Parsing is done by constructing a lingo::grammar object from a rule, then calling one of its methods against a lingo::source.

Syntax check only

lingo::source src("1 + 2 * 3");
lingo::grammar g(expression);
if (!g.check(src))
    std::cerr << "parse failed\n";

Check with error message

lingo::source src("1 + * 3");
std::string error;
if (!g.check(src, error))
    std::cerr << error << '\n';
// unexpected '*', expected number at line 1 column 5

Error messages include the unexpected character, the nearest named rule that was expected, and the line and column.

Building a parse tree

lingo::parse_node tree;
std::string error;
if (g.parse(src, tree, error)) {
    // tree contains named nodes with source spans
}

parse_node stores:

  • name — the rule name (only named rules appear in the tree)
  • start / end — byte offsets into the source
  • children — child named nodes

Helper methods on parse_node:

node.text(src);               // matched source text
node.find("name");            // first direct child by name
node.all("name");             // all direct children by name
node.find_deep("name");       // first descendant DFS
node.collect("name");         // all descendants DFS

Evaluation

Lingo does not provide built-in evaluation. You call grammar::parse() to get a named parse tree, then walk it yourself. The samples/ directory contains two complete working examples that illustrate the two main approaches.

Expression evaluator (samples/eval.cpp)

A recursive descent evaluator for arithmetic expressions with operator precedence, unary minus, and parentheses. It demonstrates the direct walk pattern: the parse tree is walked in a single recursive function without building any intermediate representation.

The grammar produces named nodes number, sign, addop, mulop, unary, factor, term, and expression. Because the grammar is left-recursive by construction, children of expression and term already alternate value/operator/value:

expression
  ├─ term          ← left operand
  ├─ addop ("+")
  └─ term          ← right operand

The entire evaluator is a single eval() function that pattern-matches on the node name:

double eval(const parse_node &n, const source &src)
{
    if (n.name == "number") return std::stod(n.text(src));
    if (n.name == "factor") return eval(n.children[0], src);
    if (n.name == "unary")  return sign * eval(n.children[1], src);
    // expression / term: fold [v, op, v, op, ...]
    double result = eval(n.children[0], src);
    for (size_t i = 1; i + 1 < n.children.size(); i += 2) { ... }
    return result;
}

This approach works well when the parse tree structure maps cleanly onto the computation — no intermediate AST is needed.

BASIC interpreter (samples/basic.cpp)

A tiny BASIC interpreter supporting LET, PRINT, IF/THEN, GOTO, and END. It demonstrates the two-phase pattern: first build an intermediate AST from the parse tree, then execute the AST.

An intermediate AST is necessary here because:

  • Variables need to be resolved at runtime, not at parse time.
  • GOTO needs a map from line number to program index, which requires all lines to be collected first.

The two phases are:

  1. Parse → AST: build_program_ast() walks the parse tree using collect("line"), then for each line uses find("statement") / find("identifier") / all("expression") etc. to build line_ast and statement_ast structs.

  2. Execute: A simple interpreter loop with a program counter (pc), a variable map (vars), and a line-number index (index_by_line) for GOTO resolution.

Use the two-phase approach when you need to look ahead, resolve references across the tree, or separate parsing concerns cleanly from execution.

Project structure

lingo
├── README.md              ← This file
├── LICENSE                ← MIT license
├── CMakeLists.txt         ← Build configuration
├── include/
│   └── lingo.h            ← Single header (drop this into your project)
├── samples/
│   ├── eval.cpp           ← Arithmetic expression parser/evaluator
│   ├── eval-main.cpp      ← CLI entry point
│   ├── basic.cpp          ← Tiny BASIC interpreter
│   └── basic-main.cpp     ← CLI entry point
├── tests/
│   ├── parse-tests.cpp    ← Core parse tests
│   ├── basic-tests.cpp    ← BASIC grammar tests
│   └── data/              ← Input files for tests
└── build/                 ← CMake build output

Architecture

Lingo is organized in four layers, each with a single responsibility:

┌─────────────────────────────────────────────────────────┐
│  rule / placeholder          (grammar layer)            │
│  User-facing combinators. Build a shared_ptr<node>      │
│  tree via operator+, operator|, operator!, repeat().    │
│  rule describes structure — it never parses anything.   │
├─────────────────────────────────────────────────────────┤
│  node hierarchy              (grammar representation)   │
│  literal, or_node, and_node, not_node, repeat_node,     │
│  placeholder_node. Each node implements accept() for    │
│  the visitor pattern. Nodes are ref-counted and shared  │
│  across rules to avoid copying.                         │
├─────────────────────────────────────────────────────────┤
│  detail::parser              (parsing engine)           │
│  Implements node_visitor. Walks the grammar node tree   │
│  against a source, tracks the furthest failure for      │
│  error messages, and — when a root parse_node is        │
│  supplied — incrementally builds the named parse tree   │
│  with checkpoint/rollback on backtrack.                 │
├─────────────────────────────────────────────────────────┤
│  grammar                     (user-facing facade)       │
│  Owns the root node (shared_ptr). Exposes check() and   │
│  parse() which construct a detail::parser internally    │
│  and run it. Users never touch detail::parser directly. │
└─────────────────────────────────────────────────────────┘

Key design decisions:

  • rule is immutable after construction — combinators return new rules, never modify existing ones.
  • Only named nodes (rule("name", ...)) appear in the parse tree. Unnamed nodes are transparent glue — they match input but produce no tree output.
  • detail::parser is stateless between calls; grammar::check() / grammar::parse() construct one per invocation.
  • placeholder breaks recursive grammar cycles: it holds a shared_ptr<node> that is filled in after the recursive rule is defined.

Building

No dependencies — GoogleTest is fetched automatically by CMake.

cmake -B build
cmake --build build

Run tests:

./bin/test-lingo

Build options

Option Default Description
BUILD_TESTS ON Build the test executable
BUILD_SAMPLES ON Build the sample executables

Running samples

./bin/sample-evaluator "1 + 2 * (3 + 4)"
./bin/sample-basic path/to/program.bas

About

Lingo is a C++20 framework for creating simple parsers.

Topics

Resources

License

Stars

Watchers

Forks

Contributors