re2c | |
---|---|
Original author(s) | Peter Bumbulis |
Initial release | around 1994[1] |
Stable release | 4.0 [2]
/ 19 November 2024 |
Repository | github |
Type | Lexical analyzer generator |
License | Public domain |
Website | re2c |
re2c is a free and open-source lexer generator for C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V and Zig. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper,[1] re2c was put in public domain and has been since maintained by volunteers.[3] It is the lexer generator adopted by projects such as PHP,[4] SpamAssassin,[5] Ninja build system[6] and others. Together with the Lemon parser generator, re2c is used in BRL-CAD.[7] This combination is also used with STEPcode, an implementation of ISO 10303 standard.[8]
The main goal of re2c is generating fast lexers:[1] at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite-state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart[1] and much easier to debug and understand. Moreover, this approach often results in smaller lexers,[1] as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton.[9] Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.
re2c program can contain any number of /*!re2c ... */
blocks.
Each block consists of a sequence of rules, definitions and configurations
(they can be intermixed, but it is generally better to put configurations first, then definitions and then rules).
Rules have the form REGEXP { CODE }
or REGEXP := CODE;
where REGEXP
is a regular expression and CODE
is a block of C code. When REGEXP
matches the input string, control flow is transferred to the associated CODE
. There is one special rule: the default rule with *
instead of REGEXP
; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority.
Definitions have the form NAME = REGEXP;
(and also NAME { REGEXP }
in Flex compatibility mode).
Configurations have the form re2c:CONFIG = VALUE;
where CONFIG
is the name of the particular configuration and VALUE
is a number or a string.
For more advanced usage see the official re2c manual.[22]
re2c uses the following syntax for regular expressions:
"foo"
case-sensitive string literal'foo'
case-insensitive string literal[a-xyz]
, [^a-xyz]
character class (possibly negated).
any character except newlineR \ S
difference of character classesR*
zero or more occurrences of R
R+
one or more occurrences of R
R?
zero or one occurrence of R
R{n}
repetition of R
exactly n
timesR{n,}
repetition of R
at least n
timesR{n,m}
repetition of R
from n
to m
times(R)
just R
; parentheses are used to override precedence or for POSIX-style submatchR S
concatenation: R
followed by S
R | S
alternative: R
or S
R / S
lookahead: R
followed by S
, but S
is not consumedname
the regular expression defined as name
(except in Flex compatibility mode)@stag
an s-tag: saves the last input position at which @stag
matches in a variable named stag
#mtag
an m-tag: saves all input positions at which #mtag
matches in a variable named mtag
Character classes and string literals may contain the following escape sequences: \a
, \b
, \f
, \n
, \r
, \t
, \v
, \\
, octal escapes \ooo
and hexadecimal escapes \xhh
, \uhhhh
and \Uhhhhhhhh
.
Here is a very simple program in re2c (example.re).
It checks that all input arguments are hexadecimal numbers.
The code for re2c is enclosed in comments /*!re2c ... */
, all the rest is plain C code.
See the official re2c website for more complex examples.[23]
#include <stdio.h>
static int lex(const char *YYCURSOR)
{
const char *YYMARKER;
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
end = "\x00";
hex = "0x" [0-9a-fA-F]+;
* { printf("err\n"); return 1; }
hex end { printf("hex\n"); return 0; }
*/
}
int main(int argc, char **argv)
{
for (int i = 1; i < argc; ++i) {
lex(argv[i]);
}
return 0;
}
Given that, re2c -is -o example.c example.re
generates the code below (example.c).
The contents of the comment /*!re2c ... */
are substituted with a deterministic finite automaton
encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file.
There are several code generation options; normally re2c uses switch
statements,
but it can use nested if
statements (as in this example with -s
option),
or generate bitmaps and jump tables.
Which option is better depends on the C compiler;
re2c users are encouraged to experiment.
/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */
#include <stdio.h>
static int lex(const char *YYCURSOR)
{
const char *YYMARKER;
{
char yych;
yych = *YYCURSOR;
if (yych == '0') goto yy4;
++YYCURSOR;
yy3:
{ printf("err\n"); return 1; }
yy4:
yych = *(YYMARKER = ++YYCURSOR);
if (yych != 'x') goto yy3;
yych = *++YYCURSOR;
if (yych >= 0x01) goto yy8;
yy6:
YYCURSOR = YYMARKER;
goto yy3;
yy7:
yych = *++YYCURSOR;
yy8:
if (yych <= '@') {
if (yych <= 0x00) goto yy9;
if (yych <= '/') goto yy6;
if (yych <= '9') goto yy7;
goto yy6;
} else {
if (yych <= 'F') goto yy7;
if (yych <= '`') goto yy6;
if (yych <= 'f') goto yy7;
goto yy6;
}
yy9:
++YYCURSOR;
{ printf("hex\n"); return 0; }
}
}
int main(int argc, char **argv)
{
for (int i = 1; i < argc; ++i) {
lex(argv[i]);
}
return 0;
}