FLEX(1) FLEX(1)
NAME
flex − fast lexical analyzer generator
SYNOPSIS
flex [−bcdfhilnpstvwBFILTV78+? −C[aefFmr] −ooutput −Pprefix −Sskeleton] [−−help −−version]
[filename ...]
OVERVIEW
This manual describes flex, a tool for generating programs that perform pattern-matching on text. The
manual includes both tutorial and reference sections:
Description
a brief overview of the tool
Some Simple Examples
Format Of The Input File
Patterns
the extended regular expressions used by flex
How The Input Is Matched
the rules for determining what has been matched
Actions
how to specify what to do when a pattern is matched
The Generated Scanner
details regarding the scanner that flex produces;
how to control the input source
Start Conditions
introducing context into your scanners, and
managing "mini-scanners"
Multiple Input Buffers
how to manipulate multiple input sources; how to
scan from strings instead of files
End-of-file Rules
special rules for matching the end of the input
Miscellaneous Macros
a summary of macros available to the actions
Values Available To The User
a summary of values available to the actions
Interfacing With Yacc
connecting flex scanners together with yacc parsers
Options
flex command-line options, and the "%option"
directive
Performance Considerations
how to make your scanner go as fast as possible
Version 2.5 April 1995 1
FLEX(1) FLEX(1)
Generating C++ Scanners
the (experimental) facility for generating C++
scanner classes
Incompatibilities With Lex And POSIX
how flex differs from AT&T lex and the POSIX lex
standard
Diagnostics
those error messages produced by flex (or scanners
it generates) whose meanings might not be apparent
Files
files used by flex
Deficiencies / Bugs
known problems with flex
See Also
other documentation, related tools
Author
includes contact information
DESCRIPTION
flex is a tool for generating scanners: programs which recognized lexical patterns in text. flex reads the
given input files, or its standard input if no file names are given, for a description of a scanner to gener-
ate. The description is in the form of pairs of regular expressions and C code, called rules. flex gener-
ates as output a C source file, lex.yy.c, which defines a routine yylex(). This file is compiled and linked
with the −lfl library to produce an executable. When the executable is run, it analyzes its input for
occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.
SOME SIMPLE EXAMPLES
First some simple examples to get the flavor of how one uses flex. The following flex input specifies a
scanner which whenever it encounters the string "username" will replace it with the user’s login name:
%%
username printf( "%s", getlogin() );
By default, any text not matched by a flex scanner is copied to the output, so the net effect of this scan-
ner is to copy its input file to its output with each occurrence of "username" expanded. In this input,
there is just one rule. "username" is the pattern and the "printf" is the action. The "%%" marks the
beginning of the rules.
Here’s another simple example:
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
2April 1995 Version 2.5
FLEX(1) FLEX(1)
This scanner counts the number of characters and the number of lines in its input (it produces no output
other than the final report on the counts). The first line declares two globals, "num_lines" and
"num_chars", which are accessible both inside yylex() and in the main() routine declared after the sec-
ond "%%". There are two rules, one which matches a newline ("\n") and increments both the line
count and the character count, and one which matches any character other than a newline (indicated by
the "." regular expression).
A somewhat more complicated example:
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[ˆ}\n]*"}" /* eat up one-line comments */
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
%%
main( argc, argv )
int argc;
char **argv;
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}
Version 2.5 April 1995 3
FLEX(1) FLEX(1)
This is the be
ginnings of a simple scanner for a language like Pascal. It identifies different types of
tokens and reports on what it has seen.
The details of this example will be explained in the following sections.
FORMAT OF THE INPUT FILE
The flex input file consists of three sections, separated by a line with just %% in it:
definitions
%%
rules
%%
user code
The definitions section contains declarations of simple name definitions to simplify the scanner specifi-
cation, and declarations of start conditions, which are explained in a later section.
Name definitions have the form:
name definition
The "name" is a word beginning with a letter or an underscore (’_’) followed by zero or more letters,
digits, ’_’, or ’-’ (dash). The definition is taken to begin at the first non-white-space character follow-
ing the name and continuing to the end of the line. The definition can subsequently be referred to using
"{name}", which will expand to "(definition)". For example,
DIGIT [0-9]
ID [a-z][a-z0-9]*
defines "DIGIT" to be a regular expression which matches a single digit, and "ID" to be a regular
expression which matches a letter followed by zero-or-more letters-or-digits. A subsequent reference
to
{DIGIT}+"."{DIGIT}*
is identical to
([0-9])+"."([0-9])*
and matches one-or-more digits followed by a ’.’ followed by zero-or-more digits.
The rules section of the flex input contains a series of rules of the form:
pattern action
where the pattern must be unindented and the action must begin on the same line.
See below for a further description of patterns and actions.
Finally, the user code section is simply copied to lex.yy.c verbatim. It is used for companion routines
which call or are called by the scanner. The presence of this section is optional; if it is missing, the sec-
ond %% in the input file may be skipped, too.
In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verba-
tim to the output (with the %{}’s removed). The %{}’s must appear unindented on lines by them-
selves.
In the rules section, any indented or %{} text appearing before the first rule may be used to declare
variables which are local to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered. Other indented or %{} text in the rule section is
still copied to the output, but its meaning is not well-defined and it may well cause compile-time errors
(this feature is present for POSIX compliance; see below for other such features).
In the definitions section (but not in the rules section), an unindented comment (i.e., a line beginning
4April 1995 Version 2.5
FLEX(1) FLEX(1)
with "/*") is also copied v
erbatim to the output up to the next "*/".
PATTERNS
The patterns in the input are written using an extended set of regular expressions. These are:
x match the character ’x’
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern
matches either an ’x’, a ’y’, or a ’z’
[abj-oZ] a "character class" with a range in it; matches
an ’a’, a ’b’, any letter from ’j’ through ’o’,
or a ’Z’
[ˆA-Z] a "negated character class", i.e., any character
but those in the class. In this case, any
character EXCEPT an uppercase letter.
[ˆA-Z\n] any character EXCEPT an uppercase letter or
a newline
r* zero or more r’s, where r is any regular expression
r+ one or more r’s
r? zero or one r’s (that is, "an optional r")
r{2,5} anywhere from two to five r’s
r{2,} two or more r’s
r{4} exactly 4 r’s
{name} the expansion of the "name" definition
(see above)
"[xyz]\"foo"
the literal string: [xyz]"foo
\X if X is an ’a’, ’b’, ’f’, ’n’, ’r’, ’t’, or ’v’,
then the ANSI-C interpretation of \x.
Otherwise, a literal ’X’ (used to escape
operators such as ’*’)
\0 a NUL character (ASCII code 0)
\123 the character with octal value 123
\x2a the character with hexadecimal value 2a
(r) match an r; parentheses are used to override
precedence (see below)
rs the regular expression r followed by the
regular expression s; called "concatenation"
r|s either an r or an s
r/s an r but only if it is followed by an s. The
text matched by s is included when determining
whether this rule is the "longest match",
but is then returned to the input before
the action is executed. So the action only
sees the text matched by r. This type
of pattern is called trailing context".
(There are some combinations of r/s that flex
cannot match correctly; see notes in the
Deficiencies / Bugs section below reg arding
"dangerous trailing context".)
ˆr an r, but only at the beginning of a line (i.e.,
which just starting to scan, or right after a
newline has been scanned).
Version 2.5 April 1995 5