2952 lines
		
	
	
		
			109 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			2952 lines
		
	
	
		
			109 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| This is Info file flex.info, produced by Makeinfo-1.55 from the input
 | ||
| file flex.texi.
 | ||
| 
 | ||
| START-INFO-DIR-ENTRY
 | ||
| * Flex: (flex).         A fast scanner generator.
 | ||
| END-INFO-DIR-ENTRY
 | ||
| 
 | ||
|    This file documents Flex.
 | ||
| 
 | ||
|    Copyright (c) 1990 The Regents of the University of California.  All
 | ||
| rights reserved.
 | ||
| 
 | ||
|    This code is derived from software contributed to Berkeley by Vern
 | ||
| Paxson.
 | ||
| 
 | ||
|    The United States Government has rights in this work pursuant to
 | ||
| contract no. DE-AC03-76SF00098 between the United States Department of
 | ||
| Energy and the University of California.
 | ||
| 
 | ||
|    Redistribution and use in source and binary forms with or without
 | ||
| modification are permitted provided that: (1) source distributions
 | ||
| retain this entire copyright notice and comment, and (2) distributions
 | ||
| including binaries display the following acknowledgement:  "This
 | ||
| product includes software developed by the University of California,
 | ||
| Berkeley and its contributors" in the documentation or other materials
 | ||
| provided with the distribution and in all advertising materials
 | ||
| mentioning features or use of this software.  Neither the name of the
 | ||
| University nor the names of its contributors may be used to endorse or
 | ||
| promote products derived from this software without specific prior
 | ||
| written permission.
 | ||
| 
 | ||
|    THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
 | ||
| WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
 | ||
| MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Top,  Next: Name,  Prev: (dir),  Up: (dir)
 | ||
| 
 | ||
| flex
 | ||
| ****
 | ||
| 
 | ||
|    This manual documents `flex'.  It covers release 2.5.
 | ||
| 
 | ||
| * Menu:
 | ||
| 
 | ||
| * Name::                        Name
 | ||
| * Synopsis::                    Synopsis
 | ||
| * Overview::                    Overview
 | ||
| * Description::                 Description
 | ||
| * Examples::                    Some simple examples
 | ||
| * Format::                      Format of the input file
 | ||
| * Patterns::                    Patterns
 | ||
| * Matching::                    How the input is matched
 | ||
| * Actions::                     Actions
 | ||
| * Generated scanner::           The generated scanner
 | ||
| * Start conditions::            Start conditions
 | ||
| * Multiple buffers::            Multiple input buffers
 | ||
| * End-of-file rules::           End-of-file rules
 | ||
| * Miscellaneous::               Miscellaneous macros
 | ||
| * User variables::              Values available to the user
 | ||
| * YACC interface::              Interfacing with `yacc'
 | ||
| * Options::                     Options
 | ||
| * Performance::                 Performance considerations
 | ||
| * C++::                         Generating C++ scanners
 | ||
| * Incompatibilities::           Incompatibilities with `lex' and POSIX
 | ||
| * Diagnostics::                 Diagnostics
 | ||
| * Files::                       Files
 | ||
| * Deficiencies::                Deficiencies / Bugs
 | ||
| * See also::                    See also
 | ||
| * Author::                      Author
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Name,  Next: Synopsis,  Prev: Top,  Up: Top
 | ||
| 
 | ||
| Name
 | ||
| ====
 | ||
| 
 | ||
|    flex - fast lexical analyzer generator
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Synopsis,  Next: Overview,  Prev: Name,  Up: Top
 | ||
| 
 | ||
| Synopsis
 | ||
| ========
 | ||
| 
 | ||
|      flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton]
 | ||
|      [--help --version] [FILENAME ...]
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Overview,  Next: Description,  Prev: Synopsis,  Up: Top
 | ||
| 
 | ||
| Overview
 | ||
| ========
 | ||
| 
 | ||
|    This manual describes `flex', a tool for generating programs that
 | ||
| perform pattern-matching on text.  The manual includes both tutorial
 | ||
| and reference sections:
 | ||
| 
 | ||
| Description
 | ||
|      a brief overview of the tool
 | ||
| 
 | ||
| Some Simple Examples
 | ||
| Format Of The Input File
 | ||
| Patterns
 | ||
|      the extended regular expressions used by flex
 | ||
| 
 | ||
| How The Input Is Matched
 | ||
|      the rules for determining what has been matched
 | ||
| 
 | ||
| Actions
 | ||
|      how to specify what to do when a pattern is matched
 | ||
| 
 | ||
| The Generated Scanner
 | ||
|      details regarding the scanner that flex produces; how to control
 | ||
|      the input source
 | ||
| 
 | ||
| Start Conditions
 | ||
|      introducing context into your scanners, and managing
 | ||
|      "mini-scanners"
 | ||
| 
 | ||
| Multiple Input Buffers
 | ||
|      how to manipulate multiple input sources; how to scan from strings
 | ||
|      instead of files
 | ||
| 
 | ||
| End-of-file Rules
 | ||
|      special rules for matching the end of the input
 | ||
| 
 | ||
| Miscellaneous Macros
 | ||
|      a summary of macros available to the actions
 | ||
| 
 | ||
| Values Available To The User
 | ||
|      a summary of values available to the actions
 | ||
| 
 | ||
| Interfacing With Yacc
 | ||
|      connecting flex scanners together with yacc parsers
 | ||
| 
 | ||
| Options
 | ||
|      flex command-line options, and the "%option" directive
 | ||
| 
 | ||
| Performance Considerations
 | ||
|      how to make your scanner go as fast as possible
 | ||
| 
 | ||
| Generating C++ Scanners
 | ||
|      the (experimental) facility for generating C++ scanner classes
 | ||
| 
 | ||
| Incompatibilities With Lex And POSIX
 | ||
|      how flex differs from AT&T lex and the POSIX lex standard
 | ||
| 
 | ||
| Diagnostics
 | ||
|      those error messages produced by flex (or scanners it generates)
 | ||
|      whose meanings might not be apparent
 | ||
| 
 | ||
| Files
 | ||
|      files used by flex
 | ||
| 
 | ||
| Deficiencies / Bugs
 | ||
|      known problems with flex
 | ||
| 
 | ||
| See Also
 | ||
|      other documentation, related tools
 | ||
| 
 | ||
| Author
 | ||
|      includes contact information
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Description,  Next: Examples,  Prev: Overview,  Up: Top
 | ||
| 
 | ||
| Description
 | ||
| ===========
 | ||
| 
 | ||
|    `flex' is a tool for generating "scanners": programs which
 | ||
| recognized lexical patterns in text.  `flex' reads the given input
 | ||
| files, or its standard input if no file names are given, for a
 | ||
| description of a scanner to generate.  The description is in the form
 | ||
| of pairs of regular expressions and C code, called "rules". `flex'
 | ||
| generates as output a C source file, `lex.yy.c', which defines a
 | ||
| routine `yylex()'.  This file is compiled and linked with the `-lfl'
 | ||
| library to produce an executable.  When the executable is run, it
 | ||
| analyzes its input for occurrences of the regular expressions.
 | ||
| Whenever it finds one, it executes the corresponding C code.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Examples,  Next: Format,  Prev: Description,  Up: Top
 | ||
| 
 | ||
| Some simple examples
 | ||
| ====================
 | ||
| 
 | ||
|    First some simple examples to get the flavor of how one uses `flex'.
 | ||
| The following `flex' input specifies a scanner which whenever it
 | ||
| encounters the string "username" will replace it with the user's login
 | ||
| name:
 | ||
| 
 | ||
|      %%
 | ||
|      username    printf( "%s", getlogin() );
 | ||
| 
 | ||
|    By default, any text not matched by a `flex' scanner is copied to
 | ||
| the output, so the net effect of this scanner is to copy its input file
 | ||
| to its output with each occurrence of "username" expanded.  In this
 | ||
| input, there is just one rule.  "username" is the PATTERN and the
 | ||
| "printf" is the ACTION.  The "%%" marks the beginning of the rules.
 | ||
| 
 | ||
|    Here's another simple example:
 | ||
| 
 | ||
|              int num_lines = 0, num_chars = 0;
 | ||
|      
 | ||
|      %%
 | ||
|      \n      ++num_lines; ++num_chars;
 | ||
|      .       ++num_chars;
 | ||
|      
 | ||
|      %%
 | ||
|      main()
 | ||
|              {
 | ||
|              yylex();
 | ||
|              printf( "# of lines = %d, # of chars = %d\n",
 | ||
|                      num_lines, num_chars );
 | ||
|              }
 | ||
| 
 | ||
|    This scanner counts the number of characters and the number of lines
 | ||
| in its input (it produces no output other than the final report on the
 | ||
| counts).  The first line declares two globals, "num_lines" and
 | ||
| "num_chars", which are accessible both inside `yylex()' and in the
 | ||
| `main()' routine declared after the second "%%".  There are two rules,
 | ||
| one which matches a newline ("\n") and increments both the line count
 | ||
| and the character count, and one which matches any character other than
 | ||
| a newline (indicated by the "." regular expression).
 | ||
| 
 | ||
|    A somewhat more complicated example:
 | ||
| 
 | ||
|      /* scanner for a toy Pascal-like language */
 | ||
|      
 | ||
|      %{
 | ||
|      /* need this for the call to atof() below */
 | ||
|      #include <math.h>
 | ||
|      %}
 | ||
|      
 | ||
|      DIGIT    [0-9]
 | ||
|      ID       [a-z][a-z0-9]*
 | ||
|      
 | ||
|      %%
 | ||
|      
 | ||
|      {DIGIT}+    {
 | ||
|                  printf( "An integer: %s (%d)\n", yytext,
 | ||
|                          atoi( yytext ) );
 | ||
|                  }
 | ||
|      
 | ||
|      {DIGIT}+"."{DIGIT}*        {
 | ||
|                  printf( "A float: %s (%g)\n", yytext,
 | ||
|                          atof( yytext ) );
 | ||
|                  }
 | ||
|      
 | ||
|      if|then|begin|end|procedure|function        {
 | ||
|                  printf( "A keyword: %s\n", yytext );
 | ||
|                  }
 | ||
|      
 | ||
|      {ID}        printf( "An identifier: %s\n", yytext );
 | ||
|      
 | ||
|      "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
 | ||
|      
 | ||
|      "{"[^}\n]*"}"     /* eat up one-line comments */
 | ||
|      
 | ||
|      [ \t\n]+          /* eat up whitespace */
 | ||
|      
 | ||
|      .           printf( "Unrecognized character: %s\n", yytext );
 | ||
|      
 | ||
|      %%
 | ||
|      
 | ||
|      main( argc, argv )
 | ||
|      int argc;
 | ||
|      char **argv;
 | ||
|          {
 | ||
|          ++argv, --argc;  /* skip over program name */
 | ||
|          if ( argc > 0 )
 | ||
|                  yyin = fopen( argv[0], "r" );
 | ||
|          else
 | ||
|                  yyin = stdin;
 | ||
|      
 | ||
|          yylex();
 | ||
|          }
 | ||
| 
 | ||
|    This is the beginnings of a simple scanner for a language like
 | ||
| Pascal.  It identifies different types of TOKENS and reports on what it
 | ||
| has seen.
 | ||
| 
 | ||
|    The details of this example will be explained in the following
 | ||
| sections.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Format,  Next: Patterns,  Prev: Examples,  Up: Top
 | ||
| 
 | ||
| Format of the input file
 | ||
| ========================
 | ||
| 
 | ||
|    The `flex' input file consists of three sections, separated by a
 | ||
| line with just `%%' in it:
 | ||
| 
 | ||
|      definitions
 | ||
|      %%
 | ||
|      rules
 | ||
|      %%
 | ||
|      user code
 | ||
| 
 | ||
|    The "definitions" section contains declarations of simple "name"
 | ||
| definitions to simplify the scanner specification, and declarations of
 | ||
| "start conditions", which are explained in a later section.  Name
 | ||
| definitions have the form:
 | ||
| 
 | ||
|      name definition
 | ||
| 
 | ||
|    The "name" is a word beginning with a letter or an underscore ('_')
 | ||
| followed by zero or more letters, digits, '_', or '-' (dash).  The
 | ||
| definition is taken to begin at the first non-white-space character
 | ||
| following the name and continuing to the end of the line.  The
 | ||
| definition can subsequently be referred to using "{name}", which will
 | ||
| expand to "(definition)".  For example,
 | ||
| 
 | ||
|      DIGIT    [0-9]
 | ||
|      ID       [a-z][a-z0-9]*
 | ||
| 
 | ||
| defines "DIGIT" to be a regular expression which matches a single
 | ||
| digit, and "ID" to be a regular expression which matches a letter
 | ||
| followed by zero-or-more letters-or-digits.  A subsequent reference to
 | ||
| 
 | ||
|      {DIGIT}+"."{DIGIT}*
 | ||
| 
 | ||
| is identical to
 | ||
| 
 | ||
|      ([0-9])+"."([0-9])*
 | ||
| 
 | ||
| and matches one-or-more digits followed by a '.' followed by
 | ||
| zero-or-more digits.
 | ||
| 
 | ||
|    The RULES section of the `flex' input contains a series of rules of
 | ||
| the form:
 | ||
| 
 | ||
|      pattern   action
 | ||
| 
 | ||
| where the pattern must be unindented and the action must begin on the
 | ||
| same line.
 | ||
| 
 | ||
|    See below for a further description of patterns and actions.
 | ||
| 
 | ||
|    Finally, the user code section is simply copied to `lex.yy.c'
 | ||
| verbatim.  It is used for companion routines which call or are called
 | ||
| by the scanner.  The presence of this section is optional; if it is
 | ||
| missing, the second `%%' in the input file may be skipped, too.
 | ||
| 
 | ||
|    In the definitions and rules sections, any *indented* text or text
 | ||
| enclosed in `%{' and `%}' is copied verbatim to the output (with the
 | ||
| `%{}''s removed).  The `%{}''s must appear unindented on lines by
 | ||
| themselves.
 | ||
| 
 | ||
|    In the rules section, any indented or %{} text appearing before the
 | ||
| first rule may be used to declare variables which are local to the
 | ||
| scanning routine and (after the declarations) code which is to be
 | ||
| executed whenever the scanning routine is entered.  Other indented or
 | ||
| %{} text in the rule section is still copied to the output, but its
 | ||
| meaning is not well-defined and it may well cause compile-time errors
 | ||
| (this feature is present for `POSIX' compliance; see below for other
 | ||
| such features).
 | ||
| 
 | ||
|    In the definitions section (but not in the rules section), an
 | ||
| unindented comment (i.e., a line beginning with "/*") is also copied
 | ||
| verbatim to the output up to the next "*/".
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Patterns,  Next: Matching,  Prev: Format,  Up: Top
 | ||
| 
 | ||
| Patterns
 | ||
| ========
 | ||
| 
 | ||
|    The patterns in the input are written using an extended set of
 | ||
| regular expressions.  These are:
 | ||
| 
 | ||
| `x'
 | ||
|      match the character `x'
 | ||
| 
 | ||
| `.'
 | ||
|      any character (byte) except newline
 | ||
| 
 | ||
| `[xyz]'
 | ||
|      a "character class"; in this case, the pattern matches either an
 | ||
|      `x', a `y', or a `z'
 | ||
| 
 | ||
| `[abj-oZ]'
 | ||
|      a "character class" with a range in it; matches an `a', a `b', any
 | ||
|      letter from `j' through `o', or a `Z'
 | ||
| 
 | ||
| `[^A-Z]'
 | ||
|      a "negated character class", i.e., any character but those in the
 | ||
|      class.  In this case, any character EXCEPT an uppercase letter.
 | ||
| 
 | ||
| `[^A-Z\n]'
 | ||
|      any character EXCEPT an uppercase letter or a newline
 | ||
| 
 | ||
| `R*'
 | ||
|      zero or more R's, where R is any regular expression
 | ||
| 
 | ||
| `R+'
 | ||
|      one or more R's
 | ||
| 
 | ||
| `R?'
 | ||
|      zero or one R's (that is, "an optional R")
 | ||
| 
 | ||
| `R{2,5}'
 | ||
|      anywhere from two to five R's
 | ||
| 
 | ||
| `R{2,}'
 | ||
|      two or more R's
 | ||
| 
 | ||
| `R{4}'
 | ||
|      exactly 4 R's
 | ||
| 
 | ||
| `{NAME}'
 | ||
|      the expansion of the "NAME" definition (see above)
 | ||
| 
 | ||
| `"[xyz]\"foo"'
 | ||
|      the literal string: `[xyz]"foo'
 | ||
| 
 | ||
| `\X'
 | ||
|      if X is an `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
 | ||
|      interpretation of \X.  Otherwise, a literal `X' (used to escape
 | ||
|      operators such as `*')
 | ||
| 
 | ||
| `\0'
 | ||
|      a NUL character (ASCII code 0)
 | ||
| 
 | ||
| `\123'
 | ||
|      the character with octal value 123
 | ||
| 
 | ||
| `\x2a'
 | ||
|      the character with hexadecimal value `2a'
 | ||
| 
 | ||
| `(R)'
 | ||
|      match an R; parentheses are used to override precedence (see below)
 | ||
| 
 | ||
| `RS'
 | ||
|      the regular expression R followed by the regular expression S;
 | ||
|      called "concatenation"
 | ||
| 
 | ||
| `R|S'
 | ||
|      either an R or an S
 | ||
| 
 | ||
| `R/S'
 | ||
|      an R but only if it is followed by an S.  The text matched by S is
 | ||
|      included when determining whether this rule is the "longest
 | ||
|      match", but is then returned to the input before the action is
 | ||
|      executed.  So the action only sees the text matched by R.  This
 | ||
|      type of pattern is called "trailing context".  (There are some
 | ||
|      combinations of `R/S' that `flex' cannot match correctly; see
 | ||
|      notes in the Deficiencies / Bugs section below regarding
 | ||
|      "dangerous trailing context".)
 | ||
| 
 | ||
| `^R'
 | ||
|      an R, but only at the beginning of a line (i.e., which just
 | ||
|      starting to scan, or right after a newline has been scanned).
 | ||
| 
 | ||
| `R$'
 | ||
|      an R, but only at the end of a line (i.e., just before a newline).
 | ||
|      Equivalent to "R/\n".
 | ||
| 
 | ||
|      Note that flex's notion of "newline" is exactly whatever the C
 | ||
|      compiler used to compile flex interprets '\n' as; in particular,
 | ||
|      on some DOS systems you must either filter out \r's in the input
 | ||
|      yourself, or explicitly use R/\r\n for "r$".
 | ||
| 
 | ||
| `<S>R'
 | ||
|      an R, but only in start condition S (see below for discussion of
 | ||
|      start conditions) <S1,S2,S3>R same, but in any of start conditions
 | ||
|      S1, S2, or S3
 | ||
| 
 | ||
| `<*>R'
 | ||
|      an R in any start condition, even an exclusive one.
 | ||
| 
 | ||
| `<<EOF>>'
 | ||
|      an end-of-file <S1,S2><<EOF>> an end-of-file when in start
 | ||
|      condition S1 or S2
 | ||
| 
 | ||
|    Note that inside of a character class, all regular expression
 | ||
| operators lose their special meaning except escape ('\') and the
 | ||
| character class operators, '-', ']', and, at the beginning of the
 | ||
| class, '^'.
 | ||
| 
 | ||
|    The regular expressions listed above are grouped according to
 | ||
| precedence, from highest precedence at the top to lowest at the bottom.
 | ||
| Those grouped together have equal precedence.  For example,
 | ||
| 
 | ||
|      foo|bar*
 | ||
| 
 | ||
| is the same as
 | ||
| 
 | ||
|      (foo)|(ba(r*))
 | ||
| 
 | ||
| since the '*' operator has higher precedence than concatenation, and
 | ||
| concatenation higher than alternation ('|').  This pattern therefore
 | ||
| matches *either* the string "foo" *or* the string "ba" followed by
 | ||
| zero-or-more r's.  To match "foo" or zero-or-more "bar"'s, use:
 | ||
| 
 | ||
|      foo|(bar)*
 | ||
| 
 | ||
| and to match zero-or-more "foo"'s-or-"bar"'s:
 | ||
| 
 | ||
|      (foo|bar)*
 | ||
| 
 | ||
|    In addition to characters and ranges of characters, character
 | ||
| classes can also contain character class "expressions".  These are
 | ||
| expressions enclosed inside `[': and `:'] delimiters (which themselves
 | ||
| must appear between the '[' and ']' of the character class; other
 | ||
| elements may occur inside the character class, too).  The valid
 | ||
| expressions are:
 | ||
| 
 | ||
|      [:alnum:] [:alpha:] [:blank:]
 | ||
|      [:cntrl:] [:digit:] [:graph:]
 | ||
|      [:lower:] [:print:] [:punct:]
 | ||
|      [:space:] [:upper:] [:xdigit:]
 | ||
| 
 | ||
|    These expressions all designate a set of characters equivalent to
 | ||
| the corresponding standard C `isXXX' function.  For example,
 | ||
| `[:alnum:]' designates those characters for which `isalnum()' returns
 | ||
| true - i.e., any alphabetic or numeric.  Some systems don't provide
 | ||
| `isblank()', so flex defines `[:blank:]' as a blank or a tab.
 | ||
| 
 | ||
|    For example, the following character classes are all equivalent:
 | ||
| 
 | ||
|      [[:alnum:]]
 | ||
|      [[:alpha:][:digit:]
 | ||
|      [[:alpha:]0-9]
 | ||
|      [a-zA-Z0-9]
 | ||
| 
 | ||
|    If your scanner is case-insensitive (the `-i' flag), then
 | ||
| `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
 | ||
| 
 | ||
|    Some notes on patterns:
 | ||
| 
 | ||
|    - A negated character class such as the example "[^A-Z]" above *will
 | ||
|      match a newline* unless "\n" (or an equivalent escape sequence) is
 | ||
|      one of the characters explicitly present in the negated character
 | ||
|      class (e.g., "[^A-Z\n]").  This is unlike how many other regular
 | ||
|      expression tools treat negated character classes, but
 | ||
|      unfortunately the inconsistency is historically entrenched.
 | ||
|      Matching newlines means that a pattern like [^"]* can match the
 | ||
|      entire input unless there's another quote in the input.
 | ||
| 
 | ||
|    - A rule can have at most one instance of trailing context (the '/'
 | ||
|      operator or the '$' operator).  The start condition, '^', and
 | ||
|      "<<EOF>>" patterns can only occur at the beginning of a pattern,
 | ||
|      and, as well as with '/' and '$', cannot be grouped inside
 | ||
|      parentheses.  A '^' which does not occur at the beginning of a
 | ||
|      rule or a '$' which does not occur at the end of a rule loses its
 | ||
|      special properties and is treated as a normal character.
 | ||
| 
 | ||
|      The following are illegal:
 | ||
| 
 | ||
|           foo/bar$
 | ||
|           <sc1>foo<sc2>bar
 | ||
| 
 | ||
|      Note that the first of these, can be written "foo/bar\n".
 | ||
| 
 | ||
|      The following will result in '$' or '^' being treated as a normal
 | ||
|      character:
 | ||
| 
 | ||
|           foo|(bar$)
 | ||
|           foo|^bar
 | ||
| 
 | ||
|      If what's wanted is a "foo" or a bar-followed-by-a-newline, the
 | ||
|      following could be used (the special '|' action is explained
 | ||
|      below):
 | ||
| 
 | ||
|           foo      |
 | ||
|           bar$     /* action goes here */
 | ||
| 
 | ||
|      A similar trick will work for matching a foo or a
 | ||
|      bar-at-the-beginning-of-a-line.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Matching,  Next: Actions,  Prev: Patterns,  Up: Top
 | ||
| 
 | ||
| How the input is matched
 | ||
| ========================
 | ||
| 
 | ||
|    When the generated scanner is run, it analyzes its input looking for
 | ||
| strings which match any of its patterns.  If it finds more than one
 | ||
| match, it takes the one matching the most text (for trailing context
 | ||
| rules, this includes the length of the trailing part, even though it
 | ||
| will then be returned to the input).  If it finds two or more matches
 | ||
| of the same length, the rule listed first in the `flex' input file is
 | ||
| chosen.
 | ||
| 
 | ||
|    Once the match is determined, the text corresponding to the match
 | ||
| (called the TOKEN) is made available in the global character pointer
 | ||
| `yytext', and its length in the global integer `yyleng'.  The ACTION
 | ||
| corresponding to the matched pattern is then executed (a more detailed
 | ||
| description of actions follows), and then the remaining input is
 | ||
| scanned for another match.
 | ||
| 
 | ||
|    If no match is found, then the "default rule" is executed: the next
 | ||
| character in the input is considered matched and copied to the standard
 | ||
| output.  Thus, the simplest legal `flex' input is:
 | ||
| 
 | ||
|      %%
 | ||
| 
 | ||
|    which generates a scanner that simply copies its input (one
 | ||
| character at a time) to its output.
 | ||
| 
 | ||
|    Note that `yytext' can be defined in two different ways: either as a
 | ||
| character *pointer* or as a character *array*.  You can control which
 | ||
| definition `flex' uses by including one of the special directives
 | ||
| `%pointer' or `%array' in the first (definitions) section of your flex
 | ||
| input.  The default is `%pointer', unless you use the `-l' lex
 | ||
| compatibility option, in which case `yytext' will be an array.  The
 | ||
| advantage of using `%pointer' is substantially faster scanning and no
 | ||
| buffer overflow when matching very large tokens (unless you run out of
 | ||
| dynamic memory).  The disadvantage is that you are restricted in how
 | ||
| your actions can modify `yytext' (see the next section), and calls to
 | ||
| the `unput()' function destroys the present contents of `yytext', which
 | ||
| can be a considerable porting headache when moving between different
 | ||
| `lex' versions.
 | ||
| 
 | ||
|    The advantage of `%array' is that you can then modify `yytext' to
 | ||
| your heart's content, and calls to `unput()' do not destroy `yytext'
 | ||
| (see below).  Furthermore, existing `lex' programs sometimes access
 | ||
| `yytext' externally using declarations of the form:
 | ||
|      extern char yytext[];
 | ||
|    This definition is erroneous when used with `%pointer', but correct
 | ||
| for `%array'.
 | ||
| 
 | ||
|    `%array' defines `yytext' to be an array of `YYLMAX' characters,
 | ||
| which defaults to a fairly large value.  You can change the size by
 | ||
| simply #define'ing `YYLMAX' to a different value in the first section
 | ||
| of your `flex' input.  As mentioned above, with `%pointer' yytext grows
 | ||
| dynamically to accommodate large tokens.  While this means your
 | ||
| `%pointer' scanner can accommodate very large tokens (such as matching
 | ||
| entire blocks of comments), bear in mind that each time the scanner
 | ||
| must resize `yytext' it also must rescan the entire token from the
 | ||
| beginning, so matching such tokens can prove slow.  `yytext' presently
 | ||
| does *not* dynamically grow if a call to `unput()' results in too much
 | ||
| text being pushed back; instead, a run-time error results.
 | ||
| 
 | ||
|    Also note that you cannot use `%array' with C++ scanner classes (the
 | ||
| `c++' option; see below).
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Actions,  Next: Generated scanner,  Prev: Matching,  Up: Top
 | ||
| 
 | ||
| Actions
 | ||
| =======
 | ||
| 
 | ||
|    Each pattern in a rule has a corresponding action, which can be any
 | ||
| arbitrary C statement.  The pattern ends at the first non-escaped
 | ||
| whitespace character; the remainder of the line is its action.  If the
 | ||
| action is empty, then when the pattern is matched the input token is
 | ||
| simply discarded.  For example, here is the specification for a program
 | ||
| which deletes all occurrences of "zap me" from its input:
 | ||
| 
 | ||
|      %%
 | ||
|      "zap me"
 | ||
| 
 | ||
|    (It will copy all other characters in the input to the output since
 | ||
| they will be matched by the default rule.)
 | ||
| 
 | ||
|    Here is a program which compresses multiple blanks and tabs down to
 | ||
| a single blank, and throws away whitespace found at the end of a line:
 | ||
| 
 | ||
|      %%
 | ||
|      [ \t]+        putchar( ' ' );
 | ||
|      [ \t]+$       /* ignore this token */
 | ||
| 
 | ||
|    If the action contains a '{', then the action spans till the
 | ||
| balancing '}' is found, and the action may cross multiple lines.
 | ||
| `flex' knows about C strings and comments and won't be fooled by braces
 | ||
| found within them, but also allows actions to begin with `%{' and will
 | ||
| consider the action to be all the text up to the next `%}' (regardless
 | ||
| of ordinary braces inside the action).
 | ||
| 
 | ||
|    An action consisting solely of a vertical bar ('|') means "same as
 | ||
| the action for the next rule." See below for an illustration.
 | ||
| 
 | ||
|    Actions can include arbitrary C code, including `return' statements
 | ||
| to return a value to whatever routine called `yylex()'.  Each time
 | ||
| `yylex()' is called it continues processing tokens from where it last
 | ||
| left off until it either reaches the end of the file or executes a
 | ||
| return.
 | ||
| 
 | ||
|    Actions are free to modify `yytext' except for lengthening it
 | ||
| (adding characters to its end-these will overwrite later characters in
 | ||
| the input stream).  This however does not apply when using `%array'
 | ||
| (see above); in that case, `yytext' may be freely modified in any way.
 | ||
| 
 | ||
|    Actions are free to modify `yyleng' except they should not do so if
 | ||
| the action also includes use of `yymore()' (see below).
 | ||
| 
 | ||
|    There are a number of special directives which can be included
 | ||
| within an action:
 | ||
| 
 | ||
|    - `ECHO' copies yytext to the scanner's output.
 | ||
| 
 | ||
|    - `BEGIN' followed by the name of a start condition places the
 | ||
|      scanner in the corresponding start condition (see below).
 | ||
| 
 | ||
|    - `REJECT' directs the scanner to proceed on to the "second best"
 | ||
|      rule which matched the input (or a prefix of the input).  The rule
 | ||
|      is chosen as described above in "How the Input is Matched", and
 | ||
|      `yytext' and `yyleng' set up appropriately.  It may either be one
 | ||
|      which matched as much text as the originally chosen rule but came
 | ||
|      later in the `flex' input file, or one which matched less text.
 | ||
|      For example, the following will both count the words in the input
 | ||
|      and call the routine special() whenever "frob" is seen:
 | ||
| 
 | ||
|                   int word_count = 0;
 | ||
|           %%
 | ||
|           
 | ||
|           frob        special(); REJECT;
 | ||
|           [^ \t\n]+   ++word_count;
 | ||
| 
 | ||
|      Without the `REJECT', any "frob"'s in the input would not be
 | ||
|      counted as words, since the scanner normally executes only one
 | ||
|      action per token.  Multiple `REJECT's' are allowed, each one
 | ||
|      finding the next best choice to the currently active rule.  For
 | ||
|      example, when the following scanner scans the token "abcd", it
 | ||
|      will write "abcdabcaba" to the output:
 | ||
| 
 | ||
|           %%
 | ||
|           a        |
 | ||
|           ab       |
 | ||
|           abc      |
 | ||
|           abcd     ECHO; REJECT;
 | ||
|           .|\n     /* eat up any unmatched character */
 | ||
| 
 | ||
|      (The first three rules share the fourth's action since they use
 | ||
|      the special '|' action.)  `REJECT' is a particularly expensive
 | ||
|      feature in terms of scanner performance; if it is used in *any* of
 | ||
|      the scanner's actions it will slow down *all* of the scanner's
 | ||
|      matching.  Furthermore, `REJECT' cannot be used with the `-Cf' or
 | ||
|      `-CF' options (see below).
 | ||
| 
 | ||
|      Note also that unlike the other special actions, `REJECT' is a
 | ||
|      *branch*; code immediately following it in the action will *not*
 | ||
|      be executed.
 | ||
| 
 | ||
|    - `yymore()' tells the scanner that the next time it matches a rule,
 | ||
|      the corresponding token should be *appended* onto the current
 | ||
|      value of `yytext' rather than replacing it.  For example, given
 | ||
|      the input "mega-kludge" the following will write
 | ||
|      "mega-mega-kludge" to the output:
 | ||
| 
 | ||
|           %%
 | ||
|           mega-    ECHO; yymore();
 | ||
|           kludge   ECHO;
 | ||
| 
 | ||
|      First "mega-" is matched and echoed to the output.  Then "kludge"
 | ||
|      is matched, but the previous "mega-" is still hanging around at
 | ||
|      the beginning of `yytext' so the `ECHO' for the "kludge" rule will
 | ||
|      actually write "mega-kludge".
 | ||
| 
 | ||
|    Two notes regarding use of `yymore()'.  First, `yymore()' depends on
 | ||
| the value of `yyleng' correctly reflecting the size of the current
 | ||
| token, so you must not modify `yyleng' if you are using `yymore()'.
 | ||
| Second, the presence of `yymore()' in the scanner's action entails a
 | ||
| minor performance penalty in the scanner's matching speed.
 | ||
| 
 | ||
|    - `yyless(n)' returns all but the first N characters of the current
 | ||
|      token back to the input stream, where they will be rescanned when
 | ||
|      the scanner looks for the next match.  `yytext' and `yyleng' are
 | ||
|      adjusted appropriately (e.g., `yyleng' will now be equal to N ).
 | ||
|      For example, on the input "foobar" the following will write out
 | ||
|      "foobarbar":
 | ||
| 
 | ||
|           %%
 | ||
|           foobar    ECHO; yyless(3);
 | ||
|           [a-z]+    ECHO;
 | ||
| 
 | ||
|      An argument of 0 to `yyless' will cause the entire current input
 | ||
|      string to be scanned again.  Unless you've changed how the scanner
 | ||
|      will subsequently process its input (using `BEGIN', for example),
 | ||
|      this will result in an endless loop.
 | ||
| 
 | ||
|      Note that `yyless' is a macro and can only be used in the flex
 | ||
|      input file, not from other source files.
 | ||
| 
 | ||
|    - `unput(c)' puts the character `c' back onto the input stream.  It
 | ||
|      will be the next character scanned.  The following action will
 | ||
|      take the current token and cause it to be rescanned enclosed in
 | ||
|      parentheses.
 | ||
| 
 | ||
|           {
 | ||
|           int i;
 | ||
|           /* Copy yytext because unput() trashes yytext */
 | ||
|           char *yycopy = strdup( yytext );
 | ||
|           unput( ')' );
 | ||
|           for ( i = yyleng - 1; i >= 0; --i )
 | ||
|               unput( yycopy[i] );
 | ||
|           unput( '(' );
 | ||
|           free( yycopy );
 | ||
|           }
 | ||
| 
 | ||
|      Note that since each `unput()' puts the given character back at
 | ||
|      the *beginning* of the input stream, pushing back strings must be
 | ||
|      done back-to-front.  An important potential problem when using
 | ||
|      `unput()' is that if you are using `%pointer' (the default), a
 | ||
|      call to `unput()' *destroys* the contents of `yytext', starting
 | ||
|      with its rightmost character and devouring one character to the
 | ||
|      left with each call.  If you need the value of yytext preserved
 | ||
|      after a call to `unput()' (as in the above example), you must
 | ||
|      either first copy it elsewhere, or build your scanner using
 | ||
|      `%array' instead (see How The Input Is Matched).
 | ||
| 
 | ||
|      Finally, note that you cannot put back `EOF' to attempt to mark
 | ||
|      the input stream with an end-of-file.
 | ||
| 
 | ||
|    - `input()' reads the next character from the input stream.  For
 | ||
|      example, the following is one way to eat up C comments:
 | ||
| 
 | ||
|           %%
 | ||
|           "/*"        {
 | ||
|                       register int c;
 | ||
|           
 | ||
|                       for ( ; ; )
 | ||
|                           {
 | ||
|                           while ( (c = input()) != '*' &&
 | ||
|                                   c != EOF )
 | ||
|                               ;    /* eat up text of comment */
 | ||
|           
 | ||
|                           if ( c == '*' )
 | ||
|                               {
 | ||
|                               while ( (c = input()) == '*' )
 | ||
|                                   ;
 | ||
|                               if ( c == '/' )
 | ||
|                                   break;    /* found the end */
 | ||
|                               }
 | ||
|           
 | ||
|                           if ( c == EOF )
 | ||
|                               {
 | ||
|                               error( "EOF in comment" );
 | ||
|                               break;
 | ||
|                               }
 | ||
|                           }
 | ||
|                       }
 | ||
| 
 | ||
|      (Note that if the scanner is compiled using `C++', then `input()'
 | ||
|      is instead referred to as `yyinput()', in order to avoid a name
 | ||
|      clash with the `C++' stream by the name of `input'.)
 | ||
| 
 | ||
|    - YY_FLUSH_BUFFER flushes the scanner's internal buffer so that the
 | ||
|      next time the scanner attempts to match a token, it will first
 | ||
|      refill the buffer using `YY_INPUT' (see The Generated Scanner,
 | ||
|      below).  This action is a special case of the more general
 | ||
|      `yy_flush_buffer()' function, described below in the section
 | ||
|      Multiple Input Buffers.
 | ||
| 
 | ||
|    - `yyterminate()' can be used in lieu of a return statement in an
 | ||
|      action.  It terminates the scanner and returns a 0 to the
 | ||
|      scanner's caller, indicating "all done".  By default,
 | ||
|      `yyterminate()' is also called when an end-of-file is encountered.
 | ||
|      It is a macro and may be redefined.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Generated scanner,  Next: Start conditions,  Prev: Actions,  Up: Top
 | ||
| 
 | ||
| The generated scanner
 | ||
| =====================
 | ||
| 
 | ||
|    The output of `flex' is the file `lex.yy.c', which contains the
 | ||
| scanning routine `yylex()', a number of tables used by it for matching
 | ||
| tokens, and a number of auxiliary routines and macros.  By default,
 | ||
| `yylex()' is declared as follows:
 | ||
| 
 | ||
|      int yylex()
 | ||
|          {
 | ||
|          ... various definitions and the actions in here ...
 | ||
|          }
 | ||
| 
 | ||
|    (If your environment supports function prototypes, then it will be
 | ||
| "int yylex( void  )".)   This  definition  may  be changed by defining
 | ||
| the "YY_DECL" macro.  For example, you could use:
 | ||
| 
 | ||
|      #define YY_DECL float lexscan( a, b ) float a, b;
 | ||
| 
 | ||
|    to give the scanning routine the name `lexscan', returning a float,
 | ||
| and taking two floats as arguments.  Note that if you give arguments to
 | ||
| the scanning routine using a K&R-style/non-prototyped function
 | ||
| declaration, you must terminate the definition with a semi-colon (`;').
 | ||
| 
 | ||
|    Whenever `yylex()' is called, it scans tokens from the global input
 | ||
| file `yyin' (which defaults to stdin).  It continues until it either
 | ||
| reaches an end-of-file (at which point it returns the value 0) or one
 | ||
| of its actions executes a `return' statement.
 | ||
| 
 | ||
|    If the scanner reaches an end-of-file, subsequent calls are undefined
 | ||
| unless either `yyin' is pointed at a new input file (in which case
 | ||
| scanning continues from that file), or `yyrestart()' is called.
 | ||
| `yyrestart()' takes one argument, a `FILE *' pointer (which can be nil,
 | ||
| if you've set up `YY_INPUT' to scan from a source other than `yyin'),
 | ||
| and initializes `yyin' for scanning from that file.  Essentially there
 | ||
| is no difference between just assigning `yyin' to a new input file or
 | ||
| using `yyrestart()' to do so; the latter is available for compatibility
 | ||
| with previous versions of `flex', and because it can be used to switch
 | ||
| input files in the middle of scanning.  It can also be used to throw
 | ||
| away the current input buffer, by calling it with an argument of
 | ||
| `yyin'; but better is to use `YY_FLUSH_BUFFER' (see above).  Note that
 | ||
| `yyrestart()' does *not* reset the start condition to `INITIAL' (see
 | ||
| Start Conditions, below).
 | ||
| 
 | ||
|    If `yylex()' stops scanning due to executing a `return' statement in
 | ||
| one of the actions, the scanner may then be called again and it will
 | ||
| resume scanning where it left off.
 | ||
| 
 | ||
|    By default (and for purposes of efficiency), the scanner uses
 | ||
| block-reads rather than simple `getc()' calls to read characters from
 | ||
| `yyin'.  The nature of how it gets its input can be controlled by
 | ||
| defining the `YY_INPUT' macro.  YY_INPUT's calling sequence is
 | ||
| "YY_INPUT(buf,result,max_size)".  Its action is to place up to MAX_SIZE
 | ||
| characters in the character array BUF and return in the integer
 | ||
| variable RESULT either the number of characters read or the constant
 | ||
| YY_NULL (0 on Unix systems) to indicate EOF.  The default YY_INPUT
 | ||
| reads from the global file-pointer "yyin".
 | ||
| 
 | ||
|    A sample definition of YY_INPUT (in the definitions section of the
 | ||
| input file):
 | ||
| 
 | ||
|      %{
 | ||
|      #define YY_INPUT(buf,result,max_size) \
 | ||
|          { \
 | ||
|          int c = getchar(); \
 | ||
|          result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
 | ||
|          }
 | ||
|      %}
 | ||
| 
 | ||
|    This definition will change the input processing to occur one
 | ||
| character at a time.
 | ||
| 
 | ||
|    When the scanner receives an end-of-file indication from YY_INPUT,
 | ||
| it then checks the `yywrap()' function.  If `yywrap()' returns false
 | ||
| (zero), then it is assumed that the function has gone ahead and set up
 | ||
| `yyin' to point to another input file, and scanning continues.  If it
 | ||
| returns true (non-zero), then the scanner terminates, returning 0 to
 | ||
| its caller.  Note that in either case, the start condition remains
 | ||
| unchanged; it does *not* revert to `INITIAL'.
 | ||
| 
 | ||
|    If you do not supply your own version of `yywrap()', then you must
 | ||
| either use `%option noyywrap' (in which case the scanner behaves as
 | ||
| though `yywrap()' returned 1), or you must link with `-lfl' to obtain
 | ||
| the default version of the routine, which always returns 1.
 | ||
| 
 | ||
|    Three routines are available for scanning from in-memory buffers
 | ||
| rather than files: `yy_scan_string()', `yy_scan_bytes()', and
 | ||
| `yy_scan_buffer()'.  See the discussion of them below in the section
 | ||
| Multiple Input Buffers.
 | ||
| 
 | ||
|    The scanner writes its `ECHO' output to the `yyout' global (default,
 | ||
| stdout), which may be redefined by the user simply by assigning it to
 | ||
| some other `FILE' pointer.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Start conditions,  Next: Multiple buffers,  Prev: Generated scanner,  Up: Top
 | ||
| 
 | ||
| Start conditions
 | ||
| ================
 | ||
| 
 | ||
|    `flex' provides a mechanism for conditionally activating rules.  Any
 | ||
| rule whose pattern is prefixed with "<sc>" will only be active when the
 | ||
| scanner is in the start condition named "sc".  For example,
 | ||
| 
 | ||
|      <STRING>[^"]*        { /* eat up the string body ... */
 | ||
|                  ...
 | ||
|                  }
 | ||
| 
 | ||
| will be active only when the scanner is in the "STRING" start
 | ||
| condition, and
 | ||
| 
 | ||
|      <INITIAL,STRING,QUOTE>\.        { /* handle an escape ... */
 | ||
|                  ...
 | ||
|                  }
 | ||
| 
 | ||
| will be active only when the current start condition is either
 | ||
| "INITIAL", "STRING", or "QUOTE".
 | ||
| 
 | ||
|    Start conditions are declared in the definitions (first) section of
 | ||
| the input using unindented lines beginning with either `%s' or `%x'
 | ||
| followed by a list of names.  The former declares *inclusive* start
 | ||
| conditions, the latter *exclusive* start conditions.  A start condition
 | ||
| is activated using the `BEGIN' action.  Until the next `BEGIN' action is
 | ||
| executed, rules with the given start condition will be active and rules
 | ||
| with other start conditions will be inactive.  If the start condition
 | ||
| is *inclusive*, then rules with no start conditions at all will also be
 | ||
| active.  If it is *exclusive*, then *only* rules qualified with the
 | ||
| start condition will be active.  A set of rules contingent on the same
 | ||
| exclusive start condition describe a scanner which is independent of
 | ||
| any of the other rules in the `flex' input.  Because of this, exclusive
 | ||
| start conditions make it easy to specify "mini-scanners" which scan
 | ||
| portions of the input that are syntactically different from the rest
 | ||
| (e.g., comments).
 | ||
| 
 | ||
|    If the distinction between inclusive and exclusive start conditions
 | ||
| is still a little vague, here's a simple example illustrating the
 | ||
| connection between the two.  The set of rules:
 | ||
| 
 | ||
|      %s example
 | ||
|      %%
 | ||
|      
 | ||
|      <example>foo   do_something();
 | ||
|      
 | ||
|      bar            something_else();
 | ||
| 
 | ||
| is equivalent to
 | ||
| 
 | ||
|      %x example
 | ||
|      %%
 | ||
|      
 | ||
|      <example>foo   do_something();
 | ||
|      
 | ||
|      <INITIAL,example>bar    something_else();
 | ||
| 
 | ||
|    Without the `<INITIAL,example>' qualifier, the `bar' pattern in the
 | ||
| second example wouldn't be active (i.e., couldn't match) when in start
 | ||
| condition `example'.  If we just used `<example>' to qualify `bar',
 | ||
| though, then it would only be active in `example' and not in `INITIAL',
 | ||
| while in the first example it's active in both, because in the first
 | ||
| example the `example' starting condition is an *inclusive* (`%s') start
 | ||
| condition.
 | ||
| 
 | ||
|    Also note that the special start-condition specifier `<*>' matches
 | ||
| every start condition.  Thus, the above example could also have been
 | ||
| written;
 | ||
| 
 | ||
|      %x example
 | ||
|      %%
 | ||
|      
 | ||
|      <example>foo   do_something();
 | ||
|      
 | ||
|      <*>bar    something_else();
 | ||
| 
 | ||
|    The default rule (to `ECHO' any unmatched character) remains active
 | ||
| in start conditions.  It is equivalent to:
 | ||
| 
 | ||
|      <*>.|\\n     ECHO;
 | ||
| 
 | ||
|    `BEGIN(0)' returns to the original state where only the rules with
 | ||
| no start conditions are active.  This state can also be referred to as
 | ||
| the start-condition "INITIAL", so `BEGIN(INITIAL)' is equivalent to
 | ||
| `BEGIN(0)'.  (The parentheses around the start condition name are not
 | ||
| required but are considered good style.)
 | ||
| 
 | ||
|    `BEGIN' actions can also be given as indented code at the beginning
 | ||
| of the rules section.  For example, the following will cause the
 | ||
| scanner to enter the "SPECIAL" start condition whenever `yylex()' is
 | ||
| called and the global variable `enter_special' is true:
 | ||
| 
 | ||
|              int enter_special;
 | ||
|      
 | ||
|      %x SPECIAL
 | ||
|      %%
 | ||
|              if ( enter_special )
 | ||
|                  BEGIN(SPECIAL);
 | ||
|      
 | ||
|      <SPECIAL>blahblahblah
 | ||
|      ...more rules follow...
 | ||
| 
 | ||
|    To illustrate the uses of start conditions, here is a scanner which
 | ||
| provides two different interpretations of a string like "123.456".  By
 | ||
| default it will treat it as as three tokens, the integer "123", a dot
 | ||
| ('.'), and the integer "456".  But if the string is preceded earlier in
 | ||
| the line by the string "expect-floats" it will treat it as a single
 | ||
| token, the floating-point number 123.456:
 | ||
| 
 | ||
|      %{
 | ||
|      #include <math.h>
 | ||
|      %}
 | ||
|      %s expect
 | ||
|      
 | ||
|      %%
 | ||
|      expect-floats        BEGIN(expect);
 | ||
|      
 | ||
|      <expect>[0-9]+"."[0-9]+      {
 | ||
|                  printf( "found a float, = %f\n",
 | ||
|                          atof( yytext ) );
 | ||
|                  }
 | ||
|      <expect>\n           {
 | ||
|                  /* that's the end of the line, so
 | ||
|                   * we need another "expect-number"
 | ||
|                   * before we'll recognize any more
 | ||
|                   * numbers
 | ||
|                   */
 | ||
|                  BEGIN(INITIAL);
 | ||
|                  }
 | ||
|      
 | ||
|      [0-9]+      {
 | ||
|      
 | ||
|      Version 2.5               December 1994                        18
 | ||
|      
 | ||
|                  printf( "found an integer, = %d\n",
 | ||
|                          atoi( yytext ) );
 | ||
|                  }
 | ||
|      
 | ||
|      "."         printf( "found a dot\n" );
 | ||
| 
 | ||
|    Here is a scanner which recognizes (and discards) C comments while
 | ||
| maintaining a count of the current input line.
 | ||
| 
 | ||
|      %x comment
 | ||
|      %%
 | ||
|              int line_num = 1;
 | ||
|      
 | ||
|      "/*"         BEGIN(comment);
 | ||
|      
 | ||
|      <comment>[^*\n]*        /* eat anything that's not a '*' */
 | ||
|      <comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
 | ||
|      <comment>\n             ++line_num;
 | ||
|      <comment>"*"+"/"        BEGIN(INITIAL);
 | ||
| 
 | ||
|    This scanner goes to a bit of trouble to match as much text as
 | ||
| possible with each rule.  In general, when attempting to write a
 | ||
| high-speed scanner try to match as much possible in each rule, as it's
 | ||
| a big win.
 | ||
| 
 | ||
|    Note that start-conditions names are really integer values and can
 | ||
| be stored as such.  Thus, the above could be extended in the following
 | ||
| fashion:
 | ||
| 
 | ||
|      %x comment foo
 | ||
|      %%
 | ||
|              int line_num = 1;
 | ||
|              int comment_caller;
 | ||
|      
 | ||
|      "/*"         {
 | ||
|                   comment_caller = INITIAL;
 | ||
|                   BEGIN(comment);
 | ||
|                   }
 | ||
|      
 | ||
|      ...
 | ||
|      
 | ||
|      <foo>"/*"    {
 | ||
|                   comment_caller = foo;
 | ||
|                   BEGIN(comment);
 | ||
|                   }
 | ||
|      
 | ||
|      <comment>[^*\n]*        /* eat anything that's not a '*' */
 | ||
|      <comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
 | ||
|      <comment>\n             ++line_num;
 | ||
|      <comment>"*"+"/"        BEGIN(comment_caller);
 | ||
| 
 | ||
|    Furthermore, you can access the current start condition using the
 | ||
| integer-valued `YY_START' macro.  For example, the above assignments to
 | ||
| `comment_caller' could instead be written
 | ||
| 
 | ||
|      comment_caller = YY_START;
 | ||
| 
 | ||
|    Flex provides `YYSTATE' as an alias for `YY_START' (since that is
 | ||
| what's used by AT&T `lex').
 | ||
| 
 | ||
|    Note that start conditions do not have their own name-space; %s's
 | ||
| and %x's declare names in the same fashion as #define's.
 | ||
| 
 | ||
|    Finally, here's an example of how to match C-style quoted strings
 | ||
| using exclusive start conditions, including expanded escape sequences
 | ||
| (but not including checking for a string that's too long):
 | ||
| 
 | ||
|      %x str
 | ||
|      
 | ||
|      %%
 | ||
|              char string_buf[MAX_STR_CONST];
 | ||
|              char *string_buf_ptr;
 | ||
|      
 | ||
|      \"      string_buf_ptr = string_buf; BEGIN(str);
 | ||
|      
 | ||
|      <str>\"        { /* saw closing quote - all done */
 | ||
|              BEGIN(INITIAL);
 | ||
|              *string_buf_ptr = '\0';
 | ||
|              /* return string constant token type and
 | ||
|               * value to parser
 | ||
|               */
 | ||
|              }
 | ||
|      
 | ||
|      <str>\n        {
 | ||
|              /* error - unterminated string constant */
 | ||
|              /* generate error message */
 | ||
|              }
 | ||
|      
 | ||
|      <str>\\[0-7]{1,3} {
 | ||
|              /* octal escape sequence */
 | ||
|              int result;
 | ||
|      
 | ||
|              (void) sscanf( yytext + 1, "%o", &result );
 | ||
|      
 | ||
|              if ( result > 0xff )
 | ||
|                      /* error, constant is out-of-bounds */
 | ||
|      
 | ||
|              *string_buf_ptr++ = result;
 | ||
|              }
 | ||
|      
 | ||
|      <str>\\[0-9]+ {
 | ||
|              /* generate error - bad escape sequence; something
 | ||
|               * like '\48' or '\0777777'
 | ||
|               */
 | ||
|              }
 | ||
|      
 | ||
|      <str>\\n  *string_buf_ptr++ = '\n';
 | ||
|      <str>\\t  *string_buf_ptr++ = '\t';
 | ||
|      <str>\\r  *string_buf_ptr++ = '\r';
 | ||
|      <str>\\b  *string_buf_ptr++ = '\b';
 | ||
|      <str>\\f  *string_buf_ptr++ = '\f';
 | ||
|      
 | ||
|      <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];
 | ||
|      
 | ||
|      <str>[^\\\n\"]+        {
 | ||
|              char *yptr = yytext;
 | ||
|      
 | ||
|              while ( *yptr )
 | ||
|                      *string_buf_ptr++ = *yptr++;
 | ||
|              }
 | ||
| 
 | ||
|    Often, such as in some of the examples above, you wind up writing a
 | ||
| whole bunch of rules all preceded by the same start condition(s).  Flex
 | ||
| makes this a little easier and cleaner by introducing a notion of start
 | ||
| condition "scope".  A start condition scope is begun with:
 | ||
| 
 | ||
|      <SCs>{
 | ||
| 
 | ||
| where SCs is a list of one or more start conditions.  Inside the start
 | ||
| condition scope, every rule automatically has the prefix `<SCs>'
 | ||
| applied to it, until a `}' which matches the initial `{'.  So, for
 | ||
| example,
 | ||
| 
 | ||
|      <ESC>{
 | ||
|          "\\n"   return '\n';
 | ||
|          "\\r"   return '\r';
 | ||
|          "\\f"   return '\f';
 | ||
|          "\\0"   return '\0';
 | ||
|      }
 | ||
| 
 | ||
| is equivalent to:
 | ||
| 
 | ||
|      <ESC>"\\n"  return '\n';
 | ||
|      <ESC>"\\r"  return '\r';
 | ||
|      <ESC>"\\f"  return '\f';
 | ||
|      <ESC>"\\0"  return '\0';
 | ||
| 
 | ||
|    Start condition scopes may be nested.
 | ||
| 
 | ||
|    Three routines are available for manipulating stacks of start
 | ||
| conditions:
 | ||
| 
 | ||
| `void yy_push_state(int new_state)'
 | ||
|      pushes the current start condition onto the top of the start
 | ||
|      condition stack and switches to NEW_STATE as though you had used
 | ||
|      `BEGIN new_state' (recall that start condition names are also
 | ||
|      integers).
 | ||
| 
 | ||
| `void yy_pop_state()'
 | ||
|      pops the top of the stack and switches to it via `BEGIN'.
 | ||
| 
 | ||
| `int yy_top_state()'
 | ||
|      returns the top of the stack without altering the stack's contents.
 | ||
| 
 | ||
|    The start condition stack grows dynamically and so has no built-in
 | ||
| size limitation.  If memory is exhausted, program execution aborts.
 | ||
| 
 | ||
|    To use start condition stacks, your scanner must include a `%option
 | ||
| stack' directive (see Options below).
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Multiple buffers,  Next: End-of-file rules,  Prev: Start conditions,  Up: Top
 | ||
| 
 | ||
| Multiple input buffers
 | ||
| ======================
 | ||
| 
 | ||
|    Some scanners (such as those which support "include" files) require
 | ||
| reading from several input streams.  As `flex' scanners do a large
 | ||
| amount of buffering, one cannot control where the next input will be
 | ||
| read from by simply writing a `YY_INPUT' which is sensitive to the
 | ||
| scanning context.  `YY_INPUT' is only called when the scanner reaches
 | ||
| the end of its buffer, which may be a long time after scanning a
 | ||
| statement such as an "include" which requires switching the input
 | ||
| source.
 | ||
| 
 | ||
|    To negotiate these sorts of problems, `flex' provides a mechanism
 | ||
| for creating and switching between multiple input buffers.  An input
 | ||
| buffer is created by using:
 | ||
| 
 | ||
|      YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
 | ||
| 
 | ||
| which takes a `FILE' pointer and a size and creates a buffer associated
 | ||
| with the given file and large enough to hold SIZE characters (when in
 | ||
| doubt, use `YY_BUF_SIZE' for the size).  It returns a `YY_BUFFER_STATE'
 | ||
| handle, which may then be passed to other routines (see below).  The
 | ||
| `YY_BUFFER_STATE' type is a pointer to an opaque `struct'
 | ||
| `yy_buffer_state' structure, so you may safely initialize
 | ||
| YY_BUFFER_STATE variables to `((YY_BUFFER_STATE) 0)' if you wish, and
 | ||
| also refer to the opaque structure in order to correctly declare input
 | ||
| buffers in source files other than that of your scanner.  Note that the
 | ||
| `FILE' pointer in the call to `yy_create_buffer' is only used as the
 | ||
| value of `yyin' seen by `YY_INPUT'; if you redefine `YY_INPUT' so it no
 | ||
| longer uses `yyin', then you can safely pass a nil `FILE' pointer to
 | ||
| `yy_create_buffer'.  You select a particular buffer to scan from using:
 | ||
| 
 | ||
|      void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
 | ||
| 
 | ||
|    switches the scanner's input buffer so subsequent tokens will come
 | ||
| from NEW_BUFFER.  Note that `yy_switch_to_buffer()' may be used by
 | ||
| `yywrap()' to set things up for continued scanning, instead of opening
 | ||
| a new file and pointing `yyin' at it.  Note also that switching input
 | ||
| sources via either `yy_switch_to_buffer()' or `yywrap()' does *not*
 | ||
| change the start condition.
 | ||
| 
 | ||
|      void yy_delete_buffer( YY_BUFFER_STATE buffer )
 | ||
| 
 | ||
| is used to reclaim the storage associated with a buffer.  You can also
 | ||
| clear the current contents of a buffer using:
 | ||
| 
 | ||
|      void yy_flush_buffer( YY_BUFFER_STATE buffer )
 | ||
| 
 | ||
|    This function discards the buffer's contents, so the next time the
 | ||
| scanner attempts to match a token from the buffer, it will first fill
 | ||
| the buffer anew using `YY_INPUT'.
 | ||
| 
 | ||
|    `yy_new_buffer()' is an alias for `yy_create_buffer()', provided for
 | ||
| compatibility with the C++ use of `new' and `delete' for creating and
 | ||
| destroying dynamic objects.
 | ||
| 
 | ||
|    Finally, the `YY_CURRENT_BUFFER' macro returns a `YY_BUFFER_STATE'
 | ||
| handle to the current buffer.
 | ||
| 
 | ||
|    Here is an example of using these features for writing a scanner
 | ||
| which expands include files (the `<<EOF>>' feature is discussed below):
 | ||
| 
 | ||
|      /* the "incl" state is used for picking up the name
 | ||
|       * of an include file
 | ||
|       */
 | ||
|      %x incl
 | ||
|      
 | ||
|      %{
 | ||
|      #define MAX_INCLUDE_DEPTH 10
 | ||
|      YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
 | ||
|      int include_stack_ptr = 0;
 | ||
|      %}
 | ||
|      
 | ||
|      %%
 | ||
|      include             BEGIN(incl);
 | ||
|      
 | ||
|      [a-z]+              ECHO;
 | ||
|      [^a-z\n]*\n?        ECHO;
 | ||
|      
 | ||
|      <incl>[ \t]*      /* eat the whitespace */
 | ||
|      <incl>[^ \t\n]+   { /* got the include file name */
 | ||
|              if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
 | ||
|                  {
 | ||
|                  fprintf( stderr, "Includes nested too deeply" );
 | ||
|                  exit( 1 );
 | ||
|                  }
 | ||
|      
 | ||
|              include_stack[include_stack_ptr++] =
 | ||
|                  YY_CURRENT_BUFFER;
 | ||
|      
 | ||
|              yyin = fopen( yytext, "r" );
 | ||
|      
 | ||
|              if ( ! yyin )
 | ||
|                  error( ... );
 | ||
|      
 | ||
|              yy_switch_to_buffer(
 | ||
|                  yy_create_buffer( yyin, YY_BUF_SIZE ) );
 | ||
|      
 | ||
|              BEGIN(INITIAL);
 | ||
|              }
 | ||
|      
 | ||
|      <<EOF>> {
 | ||
|              if ( --include_stack_ptr < 0 )
 | ||
|                  {
 | ||
|                  yyterminate();
 | ||
|                  }
 | ||
|      
 | ||
|              else
 | ||
|                  {
 | ||
|                  yy_delete_buffer( YY_CURRENT_BUFFER );
 | ||
|                  yy_switch_to_buffer(
 | ||
|                       include_stack[include_stack_ptr] );
 | ||
|                  }
 | ||
|              }
 | ||
| 
 | ||
|    Three routines are available for setting up input buffers for
 | ||
| scanning in-memory strings instead of files.  All of them create a new
 | ||
| input buffer for scanning the string, and return a corresponding
 | ||
| `YY_BUFFER_STATE' handle (which you should delete with
 | ||
| `yy_delete_buffer()' when done with it).  They also switch to the new
 | ||
| buffer using `yy_switch_to_buffer()', so the next call to `yylex()' will
 | ||
| start scanning the string.
 | ||
| 
 | ||
| `yy_scan_string(const char *str)'
 | ||
|      scans a NUL-terminated string.
 | ||
| 
 | ||
| `yy_scan_bytes(const char *bytes, int len)'
 | ||
|      scans `len' bytes (including possibly NUL's) starting at location
 | ||
|      BYTES.
 | ||
| 
 | ||
|    Note that both of these functions create and scan a *copy* of the
 | ||
| string or bytes.  (This may be desirable, since `yylex()' modifies the
 | ||
| contents of the buffer it is scanning.) You can avoid the copy by using:
 | ||
| 
 | ||
| `yy_scan_buffer(char *base, yy_size_t size)'
 | ||
|      which scans in place the buffer starting at BASE, consisting of
 | ||
|      SIZE bytes, the last two bytes of which *must* be
 | ||
|      `YY_END_OF_BUFFER_CHAR' (ASCII NUL).  These last two bytes are not
 | ||
|      scanned; thus, scanning consists of `base[0]' through
 | ||
|      `base[size-2]', inclusive.
 | ||
| 
 | ||
|      If you fail to set up BASE in this manner (i.e., forget the final
 | ||
|      two `YY_END_OF_BUFFER_CHAR' bytes), then `yy_scan_buffer()'
 | ||
|      returns a nil pointer instead of creating a new input buffer.
 | ||
| 
 | ||
|      The type `yy_size_t' is an integral type to which you can cast an
 | ||
|      integer expression reflecting the size of the buffer.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: End-of-file rules,  Next: Miscellaneous,  Prev: Multiple buffers,  Up: Top
 | ||
| 
 | ||
| End-of-file rules
 | ||
| =================
 | ||
| 
 | ||
|    The special rule "<<EOF>>" indicates actions which are to be taken
 | ||
| when an end-of-file is encountered and yywrap() returns non-zero (i.e.,
 | ||
| indicates no further files to process).  The action must finish by
 | ||
| doing one of four things:
 | ||
| 
 | ||
|    - assigning `yyin' to a new input file (in previous versions of
 | ||
|      flex, after doing the assignment you had to call the special
 | ||
|      action `YY_NEW_FILE'; this is no longer necessary);
 | ||
| 
 | ||
|    - executing a `return' statement;
 | ||
| 
 | ||
|    - executing the special `yyterminate()' action;
 | ||
| 
 | ||
|    - or, switching to a new buffer using `yy_switch_to_buffer()' as
 | ||
|      shown in the example above.
 | ||
| 
 | ||
|    <<EOF>> rules may not be used with other patterns; they may only be
 | ||
| qualified with a list of start conditions.  If an unqualified <<EOF>>
 | ||
| rule is given, it applies to *all* start conditions which do not
 | ||
| already have <<EOF>> actions.  To specify an <<EOF>> rule for only the
 | ||
| initial start condition, use
 | ||
| 
 | ||
|      <INITIAL><<EOF>>
 | ||
| 
 | ||
|    These rules are useful for catching things like unclosed comments.
 | ||
| An example:
 | ||
| 
 | ||
|      %x quote
 | ||
|      %%
 | ||
|      
 | ||
|      ...other rules for dealing with quotes...
 | ||
|      
 | ||
|      <quote><<EOF>>   {
 | ||
|               error( "unterminated quote" );
 | ||
|               yyterminate();
 | ||
|               }
 | ||
|      <<EOF>>  {
 | ||
|               if ( *++filelist )
 | ||
|                   yyin = fopen( *filelist, "r" );
 | ||
|               else
 | ||
|                  yyterminate();
 | ||
|               }
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Miscellaneous,  Next: User variables,  Prev: End-of-file rules,  Up: Top
 | ||
| 
 | ||
| Miscellaneous macros
 | ||
| ====================
 | ||
| 
 | ||
|    The macro `YY_USER_ACTION' can be defined to provide an action which
 | ||
| is always executed prior to the matched rule's action.  For example, it
 | ||
| could be #define'd to call a routine to convert yytext to lower-case.
 | ||
| When `YY_USER_ACTION' is invoked, the variable `yy_act' gives the
 | ||
| number of the matched rule (rules are numbered starting with 1).
 | ||
| Suppose you want to profile how often each of your rules is matched.
 | ||
| The following would do the trick:
 | ||
| 
 | ||
|      #define YY_USER_ACTION ++ctr[yy_act]
 | ||
| 
 | ||
|    where `ctr' is an array to hold the counts for the different rules.
 | ||
| Note that the macro `YY_NUM_RULES' gives the total number of rules
 | ||
| (including the default rule, even if you use `-s', so a correct
 | ||
| declaration for `ctr' is:
 | ||
| 
 | ||
|      int ctr[YY_NUM_RULES];
 | ||
| 
 | ||
|    The macro `YY_USER_INIT' may be defined to provide an action which
 | ||
| is always executed before the first scan (and before the scanner's
 | ||
| internal initializations are done).  For example, it could be used to
 | ||
| call a routine to read in a data table or open a logging file.
 | ||
| 
 | ||
|    The macro `yy_set_interactive(is_interactive)' can be used to
 | ||
| control whether the current buffer is considered *interactive*.  An
 | ||
| interactive buffer is processed more slowly, but must be used when the
 | ||
| scanner's input source is indeed interactive to avoid problems due to
 | ||
| waiting to fill buffers (see the discussion of the `-I' flag below).  A
 | ||
| non-zero value in the macro invocation marks the buffer as interactive,
 | ||
| a zero value as non-interactive.  Note that use of this macro overrides
 | ||
| `%option always-interactive' or `%option never-interactive' (see
 | ||
| Options below).  `yy_set_interactive()' must be invoked prior to
 | ||
| beginning to scan the buffer that is (or is not) to be considered
 | ||
| interactive.
 | ||
| 
 | ||
|    The macro `yy_set_bol(at_bol)' can be used to control whether the
 | ||
| current buffer's scanning context for the next token match is done as
 | ||
| though at the beginning of a line.  A non-zero macro argument makes
 | ||
| rules anchored with
 | ||
| 
 | ||
|    The macro `YY_AT_BOL()' returns true if the next token scanned from
 | ||
| the current buffer will have '^' rules active, false otherwise.
 | ||
| 
 | ||
|    In the generated scanner, the actions are all gathered in one large
 | ||
| switch statement and separated using `YY_BREAK', which may be
 | ||
| redefined.  By default, it is simply a "break", to separate each rule's
 | ||
| action from the following rule's.  Redefining `YY_BREAK' allows, for
 | ||
| example, C++ users to #define YY_BREAK to do nothing (while being very
 | ||
| careful that every rule ends with a "break" or a "return"!) to avoid
 | ||
| suffering from unreachable statement warnings where because a rule's
 | ||
| action ends with "return", the `YY_BREAK' is inaccessible.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: User variables,  Next: YACC interface,  Prev: Miscellaneous,  Up: Top
 | ||
| 
 | ||
| Values available to the user
 | ||
| ============================
 | ||
| 
 | ||
|    This section summarizes the various values available to the user in
 | ||
| the rule actions.
 | ||
| 
 | ||
|    - `char *yytext' holds the text of the current token.  It may be
 | ||
|      modified but not lengthened (you cannot append characters to the
 | ||
|      end).
 | ||
| 
 | ||
|      If the special directive `%array' appears in the first section of
 | ||
|      the scanner description, then `yytext' is instead declared `char
 | ||
|      yytext[YYLMAX]', where `YYLMAX' is a macro definition that you can
 | ||
|      redefine in the first section if you don't like the default value
 | ||
|      (generally 8KB).  Using `%array' results in somewhat slower
 | ||
|      scanners, but the value of `yytext' becomes immune to calls to
 | ||
|      `input()' and `unput()', which potentially destroy its value when
 | ||
|      `yytext' is a character pointer.  The opposite of `%array' is
 | ||
|      `%pointer', which is the default.
 | ||
| 
 | ||
|      You cannot use `%array' when generating C++ scanner classes (the
 | ||
|      `-+' flag).
 | ||
| 
 | ||
|    - `int yyleng' holds the length of the current token.
 | ||
| 
 | ||
|    - `FILE *yyin' is the file which by default `flex' reads from.  It
 | ||
|      may be redefined but doing so only makes sense before scanning
 | ||
|      begins or after an EOF has been encountered.  Changing it in the
 | ||
|      midst of scanning will have unexpected results since `flex'
 | ||
|      buffers its input; use `yyrestart()' instead.  Once scanning
 | ||
|      terminates because an end-of-file has been seen, you can assign
 | ||
|      `yyin' at the new input file and then call the scanner again to
 | ||
|      continue scanning.
 | ||
| 
 | ||
|    - `void yyrestart( FILE *new_file )' may be called to point `yyin'
 | ||
|      at the new input file.  The switch-over to the new file is
 | ||
|      immediate (any previously buffered-up input is lost).  Note that
 | ||
|      calling `yyrestart()' with `yyin' as an argument thus throws away
 | ||
|      the current input buffer and continues scanning the same input
 | ||
|      file.
 | ||
| 
 | ||
|    - `FILE *yyout' is the file to which `ECHO' actions are done.  It
 | ||
|      can be reassigned by the user.
 | ||
| 
 | ||
|    - `YY_CURRENT_BUFFER' returns a `YY_BUFFER_STATE' handle to the
 | ||
|      current buffer.
 | ||
| 
 | ||
|    - `YY_START' returns an integer value corresponding to the current
 | ||
|      start condition.  You can subsequently use this value with `BEGIN'
 | ||
|      to return to that start condition.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: YACC interface,  Next: Options,  Prev: User variables,  Up: Top
 | ||
| 
 | ||
| Interfacing with `yacc'
 | ||
| =======================
 | ||
| 
 | ||
|    One of the main uses of `flex' is as a companion to the `yacc'
 | ||
| parser-generator.  `yacc' parsers expect to call a routine named
 | ||
| `yylex()' to find the next input token.  The routine is supposed to
 | ||
| return the type of the next token as well as putting any associated
 | ||
| value in the global `yylval'.  To use `flex' with `yacc', one specifies
 | ||
| the `-d' option to `yacc' to instruct it to generate the file `y.tab.h'
 | ||
| containing definitions of all the `%tokens' appearing in the `yacc'
 | ||
| input.  This file is then included in the `flex' scanner.  For example,
 | ||
| if one of the tokens is "TOK_NUMBER", part of the scanner might look
 | ||
| like:
 | ||
| 
 | ||
|      %{
 | ||
|      #include "y.tab.h"
 | ||
|      %}
 | ||
|      
 | ||
|      %%
 | ||
|      
 | ||
|      [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Options,  Next: Performance,  Prev: YACC interface,  Up: Top
 | ||
| 
 | ||
| Options
 | ||
| =======
 | ||
| 
 | ||
|    `flex' has the following options:
 | ||
| 
 | ||
| `-b'
 | ||
|      Generate backing-up information to `lex.backup'.  This is a list
 | ||
|      of scanner states which require backing up and the input
 | ||
|      characters on which they do so.  By adding rules one can remove
 | ||
|      backing-up states.  If *all* backing-up states are eliminated and
 | ||
|      `-Cf' or `-CF' is used, the generated scanner will run faster (see
 | ||
|      the `-p' flag).  Only users who wish to squeeze every last cycle
 | ||
|      out of their scanners need worry about this option.  (See the
 | ||
|      section on Performance Considerations below.)
 | ||
| 
 | ||
| `-c'
 | ||
|      is a do-nothing, deprecated option included for POSIX compliance.
 | ||
| 
 | ||
| `-d'
 | ||
|      makes the generated scanner run in "debug" mode.  Whenever a
 | ||
|      pattern is recognized and the global `yy_flex_debug' is non-zero
 | ||
|      (which is the default), the scanner will write to `stderr' a line
 | ||
|      of the form:
 | ||
| 
 | ||
|           --accepting rule at line 53 ("the matched text")
 | ||
| 
 | ||
|      The line number refers to the location of the rule in the file
 | ||
|      defining the scanner (i.e., the file that was fed to flex).
 | ||
|      Messages are also generated when the scanner backs up, accepts the
 | ||
|      default rule, reaches the end of its input buffer (or encounters a
 | ||
|      NUL; at this point, the two look the same as far as the scanner's
 | ||
|      concerned), or reaches an end-of-file.
 | ||
| 
 | ||
| `-f'
 | ||
|      specifies "fast scanner".  No table compression is done and stdio
 | ||
|      is bypassed.  The result is large but fast.  This option is
 | ||
|      equivalent to `-Cfr' (see below).
 | ||
| 
 | ||
| `-h'
 | ||
|      generates a "help" summary of `flex's' options to `stdout' and
 | ||
|      then exits.  `-?' and `--help' are synonyms for `-h'.
 | ||
| 
 | ||
| `-i'
 | ||
|      instructs `flex' to generate a *case-insensitive* scanner.  The
 | ||
|      case of letters given in the `flex' input patterns will be
 | ||
|      ignored, and tokens in the input will be matched regardless of
 | ||
|      case.  The matched text given in `yytext' will have the preserved
 | ||
|      case (i.e., it will not be folded).
 | ||
| 
 | ||
| `-l'
 | ||
|      turns on maximum compatibility with the original AT&T `lex'
 | ||
|      implementation.  Note that this does not mean *full*
 | ||
|      compatibility.  Use of this option costs a considerable amount of
 | ||
|      performance, and it cannot be used with the `-+, -f, -F, -Cf', or
 | ||
|      `-CF' options.  For details on the compatibilities it provides, see
 | ||
|      the section "Incompatibilities With Lex And POSIX" below.  This
 | ||
|      option also results in the name `YY_FLEX_LEX_COMPAT' being
 | ||
|      #define'd in the generated scanner.
 | ||
| 
 | ||
| `-n'
 | ||
|      is another do-nothing, deprecated option included only for POSIX
 | ||
|      compliance.
 | ||
| 
 | ||
| `-p'
 | ||
|      generates a performance report to stderr.  The report consists of
 | ||
|      comments regarding features of the `flex' input file which will
 | ||
|      cause a serious loss of performance in the resulting scanner.  If
 | ||
|      you give the flag twice, you will also get comments regarding
 | ||
|      features that lead to minor performance losses.
 | ||
| 
 | ||
|      Note that the use of `REJECT', `%option yylineno' and variable
 | ||
|      trailing context (see the Deficiencies / Bugs section below)
 | ||
|      entails a substantial performance penalty; use of `yymore()', the
 | ||
|      `^' operator, and the `-I' flag entail minor performance penalties.
 | ||
| 
 | ||
| `-s'
 | ||
|      causes the "default rule" (that unmatched scanner input is echoed
 | ||
|      to `stdout') to be suppressed.  If the scanner encounters input
 | ||
|      that does not match any of its rules, it aborts with an error.
 | ||
|      This option is useful for finding holes in a scanner's rule set.
 | ||
| 
 | ||
| `-t'
 | ||
|      instructs `flex' to write the scanner it generates to standard
 | ||
|      output instead of `lex.yy.c'.
 | ||
| 
 | ||
| `-v'
 | ||
|      specifies that `flex' should write to `stderr' a summary of
 | ||
|      statistics regarding the scanner it generates.  Most of the
 | ||
|      statistics are meaningless to the casual `flex' user, but the
 | ||
|      first line identifies the version of `flex' (same as reported by
 | ||
|      `-V'), and the next line the flags used when generating the
 | ||
|      scanner, including those that are on by default.
 | ||
| 
 | ||
| `-w'
 | ||
|      suppresses warning messages.
 | ||
| 
 | ||
| `-B'
 | ||
|      instructs `flex' to generate a *batch* scanner, the opposite of
 | ||
|      *interactive* scanners generated by `-I' (see below).  In general,
 | ||
|      you use `-B' when you are *certain* that your scanner will never
 | ||
|      be used interactively, and you want to squeeze a *little* more
 | ||
|      performance out of it.  If your goal is instead to squeeze out a
 | ||
|      *lot* more performance, you should be using the `-Cf' or `-CF'
 | ||
|      options (discussed below), which turn on `-B' automatically anyway.
 | ||
| 
 | ||
| `-F'
 | ||
|      specifies that the "fast" scanner table representation should be
 | ||
|      used (and stdio bypassed).  This representation is about as fast
 | ||
|      as the full table representation `(-f)', and for some sets of
 | ||
|      patterns will be considerably smaller (and for others, larger).
 | ||
|      In general, if the pattern set contains both "keywords" and a
 | ||
|      catch-all, "identifier" rule, such as in the set:
 | ||
| 
 | ||
|           "case"    return TOK_CASE;
 | ||
|           "switch"  return TOK_SWITCH;
 | ||
|           ...
 | ||
|           "default" return TOK_DEFAULT;
 | ||
|           [a-z]+    return TOK_ID;
 | ||
| 
 | ||
|      then you're better off using the full table representation.  If
 | ||
|      only the "identifier" rule is present and you then use a hash
 | ||
|      table or some such to detect the keywords, you're better off using
 | ||
|      `-F'.
 | ||
| 
 | ||
|      This option is equivalent to `-CFr' (see below).  It cannot be
 | ||
|      used with `-+'.
 | ||
| 
 | ||
| `-I'
 | ||
|      instructs `flex' to generate an *interactive* scanner.  An
 | ||
|      interactive scanner is one that only looks ahead to decide what
 | ||
|      token has been matched if it absolutely must.  It turns out that
 | ||
|      always looking one extra character ahead, even if the scanner has
 | ||
|      already seen enough text to disambiguate the current token, is a
 | ||
|      bit faster than only looking ahead when necessary.  But scanners
 | ||
|      that always look ahead give dreadful interactive performance; for
 | ||
|      example, when a user types a newline, it is not recognized as a
 | ||
|      newline token until they enter *another* token, which often means
 | ||
|      typing in another whole line.
 | ||
| 
 | ||
|      `Flex' scanners default to *interactive* unless you use the `-Cf'
 | ||
|      or `-CF' table-compression options (see below).  That's because if
 | ||
|      you're looking for high-performance you should be using one of
 | ||
|      these options, so if you didn't, `flex' assumes you'd rather trade
 | ||
|      off a bit of run-time performance for intuitive interactive
 | ||
|      behavior.  Note also that you *cannot* use `-I' in conjunction
 | ||
|      with `-Cf' or `-CF'.  Thus, this option is not really needed; it
 | ||
|      is on by default for all those cases in which it is allowed.
 | ||
| 
 | ||
|      You can force a scanner to *not* be interactive by using `-B' (see
 | ||
|      above).
 | ||
| 
 | ||
| `-L'
 | ||
|      instructs `flex' not to generate `#line' directives.  Without this
 | ||
|      option, `flex' peppers the generated scanner with #line directives
 | ||
|      so error messages in the actions will be correctly located with
 | ||
|      respect to either the original `flex' input file (if the errors
 | ||
|      are due to code in the input file), or `lex.yy.c' (if the errors
 | ||
|      are `flex's' fault - you should report these sorts of errors to
 | ||
|      the email address given below).
 | ||
| 
 | ||
| `-T'
 | ||
|      makes `flex' run in `trace' mode.  It will generate a lot of
 | ||
|      messages to `stderr' concerning the form of the input and the
 | ||
|      resultant non-deterministic and deterministic finite automata.
 | ||
|      This option is mostly for use in maintaining `flex'.
 | ||
| 
 | ||
| `-V'
 | ||
|      prints the version number to `stdout' and exits.  `--version' is a
 | ||
|      synonym for `-V'.
 | ||
| 
 | ||
| `-7'
 | ||
|      instructs `flex' to generate a 7-bit scanner, i.e., one which can
 | ||
|      only recognized 7-bit characters in its input.  The advantage of
 | ||
|      using `-7' is that the scanner's tables can be up to half the size
 | ||
|      of those generated using the `-8' option (see below).  The
 | ||
|      disadvantage is that such scanners often hang or crash if their
 | ||
|      input contains an 8-bit character.
 | ||
| 
 | ||
|      Note, however, that unless you generate your scanner using the
 | ||
|      `-Cf' or `-CF' table compression options, use of `-7' will save
 | ||
|      only a small amount of table space, and make your scanner
 | ||
|      considerably less portable.  `Flex's' default behavior is to
 | ||
|      generate an 8-bit scanner unless you use the `-Cf' or `-CF', in
 | ||
|      which case `flex' defaults to generating 7-bit scanners unless
 | ||
|      your site was always configured to generate 8-bit scanners (as
 | ||
|      will often be the case with non-USA sites).  You can tell whether
 | ||
|      flex generated a 7-bit or an 8-bit scanner by inspecting the flag
 | ||
|      summary in the `-v' output as described above.
 | ||
| 
 | ||
|      Note that if you use `-Cfe' or `-CFe' (those table compression
 | ||
|      options, but also using equivalence classes as discussed see
 | ||
|      below), flex still defaults to generating an 8-bit scanner, since
 | ||
|      usually with these compression options full 8-bit tables are not
 | ||
|      much more expensive than 7-bit tables.
 | ||
| 
 | ||
| `-8'
 | ||
|      instructs `flex' to generate an 8-bit scanner, i.e., one which can
 | ||
|      recognize 8-bit characters.  This flag is only needed for scanners
 | ||
|      generated using `-Cf' or `-CF', as otherwise flex defaults to
 | ||
|      generating an 8-bit scanner anyway.
 | ||
| 
 | ||
|      See the discussion of `-7' above for flex's default behavior and
 | ||
|      the tradeoffs between 7-bit and 8-bit scanners.
 | ||
| 
 | ||
| `-+'
 | ||
|      specifies that you want flex to generate a C++ scanner class.  See
 | ||
|      the section on Generating C++ Scanners below for details.
 | ||
| 
 | ||
| `-C[aefFmr]'
 | ||
|      controls the degree of table compression and, more generally,
 | ||
|      trade-offs between small scanners and fast scanners.
 | ||
| 
 | ||
|      `-Ca' ("align") instructs flex to trade off larger tables in the
 | ||
|      generated scanner for faster performance because the elements of
 | ||
|      the tables are better aligned for memory access and computation.
 | ||
|      On some RISC architectures, fetching and manipulating long-words
 | ||
|      is more efficient than with smaller-sized units such as
 | ||
|      shortwords.  This option can double the size of the tables used by
 | ||
|      your scanner.
 | ||
| 
 | ||
|      `-Ce' directs `flex' to construct "equivalence classes", i.e.,
 | ||
|      sets of characters which have identical lexical properties (for
 | ||
|      example, if the only appearance of digits in the `flex' input is
 | ||
|      in the character class "[0-9]" then the digits '0', '1', ..., '9'
 | ||
|      will all be put in the same equivalence class).  Equivalence
 | ||
|      classes usually give dramatic reductions in the final table/object
 | ||
|      file sizes (typically a factor of 2-5) and are pretty cheap
 | ||
|      performance-wise (one array look-up per character scanned).
 | ||
| 
 | ||
|      `-Cf' specifies that the *full* scanner tables should be generated
 | ||
|      - `flex' should not compress the tables by taking advantages of
 | ||
|      similar transition functions for different states.
 | ||
| 
 | ||
|      `-CF' specifies that the alternate fast scanner representation
 | ||
|      (described above under the `-F' flag) should be used.  This option
 | ||
|      cannot be used with `-+'.
 | ||
| 
 | ||
|      `-Cm' directs `flex' to construct "meta-equivalence classes",
 | ||
|      which are sets of equivalence classes (or characters, if
 | ||
|      equivalence classes are not being used) that are commonly used
 | ||
|      together.  Meta-equivalence classes are often a big win when using
 | ||
|      compressed tables, but they have a moderate performance impact
 | ||
|      (one or two "if" tests and one array look-up per character
 | ||
|      scanned).
 | ||
| 
 | ||
|      `-Cr' causes the generated scanner to *bypass* use of the standard
 | ||
|      I/O library (stdio) for input.  Instead of calling `fread()' or
 | ||
|      `getc()', the scanner will use the `read()' system call, resulting
 | ||
|      in a performance gain which varies from system to system, but in
 | ||
|      general is probably negligible unless you are also using `-Cf' or
 | ||
|      `-CF'.  Using `-Cr' can cause strange behavior if, for example,
 | ||
|      you read from `yyin' using stdio prior to calling the scanner
 | ||
|      (because the scanner will miss whatever text your previous reads
 | ||
|      left in the stdio input buffer).
 | ||
| 
 | ||
|      `-Cr' has no effect if you define `YY_INPUT' (see The Generated
 | ||
|      Scanner above).
 | ||
| 
 | ||
|      A lone `-C' specifies that the scanner tables should be compressed
 | ||
|      but neither equivalence classes nor meta-equivalence classes
 | ||
|      should be used.
 | ||
| 
 | ||
|      The options `-Cf' or `-CF' and `-Cm' do not make sense together -
 | ||
|      there is no opportunity for meta-equivalence classes if the table
 | ||
|      is not being compressed.  Otherwise the options may be freely
 | ||
|      mixed, and are cumulative.
 | ||
| 
 | ||
|      The default setting is `-Cem', which specifies that `flex' should
 | ||
|      generate equivalence classes and meta-equivalence classes.  This
 | ||
|      setting provides the highest degree of table compression.  You can
 | ||
|      trade off faster-executing scanners at the cost of larger tables
 | ||
|      with the following generally being true:
 | ||
| 
 | ||
|           slowest & smallest
 | ||
|                 -Cem
 | ||
|                 -Cm
 | ||
|                 -Ce
 | ||
|                 -C
 | ||
|                 -C{f,F}e
 | ||
|                 -C{f,F}
 | ||
|                 -C{f,F}a
 | ||
|           fastest & largest
 | ||
| 
 | ||
|      Note that scanners with the smallest tables are usually generated
 | ||
|      and compiled the quickest, so during development you will usually
 | ||
|      want to use the default, maximal compression.
 | ||
| 
 | ||
|      `-Cfe' is often a good compromise between speed and size for
 | ||
|      production scanners.
 | ||
| 
 | ||
| `-ooutput'
 | ||
|      directs flex to write the scanner to the file `out-' `put' instead
 | ||
|      of `lex.yy.c'.  If you combine `-o' with the `-t' option, then the
 | ||
|      scanner is written to `stdout' but its `#line' directives (see the
 | ||
|      `-L' option above) refer to the file `output'.
 | ||
| 
 | ||
| `-Pprefix'
 | ||
|      changes the default `yy' prefix used by `flex' for all
 | ||
|      globally-visible variable and function names to instead be PREFIX.
 | ||
|      For example, `-Pfoo' changes the name of `yytext' to `footext'.
 | ||
|      It also changes the name of the default output file from
 | ||
|      `lex.yy.c' to `lex.foo.c'.  Here are all of the names affected:
 | ||
| 
 | ||
|           yy_create_buffer
 | ||
|           yy_delete_buffer
 | ||
|           yy_flex_debug
 | ||
|           yy_init_buffer
 | ||
|           yy_flush_buffer
 | ||
|           yy_load_buffer_state
 | ||
|           yy_switch_to_buffer
 | ||
|           yyin
 | ||
|           yyleng
 | ||
|           yylex
 | ||
|           yylineno
 | ||
|           yyout
 | ||
|           yyrestart
 | ||
|           yytext
 | ||
|           yywrap
 | ||
| 
 | ||
|      (If you are using a C++ scanner, then only `yywrap' and
 | ||
|      `yyFlexLexer' are affected.) Within your scanner itself, you can
 | ||
|      still refer to the global variables and functions using either
 | ||
|      version of their name; but externally, they have the modified name.
 | ||
| 
 | ||
|      This option lets you easily link together multiple `flex' programs
 | ||
|      into the same executable.  Note, though, that using this option
 | ||
|      also renames `yywrap()', so you now *must* either provide your own
 | ||
|      (appropriately-named) version of the routine for your scanner, or
 | ||
|      use `%option noyywrap', as linking with `-lfl' no longer provides
 | ||
|      one for you by default.
 | ||
| 
 | ||
| `-Sskeleton_file'
 | ||
|      overrides the default skeleton file from which `flex' constructs
 | ||
|      its scanners.  You'll never need this option unless you are doing
 | ||
|      `flex' maintenance or development.
 | ||
| 
 | ||
|    `flex' also provides a mechanism for controlling options within the
 | ||
| scanner specification itself, rather than from the flex command-line.
 | ||
| This is done by including `%option' directives in the first section of
 | ||
| the scanner specification.  You can specify multiple options with a
 | ||
| single `%option' directive, and multiple directives in the first
 | ||
| section of your flex input file.  Most options are given simply as
 | ||
| names, optionally preceded by the word "no" (with no intervening
 | ||
| whitespace) to negate their meaning.  A number are equivalent to flex
 | ||
| flags or their negation:
 | ||
| 
 | ||
|      7bit            -7 option
 | ||
|      8bit            -8 option
 | ||
|      align           -Ca option
 | ||
|      backup          -b option
 | ||
|      batch           -B option
 | ||
|      c++             -+ option
 | ||
|      
 | ||
|      caseful or
 | ||
|      case-sensitive  opposite of -i (default)
 | ||
|      
 | ||
|      case-insensitive or
 | ||
|      caseless        -i option
 | ||
|      
 | ||
|      debug           -d option
 | ||
|      default         opposite of -s option
 | ||
|      ecs             -Ce option
 | ||
|      fast            -F option
 | ||
|      full            -f option
 | ||
|      interactive     -I option
 | ||
|      lex-compat      -l option
 | ||
|      meta-ecs        -Cm option
 | ||
|      perf-report     -p option
 | ||
|      read            -Cr option
 | ||
|      stdout          -t option
 | ||
|      verbose         -v option
 | ||
|      warn            opposite of -w option
 | ||
|                      (use "%option nowarn" for -w)
 | ||
|      
 | ||
|      array           equivalent to "%array"
 | ||
|      pointer         equivalent to "%pointer" (default)
 | ||
| 
 | ||
|    Some `%option's' provide features otherwise not available:
 | ||
| 
 | ||
| `always-interactive'
 | ||
|      instructs flex to generate a scanner which always considers its
 | ||
|      input "interactive".  Normally, on each new input file the scanner
 | ||
|      calls `isatty()' in an attempt to determine whether the scanner's
 | ||
|      input source is interactive and thus should be read a character at
 | ||
|      a time.  When this option is used, however, then no such call is
 | ||
|      made.
 | ||
| 
 | ||
| `main'
 | ||
|      directs flex to provide a default `main()' program for the
 | ||
|      scanner, which simply calls `yylex()'.  This option implies
 | ||
|      `noyywrap' (see below).
 | ||
| 
 | ||
| `never-interactive'
 | ||
|      instructs flex to generate a scanner which never considers its
 | ||
|      input "interactive" (again, no call made to `isatty())'.  This is
 | ||
|      the opposite of `always-' *interactive*.
 | ||
| 
 | ||
| `stack'
 | ||
|      enables the use of start condition stacks (see Start Conditions
 | ||
|      above).
 | ||
| 
 | ||
| `stdinit'
 | ||
|      if unset (i.e., `%option nostdinit') initializes `yyin' and
 | ||
|      `yyout' to nil `FILE' pointers, instead of `stdin' and `stdout'.
 | ||
| 
 | ||
| `yylineno'
 | ||
|      directs `flex' to generate a scanner that maintains the number of
 | ||
|      the current line read from its input in the global variable
 | ||
|      `yylineno'.  This option is implied by `%option lex-compat'.
 | ||
| 
 | ||
| `yywrap'
 | ||
|      if unset (i.e., `%option noyywrap'), makes the scanner not call
 | ||
|      `yywrap()' upon an end-of-file, but simply assume that there are
 | ||
|      no more files to scan (until the user points `yyin' at a new file
 | ||
|      and calls `yylex()' again).
 | ||
| 
 | ||
|    `flex' scans your rule actions to determine whether you use the
 | ||
| `REJECT' or `yymore()' features.  The `reject' and `yymore' options are
 | ||
| available to override its decision as to whether you use the options,
 | ||
| either by setting them (e.g., `%option reject') to indicate the feature
 | ||
| is indeed used, or unsetting them to indicate it actually is not used
 | ||
| (e.g., `%option noyymore').
 | ||
| 
 | ||
|    Three options take string-delimited values, offset with '=':
 | ||
| 
 | ||
|      %option outfile="ABC"
 | ||
| 
 | ||
| is equivalent to `-oABC', and
 | ||
| 
 | ||
|      %option prefix="XYZ"
 | ||
| 
 | ||
| is equivalent to `-PXYZ'.
 | ||
| 
 | ||
|    Finally,
 | ||
| 
 | ||
|      %option yyclass="foo"
 | ||
| 
 | ||
| only applies when generating a C++ scanner (`-+' option).  It informs
 | ||
| `flex' that you have derived `foo' as a subclass of `yyFlexLexer' so
 | ||
| `flex' will place your actions in the member function `foo::yylex()'
 | ||
| instead of `yyFlexLexer::yylex()'.  It also generates a
 | ||
| `yyFlexLexer::yylex()' member function that emits a run-time error (by
 | ||
| invoking `yyFlexLexer::LexerError()') if called.  See Generating C++
 | ||
| Scanners, below, for additional information.
 | ||
| 
 | ||
|    A number of options are available for lint purists who want to
 | ||
| suppress the appearance of unneeded routines in the generated scanner.
 | ||
| Each of the following, if unset, results in the corresponding routine
 | ||
| not appearing in the generated scanner:
 | ||
| 
 | ||
|      input, unput
 | ||
|      yy_push_state, yy_pop_state, yy_top_state
 | ||
|      yy_scan_buffer, yy_scan_bytes, yy_scan_string
 | ||
| 
 | ||
| (though `yy_push_state()' and friends won't appear anyway unless you
 | ||
| use `%option stack').
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Performance,  Next: C++,  Prev: Options,  Up: Top
 | ||
| 
 | ||
| Performance considerations
 | ||
| ==========================
 | ||
| 
 | ||
|    The main design goal of `flex' is that it generate high-performance
 | ||
| scanners.  It has been optimized for dealing well with large sets of
 | ||
| rules.  Aside from the effects on scanner speed of the table
 | ||
| compression `-C' options outlined above, there are a number of
 | ||
| options/actions which degrade performance.  These are, from most
 | ||
| expensive to least:
 | ||
| 
 | ||
|      REJECT
 | ||
|      %option yylineno
 | ||
|      arbitrary trailing context
 | ||
|      
 | ||
|      pattern sets that require backing up
 | ||
|      %array
 | ||
|      %option interactive
 | ||
|      %option always-interactive
 | ||
|      
 | ||
|      '^' beginning-of-line operator
 | ||
|      yymore()
 | ||
| 
 | ||
|    with the first three all being quite expensive and the last two
 | ||
| being quite cheap.  Note also that `unput()' is implemented as a
 | ||
| routine call that potentially does quite a bit of work, while
 | ||
| `yyless()' is a quite-cheap macro; so if just putting back some excess
 | ||
| text you scanned, use `yyless()'.
 | ||
| 
 | ||
|    `REJECT' should be avoided at all costs when performance is
 | ||
| important.  It is a particularly expensive option.
 | ||
| 
 | ||
|    Getting rid of backing up is messy and often may be an enormous
 | ||
| amount of work for a complicated scanner.  In principal, one begins by
 | ||
| using the `-b' flag to generate a `lex.backup' file.  For example, on
 | ||
| the input
 | ||
| 
 | ||
|      %%
 | ||
|      foo        return TOK_KEYWORD;
 | ||
|      foobar     return TOK_KEYWORD;
 | ||
| 
 | ||
| the file looks like:
 | ||
| 
 | ||
|      State #6 is non-accepting -
 | ||
|       associated rule line numbers:
 | ||
|             2       3
 | ||
|       out-transitions: [ o ]
 | ||
|       jam-transitions: EOF [ \001-n  p-\177 ]
 | ||
|      
 | ||
|      State #8 is non-accepting -
 | ||
|       associated rule line numbers:
 | ||
|             3
 | ||
|       out-transitions: [ a ]
 | ||
|       jam-transitions: EOF [ \001-`  b-\177 ]
 | ||
|      
 | ||
|      State #9 is non-accepting -
 | ||
|       associated rule line numbers:
 | ||
|             3
 | ||
|       out-transitions: [ r ]
 | ||
|       jam-transitions: EOF [ \001-q  s-\177 ]
 | ||
|      
 | ||
|      Compressed tables always back up.
 | ||
| 
 | ||
|    The first few lines tell us that there's a scanner state in which it
 | ||
| can make a transition on an 'o' but not on any other character, and
 | ||
| that in that state the currently scanned text does not match any rule.
 | ||
| The state occurs when trying to match the rules found at lines 2 and 3
 | ||
| in the input file.  If the scanner is in that state and then reads
 | ||
| something other than an 'o', it will have to back up to find a rule
 | ||
| which is matched.  With a bit of head-scratching one can see that this
 | ||
| must be the state it's in when it has seen "fo".  When this has
 | ||
| happened, if anything other than another 'o' is seen, the scanner will
 | ||
| have to back up to simply match the 'f' (by the default rule).
 | ||
| 
 | ||
|    The comment regarding State #8 indicates there's a problem when
 | ||
| "foob" has been scanned.  Indeed, on any character other than an 'a',
 | ||
| the scanner will have to back up to accept "foo".  Similarly, the
 | ||
| comment for State #9 concerns when "fooba" has been scanned and an 'r'
 | ||
| does not follow.
 | ||
| 
 | ||
|    The final comment reminds us that there's no point going to all the
 | ||
| trouble of removing backing up from the rules unless we're using `-Cf'
 | ||
| or `-CF', since there's no performance gain doing so with compressed
 | ||
| scanners.
 | ||
| 
 | ||
|    The way to remove the backing up is to add "error" rules:
 | ||
| 
 | ||
|      %%
 | ||
|      foo         return TOK_KEYWORD;
 | ||
|      foobar      return TOK_KEYWORD;
 | ||
|      
 | ||
|      fooba       |
 | ||
|      foob        |
 | ||
|      fo          {
 | ||
|                  /* false alarm, not really a keyword */
 | ||
|                  return TOK_ID;
 | ||
|                  }
 | ||
| 
 | ||
|    Eliminating backing up among a list of keywords can also be done
 | ||
| using a "catch-all" rule:
 | ||
| 
 | ||
|      %%
 | ||
|      foo         return TOK_KEYWORD;
 | ||
|      foobar      return TOK_KEYWORD;
 | ||
|      
 | ||
|      [a-z]+      return TOK_ID;
 | ||
| 
 | ||
|    This is usually the best solution when appropriate.
 | ||
| 
 | ||
|    Backing up messages tend to cascade.  With a complicated set of
 | ||
| rules it's not uncommon to get hundreds of messages.  If one can
 | ||
| decipher them, though, it often only takes a dozen or so rules to
 | ||
| eliminate the backing up (though it's easy to make a mistake and have
 | ||
| an error rule accidentally match a valid token.  A possible future
 | ||
| `flex' feature will be to automatically add rules to eliminate backing
 | ||
| up).
 | ||
| 
 | ||
|    It's important to keep in mind that you gain the benefits of
 | ||
| eliminating backing up only if you eliminate *every* instance of
 | ||
| backing up.  Leaving just one means you gain nothing.
 | ||
| 
 | ||
|    VARIABLE trailing context (where both the leading and trailing parts
 | ||
| do not have a fixed length) entails almost the same performance loss as
 | ||
| `REJECT' (i.e., substantial).  So when possible a rule like:
 | ||
| 
 | ||
|      %%
 | ||
|      mouse|rat/(cat|dog)   run();
 | ||
| 
 | ||
| is better written:
 | ||
| 
 | ||
|      %%
 | ||
|      mouse/cat|dog         run();
 | ||
|      rat/cat|dog           run();
 | ||
| 
 | ||
| or as
 | ||
| 
 | ||
|      %%
 | ||
|      mouse|rat/cat         run();
 | ||
|      mouse|rat/dog         run();
 | ||
| 
 | ||
|    Note that here the special '|' action does *not* provide any
 | ||
| savings, and can even make things worse (see Deficiencies / Bugs below).
 | ||
| 
 | ||
|    Another area where the user can increase a scanner's performance
 | ||
| (and one that's easier to implement) arises from the fact that the
 | ||
| longer the tokens matched, the faster the scanner will run.  This is
 | ||
| because with long tokens the processing of most input characters takes
 | ||
| place in the (short) inner scanning loop, and does not often have to go
 | ||
| through the additional work of setting up the scanning environment
 | ||
| (e.g., `yytext') for the action.  Recall the scanner for C comments:
 | ||
| 
 | ||
|      %x comment
 | ||
|      %%
 | ||
|              int line_num = 1;
 | ||
|      
 | ||
|      "/*"         BEGIN(comment);
 | ||
|      
 | ||
|      <comment>[^*\n]*
 | ||
|      <comment>"*"+[^*/\n]*
 | ||
|      <comment>\n             ++line_num;
 | ||
|      <comment>"*"+"/"        BEGIN(INITIAL);
 | ||
| 
 | ||
|    This could be sped up by writing it as:
 | ||
| 
 | ||
|      %x comment
 | ||
|      %%
 | ||
|              int line_num = 1;
 | ||
|      
 | ||
|      "/*"         BEGIN(comment);
 | ||
|      
 | ||
|      <comment>[^*\n]*
 | ||
|      <comment>[^*\n]*\n      ++line_num;
 | ||
|      <comment>"*"+[^*/\n]*
 | ||
|      <comment>"*"+[^*/\n]*\n ++line_num;
 | ||
|      <comment>"*"+"/"        BEGIN(INITIAL);
 | ||
| 
 | ||
|    Now instead of each newline requiring the processing of another
 | ||
| action, recognizing the newlines is "distributed" over the other rules
 | ||
| to keep the matched text as long as possible.  Note that *adding* rules
 | ||
| does *not* slow down the scanner!  The speed of the scanner is
 | ||
| independent of the number of rules or (modulo the considerations given
 | ||
| at the beginning of this section) how complicated the rules are with
 | ||
| regard to operators such as '*' and '|'.
 | ||
| 
 | ||
|    A final example in speeding up a scanner: suppose you want to scan
 | ||
| through a file containing identifiers and keywords, one per line and
 | ||
| with no other extraneous characters, and recognize all the keywords.  A
 | ||
| natural first approach is:
 | ||
| 
 | ||
|      %%
 | ||
|      asm      |
 | ||
|      auto     |
 | ||
|      break    |
 | ||
|      ... etc ...
 | ||
|      volatile |
 | ||
|      while    /* it's a keyword */
 | ||
|      
 | ||
|      .|\n     /* it's not a keyword */
 | ||
| 
 | ||
|    To eliminate the back-tracking, introduce a catch-all rule:
 | ||
| 
 | ||
|      %%
 | ||
|      asm      |
 | ||
|      auto     |
 | ||
|      break    |
 | ||
|      ... etc ...
 | ||
|      volatile |
 | ||
|      while    /* it's a keyword */
 | ||
|      
 | ||
|      [a-z]+   |
 | ||
|      .|\n     /* it's not a keyword */
 | ||
| 
 | ||
|    Now, if it's guaranteed that there's exactly one word per line, then
 | ||
| we can reduce the total number of matches by a half by merging in the
 | ||
| recognition of newlines with that of the other tokens:
 | ||
| 
 | ||
|      %%
 | ||
|      asm\n    |
 | ||
|      auto\n   |
 | ||
|      break\n  |
 | ||
|      ... etc ...
 | ||
|      volatile\n |
 | ||
|      while\n  /* it's a keyword */
 | ||
|      
 | ||
|      [a-z]+\n |
 | ||
|      .|\n     /* it's not a keyword */
 | ||
| 
 | ||
|    One has to be careful here, as we have now reintroduced backing up
 | ||
| into the scanner.  In particular, while *we* know that there will never
 | ||
| be any characters in the input stream other than letters or newlines,
 | ||
| `flex' can't figure this out, and it will plan for possibly needing to
 | ||
| back up when it has scanned a token like "auto" and then the next
 | ||
| character is something other than a newline or a letter.  Previously it
 | ||
| would then just match the "auto" rule and be done, but now it has no
 | ||
| "auto" rule, only a "auto\n" rule.  To eliminate the possibility of
 | ||
| backing up, we could either duplicate all rules but without final
 | ||
| newlines, or, since we never expect to encounter such an input and
 | ||
| therefore don't how it's classified, we can introduce one more
 | ||
| catch-all rule, this one which doesn't include a newline:
 | ||
| 
 | ||
|      %%
 | ||
|      asm\n    |
 | ||
|      auto\n   |
 | ||
|      break\n  |
 | ||
|      ... etc ...
 | ||
|      volatile\n |
 | ||
|      while\n  /* it's a keyword */
 | ||
|      
 | ||
|      [a-z]+\n |
 | ||
|      [a-z]+   |
 | ||
|      .|\n     /* it's not a keyword */
 | ||
| 
 | ||
|    Compiled with `-Cf', this is about as fast as one can get a `flex'
 | ||
| scanner to go for this particular problem.
 | ||
| 
 | ||
|    A final note: `flex' is slow when matching NUL's, particularly when
 | ||
| a token contains multiple NUL's.  It's best to write rules which match
 | ||
| *short* amounts of text if it's anticipated that the text will often
 | ||
| include NUL's.
 | ||
| 
 | ||
|    Another final note regarding performance: as mentioned above in the
 | ||
| section How the Input is Matched, dynamically resizing `yytext' to
 | ||
| accommodate huge tokens is a slow process because it presently requires
 | ||
| that the (huge) token be rescanned from the beginning.  Thus if
 | ||
| performance is vital, you should attempt to match "large" quantities of
 | ||
| text but not "huge" quantities, where the cutoff between the two is at
 | ||
| about 8K characters/token.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: C++,  Next: Incompatibilities,  Prev: Performance,  Up: Top
 | ||
| 
 | ||
| Generating C++ scanners
 | ||
| =======================
 | ||
| 
 | ||
|    `flex' provides two different ways to generate scanners for use with
 | ||
| C++.  The first way is to simply compile a scanner generated by `flex'
 | ||
| using a C++ compiler instead of a C compiler.  You should not encounter
 | ||
| any compilations errors (please report any you find to the email address
 | ||
| given in the Author section below).  You can then use C++ code in your
 | ||
| rule actions instead of C code.  Note that the default input source for
 | ||
| your scanner remains `yyin', and default echoing is still done to
 | ||
| `yyout'.  Both of these remain `FILE *' variables and not C++ `streams'.
 | ||
| 
 | ||
|    You can also use `flex' to generate a C++ scanner class, using the
 | ||
| `-+' option, (or, equivalently, `%option c++'), which is automatically
 | ||
| specified if the name of the flex executable ends in a `+', such as
 | ||
| `flex++'.  When using this option, flex defaults to generating the
 | ||
| scanner to the file `lex.yy.cc' instead of `lex.yy.c'.  The generated
 | ||
| scanner includes the header file `FlexLexer.h', which defines the
 | ||
| interface to two C++ classes.
 | ||
| 
 | ||
|    The first class, `FlexLexer', provides an abstract base class
 | ||
| defining the general scanner class interface.  It provides the
 | ||
| following member functions:
 | ||
| 
 | ||
| `const char* YYText()'
 | ||
|      returns the text of the most recently matched token, the
 | ||
|      equivalent of `yytext'.
 | ||
| 
 | ||
| `int YYLeng()'
 | ||
|      returns the length of the most recently matched token, the
 | ||
|      equivalent of `yyleng'.
 | ||
| 
 | ||
| `int lineno() const'
 | ||
|      returns the current input line number (see `%option yylineno'), or
 | ||
|      1 if `%option yylineno' was not used.
 | ||
| 
 | ||
| `void set_debug( int flag )'
 | ||
|      sets the debugging flag for the scanner, equivalent to assigning to
 | ||
|      `yy_flex_debug' (see the Options section above).  Note that you
 | ||
|      must build the scanner using `%option debug' to include debugging
 | ||
|      information in it.
 | ||
| 
 | ||
| `int debug() const'
 | ||
|      returns the current setting of the debugging flag.
 | ||
| 
 | ||
|    Also provided are member functions equivalent to
 | ||
| `yy_switch_to_buffer(), yy_create_buffer()' (though the first argument
 | ||
| is an `istream*' object pointer and not a `FILE*', `yy_flush_buffer()',
 | ||
| `yy_delete_buffer()', and `yyrestart()' (again, the first argument is a
 | ||
| `istream*' object pointer).
 | ||
| 
 | ||
|    The second class defined in `FlexLexer.h' is `yyFlexLexer', which is
 | ||
| derived from `FlexLexer'.  It defines the following additional member
 | ||
| functions:
 | ||
| 
 | ||
| `yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )'
 | ||
|      constructs a `yyFlexLexer' object using the given streams for
 | ||
|      input and output.  If not specified, the streams default to `cin'
 | ||
|      and `cout', respectively.
 | ||
| 
 | ||
| `virtual int yylex()'
 | ||
|      performs the same role is `yylex()' does for ordinary flex
 | ||
|      scanners: it scans the input stream, consuming tokens, until a
 | ||
|      rule's action returns a value.  If you derive a subclass S from
 | ||
|      `yyFlexLexer' and want to access the member functions and
 | ||
|      variables of S inside `yylex()', then you need to use `%option
 | ||
|      yyclass="S"' to inform `flex' that you will be using that subclass
 | ||
|      instead of `yyFlexLexer'.  In this case, rather than generating
 | ||
|      `yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also
 | ||
|      generates a dummy `yyFlexLexer::yylex()' that calls
 | ||
|      `yyFlexLexer::LexerError()' if called).
 | ||
| 
 | ||
| `virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)'
 | ||
|      reassigns `yyin' to `new_in' (if non-nil) and `yyout' to `new_out'
 | ||
|      (ditto), deleting the previous input buffer if `yyin' is
 | ||
|      reassigned.
 | ||
| 
 | ||
| `int yylex( istream* new_in = 0, ostream* new_out = 0 )'
 | ||
|      first switches the input streams via `switch_streams( new_in,
 | ||
|      new_out )' and then returns the value of `yylex()'.
 | ||
| 
 | ||
|    In addition, `yyFlexLexer' defines the following protected virtual
 | ||
| functions which you can redefine in derived classes to tailor the
 | ||
| scanner:
 | ||
| 
 | ||
| `virtual int LexerInput( char* buf, int max_size )'
 | ||
|      reads up to `max_size' characters into BUF and returns the number
 | ||
|      of characters read.  To indicate end-of-input, return 0
 | ||
|      characters.  Note that "interactive" scanners (see the `-B' and
 | ||
|      `-I' flags) define the macro `YY_INTERACTIVE'.  If you redefine
 | ||
|      `LexerInput()' and need to take different actions depending on
 | ||
|      whether or not the scanner might be scanning an interactive input
 | ||
|      source, you can test for the presence of this name via `#ifdef'.
 | ||
| 
 | ||
| `virtual void LexerOutput( const char* buf, int size )'
 | ||
|      writes out SIZE characters from the buffer BUF, which, while
 | ||
|      NUL-terminated, may also contain "internal" NUL's if the scanner's
 | ||
|      rules can match text with NUL's in them.
 | ||
| 
 | ||
| `virtual void LexerError( const char* msg )'
 | ||
|      reports a fatal error message.  The default version of this
 | ||
|      function writes the message to the stream `cerr' and exits.
 | ||
| 
 | ||
|    Note that a `yyFlexLexer' object contains its *entire* scanning
 | ||
| state.  Thus you can use such objects to create reentrant scanners.
 | ||
| You can instantiate multiple instances of the same `yyFlexLexer' class,
 | ||
| and you can also combine multiple C++ scanner classes together in the
 | ||
| same program using the `-P' option discussed above.  Finally, note that
 | ||
| the `%array' feature is not available to C++ scanner classes; you must
 | ||
| use `%pointer' (the default).
 | ||
| 
 | ||
|    Here is an example of a simple C++ scanner:
 | ||
| 
 | ||
|          // An example of using the flex C++ scanner class.
 | ||
|      
 | ||
|      %{
 | ||
|      int mylineno = 0;
 | ||
|      %}
 | ||
|      
 | ||
|      string  \"[^\n"]+\"
 | ||
|      
 | ||
|      ws      [ \t]+
 | ||
|      
 | ||
|      alpha   [A-Za-z]
 | ||
|      dig     [0-9]
 | ||
|      name    ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
 | ||
|      num1    [-+]?{dig}+\.?([eE][-+]?{dig}+)?
 | ||
|      num2    [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
 | ||
|      number  {num1}|{num2}
 | ||
|      
 | ||
|      %%
 | ||
|      
 | ||
|      {ws}    /* skip blanks and tabs */
 | ||
|      
 | ||
|      "/*"    {
 | ||
|              int c;
 | ||
|      
 | ||
|              while((c = yyinput()) != 0)
 | ||
|                  {
 | ||
|                  if(c == '\n')
 | ||
|                      ++mylineno;
 | ||
|      
 | ||
|                  else if(c == '*')
 | ||
|                      {
 | ||
|                      if((c = yyinput()) == '/')
 | ||
|                          break;
 | ||
|                      else
 | ||
|                          unput(c);
 | ||
|                      }
 | ||
|                  }
 | ||
|              }
 | ||
|      
 | ||
|      {number}  cout << "number " << YYText() << '\n';
 | ||
|      
 | ||
|      \n        mylineno++;
 | ||
|      
 | ||
|      {name}    cout << "name " << YYText() << '\n';
 | ||
|      
 | ||
|      {string}  cout << "string " << YYText() << '\n';
 | ||
|      
 | ||
|      %%
 | ||
|      
 | ||
|      Version 2.5               December 1994                        44
 | ||
|      
 | ||
|      int main( int /* argc */, char** /* argv */ )
 | ||
|          {
 | ||
|          FlexLexer* lexer = new yyFlexLexer;
 | ||
|          while(lexer->yylex() != 0)
 | ||
|              ;
 | ||
|          return 0;
 | ||
|          }
 | ||
| 
 | ||
|    If you want to create multiple (different) lexer classes, you use
 | ||
| the `-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to
 | ||
| some other `xxFlexLexer'.  You then can include `<FlexLexer.h>' in your
 | ||
| other sources once per lexer class, first renaming `yyFlexLexer' as
 | ||
| follows:
 | ||
| 
 | ||
|      #undef yyFlexLexer
 | ||
|      #define yyFlexLexer xxFlexLexer
 | ||
|      #include <FlexLexer.h>
 | ||
|      
 | ||
|      #undef yyFlexLexer
 | ||
|      #define yyFlexLexer zzFlexLexer
 | ||
|      #include <FlexLexer.h>
 | ||
| 
 | ||
|    if, for example, you used `%option prefix="xx"' for one of your
 | ||
| scanners and `%option prefix="zz"' for the other.
 | ||
| 
 | ||
|    IMPORTANT: the present form of the scanning class is *experimental*
 | ||
| and may change considerably between major releases.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Incompatibilities,  Next: Diagnostics,  Prev: C++,  Up: Top
 | ||
| 
 | ||
| Incompatibilities with `lex' and POSIX
 | ||
| ======================================
 | ||
| 
 | ||
|    `flex' is a rewrite of the AT&T Unix `lex' tool (the two
 | ||
| implementations do not share any code, though), with some extensions
 | ||
| and incompatibilities, both of which are of concern to those who wish
 | ||
| to write scanners acceptable to either implementation.  Flex is fully
 | ||
| compliant with the POSIX `lex' specification, except that when using
 | ||
| `%pointer' (the default), a call to `unput()' destroys the contents of
 | ||
| `yytext', which is counter to the POSIX specification.
 | ||
| 
 | ||
|    In this section we discuss all of the known areas of incompatibility
 | ||
| between flex, AT&T lex, and the POSIX specification.
 | ||
| 
 | ||
|    `flex's' `-l' option turns on maximum compatibility with the
 | ||
| original AT&T `lex' implementation, at the cost of a major loss in the
 | ||
| generated scanner's performance.  We note below which incompatibilities
 | ||
| can be overcome using the `-l' option.
 | ||
| 
 | ||
|    `flex' is fully compatible with `lex' with the following exceptions:
 | ||
| 
 | ||
|    - The undocumented `lex' scanner internal variable `yylineno' is not
 | ||
|      supported unless `-l' or `%option yylineno' is used.  `yylineno'
 | ||
|      should be maintained on a per-buffer basis, rather than a
 | ||
|      per-scanner (single global variable) basis.  `yylineno' is not
 | ||
|      part of the POSIX specification.
 | ||
| 
 | ||
|    - The `input()' routine is not redefinable, though it may be called
 | ||
|      to read characters following whatever has been matched by a rule.
 | ||
|      If `input()' encounters an end-of-file the normal `yywrap()'
 | ||
|      processing is done.  A "real" end-of-file is returned by `input()'
 | ||
|      as `EOF'.
 | ||
| 
 | ||
|      Input is instead controlled by defining the `YY_INPUT' macro.
 | ||
| 
 | ||
|      The `flex' restriction that `input()' cannot be redefined is in
 | ||
|      accordance with the POSIX specification, which simply does not
 | ||
|      specify any way of controlling the scanner's input other than by
 | ||
|      making an initial assignment to `yyin'.
 | ||
| 
 | ||
|    - The `unput()' routine is not redefinable.  This restriction is in
 | ||
|      accordance with POSIX.
 | ||
| 
 | ||
|    - `flex' scanners are not as reentrant as `lex' scanners.  In
 | ||
|      particular, if you have an interactive scanner and an interrupt
 | ||
|      handler which long-jumps out of the scanner, and the scanner is
 | ||
|      subsequently called again, you may get the following message:
 | ||
| 
 | ||
|           fatal flex scanner internal error--end of buffer missed
 | ||
| 
 | ||
|      To reenter the scanner, first use
 | ||
| 
 | ||
|           yyrestart( yyin );
 | ||
| 
 | ||
|      Note that this call will throw away any buffered input; usually
 | ||
|      this isn't a problem with an interactive scanner.
 | ||
| 
 | ||
|      Also note that flex C++ scanner classes *are* reentrant, so if
 | ||
|      using C++ is an option for you, you should use them instead.  See
 | ||
|      "Generating C++ Scanners" above for details.
 | ||
| 
 | ||
|    - `output()' is not supported.  Output from the `ECHO' macro is done
 | ||
|      to the file-pointer `yyout' (default `stdout').
 | ||
| 
 | ||
|      `output()' is not part of the POSIX specification.
 | ||
| 
 | ||
|    - `lex' does not support exclusive start conditions (%x), though
 | ||
|      they are in the POSIX specification.
 | ||
| 
 | ||
|    - When definitions are expanded, `flex' encloses them in
 | ||
|      parentheses.  With lex, the following:
 | ||
| 
 | ||
|           NAME    [A-Z][A-Z0-9]*
 | ||
|           %%
 | ||
|           foo{NAME}?      printf( "Found it\n" );
 | ||
|           %%
 | ||
| 
 | ||
|      will not match the string "foo" because when the macro is expanded
 | ||
|      the rule is equivalent to "foo[A-Z][A-Z0-9]*?" and the precedence
 | ||
|      is such that the '?' is associated with "[A-Z0-9]*".  With `flex',
 | ||
|      the rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and so the
 | ||
|      string "foo" will match.
 | ||
| 
 | ||
|      Note that if the definition begins with `^' or ends with `$' then
 | ||
|      it is *not* expanded with parentheses, to allow these operators to
 | ||
|      appear in definitions without losing their special meanings.  But
 | ||
|      the `<s>, /', and `<<EOF>>' operators cannot be used in a `flex'
 | ||
|      definition.
 | ||
| 
 | ||
|      Using `-l' results in the `lex' behavior of no parentheses around
 | ||
|      the definition.
 | ||
| 
 | ||
|      The POSIX specification is that the definition be enclosed in
 | ||
|      parentheses.
 | ||
| 
 | ||
|    - Some implementations of `lex' allow a rule's action to begin on a
 | ||
|      separate line, if the rule's pattern has trailing whitespace:
 | ||
| 
 | ||
|           %%
 | ||
|           foo|bar<space here>
 | ||
|             { foobar_action(); }
 | ||
| 
 | ||
|      `flex' does not support this feature.
 | ||
| 
 | ||
|    - The `lex' `%r' (generate a Ratfor scanner) option is not
 | ||
|      supported.  It is not part of the POSIX specification.
 | ||
| 
 | ||
|    - After a call to `unput()', `yytext' is undefined until the next
 | ||
|      token is matched, unless the scanner was built using `%array'.
 | ||
|      This is not the case with `lex' or the POSIX specification.  The
 | ||
|      `-l' option does away with this incompatibility.
 | ||
| 
 | ||
|    - The precedence of the `{}' (numeric range) operator is different.
 | ||
|      `lex' interprets "abc{1,3}" as "match one, two, or three
 | ||
|      occurrences of 'abc'", whereas `flex' interprets it as "match 'ab'
 | ||
|      followed by one, two, or three occurrences of 'c'".  The latter is
 | ||
|      in agreement with the POSIX specification.
 | ||
| 
 | ||
|    - The precedence of the `^' operator is different.  `lex' interprets
 | ||
|      "^foo|bar" as "match either 'foo' at the beginning of a line, or
 | ||
|      'bar' anywhere", whereas `flex' interprets it as "match either
 | ||
|      'foo' or 'bar' if they come at the beginning of a line".  The
 | ||
|      latter is in agreement with the POSIX specification.
 | ||
| 
 | ||
|    - The special table-size declarations such as `%a' supported by
 | ||
|      `lex' are not required by `flex' scanners; `flex' ignores them.
 | ||
| 
 | ||
|    - The name FLEX_SCANNER is #define'd so scanners may be written for
 | ||
|      use with either `flex' or `lex'.  Scanners also include
 | ||
|      `YY_FLEX_MAJOR_VERSION' and `YY_FLEX_MINOR_VERSION' indicating
 | ||
|      which version of `flex' generated the scanner (for example, for the
 | ||
|      2.5 release, these defines would be 2 and 5 respectively).
 | ||
| 
 | ||
|    The following `flex' features are not included in `lex' or the POSIX
 | ||
| specification:
 | ||
| 
 | ||
|      C++ scanners
 | ||
|      %option
 | ||
|      start condition scopes
 | ||
|      start condition stacks
 | ||
|      interactive/non-interactive scanners
 | ||
|      yy_scan_string() and friends
 | ||
|      yyterminate()
 | ||
|      yy_set_interactive()
 | ||
|      yy_set_bol()
 | ||
|      YY_AT_BOL()
 | ||
|      <<EOF>>
 | ||
|      <*>
 | ||
|      YY_DECL
 | ||
|      YY_START
 | ||
|      YY_USER_ACTION
 | ||
|      YY_USER_INIT
 | ||
|      #line directives
 | ||
|      %{}'s around actions
 | ||
|      multiple actions on a line
 | ||
| 
 | ||
| plus almost all of the flex flags.  The last feature in the list refers
 | ||
| to the fact that with `flex' you can put multiple actions on the same
 | ||
| line, separated with semicolons, while with `lex', the following
 | ||
| 
 | ||
|      foo    handle_foo(); ++num_foos_seen;
 | ||
| 
 | ||
| is (rather surprisingly) truncated to
 | ||
| 
 | ||
|      foo    handle_foo();
 | ||
| 
 | ||
|    `flex' does not truncate the action.  Actions that are not enclosed
 | ||
| in braces are simply terminated at the end of the line.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Diagnostics,  Next: Files,  Prev: Incompatibilities,  Up: Top
 | ||
| 
 | ||
| Diagnostics
 | ||
| ===========
 | ||
| 
 | ||
| `warning, rule cannot be matched'
 | ||
|      indicates that the given rule cannot be matched because it follows
 | ||
|      other rules that will always match the same text as it.  For
 | ||
|      example, in the following "foo" cannot be matched because it comes
 | ||
|      after an identifier "catch-all" rule:
 | ||
| 
 | ||
|           [a-z]+    got_identifier();
 | ||
|           foo       got_foo();
 | ||
| 
 | ||
|      Using `REJECT' in a scanner suppresses this warning.
 | ||
| 
 | ||
| `warning, -s option given but default rule can be matched'
 | ||
|      means that it is possible (perhaps only in a particular start
 | ||
|      condition) that the default rule (match any single character) is
 | ||
|      the only one that will match a particular input.  Since `-s' was
 | ||
|      given, presumably this is not intended.
 | ||
| 
 | ||
| `reject_used_but_not_detected undefined'
 | ||
| `yymore_used_but_not_detected undefined'
 | ||
|      These errors can occur at compile time.  They indicate that the
 | ||
|      scanner uses `REJECT' or `yymore()' but that `flex' failed to
 | ||
|      notice the fact, meaning that `flex' scanned the first two sections
 | ||
|      looking for occurrences of these actions and failed to find any,
 | ||
|      but somehow you snuck some in (via a #include file, for example).
 | ||
|      Use `%option reject' or `%option yymore' to indicate to flex that
 | ||
|      you really do use these features.
 | ||
| 
 | ||
| `flex scanner jammed'
 | ||
|      a scanner compiled with `-s' has encountered an input string which
 | ||
|      wasn't matched by any of its rules.  This error can also occur due
 | ||
|      to internal problems.
 | ||
| 
 | ||
| `token too large, exceeds YYLMAX'
 | ||
|      your scanner uses `%array' and one of its rules matched a string
 | ||
|      longer than the `YYL-' `MAX' constant (8K bytes by default).  You
 | ||
|      can increase the value by #define'ing `YYLMAX' in the definitions
 | ||
|      section of your `flex' input.
 | ||
| 
 | ||
| `scanner requires -8 flag to use the character 'X''
 | ||
|      Your scanner specification includes recognizing the 8-bit
 | ||
|      character X and you did not specify the -8 flag, and your scanner
 | ||
|      defaulted to 7-bit because you used the `-Cf' or `-CF' table
 | ||
|      compression options.  See the discussion of the `-7' flag for
 | ||
|      details.
 | ||
| 
 | ||
| `flex scanner push-back overflow'
 | ||
|      you used `unput()' to push back so much text that the scanner's
 | ||
|      buffer could not hold both the pushed-back text and the current
 | ||
|      token in `yytext'.  Ideally the scanner should dynamically resize
 | ||
|      the buffer in this case, but at present it does not.
 | ||
| 
 | ||
| `input buffer overflow, can't enlarge buffer because scanner uses REJECT'
 | ||
|      the scanner was working on matching an extremely large token and
 | ||
|      needed to expand the input buffer.  This doesn't work with
 | ||
|      scanners that use `REJECT'.
 | ||
| 
 | ||
| `fatal flex scanner internal error--end of buffer missed'
 | ||
|      This can occur in an scanner which is reentered after a long-jump
 | ||
|      has jumped out (or over) the scanner's activation frame.  Before
 | ||
|      reentering the scanner, use:
 | ||
| 
 | ||
|           yyrestart( yyin );
 | ||
| 
 | ||
|      or, as noted above, switch to using the C++ scanner class.
 | ||
| 
 | ||
| `too many start conditions in <> construct!'
 | ||
|      you listed more start conditions in a <> construct than exist (so
 | ||
|      you must have listed at least one of them twice).
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Files,  Next: Deficiencies,  Prev: Diagnostics,  Up: Top
 | ||
| 
 | ||
| Files
 | ||
| =====
 | ||
| 
 | ||
| `-lfl'
 | ||
|      library with which scanners must be linked.
 | ||
| 
 | ||
| `lex.yy.c'
 | ||
|      generated scanner (called `lexyy.c' on some systems).
 | ||
| 
 | ||
| `lex.yy.cc'
 | ||
|      generated C++ scanner class, when using `-+'.
 | ||
| 
 | ||
| `<FlexLexer.h>'
 | ||
|      header file defining the C++ scanner base class, `FlexLexer', and
 | ||
|      its derived class, `yyFlexLexer'.
 | ||
| 
 | ||
| `flex.skl'
 | ||
|      skeleton scanner.  This file is only used when building flex, not
 | ||
|      when flex executes.
 | ||
| 
 | ||
| `lex.backup'
 | ||
|      backing-up information for `-b' flag (called `lex.bck' on some
 | ||
|      systems).
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Deficiencies,  Next: See also,  Prev: Files,  Up: Top
 | ||
| 
 | ||
| Deficiencies / Bugs
 | ||
| ===================
 | ||
| 
 | ||
|    Some trailing context patterns cannot be properly matched and
 | ||
| generate warning messages ("dangerous trailing context").  These are
 | ||
| patterns where the ending of the first part of the rule matches the
 | ||
| beginning of the second part, such as "zx*/xy*", where the 'x*' matches
 | ||
| the 'x' at the beginning of the trailing context.  (Note that the POSIX
 | ||
| draft states that the text matched by such patterns is undefined.)
 | ||
| 
 | ||
|    For some trailing context rules, parts which are actually
 | ||
| fixed-length are not recognized as such, leading to the abovementioned
 | ||
| performance loss.  In particular, parts using '|' or {n} (such as
 | ||
| "foo{3}") are always considered variable-length.
 | ||
| 
 | ||
|    Combining trailing context with the special '|' action can result in
 | ||
| *fixed* trailing context being turned into the more expensive VARIABLE
 | ||
| trailing context.  For example, in the following:
 | ||
| 
 | ||
|      %%
 | ||
|      abc      |
 | ||
|      xyz/def
 | ||
| 
 | ||
|    Use of `unput()' invalidates yytext and yyleng, unless the `%array'
 | ||
| directive or the `-l' option has been used.
 | ||
| 
 | ||
|    Pattern-matching of NUL's is substantially slower than matching
 | ||
| other characters.
 | ||
| 
 | ||
|    Dynamic resizing of the input buffer is slow, as it entails
 | ||
| rescanning all the text matched so far by the current (generally huge)
 | ||
| token.
 | ||
| 
 | ||
|    Due to both buffering of input and read-ahead, you cannot intermix
 | ||
| calls to <stdio.h> routines, such as, for example, `getchar()', with
 | ||
| `flex' rules and expect it to work.  Call `input()' instead.
 | ||
| 
 | ||
|    The total table entries listed by the `-v' flag excludes the number
 | ||
| of table entries needed to determine what rule has been matched.  The
 | ||
| number of entries is equal to the number of DFA states if the scanner
 | ||
| does not use `REJECT', and somewhat greater than the number of states
 | ||
| if it does.
 | ||
| 
 | ||
|    `REJECT' cannot be used with the `-f' or `-F' options.
 | ||
| 
 | ||
|    The `flex' internal algorithms need documentation.
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: See also,  Next: Author,  Prev: Deficiencies,  Up: Top
 | ||
| 
 | ||
| See also
 | ||
| ========
 | ||
| 
 | ||
|    `lex'(1), `yacc'(1), `sed'(1), `awk'(1).
 | ||
| 
 | ||
|    John Levine, Tony Mason, and Doug Brown: Lex & Yacc; O'Reilly and
 | ||
| Associates.  Be sure to get the 2nd edition.
 | ||
| 
 | ||
|    M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator.
 | ||
| 
 | ||
|    Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: Principles,
 | ||
| Techniques and Tools; Addison-Wesley (1986).  Describes the
 | ||
| pattern-matching techniques used by `flex' (deterministic finite
 | ||
| automata).
 | ||
| 
 | ||
| 
 | ||
| File: flex.info,  Node: Author,  Prev: See also,  Up: Top
 | ||
| 
 | ||
| Author
 | ||
| ======
 | ||
| 
 | ||
|    Vern Paxson, with the help of many ideas and much inspiration from
 | ||
| Van Jacobson.  Original version by Jef Poskanzer.  The fast table
 | ||
| representation is a partial implementation of a design done by Van
 | ||
| Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
 | ||
| 
 | ||
|    Thanks to the many `flex' beta-testers, feedbackers, and
 | ||
| contributors, especially Francois Pinard, Casey Leedom, Stan Adermann,
 | ||
| Terry Allen, David Barker-Plummer, John Basrai, Nelson H.F. Beebe,
 | ||
| `benson@odi.com', Karl Berry, Peter A. Bigot, Simon Blanchard, Keith
 | ||
| Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, Brian
 | ||
| Clapper, J.T. Conklin, Jason Coughlin, Bill Cox, Nick Cropper, Dave
 | ||
| Curtis, Scott David Daniels, Chris G. Demetriou, Theo Deraadt, Mike
 | ||
| Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris
 | ||
| Flatters, Jon Forrest, Joe Gayda, Kaveh R. Ghazi, Eric Goldman,
 | ||
| Christopher M.  Gould, Ulrich Grepel, Peer Griebel, Jan Hajic, Charles
 | ||
| Hemphill, NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig,
 | ||
| Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, Michal
 | ||
| Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry Juengst, Klaus
 | ||
| Kaempf, Jonathan I. Kamens, Terrence O Kane, Amir Katz,
 | ||
| `ken@ken.hilco.com', Kevin B. Kenny, Steve Kirsch, Winfried Koenig,
 | ||
| Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, Craig Leres, John
 | ||
| Levine, Steve Liddle, Mike Long, Mohamed el Lozy, Brian Madsen, Malte,
 | ||
| Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, Jim
 | ||
| Meyering, R.  Alexander Milowski, Erik Naggum, G.T. Nicol, Landon Noll,
 | ||
| James Nordby, Marc Nozell, Richard Ohnemus, Karsten Pahnke, Sven Panne,
 | ||
| Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef
 | ||
| Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin,
 | ||
| Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto
 | ||
| Santini, Andreas Scherer, Darrell Schiebel, Raf Schietekat, Doug
 | ||
| Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Eckehard
 | ||
| Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian
 | ||
| Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, Paul
 | ||
| Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
 | ||
| Yap, Ron Zellar, Nathan Zelle, David Zuhn, and those whose names have
 | ||
| slipped my marginal mail-archiving skills but whose contributions are
 | ||
| appreciated all the same.
 | ||
| 
 | ||
|    Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore,
 | ||
| Craig Leres, John Levine, Bob Mulcahy, G.T.  Nicol, Francois Pinard,
 | ||
| Rich Salz, and Richard Stallman for help with various distribution
 | ||
| headaches.
 | ||
| 
 | ||
|    Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
 | ||
| to Benson Margulies and Fred Burke for C++ support; to Kent Williams
 | ||
| and Tom Epperly for C++ class support; to Ove Ewerlid for support of
 | ||
| NUL's; and to Eric Hughes for support of multiple buffers.
 | ||
| 
 | ||
|    This work was primarily done when I was with the Real Time Systems
 | ||
| Group at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks
 | ||
| to all there for the support I received.
 | ||
| 
 | ||
|    Send comments to `vern@ee.lbl.gov'.
 | ||
| 
 | ||
| 
 | ||
| 
 | ||
| Tag Table:
 | ||
| Node: Top1430
 | ||
| Node: Name2808
 | ||
| Node: Synopsis2933
 | ||
| Node: Overview3145
 | ||
| Node: Description4986
 | ||
| Node: Examples5748
 | ||
| Node: Format8896
 | ||
| Node: Patterns11637
 | ||
| Node: Matching18138
 | ||
| Node: Actions21438
 | ||
| Node: Generated scanner30560
 | ||
| Node: Start conditions34988
 | ||
| Node: Multiple buffers45069
 | ||
| Node: End-of-file rules50975
 | ||
| Node: Miscellaneous52508
 | ||
| Node: User variables55279
 | ||
| Node: YACC interface57651
 | ||
| Node: Options58542
 | ||
| Node: Performance78234
 | ||
| Node: C++87532
 | ||
| Node: Incompatibilities94993
 | ||
| Node: Diagnostics101853
 | ||
| Node: Files105094
 | ||
| Node: Deficiencies105715
 | ||
| Node: See also107684
 | ||
| Node: Author108216
 | ||
| 
 | ||
| End Tag Table
 | 
