1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

lex/yacc efficiency tradeoffs

Discussion in 'Embedded' started by D Yuniskis, Jul 23, 2010.

  1. D Yuniskis

    D Yuniskis Guest

    Hi,

    Does anyone have any rule-of-thumb guidelines for
    the relative efficiencies of doing promotions in
    the lexer vs the formal grammar? Both in terms of
    time and space?

    E.g., imagine:

    [1-9][0-9]{2,4} return VALUE;

    in the lexer vs. the same explicit definition in
    the *grammar*? (i.e., using ". return (int) yylex[0]"
    in the lexer)

    Are there any easily identifiable criteria that affect
    the tradeoff between the two approaches?

    Is there any (significant) difference in lex/yacc
    implementations wrt this issue (e.g., vs flex/bison)?

    I can run some empirical tests but I am not expert enough
    on either to know how to push the limits with test cases...

    Thx,
    --don
     
    D Yuniskis, Jul 23, 2010
    #1
    1. Advertisements

  2. D Yuniskis

    Peter Keller Guest

    Unless your compiler pass is compiling 24/7 and has to be real-time
    like, you absolutely shouldn't care and should write your compiler
    to be the most maintainable thing ever. Speed doesn't matter for 99%
    of compilation passes.

    Type promotion is done either in the type checking phase of the
    compiler, after parsing, or in the code generator. It is better to
    make type promotions explicit in a pass of your compiler, since them
    you can warn when type promotions lose data (among other things). Of
    course, all of the phases can be adhoc'ed together, but then you end
    up with unmaintainable and fragile junk.

    Well, that's my opinion anyway... :)

    -pete
     
    Peter Keller, Jul 23, 2010
    #2
    1. Advertisements

  3. D Yuniskis

    D Yuniskis Guest

    Hi Peter,

    (sigh) Sorry, I should have clarified. I'm not using them
    to write a compiler but, rather, a parser for "messages".
    While speed isn't a real issue (though I wouldn't want it to
    be a *pig*!), I *am* concerned with size and stack penetration,
    etc. "footprint"

    As such, I am concerned with how often the machine "backtracks",
    how much state it carries, etc.
     
    D Yuniskis, Jul 23, 2010
    #3
  4. D Yuniskis

    D Yuniskis Guest

    Hi Walter,

    Not part of a compiler -- please see my reply to Peter Keller's
    post.
    The point of using a formal grammar instead of an ad-hoc
    implementation is exactly that -- to make the formats
    of the messages parsed very obvious. And, to make it easy for
    others to write similar parsers just by cutting and pasting
    from an existing one.

    However, since it *does* sit in a run-time loop, I can't
    *ignore* runtime costs (time *and* space).
     
    D Yuniskis, Jul 23, 2010
    #4
  5. D Yuniskis

    D Yuniskis Guest

    Hi Joe,

    Well, in some cases, I have complete control over the grammar.
    In other cases, I am interfacing to existing protocols so I'm
    stuck with whatever was thrown together (usually with little
    concern for this sort of thing!) previously.
    Currently, I'm experimenting with the lexer doing damn little
    at all! E.g., things like:

    \r return CR
    \n return LF
    [folks seem to more readily relate to seeing "CR LF" in messages]

    \040 return SP
    [eliminate the ambiguity of ' ' being a space or a coincidental tab]

    [\200-\277] return NONASCII
    [8b clear channel can see bogus "characters" if UART misconfigured]

    .. return (int) yylex[0];
    [push everything up to the formal grammar]

    But, doing this (last rule) leaves me worrying: "lex exists for a
    reason; I've just made it redundant! :< "
    I'm more concerned with rules that have long chains of terminals
    that can force lots of (consecutive) states into the FSM. I've
    not looked at the actual code generated, yet to see how well
    yacc "folds" those tables (implication analysis). Nor if it is
    clever enough to shrink the "other" tables to their minimum sizes
    (i.e., omit the "can't happen" state transitions).

    For example, parsing a fixed width data field:

    value: [1-9][0-9][0-9][0-9][0-9]'.'[0-9][0-9]
    | SP [1-9][0-9][0-9][0-9]'.'[0-9][0-9]
    | SP SP [1-9][0-9][0-9]'.'[0-9][0-9]
    | SP SP SP [1-9][0-9]'.'[0-9][0-9]
    | SP SP SP SP [1-9]'.'[0-9][0-9]
    | undefined

    Note these sorts of rules are more expensive than
    a more typical:

    value: [0-9]+.[0-9]+

    As I say, lex exists for a reason. Which god am I angering by
    failing to pay it tribute? :-/
     
    D Yuniskis, Jul 24, 2010
    #5
  6. D Yuniskis

    D Yuniskis Guest

    Hi Walter,

    Agreed as to "clarity", ease of maintenance, etc.

    But, you can push some stuff into the lexer instead of putting
    it *all* in the yacc grammar. Roughly the same in terms of
    readability:

    [1-9] return NONZERO;
    0 return ZERO;

    with:

    digit: NONZERO | ZERO
    value: NONZERO DIGIT DIGIT DIGIT DIGIT '.' DIGIT DIGIT
    | SP NONZERO DIGIT DIGIT DIGIT '.' DIGIT DIGIT

    (ad nauseum)

    [this is admittedly a bad example]

    My question regards the differences in the generated code
    and memory requirements (e.g., transition tables in the FSM)
    that are consequential to one approach over another.

    (and, whether different lex/yacc implementations are better
    or worse than others)

    Keep in mind, lex and yacc were intended to generate code that
    runs in resource rich environments...
     
    D Yuniskis, Jul 24, 2010
    #6
  7. Hi Don,
    Point of clarification: a lexer specification *is* a formal grammar
    .... only for a simpler language (in the Chomsky sense).

    For one thing, using (f)lex, it won't work if you are using a yacc or
    bison parser above it. The parser expects the return value to be a
    numeric token identifier ... returning an arbitrary integer extracted
    from the input will cause a fatal error.

    What you want to do is something like the following:

    [1-9][0-9]{2,4} { yylval = atoi(yytext); return VALUE; }

    The token's value is passed separately through a variable called
    "yylval", which defaults to an integer but can be redefined to be a
    union of differing types.

    lex/yacc is before my time, but using flex/bison you can redefine the
    default parser and lexer functions to include arbitrary parameters.
    I've used this ability to, e.g., parse out of memory buffers rather
    than from file streams.

    In general there is no performance difference between extracting
    non-character data in the lexer vs extracting it at a higher level ...
    you'll make the same number of passes over it.

    However, the lexer does not have to preserve input characters across
    calls to yylex(), so extraction at a higher level may mean making a
    copy of the input characters.

    No ... your issue is outside the scope of either lexer or parser
    generation.

    flex and bison are MAJOR improvements on lex and yacc, but the
    differences are optimizations of their internal DFA and PDA processes,
    support for programmer extensions, and for bison, support for handling
    more complex grammars: yacc is LALR, bison supports LALR, LR, SLR and
    GLR (depending on version).


    George
     
    George Neuner, Jul 24, 2010
    #7
  8. The lexer has to examine each and every character of input and there
    is overhead for each call to it. The more times the parsing code
    calls the lexing code, the slower the parse becomes.

    Regardless of implementation, tokenization and lexical analysis of the
    input stream are the limiting factors for parser performance. In a
    typical optimizing compiler, lexing accounts for 70-80% of the running
    time while parsing accounts for less than 1%.

    No they aren't. lex and yacc are products of the 1960's when the
    average program executing on a mainframe had 32KB of RAM. They were
    designed around a particular space/speed performance point that was
    acceptable to programmers and which the tools themselves could manage
    the limited space available.

    Newer tools like flex and bison have other options. bison has even
    tighter table packing modes than yacc and can produce smaller parsers
    using the same grammars. But bison can also produce much faster
    parsers, or reentrant parsers, or handle more complex grammars, etc.
    Similarly, flex can make smaller lexers than lex or it can make lexers
    that run somewhat faster.

    George
     
    George Neuner, Jul 24, 2010
    #8
  9. D Yuniskis

    D Yuniskis Guest

    Hi George,

    Yes, but the grammars that lex can parse are quite simple.
    It's mainly used to do the "grunt work" (e.g., gobbling
    up groups of characters that the yacc grammar would
    eventually consider as "nonterminals" -- NUMBER, STRING,
    PATHNAME, etc.) that would consume lots of state transitions
    in the yacc FSM.
    [I can't speak for flex...]

    lex begins assigning token values at 256 -- conveniently so that
    you can pass any ASCII value (any 8 bit value, to be pedantic)
    directly through. E.g., "(int) yylex[0]" just takes the
    "current character" (matching "dot") and returns it AS IF
    it was a "token". (technically, the cast should be a bit more
    explicit to handle signed char's properly)

    So, the rule I cited essentially turns lex into a raw pipe...
    Yes, I understand. I was pointing out the difference in the
    lex+yacc approach or "dumb lex"+"do-it-all-in-yacc" approach.

    I.e., the traditional approach would be to let lex do the
    grunt work of gathering up all the "digits", converting them
    to an integer (or whatever the grammar really wants), stuffing
    the value of that "integer" into u.int and reducing the
    character sequence to a single non-terminal: INTEGER (etc.).

    By contrast, I am comparing the approach where the individual
    digits are passed directly through lex as tokens in their
    own right. I.e., '0' is the token having the value 0x30
    (call it ZERO if you like); '1' is the token having the
    value 0x31 ("ONE"), etc.

    Then, letting yacc directly operate on these "characters"
    (tokens) and fabricate it's own concept of "integer:"
    (vs. INTEGER).

    So, my question is, how expensive does this get to be by
    doing that grunt work in yacc (instead of lex)? E.g.,
    yacc has to track N different states (for the N digit
    "integer:") instead of responding to a single INTEGER
    token in *a* particular state.

    [grrr... this is clear in my head but I seem to be
    having a tough time expressing it :< ]

    Consider:

    You can "scan" a 5 character sequence of digits with
    a counter and a single state (if char is digit, stay
    in this state and increment count; if char is not digit,
    move to someother state as defined by the grammar).

    If, however, yacc "can't count", then it effectively
    counts a 5 character sequence of digit by:

    state1: if digit, move to state 2, else state X
    state2: if digit, move to state 3, else state ...
    state3: if digit, move to state 4, else state ...
    state4: if digit, move to state 5, else state ...
    state5: if digit, move to state 6, else state ...

    i.e., in state2, one digit has been accumulated; in
    state3, two digits have been accumulated; ... until,
    finally, in state6, five (consecutive) digits have
    been accumulated.

    If each state requires a transition table with 256+ entries
    (for each of the possible stimuli -- tokens -- that can be
    received *in* that state), then having "state strings" like
    this gets expensive.

    E.g., imagine a grammar that parses:

    sentence: "The quick brown fox stepped on the lazy dog's nose"

    entirely within yacc (I am guessing that implication analysis
    won't let you reduce that to anything less than strlen(...)
    discrete states!)
    Yes, this is an inherited capability.
    I'm concerned with the efficiency of the FSM generation.
    (consider the "sentence" above) And, how deep the pushdown
    stack gets to cover backing out of "wrong promotions"
    that are attempted as yacc encounters new tokens (which,
    in the case I am suggesting, are actually *characters*!)
    The grammars are all relatively simple. Rather, I was interested in
    any differences in the automata they construct (for the previously
    illustrated examples).

    I should just build some bogus grammars and tweek them to see
    how the tables change in size. Perhaps I can deduce the basic
    behavior of the FSM generator empirically (I suspect these
    things are too clever for that sort of simplistic experimentation)

    [if I've still not managed to highlight the issue that is
    of interest to me, let me know and I'll try to come up with
    some real examples to *quantitatively* illustrate]
     
    D Yuniskis, Jul 24, 2010
    #9
  10. The grammars lex can parse are a superset of regular expressions.
    http://en.wikipedia.org/wiki/Chomsky_hierarchy

    lex adds to regex limited context sensitivity and limited ability to
    do counting. Both of these abilities are simple hacks, however, and
    neither is sufficient to promote the recognized language to the next
    Chomsky level.

    Read (f)lex as shorthand for "flex or lex"
    So does flex.
    Yes, bison predefines 1..255 as tokens (I don't know about yacc).
    However zero is an illegal token - (f)lex returns zero to mean EOF. So
    you can't parse binary that includes zero by doing this.

    I don't think anyone has measured this.

    I would expect the __raw__ regex ((f)lex) and PDA (yacc/bison) state
    machines to be roughly comparable in performance. The main difference
    is in where and when any additional user code is executed.

    For regex, the user code is always executed after the FSM has finished
    so there is no impact on recognition speed. If you try to recognize
    simple things in the PDA however, the intermediate rules may have to
    execute code: consider the difference between recognizing the word
    "integer" and recognizing an actual integer value:

    %%

    input : word
    {
    printf( "recognized word \"integer\"\n");
    }
    | value
    {
    printf( "recognized value : %d\n", $1 );
    }
    ;

    word : 'i' 'n' 't' 'e' 'g' 'e' 'r';

    value : value digit
    {
    $$ = ($1 * 10) + $2;
    }
    | digit
    ;

    digit : '0' { $$ = 0; }
    | '1' { $$ = 1; }
    | '2' { $$ = 2; }
    | '3' { $$ = 3; }
    | '4' { $$ = 4; }
    | '5' { $$ = 5; }
    | '6' { $$ = 6; }
    | '7' { $$ = 7; }
    | '8' { $$ = 8; }
    | '9' { $$ = 9; }
    ;

    %%

    Accumulating any kind of value requires execution of user code. To
    accumulate an integer, you *must* construct a value for EACH digit
    individually and pass it up the chain.

    So for example, the following won't work:

    value : value digit
    {
    $$ = ($1 * 10) + ($2 - '0');
    }
    | digit
    {
    $$ = $1 - '0';
    }
    ;

    because by default the current state identifier is passed up the chain
    .... that is, the subrule digit will receive some bogus internal yacc
    value instead of the input character.

    Asa matter of fact, yacc CAN count - indirectly via the PDA's stack -
    which is why yacc can handle recursive grammars while lex cannot.

    But I know what you mean.


    (f)lex constructs a pure state machine. yacc/bison constructs a state
    machine with a stack. Aside from that you are dealing with how they
    pass user values around.

    George
     
    George Neuner, Jul 26, 2010
    #10
  11. D Yuniskis

    Cesar Rabak Guest

    Em 23/7/2010 05:47, D Yuniskis escreveu:
    [snipped]

    don,

    Having arrived late in this thread, and understood that you're trying to
    use lex/yacc for a [comm] protocol, I wonder if in your searches did you
    consider using tools more closer to your problem at hand.

    Something lighter than Estelle, for example?
     
    Cesar Rabak, Jul 26, 2010
    #11
  12. D Yuniskis

    D Yuniskis Guest

    Hi Cesar,

    I've never used Estelle. But, it seems a lot more heavy-handed
    than simply specifying a grammar in lex/yac. I.e., it seems like
    it wants to extend *beyond* just the "interface specification"
    and deeper into the interactions between competing/cooperating
    processes that happen to be "communicating" (gee, how's that
    for a mouthful? :> )

    I just want to create a simple template/framework that allows
    others to create similar interfaces just by tweeking an
    existing one (since most folks seem to go "deer-in-headlights"
    when faced with creating something "from scratch").

    lex/yacc can do that. But, the question is, where various bits
    of the grammar should reside to most effectively use these
    capabilities (as with most things, there are lots of different
    ways to approach it WITHIN the lex/yacc framework; but, there
    are consequences to each approach!). Since I am anticipating
    others will copy/modify what I have done, there is extra incentive
    to "get it right" (lest *everyone* ends up repeating my mistakes!)

    :-/
     
    D Yuniskis, Jul 27, 2010
    #12
  13. If all you need is regex there is a simple preprocessing tool called
    RE2C that lets you embed the expression right into your C code. I
    used it a number of years ago and it worked quite well. The project
    has a new owner and supposedly there have been improvements since
    then. http://re2c.org/

    If you can use a C++ compiler (even for writing C) Boost Spirit is a
    lightweight library for regex and LL parsing. A number of Boost
    libraries are function based rather than object based and intermix
    easily with C code ... but I haven't used Spirit so I don't know which
    category it is in. http://www.boost.org/

    George
     
    George Neuner, Jul 28, 2010
    #13
  14. D Yuniskis

    D Yuniskis Guest

    Hi George,

    Read (f)lex as shorthand for "flex or lex"[/QUOTE]

    Yes, I understood your notation. Rather, I was commenting
    that I don't know if *flex* makes the same assumptions
    re: token "values" (starting at 256)
    I'm not worried about binary values -- just ASCII
    (I don't consider 'NUL' :> )
    lex seems to like to "write spaghetti code" (an overstatement
    but, contrasted to yacc's more rigid "tables", I think you
    know what I mean).

    For the sorts of "traditional" tokens that it is meant to
    parse, I think lex is probably a win. E.g.,

    [0-9]+ return INTEGER;
    [a-zA-Z]+ return WORD;

    etc. yacc would probably not take a big hit, either, as
    it's just a state looping back to itself repeatedly.
    Even ad hoc approaches can be efficient with these.
    (i.e., lots of actions result in "same state" transitions
    so the total number of states is "fewer" than if you had
    to *count* "digits" or "letters")

    OTOH, if your grammar has more rigidly defined tokens like:

    [0-9]{4,4} return INTEGER;

    or (gasp):

    [1-9][0-9][0-9][0-9]
    | ' '[1-9][0-9][0-9]
    | ' ' ' '[1-9][0-9]
    | ' ' ' ' ' '[0-9]

    then there are a lot *more* state transitions involved.
    Here, yacc's approach could get pricey (since you can't
    "cheat" and wrap lots of character sequences into a
    *few* consecutive states like:

    {' '}*[1-9]?[0-9]+

    (which isn't a 1:1 replacement for the above)

    I.e., for rigid message formats, you can end up
    with lots of sequential states, each of which
    defines the *few* expected "inputs" that will
    advance to the "next state" while many/most
    other "inputs" will signal an error (syntax).
    Yes, but you could do:

    integer: digit digit digit digit
    {
    $$ = ( (( (( ( ($1 - '0') * 10) +
    ($2 - '0')) * 10) +
    ($3 - '0')) * 10) +
    ($4 - '0'));
    }

    (assuming I've counted my parens properly :-/ )
    I'll have to build some toy grammars and move the promotions
    around between lex and yacc and see what the resulting code
    looks like. I'll try to get around to that in the next few
    days and post some examples

    [currently preoccupied retiring a server and replacing another
    so I live in fear of "forgetting something" if distracted while
    in the process :-/ Unfortunately, I should have installed
    the Gb switch *before* doing this as moving that much data at
    100M is *painfully* slow! :< Sheesh! To think back to the
    SneakerNet days... :-/ ]
     
    D Yuniskis, Jul 28, 2010
    #14
  15. D Yuniskis

    D Yuniskis Guest

    Hi George,

    Agreed. This suggests letting lex gobble up as much
    input as it can (reasonably).

    OTOH, the parser has more knowledge of what's going on
    than the lexer. Once invoked, the lexer will gladly
    process lots of input... THAT THE PARSER MIGHT CONSIDER
    AS "INAPPROPRIATE" (given context)!

    It's this that I think will force me to move everything
    into yacc, regardless of "efficiency". Thinking aloud:
    if context suggest to yacc that <blah> is *required*,
    lex can merrily encounter <string> in the input and
    patiently gobble up as many [A-Za-z]+ as happen to be
    present -- EVEN THOUGH "return STRING" will cause yacc
    to signal an error!

    If, OTOH, lex simply passed each character (in essence)
    to yacc as discrete tokens and let yacc promote each
    accordingly based on the rules of the grammar, then
    yacc can detect invalid "messages" earlier!

    Recall, the goal here is to parse messages, not "text
    files" (e.g., "source code"). And, that processing is not
    "off-line"/batch oriented. I.e., lex's "input()" will
    be pulling characters off of a live input stream.

    That input stream is not "received" in nice clean units
    (i.e., it's not like a text file where full lines can be
    extracted at a time). E.g., it could *pause* "mid-message"
    for an indeterminate amount of time (subject to the
    communication protocol in place).

    As a (trivial) example, consider a grammar that supports
    two message types:

    "Hello, my name is <string>"
    | "<expr> <op> <expr>"

    (where <expr> and <op> are what you suspect them to be)

    [granted, a silly example but should suffice to illustrate my
    point]

    If the parser has processed "Hello, my name is ", then it
    is obviously awaiting <string>. If, OTOH, the input contains
    "1234567.8" and has *stalled* at that point, then having
    lex process <number> as "[0-9]+'.'[0-9]*" means lex will
    dutifully hang waiting the next character (after the '8')
    to see if it can fold that into it's <number> rule. Only
    when *lex* has decided where that token's boundary lies
    will it pass NUMBER to yacc.

    OTOH, if the "1" had been passed to yacc, it would have
    already decided that "Hello, my name is 1..." is not a
    valid sentence in the grammar and signalled an error.
    (by contrast, if it has to wait for lex -- and lex has to
    wait for a *paused* input stream -- then the error is
    not detected until well after the fact!)

    So, it looks like lex can only do context independant
    promotions *if* I want to keep the error path expedited.
    But, they had secondary storage available, could make multiple
    passes over the input, weren't time constrained, etc. I.e.,
    they could concentrate on *just* that one aspect.
    The details of their respective (and relative) automata are
    needed to sort out just how efficient each *might* be.
     
    D Yuniskis, Jul 28, 2010
    #15
  16. That's not a convincing argument.

    First of all, every successful non-human language and protocol is
    context free (in the Chomsky sense). "Context free" does not mean the
    parser works without knowledge of its surroundings ... it means that
    the syntax is such that any legal "sentence" can have only one
    possible interpretation.

    Secondly, to be efficient the parser needs to work at a higher level
    than input characters. Lexers were created specifically so that
    parsers can work with abstract idea "tokens" rather than strings.

    That's right. But if the surroundings (let's avoid the word "context"
    to prevent Chomsky based misunderstandings) dictate that <blah> is
    required, then it would be a syntax error if the lexer found a string.

    But at the expense of a huge increase in grammar complexity and
    correspondingly larger parsing code, increased processing per input
    character and possibly a dead programmer who, in deep despair, blew
    his own brains out before finishing the job because he didn't realize
    what he was getting himself into.

    Live input doesn't matter in any theoretical sense. As a practical
    matter, however, if you must provide real time feedback you simply
    create an input driven "push" parser rather than an expectation driven
    "pull" parser. In a push parser, the lexer drives parsing, throwing
    tokens at the parser which accumulates them until they make sense.
    You'll get your error at the first "word" that doesn't fit into any of
    the possible "sentences" in the match set.

    yacc/lex can't create push parsers, but bison/flex and some other
    modern tools like ANTLR and Yacc++ can.

    That's as it should be. The "words" of a "sentence" have to be
    delineated. Parsing 1492 as "I've got '149' so far and I'm stuck
    waiting for more input" is semantically wrong. Until you see
    something that doesn't fit into the assumptions of the current match
    set, you *cannot* know that you've matched anything.

    Honestly, it doesn't matter whether you are matching characters or
    abstract tokens. Whenever matching is applied over sequences, the
    matching *process* itself is the limitation on what's possible.

    You're going to have to take my word for it that you likely will be
    unable to write a non-trivial grammar that can to do that. Go ahead
    and try, but if you blow your brains out don't say I didn't warn you.

    My advice is to investigate push parsers instead.

    (f)lex has a feature that can help: start states (flex calls them
    start "conditions"). Normally all the individual regex patterns are
    coalesced into one giant DFA which matches input against all of them
    simultaneously and stops when it has a match against any of them.
    Start states group related patterns into sub-DFAs which are only
    matched against when the lexer is in their related state.

    Flex is more flexible 8) than lex in that it allows the same pattern
    to belong to more than one state group (in lex you have to repeat the
    pattern with a different state prefix). State changes are entirely
    handled by user code, so, for example, the state could be a parameter
    passed in by the parser.

    Overall flex is a much better tool than lex. It is *much* more
    flexible in the way it interfaces to the parser and in how it handles
    input. Unlike lex, there are well documented hooks into most of the
    core functions. Flex lexers are intended to be customizable. Flex
    also can create measurably faster lexers.

    George
     
    George Neuner, Jul 29, 2010
    #16
  17. D Yuniskis

    D Yuniskis Guest

    Hi George,

    Sorry, I meant "context" in the sense that it is used in
    defining syntax (e.g., "left context", "right context")
    of symbols. E.g., <number> may be a valid token in
    *some* portions of a particular grammar but would never
    be valid with a left context of "GET".
    But wasn't *efficiency* the whole point of my query? :>
    I.e., I am already acknowledging this (potential) "hit"
    and trying to gauge how big it will be.

    OTOH, eliminating the lexer results in a savings, there.

    As always, the devil is in the details (of the grammar, etc.)
    But, if <string> is something like [a-zA-Z]+, then
    the lexer can sit there gobbling an indeterminate number
    of characters even though the *first* such character
    would be enough for the parser to realize a syntax
    error has occurred!
    I don't see that. You're thinking in terms of languages,
    not communication protocols. E.g., you could parse SMTP/NNTP/etc.
    messages without a lexer and without unduly complicating the
    parser's grammar. The question is, how much cost in terms of
    performance (space+time) does such an approach incur (?).
    Note that SMTP/NNTP/etc. don't have the same issues that I'm
    addressing so this approach is needlessly complex, there
    (e.g., who cares if the SMPT client/server gobbles up a
    200 character "string" as <email_address> and *later*
    (algorithmically) determines it to be invalid *as* an
    email address).
    Gries' top-down vs. bottom up.
    The goal here is to stick with lex and yacc (with some *possible*
    allowances for flex/bison). Otherwise, it would be easier to
    just write some tight "hand-crafted" parsers and document how they
    could be modified for other, similar grammars.
    But that is not consistent with the goal of detecting
    errors *early*.

    Unlike pulling text from a file (foo.c, etc.), this is a
    real-time application -- there is information to be gleaned
    from the temporal aspects of data stream. E.g., if characters
    are to arrive over a serial link at a particular rate,
    then they *must* arrive at that rate... waiting indefinitely
    ignores an error condition. Likewise, waiting some amount of time
    "per character" allows a trickle of characters to keep tickling
    the timeout (thus avoiding a temporal error indication) yet
    *still* satisfying an inappropriate (for the current "surroundings")
    token for the particular grammar.

    Lex and yacc weren't written for the use to which I am applying them.
    But, there are enough hooks in their implementations that I think
    I *can* trick them into doing what I want within my design
    constraints (I have to look more closely at how to throttle
    their stack usage, etc.).
    I must be missing something. :< I threw together some simple
    grammars (<verb> <object> <modifier>?) and got yacc to perform
    as I had expected. I was unhappy with the footprint but haven't
    yet looked to see how much initialization code is dragged in
    by default, how I can tweek the size of the pushdown stack, etc.
    (but I suspect there is a way to adjust those things).

    Moving recognition of the individual tokens into a lexer
    front end only marginally cleans up the grammars. And, adds
    in the "uncertainty" that I mentioned above. It also adds
    more of those "configuration" items that I'll have to chase down
    (to more closely "fit" the lexer -- to whatever extent it *is*
    used -- to the particulars of the grammar in question)
    Won't help. Unless I move parts of the grammar specification
    into this realm just as a "hack".

    The advantage of lex/yacc is that it would let developers
    "simply" formalize the interface grammars and leave the
    code writing to lex/yacc. Cleaner implementation that
    should be a lot easier to validate (instead of waking through
    a hand-crafted parser verifying that each path in the syntax
    tree is handled correctly).
    I'll be traveling for the next few weeks. I'll try to
    tabulate the results I got from the first tests I ran
    (along with the grammars) so you can see the issues
    and their resolutions in the two approaches (lex vs. yacc).

    Thx,
    --don
     
    D Yuniskis, Aug 25, 2010
    #17
  18. Not really ... pull vs push parsing is more like the difference
    between function call-return and CPS: the control path is inverted.

    In a push design the lexer is fed characters one at a time and tries
    incrementally to match its partial input against the recognizer set.
    When a valid token is identified, that token is fed to the parser
    which tries incrementally to match partial token sequences against its
    rule set. When a rule is matched, the parser executes the associated
    user code as is usual.

    The only additional complexity vs the normal pull design is that the
    lexer and parser have to remember the current state of their FSM
    between activations. Because bison/flex FSM are table driven, in
    practice that amounts to just a few integer values.

    WRT memory use, a push lexer doesn't need any character buffering at
    all unless it must pass a matched string on to the parser. Input
    stream buffering (if any) is normally done by the surrounding program
    code. Character strings passed from lexer to parser need to be heap
    allocated, but the same is generally true in pull designs. Overall
    the push design can use less memory.

    As far as catching errors early, the lexer is essentially fed one
    character at a time ... if it can tell that the character is out of
    place for a particular recognizer context, it can signal the error
    immediately (either directly or by passing an error token to the
    parser).

    Yacc and lex are severely outdated - they now are kept only for
    backward compatibility. Bison and flex now are standard issue on
    nearly every Unix/Linux system. Besides which, anyone likely to
    understand an LALR grammar well enough to maintain/modify the code
    likely already knows about them.

    Unless the language[*] is simple or the target platform extremely
    limited, a hand-crafted parser is almost never the right way to go.
    Good CCG tools like bison, Yacc++, ANTLR, LLgen, etc. have sound
    theory behind them and their results are predictable and reproducible.
    For non-trival grammars, generated parsers are likely to be both
    smaller and faster than most people could hand craft.

    [*] a comm protocol is a language just as is a turing complete
    programming language. the only difference is complexity.


    Lexers are a different story. Lexers are relatively easy and when the
    set of recognizers is limited, table driven regex can be improved upon
    by hand coding. Quite a few real-world systems use a generated parser
    with a hand coded lexer.

    Most use of the CPU stack is from user added code. FSM use in both
    tools is actually very modest as the generated code is basically a
    loop containing a big switch statement - there is no recursion. Yacc's
    PDA stack is optionally a static array or a heap allocation and each
    stack element is quite small: a couple of integer indexes and a
    "token" type (which by default is an integer but can be a user defined
    union or structure).

    Push parsers/lexers created by bison/flex are just a big switch
    statements - they don't need the control loop.

    I'm not sure what you mean by that - yacc/bison and (f)lex don't have
    any "configuration" problems that aren't part of any multiple input
    file project. However, there are better tools that will generate
    parser and lexer from a single grammar file.

    I still think you should look into using a push parser.


    George
     
    George Neuner, Aug 26, 2010
    #18
  19. D Yuniskis

    D Yuniskis Guest

    Hi George,

    Yes, this is top down vs. bottom up (in Gries-speak).

    A bottom up parser tries to "promote" (Gries would say
    "reduce") the terminals to nonterminals with the eventual
    (ultimate) goal of reducing them to the "distinguished
    symbol" (e.g., "S" or "G" or whatever you use to represent
    a "valid 'sentence'" in that grammar).

    A top down parser starts at the distinguished symbol and
    tries to come up with a set of non terminals that map
    onto the input stream of terminals.

    In the bottom up case, if a character ("terminal")
    can be promoted (reduced) to *any* nonterminal, then
    the parser has to accept it IN CASE that nonterminal
    might eventually be recognized and promoted (reduced)
    to yet another nonterminal and, ultimately, to the
    distinguished symbol (WIN). Only when the parser
    can determine that the character *doesn't* allow a
    promotion to the "distinguished symbol" can the input
    stream be considered "in error" (i.e., can't get a valid
    sentence from this input).

    In the top down case, as long as the input (stream)
    continues to match *some* possible set of terminals
    (and nonterminals) that can be derived *FROM* the
    "distinguished symbol", parsing is "on track". I.e.,
    "so far, this input stream *could* potentially form
    a valid sentence". OTOH, s soon as a terminal is
    encountered that *doesn't* fit a derivation from
    the distinguished symbol, *everything* up to this point
    must be considered as bogus.

    Assume S ::= VERB NOUN VALUE?
    where VERB and NOUN are specific "words" and VALUE is
    a "number". (S is the "distinguished symbol" for
    this grammar)

    In practical terms, if I see a set of digits, I keep
    gobbling them up as they can represent a "number"
    (VALUE in my example). Once I see the end of that
    sequence of digits, I can promote this to a VALUE.
    But, then I realize that "VALUE isn't valid in this
    'context'" (surroundings). So, my error is detected
    late instead of early.

    OTOH, if you proceed top down, you expand the (single)
    rule for the distinguished symbol (S) to get VERB NOUN VALUE?
    You then expand VERB and find that the *first* character
    in your input stream (a digit!) doesn't satisfy the (single!)
    rule for VERB (a string of letters, for example). So,
    you immediately know the "digit" represents a syntax error
    without waiting around to accumulate more "digits"
    Exactly. Communication protocols tend to be very narrowly
    defined. E.g., cc(1) will gladly interpret:
    "123.000000000000000000 + -456.000000000000"
    whereas conveying that same *information* could be done with
    "123 + -456" (assuming the protocol implicitly promotes
    to floats)
    I'm hoping to use *one* set of tools to implement protocols
    on a range of devices from resource rich "hosts" to
    resource *starved* "motes". The appeal is significant
    since it lets protocols et al. be defined and implemented
    before target hardware is selected. However, if the
    resulting code ends up placing a burden that effectively
    *requires* a specific class of processor just to parse
    messages, then the cost will exceed the benefit.
    Look at your comments again with "I have a little 8b micro"
    in mind. Ask yourself how much of that processor's resources
    you want to set aside a priori for lex/yacc's needs. Then,
    ask yourself what you could do "by hand".
    No, I meant configuring the *runtime* that they generate.
    E.g., so that UNCONSTRAINED input doesn't cause them to
    crash; so they can handle the intended syntax accurately
    yet without bloat, etc.

    E.g., if I *know* all "VALUE"s are 4 "DIGIT"s and DIGITs
    are never used elsewhere, I (writing a hand-crafted parser)
    can set aside a single "short" and parse the VALUE directly
    into it -- without ever having to track the individual DIGITs
    that comprise those VALUEs (in case they might be used in
    other nonterminals).

    As I said in previous post, I'll put together an example
    that points out each of these items and puts real numbers
    on them so you can better see the "costs" involved. For
    even trivial grammars, they are significant! :<
     
    D Yuniskis, Aug 26, 2010
    #19
  20. This proves that while you understand some of the theory, you clearly
    do not have a firm grasp on the technology.

    What you are describing is the difference between "sequence" matching
    (also called "tuple" matching) and "predictive" matching. Either form
    of matching can be applied to either parsing or to lexing (tokenizing)
    .... and you are improperly conflating these levels.

    All available LR tools - yacc, bison, etc. - are bottom-up parallel
    sequence match parsers that use a finite state machine implementation
    (almost all use a ND-PDA) ... the parser accumulates tokens (context)
    as long as the set of partial rule matches is non-empty and reduces
    (executes) a rule when it can be exactly matched to the tail of the
    sequence.

    All available LL tools - ANTLR, llgen, pccts, etc. - are top-down
    sequential predictive parsers that use a recursive decent coded
    implementation ... the parser chooses a rule to pursue based on the
    current "look-ahead" token(s) (the next to be input). Predictive
    parsing has trouble when multiple rules have common prefixes - unlike
    sequence matching, predictive matching has no viable FSM
    implementation for parallel matching, so common prefixes have to be
    handled by
    a) refactoring the grammar,
    b) backtracking (trying each alternative),
    c) increasing look-ahead (syntactic predication),
    d) forking the parser,
    or e) all of the above.


    "LR" and "LL" refer to Chomsky language groupings and to different
    ways of decomposing the input. It is logically incorrect to lump "LL"
    with predictive parsing and "LR" with sequence parsing ... the reasons
    for these associations are historical and are based on the most
    efficient ways known to implement them on serial computers. There is
    no technical reason why a LR parser can't be predictive and recursive
    decent coded or why a LL parser can't be parallel and FSM coded ...
    they aren't because those implementations would be (much) less
    efficient than the ones now used.

    There are both top-down and bottom-up lexing (tokenizing) tools.
    (f)lex is a bottom-up sequence matching FSM implementation based on
    regular expressions. ANTLR is a top-down predictive recursive decent
    code implementation based on EBNF plus regular expressions. There are
    a number of other lexer tools (mostly top-down) that use EBNF for
    specifying patterns.

    ================================


    What I was talking about with "pull" vs "push" is control flow. The
    conventional pull case is diagramed this way:

    application <-- parser <-- lexer <-- char-stream

    or more generally as:

    application <-- parser <------+-- char-stream
    ^ |
    +- lexer <-+

    where the application invokes a parser and the parser drives input
    from the stream and subsequent lexing. This is the "typical" use by
    compilers, etc.


    The push case is quite different. It is diagramed as:

    application <-------------+-- char-stream
    ^ ^ |
    | +-------+- lexer <-+
    | |
    +- parser <-+

    Here the lexer and parser are independent of the stream and operate as
    filtering stages that can be applied to it. The application maintains
    control of the stream and decides whether to pipe (push) it through a
    lexer or lexer/parser combination.

    Push parsing is usually simplified to:

    application <------------+-- char-stream
    ^ |
    +- parser <-- lexer <-+

    because where a parser is needed there typically is not a use for
    tokenization alone. But the general case is applicable where there
    are different lexer and/or parser plug-ins for different purposes.

    Push is normally done with bottom-up parsers. It's harder to do with
    a top-down parser because the partial parse has to be stored between
    activations and recursive decent code has a lot more state to be saved
    than does a FSM. But it can be done.

    Sigh. For the, I think 3rd, time, this has NOTHING to do with parsing
    strategy ... it is a tokenization, i.e. lexer, issue.

    All that is required to catch syntax errors at the character level is
    for the parser to provide the needed context information so that the
    tokenizer can make informed choices.

    If you really want/need to do this then (f)lex probably is the wrong
    tool to use and you may want to use a predictive tokenizer. But the
    choice of lexing strategy has absolutely zero bearing on the choice of
    parsing strategy ... and not a whole lot of bearing on the choice of
    parser tool either given that most expect a separate lexer and will
    work with anything that obeys their interface convention. (There are
    a few tools which expect the parser will directly control the input
    stream, but none of them are terribly popular).

    George
     
    George Neuner, Aug 26, 2010
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.