655
|
1 |
#
|
|
2 |
# Module Parse::Yapp.pm.
|
|
3 |
#
|
|
4 |
# Copyright (c) 1998-2001, Francois Desarmenien, all right reserved.
|
|
5 |
#
|
|
6 |
# See the Copyright section at the end of the Parse/Yapp.pm pod section
|
|
7 |
# for usage and distribution rights.
|
|
8 |
#
|
|
9 |
#
|
|
10 |
package Parse::Yapp;
|
|
11 |
|
|
12 |
use strict;
|
|
13 |
use vars qw($VERSION @ISA);
|
|
14 |
@ISA = qw(Parse::Yapp::Output);
|
|
15 |
|
|
16 |
use Parse::Yapp::Output;
|
|
17 |
|
|
18 |
# $VERSION is in Parse/Yapp/Driver.pm
|
|
19 |
|
|
20 |
|
|
21 |
1;
|
|
22 |
|
|
23 |
__END__
|
|
24 |
|
|
25 |
=head1 NAME
|
|
26 |
|
|
27 |
Parse::Yapp - Perl extension for generating and using LALR parsers.
|
|
28 |
|
|
29 |
=head1 SYNOPSIS
|
|
30 |
|
|
31 |
yapp -m MyParser grammar_file.yp
|
|
32 |
|
|
33 |
...
|
|
34 |
|
|
35 |
use MyParser;
|
|
36 |
|
|
37 |
$parser=new MyParser();
|
|
38 |
$value=$parser->YYParse(yylex => \&lexer_sub, yyerror => \&error_sub);
|
|
39 |
|
|
40 |
$nberr=$parser->YYNberr();
|
|
41 |
|
|
42 |
$parser->YYData->{DATA}= [ 'Anything', 'You Want' ];
|
|
43 |
|
|
44 |
$data=$parser->YYData->{DATA}[0];
|
|
45 |
|
|
46 |
=head1 DESCRIPTION
|
|
47 |
|
|
48 |
Parse::Yapp (Yet Another Perl Parser compiler) is a collection of modules
|
|
49 |
that let you generate and use yacc like thread safe (reentrant) parsers with
|
|
50 |
perl object oriented interface.
|
|
51 |
|
|
52 |
The script yapp is a front-end to the Parse::Yapp module and let you
|
|
53 |
easily create a Perl OO parser from an input grammar file.
|
|
54 |
|
|
55 |
=head2 The Grammar file
|
|
56 |
|
|
57 |
=over 4
|
|
58 |
|
|
59 |
=item C<Comments>
|
|
60 |
|
|
61 |
Through all your files, comments are either Perl style, introduced by I<#>
|
|
62 |
up to the end of line, or C style, enclosed between I</*> and I<*/>.
|
|
63 |
|
|
64 |
|
|
65 |
=item C<Tokens and string literals>
|
|
66 |
|
|
67 |
|
|
68 |
Through all the grammar files, two kind of symbols may appear:
|
|
69 |
I<Non-terminal> symbols, called also I<left-hand-side> symbols,
|
|
70 |
which are the names of your rules, and I<Terminal> symbols, called
|
|
71 |
also I<Tokens>.
|
|
72 |
|
|
73 |
Tokens are the symbols your lexer function will feed your parser with
|
|
74 |
(see below). They are of two flavours: symbolic tokens and string
|
|
75 |
literals.
|
|
76 |
|
|
77 |
Non-terminals and symbolic tokens share the same identifier syntax:
|
|
78 |
|
|
79 |
[A-Za-z][A-Za-z0-9_]*
|
|
80 |
|
|
81 |
String literals are enclosed in single quotes and can contain almost
|
|
82 |
anything. They will be output to your parser file double-quoted, making
|
|
83 |
any special character as such. '"', '$' and '@' will be automatically
|
|
84 |
quoted with '\', making their writing more natural. On the other hand,
|
|
85 |
if you need a single quote inside your literal, just quote it with '\'.
|
|
86 |
|
|
87 |
You cannot have a literal I<'error'> in your grammar as it would
|
|
88 |
confuse the driver with the I<error> token. Use a symbolic token instead.
|
|
89 |
In case you inadvertently use it, this will produce a warning telling you
|
|
90 |
you should have written it I<error> and will treat it as if it were the
|
|
91 |
I<error> token, which is certainly NOT what you meant.
|
|
92 |
|
|
93 |
|
|
94 |
=item C<Grammar file syntax>
|
|
95 |
|
|
96 |
It is very close to yacc syntax (in fact, I<Parse::Yapp> should compile
|
|
97 |
a clean I<yacc> grammar without any modification, whereas the opposite
|
|
98 |
is not true).
|
|
99 |
|
|
100 |
This file is divided in three sections, separated by C<%%>:
|
|
101 |
|
|
102 |
header section
|
|
103 |
%%
|
|
104 |
rules section
|
|
105 |
%%
|
|
106 |
footer section
|
|
107 |
|
|
108 |
=over 4
|
|
109 |
|
|
110 |
=item B<The Header Section> section may optionally contain:
|
|
111 |
|
|
112 |
=item *
|
|
113 |
|
|
114 |
One or more code blocks enclosed inside C<%{> and C<%}> just like in
|
|
115 |
yacc. They may contain any valid Perl code and will be copied verbatim
|
|
116 |
at the very beginning of the parser module. They are not as useful as
|
|
117 |
they are in yacc, but you can use them, for example, for global variable
|
|
118 |
declarations, though you will notice later that such global variables can
|
|
119 |
be avoided to make a reentrant parser module.
|
|
120 |
|
|
121 |
=item *
|
|
122 |
|
|
123 |
Precedence declarations, introduced by C<%left>, C<%right> and C<%nonassoc>
|
|
124 |
specifying associativity, followed by the list of tokens or litterals
|
|
125 |
having the same precedence and associativity.
|
|
126 |
The precedence beeing the latter declared will be having the highest level.
|
|
127 |
(see the yacc or bison manuals for a full explanation of how they work,
|
|
128 |
as they are implemented exactly the same way in Parse::Yapp)
|
|
129 |
|
|
130 |
=item *
|
|
131 |
|
|
132 |
C<%start> followed by a rule's left hand side, declaring this rule to
|
|
133 |
be the starting rule of your grammar. The default, when C<%start> is not
|
|
134 |
used, is the first rule in your grammar section.
|
|
135 |
|
|
136 |
=item *
|
|
137 |
|
|
138 |
C<%token> followed by a list of symbols, forcing them to be recognized
|
|
139 |
as tokens, generating a syntax error if used in the left hand side of
|
|
140 |
a rule declaration.
|
|
141 |
Note that in Parse::Yapp, you I<don't> need to declare tokens as in yacc: any
|
|
142 |
symbol not appearing as a left hand side of a rule is considered to be
|
|
143 |
a token.
|
|
144 |
Other yacc declarations or constructs such as C<%type> and C<%union> are
|
|
145 |
parsed but (almost) ignored.
|
|
146 |
|
|
147 |
=item *
|
|
148 |
|
|
149 |
C<%expect> followed by a number, suppress warnings about number of Shift/Reduce
|
|
150 |
conflicts when both numbers match, a la bison.
|
|
151 |
|
|
152 |
|
|
153 |
=item B<The Rule Section> contains your grammar rules:
|
|
154 |
|
|
155 |
A rule is made of a left-hand-side symbol, followed by a C<':'> and one
|
|
156 |
or more right-hand-sides separated by C<'|'> and terminated by a C<';'>:
|
|
157 |
|
|
158 |
exp: exp '+' exp
|
|
159 |
| exp '-' exp
|
|
160 |
;
|
|
161 |
|
|
162 |
A right hand side may be empty:
|
|
163 |
|
|
164 |
input: #empty
|
|
165 |
| input line
|
|
166 |
;
|
|
167 |
|
|
168 |
(if you have more than one empty rhs, Parse::Yapp will issue a warning,
|
|
169 |
as this is usually a mistake, and you will certainly have a reduce/reduce
|
|
170 |
conflict)
|
|
171 |
|
|
172 |
|
|
173 |
A rhs may be followed by an optional C<%prec> directive, followed
|
|
174 |
by a token, giving the rule an explicit precedence (see yacc manuals
|
|
175 |
for its precise meaning) and optionnal semantic action code block (see
|
|
176 |
below).
|
|
177 |
|
|
178 |
exp: '-' exp %prec NEG { -$_[1] }
|
|
179 |
| exp '+' exp { $_[1] + $_[3] }
|
|
180 |
| NUM
|
|
181 |
;
|
|
182 |
|
|
183 |
Note that in Parse::Yapp, a lhs I<cannot> appear more than once as
|
|
184 |
a rule name (This differs from yacc).
|
|
185 |
|
|
186 |
|
|
187 |
=item C<The footer section>
|
|
188 |
|
|
189 |
may contain any valid Perl code and will be appended at the very end
|
|
190 |
of your parser module. Here you can write your lexer, error report
|
|
191 |
subs and anything relevant to you parser.
|
|
192 |
|
|
193 |
=item C<Semantic actions>
|
|
194 |
|
|
195 |
Semantic actions are run every time a I<reduction> occurs in the
|
|
196 |
parsing flow and they must return a semantic value.
|
|
197 |
|
|
198 |
They are (usually, but see below C<In rule actions>) written at
|
|
199 |
the very end of the rhs, enclosed with C<{ }>, and are copied verbatim
|
|
200 |
to your parser file, inside of the rules table.
|
|
201 |
|
|
202 |
Be aware that matching braces in Perl is much more difficult than
|
|
203 |
in C: inside strings they don't need to match. While in C it is
|
|
204 |
very easy to detect the beginning of a string construct, or a
|
|
205 |
single character, it is much more difficult in Perl, as there
|
|
206 |
are so many ways of writing such literals. So there is no check
|
|
207 |
for that today. If you need a brace in a double-quoted string, just
|
|
208 |
quote it (C<\{> or C<\}>). For single-quoted strings, you will need
|
|
209 |
to make a comment matching it I<in th right order>.
|
|
210 |
Sorry for the inconvenience.
|
|
211 |
|
|
212 |
{
|
|
213 |
"{ My string block }".
|
|
214 |
"\{ My other string block \}".
|
|
215 |
qq/ My unmatched brace \} /.
|
|
216 |
# Force the match: {
|
|
217 |
q/ for my closing brace } /
|
|
218 |
q/ My opening brace { /
|
|
219 |
# must be closed: }
|
|
220 |
}
|
|
221 |
|
|
222 |
All of these constructs should work.
|
|
223 |
|
|
224 |
|
|
225 |
In Parse::Yapp, semantic actions are called like normal Perl sub calls,
|
|
226 |
with their arguments passed in C<@_>, and their semantic value are
|
|
227 |
their return values.
|
|
228 |
|
|
229 |
$_[1] to $_[n] are the parameters just as $1 to $n in yacc, while
|
|
230 |
$_[0] is the parser object itself.
|
|
231 |
|
|
232 |
Having $_[0] beeing the parser object itself allows you to call
|
|
233 |
parser methods. Thats how the yacc macros are implemented:
|
|
234 |
|
|
235 |
yyerrok is done by calling $_[0]->YYErrok
|
|
236 |
YYERROR is done by calling $_[0]->YYError
|
|
237 |
YYACCEPT is done by calling $_[0]->YYAccept
|
|
238 |
YYABORT is done by calling $_[0]->YYAbort
|
|
239 |
|
|
240 |
All those methods explicitly return I<undef>, for convenience.
|
|
241 |
|
|
242 |
YYRECOVERING is done by calling $_[0]->YYRecovering
|
|
243 |
|
|
244 |
Four useful methods in error recovery sub
|
|
245 |
|
|
246 |
$_[0]->YYCurtok
|
|
247 |
$_[0]->YYCurval
|
|
248 |
$_[0]->YYExpect
|
|
249 |
$_[0]->YYLexer
|
|
250 |
|
|
251 |
return respectivly the current input token that made the parse fail,
|
|
252 |
its semantic value (both can be used to modify their values too, but
|
|
253 |
I<know what you are doing> ! See I<Error reporting routine> section for
|
|
254 |
an example), a list which contains the tokens the parser expected when
|
|
255 |
the failure occured and a reference to the lexer routine.
|
|
256 |
|
|
257 |
Note that if C<$_[0]-E<gt>YYCurtok> is declared as a C<%nonassoc> token,
|
|
258 |
it can be included in C<$_[0]-E<gt>YYExpect> list whenever the input
|
|
259 |
try to use it in an associative way. This is not a bug: the token
|
|
260 |
IS expected to report an error if encountered.
|
|
261 |
|
|
262 |
To detect such a thing in your error reporting sub, the following
|
|
263 |
example should do the trick:
|
|
264 |
|
|
265 |
grep { $_[0]->YYCurtok eq $_ } $_[0]->YYExpect
|
|
266 |
and do {
|
|
267 |
#Non-associative token used in an associative expression
|
|
268 |
};
|
|
269 |
|
|
270 |
Accessing semantics values on the left of your reducing rule is done
|
|
271 |
through the method
|
|
272 |
|
|
273 |
$_[0]->YYSemval( index )
|
|
274 |
|
|
275 |
where index is an integer. Its value being I<1 .. n> returns the same values
|
|
276 |
than I<$_[1] .. $_[n]>, but I<-n .. 0> returns values on the left of the rule
|
|
277 |
beeing reduced (It is related to I<$-n .. $0 .. $n> in yacc, but you
|
|
278 |
cannot use I<$_[0]> or I<$_[-n]> constructs in Parse::Yapp for obvious reasons)
|
|
279 |
|
|
280 |
|
|
281 |
There is also a provision for a user data area in the parser object,
|
|
282 |
accessed by the method:
|
|
283 |
|
|
284 |
$_[0]->YYData
|
|
285 |
|
|
286 |
which returns a reference to an anonymous hash, which let you have
|
|
287 |
all of your parsing data held inside the object (see the Calc.yp
|
|
288 |
or ParseYapp.yp files in the distribution for some examples).
|
|
289 |
That's how you can make you parser module reentrant: all of your
|
|
290 |
module states and variables are held inside the parser object.
|
|
291 |
|
|
292 |
Note: unfortunatly, method calls in Perl have a lot of overhead,
|
|
293 |
and when YYData is used, it may be called a huge number
|
|
294 |
of times. If your are not a *real* purist and efficiency
|
|
295 |
is your concern, you may access directly the user-space
|
|
296 |
in the object: $parser->{USER} wich is a reference to an
|
|
297 |
anonymous hash array, and then benchmark.
|
|
298 |
|
|
299 |
If no action is specified for a rule, the equivalant of a default
|
|
300 |
action is run, which returns the first parameter:
|
|
301 |
|
|
302 |
{ $_[1] }
|
|
303 |
|
|
304 |
=item C<In rule actions>
|
|
305 |
|
|
306 |
It is also possible to embed semantic actions inside of a rule:
|
|
307 |
|
|
308 |
typedef: TYPE { $type = $_[1] } identlist { ... } ;
|
|
309 |
|
|
310 |
When the Parse::Yapp's parser encounter such an embedded action, it modifies
|
|
311 |
the grammar as if you wrote (although @x-1 is not a legal lhs value):
|
|
312 |
|
|
313 |
@x-1: /* empty */ { $type = $_[1] };
|
|
314 |
typedef: TYPE @x-1 identlist { ... } ;
|
|
315 |
|
|
316 |
where I<x> is a sequential number incremented for each "in rule" action,
|
|
317 |
and I<-1> represents the "dot position" in the rule where the action arises.
|
|
318 |
|
|
319 |
In such actions, you can use I<$_[1]..$_[n]> variables, which are the
|
|
320 |
semantic values on the left of your action.
|
|
321 |
|
|
322 |
Be aware that the way Parse::Yapp modifies your grammar because of
|
|
323 |
I<in rule actions> can produce, in some cases, spurious conflicts
|
|
324 |
that wouldn't happen otherwise.
|
|
325 |
|
|
326 |
=item C<Generating the Parser Module>
|
|
327 |
|
|
328 |
Now that you grammar file is written, you can use yapp on it
|
|
329 |
to generate your parser module:
|
|
330 |
|
|
331 |
yapp -v Calc.yp
|
|
332 |
|
|
333 |
will create two files F<Calc.pm>, your parser module, and F<Calc.output>
|
|
334 |
a verbose output of your parser rules, conflicts, warnings, states
|
|
335 |
and summary.
|
|
336 |
|
|
337 |
What your are missing now is a lexer routine.
|
|
338 |
|
|
339 |
=item C<The Lexer sub>
|
|
340 |
|
|
341 |
is called each time the parser need to read the next token.
|
|
342 |
|
|
343 |
It is called with only one argument that is the parser object itself,
|
|
344 |
so you can access its methods, specially the
|
|
345 |
|
|
346 |
$_[0]->YYData
|
|
347 |
|
|
348 |
data area.
|
|
349 |
|
|
350 |
It is its duty to return the next token and value to the parser.
|
|
351 |
They C<must> be returned as a list of two variables, the first one
|
|
352 |
is the token known by the parser (symbolic or literal), the second
|
|
353 |
one beeing anything you want (usualy the content of the token, or the
|
|
354 |
literal value) from a simple scalar value to any complex reference,
|
|
355 |
as the parsing driver never use it but to call semantic actions:
|
|
356 |
|
|
357 |
( 'NUMBER', $num )
|
|
358 |
or
|
|
359 |
( '>=', '>=' )
|
|
360 |
or
|
|
361 |
( 'ARRAY', [ @values ] )
|
|
362 |
|
|
363 |
When the lexer reach the end of input, it must return the C<''>
|
|
364 |
empty token with an undef value:
|
|
365 |
|
|
366 |
( '', undef )
|
|
367 |
|
|
368 |
Note that your lexer should I<never> return C<'error'> as token
|
|
369 |
value: for the driver, this is the error token used for error
|
|
370 |
recovery and would lead to odd reactions.
|
|
371 |
|
|
372 |
Now that you have your lexer written, maybe you will need to output
|
|
373 |
meaningful error messages, instead of the default which is to print
|
|
374 |
'Parse error.' on STDERR.
|
|
375 |
|
|
376 |
So you will need an Error reporting sub.
|
|
377 |
|
|
378 |
item C<Error reporting routine>
|
|
379 |
|
|
380 |
If you want one, write it knowing that it is passed as parameter
|
|
381 |
the parser object. So you can share information whith the lexer
|
|
382 |
routine quite easily.
|
|
383 |
|
|
384 |
You can also use the C<$_[0]-E<gt>YYErrok> method in it, which will
|
|
385 |
resume parsing as if no error occured. Of course, since the invalid
|
|
386 |
token is still invalid, you're supposed to fix the problem by
|
|
387 |
yourself.
|
|
388 |
|
|
389 |
The method C<$_[0]-E<gt>YYLexer> may help you, as it returns a reference
|
|
390 |
to the lexer routine, and can be called as
|
|
391 |
|
|
392 |
($tok,$val)=&{$_[0]->Lexer}
|
|
393 |
|
|
394 |
to get the next token and semantic value from the input stream. To
|
|
395 |
make them current for the parser, use:
|
|
396 |
|
|
397 |
($_[0]->YYCurtok, $_[0]->YYCurval) = ($tok, $val)
|
|
398 |
|
|
399 |
and know what you're doing...
|
|
400 |
|
|
401 |
=item C<Parsing>
|
|
402 |
|
|
403 |
Now you've got everything to do the parsing.
|
|
404 |
|
|
405 |
First, use the parser module:
|
|
406 |
|
|
407 |
use Calc;
|
|
408 |
|
|
409 |
Then create the parser object:
|
|
410 |
|
|
411 |
$parser=new Calc;
|
|
412 |
|
|
413 |
Now, call the YYParse method, telling it where to find the lexer
|
|
414 |
and error report subs:
|
|
415 |
|
|
416 |
$result=$parser->YYParse(yylex => \&Lexer,
|
|
417 |
yyerror => \&ErrorReport);
|
|
418 |
|
|
419 |
(assuming Lexer and ErrorReport subs have been written in your current
|
|
420 |
package)
|
|
421 |
|
|
422 |
The order in which parameters appear is unimportant.
|
|
423 |
|
|
424 |
Et voila.
|
|
425 |
|
|
426 |
The YYParse method will do the parse, then return the last semantic
|
|
427 |
value returned, or undef if error recovery cannot recover.
|
|
428 |
|
|
429 |
If you need to be sure the parse has been successful (in case your
|
|
430 |
last returned semantic value I<is> undef) make a call to:
|
|
431 |
|
|
432 |
$parser->YYNberr()
|
|
433 |
|
|
434 |
which returns the total number of time the error reporting sub has been called.
|
|
435 |
|
|
436 |
=item C<Error Recovery>
|
|
437 |
|
|
438 |
in Parse::Yapp is implemented the same way it is in yacc.
|
|
439 |
|
|
440 |
=item C<Debugging Parser>
|
|
441 |
|
|
442 |
To debug your parser, you can call the YYParse method with a debug parameter:
|
|
443 |
|
|
444 |
$parser->YYParse( ... , yydebug => value, ... )
|
|
445 |
|
|
446 |
where value is a bitfield, each bit representing a specific debug output:
|
|
447 |
|
|
448 |
Bit Value Outputs
|
|
449 |
0x01 Token reading (useful for Lexer debugging)
|
|
450 |
0x02 States information
|
|
451 |
0x04 Driver actions (shifts, reduces, accept...)
|
|
452 |
0x08 Parse Stack dump
|
|
453 |
0x10 Error Recovery tracing
|
|
454 |
|
|
455 |
To have a full debugging ouput, use
|
|
456 |
|
|
457 |
debug => 0x1F
|
|
458 |
|
|
459 |
Debugging output is sent to STDERR, and be aware that it can produce
|
|
460 |
C<huge> outputs.
|
|
461 |
|
|
462 |
=item C<Standalone Parsers>
|
|
463 |
|
|
464 |
By default, the parser modules generated will need the Parse::Yapp
|
|
465 |
module installed on the system to run. They use the Parse::Yapp::Driver
|
|
466 |
which can be safely shared between parsers in the same script.
|
|
467 |
|
|
468 |
In the case you'd prefer to have a standalone module generated, use
|
|
469 |
the C<-s> switch with yapp: this will automagically copy the driver
|
|
470 |
code into your module so you can use/distribute it without the need
|
|
471 |
of the Parse::Yapp module, making it really a C<Standalone Parser>.
|
|
472 |
|
|
473 |
If you do so, please remember to include Parse::Yapp's copyright notice
|
|
474 |
in your main module copyright, so others can know about Parse::Yapp module.
|
|
475 |
|
|
476 |
=item C<Source file line numbers>
|
|
477 |
|
|
478 |
by default will be included in the generated parser module, which will help
|
|
479 |
to find the guilty line in your source file in case of a syntax error.
|
|
480 |
You can disable this feature by compiling your grammar with yapp using
|
|
481 |
the C<-n> switch.
|
|
482 |
|
|
483 |
=back
|
|
484 |
|
|
485 |
=head1 BUGS AND SUGGESTIONS
|
|
486 |
|
|
487 |
If you find bugs, think of anything that could improve Parse::Yapp
|
|
488 |
or have any questions related to it, feel free to contact the author.
|
|
489 |
|
|
490 |
=head1 AUTHOR
|
|
491 |
|
|
492 |
Francois Desarmenien <francois@fdesar.net>
|
|
493 |
|
|
494 |
=head1 SEE ALSO
|
|
495 |
|
|
496 |
yapp(1) perl(1) yacc(1) bison(1).
|
|
497 |
|
|
498 |
=head1 COPYRIGHT
|
|
499 |
|
|
500 |
The Parse::Yapp module and its related modules and shell scripts are copyright
|
|
501 |
(c) 1998-2001 Francois Desarmenien, France. All rights reserved.
|
|
502 |
|
|
503 |
You may use and distribute them under the terms of either
|
|
504 |
the GNU General Public License or the Artistic License,
|
|
505 |
as specified in the Perl README file.
|
|
506 |
|
|
507 |
If you use the "standalone parser" option so people don't need to install
|
|
508 |
Parse::Yapp on their systems in order to run you software, this copyright
|
|
509 |
noticed should be included in your software copyright too, and the copyright
|
|
510 |
notice in the embedded driver should be left untouched.
|
|
511 |
|
|
512 |
=cut
|