While a parser-generator takes as input an input-specification to generate a parser, PGG takes as input a translation module to generate a parser-generator. The translation module plays a role to translate an abstracted parser into a concrete parser written in your language. Implementing the translation module may seem difficult, but don't worry. PGG provides you with most implementation of it and you only have to implement simple functions and customize some variables with provided information. See Implementing Translation module .
The input-specification format of PGG-generated parser-generator(PGG parser-generator) is based on that of the existing parser-generator Ocamlyacc. If you are already familiar to Ocamlyacc, you will easily adapt yourself to the input specification format of PGG parser-generator. Between Ocamlyacc and PGG parser-generator input-specifications, only three differences exist; 1) header and trailer sections are divided into two subsections, 2) types of non-terminal symbols should be explicitly declared, 3) a new keyword '@' is introduced as a method to access information about token position. See Input-specification of PGG parser-generator.
According to our experiments, performances of parsers generated from PGG parser-generators are generally poor than parsers generated from existing parsers. When we measured elapsed time to parse all implementation files(*.ml, 597files, 85384lines) of Ocaml sources distribution(version 3.10.0) by using two Ocaml parsers generated from PGG parser-generator and Ocamlyacc, PGG parser is 4~5 seconds slower than Ocamlyacc parser. However, think about effort to imlpement a parser-generator. You can choose: struggle against complex algorithms to implement your parser-generator for a week, or use PGG to implement your own parser-generator just in several hours.
(1) Customizing variables
char *output_suffix = ".ml"; char *token_suffix = "_token.ml";These two variables decide suffixes of two output files which are generated by PGG-generated parser; token-output file that contains the definition of token type and parser-output file that contains the main parser program. For example, if you defines them as above and PGG-generated parser takes 'a.y' as input-specification, it yields a.ml and a_token.ml as the parser-output file and the token-output file respectively.
char *token_tycon = "token"; char *state_tycon = "state"; char *location_tycon = "location"; char *stack_elem_tycon = "stack_elem"; char *stack_tycon = "stack"; char *next_action_tycon = "next_act"; char *result_tycon = "result";Since the lexical convention of type constructor names is different according to the target language, you have to define each type constructor name to be appropriate to the lexical convention of your language. For example, for Haskell, you should replace the first character of each name with a upper-case character due to its lexical convention. (eg. "Token", "State" ..).
char *define_stack_type = "stack_elem list"; char *define_location_type = "int";You may simply define the stack type and location type by using list and int as above. Otherwise, you may modify these. For example, if you want more detailed location including a start location and an end location, you may define the location type as int * int to represents start and end locations.
char *define_dummy_location = "-1"; char *define_stack_empty = "[]";define_dummy_location represents the dummy location which is necessary for PGG. You should assign this variable a meaningless value whose type is previously defined location type. If the location type is int * int, you should assign a value such as (-1,-1). define stack empty represents the empty stack implementation.
push : stack elem -> stack -> stack pop : stack -> stack elem * stackpush has two arguments whose names are; elem and st. pop has one arguments whose name is st. You may simply implement these two functions by using these arguments as follows:
char *define_stack_push = "elem::st"; char *define_stack_pop = " match st with | head::st' -> (head,st')";In the pop function, since stack is never empty before parsing is completed, you need not concern such case.
* Declaration
void prt_decl_TYPE(FILE * fp ,struct d_type *d); void prt_decl_TYPE_ALIAS(FILE * fp ,struct d_type_alias *d); void prt_decl_VAL(FILE *fp, struct d_val *d); void prt_decl_FUNC(FILE *fp, struct d_func *d);* Type
void prt_type_IDENT (FILE *fp, struct t_ident *t); void prt_type_TUPLE (FILE *fp, struct t_tuple *t);* Pattern
void prt_pattern_IDENT (FILE *fp, struct p_ident *p); void prt_pattern_TUPLE (FILE *fp, struct p_tuple *p); void prt_pattern_CONSTRUCTOR (FILE *fp, struct p_constructor *p); void prt_pattern_CONSTANT (FILE *fp, struct p_constant *p);* Expression
void prt_expression_IDENT (FILE *fp, struct e_ident *e); void prt_expression_TUPLE (FILE *fp, struct e_tuple *e); void prt_expression_CONSTRUCTOR (FILE *fp, struct e_constructor *e); void prt_expression_CONSTANT (FILE *fp, struct e_constant *e); void prt_expression_APPLY (FILE *fp, struct e_apply *e); void prt_expression_LET_VAL (FILE *fp, struct e_let_val *e); void prt_expression_LET_FUNC (FILE *fp, struct e_let_func *e); void prt_expression_MATCH (FILE *fp, struct e_match *e);
void prt_decl_TYPE(FILE * fp ,struct d_type *d)fp is the file pointer of outputs and d includes all information to describe type declaration syntax. PGG comments information about this function like this:
/* prt_decl_TYPE
*
* 1) references
* struct d_type { int num_ctlist; string tcon_id; CON_TYPE* ctlist; };
* struct con_type { string con_id; TYPEXPR t;};
*
* 2) output
* type <tcon_id> =
* | <ctlist[0]->con_id> { of <ctlist[0] -> t> }
* | <ctlist[1]->con_id> { of <ctlist[1] -> t> }
* | ..
*/
'1) references' shows member-variables of given data structures and
'2) output' shows output code written in Ocaml syntax. PGG also provides implementation for Ocaml as follow:
void prt_decl_TYPE(FILE* fp, struct d_type *d)
{
int i;
fprintf(fp,"type %s = ",d->tcon_id);
for(i=0; i<d->num_ctlist; i++)
{
fprintf(fp,"\n | ");
if (d->ctlist[i]->t)
{
fprintf(fp,"%s of ",d->ctlist[i]->con_id);
prt_type(fp,d->ctlist[i]->t);
}
else fprintf(fp,"%s",d->ctlist[i]->con_id);
}
}
From this, you can find how to access information to describe type declaration syntax.
datatype typename =
constructor {of typexpr}
| constructor {of typexpr}
| ...
You can easily modify void prt_decl_TYPE() to express above syntax like this:
void prt_decl_TYPE(FILE* fp, struct d_type *d)
{
int i;
fprintf(fp,"datatype %s = ",d->tcon_id);
for(i=0; inum_ctlist; i++)
{
fprintf(fp,"\n ");
if (i!=0)
fprintf(fp,"| ");
if (d->ctlist[i]->t)
{
fprintf(fp,"%s of ",d->ctlist[i]->con_id);
prt_type(fp,d->ctlist[i]->t);
}
else fprintf(fp,"%s",d->ctlist[i]->con_id);
}
}
| Ocamlyacc input-specification | PGG input-specification |
|
/* token header*/
%[
%]
/* parser header */
%{
open Test_token
let lexfun lb =
let token = Lexer.token lb
in (token,lb.Lexing.lex_curr_pos,lb)
%}
/* beginning of declarations section */
%token DOT
%token <string> CHAR
%token EOF
%start start_list
%type <string list> start_list
%type <string list> list
%type <string> element
%%
/* beginning of rules section */
start_list : list EOF {$1}
;
list : list DOT element { $3 :: $1 }
| element {[$1]}
;
element : CHAR { print string "CHAR location :";
print_int @1;
$1 }
;
%%
/* token trailer */
%%
/* parser trailer */
|
lexfun : lexbuf -> token * location * lexbuflexbuf is the type of input stream buffer. lexfun takes as input lexbuf and outputs a token(token), position of the token(location) and an updated input stream buffer(lexbuf). Lexer.token is an external lexical analyzer to produce tokens.
In the Ocamlyacc input-specification, tokens (terminal-symbols) and the semantic type of start symbol (start list) are declared. However, PGG input-specification needs to know additional information about non-terminals such as list and element. You should additionally declare the semantic value type of each non-terminal symbols. If no type declaration about non-terminal list is specified, PGG parser-generator outputs an error message:
No type has been declared for the non-terminal symbol "list"
type result_start_list = OK_start_list of string_list | Fail_start_list of locationPGG parser informs you of the result of parsing by providing OK_ with parsing result or Fail with the location of an error.
let program inputfile = let ic = open_in_bin inputfile in let lexbuf = Lexing.from_channel ic in let result = MyParser.start_list lexbuf in match result with | MyParser.OK_start_list r -> r | MyParser.Fail_start_list loc -> print_error loc
.
Programming Language Laboratory Department of Computer Science and Engineering Pohang University of Science and Technology Republic of Korea