mpc is a lightweight but powerful Parser Combinator library for C.
Using mpc might be of interest to you if you are...
- Building a new programming language
- Building a new data format
- Parsing an existing data format
- Embedding a Domain Specific Language
- (http://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule)[Adding an ad-hoc embedded Lisp to your C program]
- Type-Generic Parser Combinators
- Error Message Support
- Regular Expression Support
- Abstract Syntax Tree Support
- Easy to Integrate (One Source File in ANSI C)
The current main alternative C based parser combinator is a branch of (https://github.com/wbhart/Cesium3/tree/combinators)[Cesium3].
The main advantages of mpc over this are:
- Works for Generic Types
- Doesn't rely on Boehm-Demers-Weiser Garbage Collection
- Doesn't use
setjmp
andlongjmp
for errors - Doesn't pollute namespace
In this example I specify a grammar for a basic maths language and parse a string from it.
The output is an instance of the included mpc_ast_t
type.
#include "mpc.h"
mpc_ast_t* parse_maths(const char* input) {
mpc_parser_t* Expr = mpc_new("expression");
mpc_parser_t* Prod = mpc_new("product");
mpc_parser_t* Value = mpc_new("value");
mpc_parser_t* Maths = mpc_new("maths");
mpc_define(Expr, mpca_grammar(" <product> (('+' | '-') <product>)* ", Prod));
mpc_define(Prod, mpca_grammar(" <value> (('*' | '/') <value>)* ", Value));
mpc_define(Value, mpca_grammar(" /[0-9]+/ | '(' <expression> ')' ", Expr));
mpc_define(Maths, mpca_total(Expr));
mpc_result_t r;
if (!mpc_parse("parse_maths", input, Maths, &r)) {
mpc_err_print(r.error);
abort();
}
mpc_cleanup(4, Expr, Prod, Value, Maths);
return r.output;
}
Then the output for the parse of an expression (4 * 2 * 11 + 2) + 5
would look something like this
<root>
<value>
<char> '('
<expression>
<product>
<value> '4'
<char> '*'
<value> '2'
<char> '*'
<value> '11'
<char> '+'
<value> '2'
<char> ')'
<char> '+'
<value> '5'
Parser Combinators are structures that encode how to parse a particular language. They can be combined using a number of intuitive operators to create new parsers of ever increasing complexity. With these, complex grammars and languages can be processed easily.
The trick behind Parser Combinators is the observation that by structuring the library in a particular way one can make building parser combinators look like writing a grammar itself. Therefore instead of describing how to parse a language, a user must only specify the language itself, and the computer will work out how to parse it ... as if by magic!
The Parser Combinator type in mpc is mpc_parser_t
. This encodes a function that attempts to parse some string and, if successful, returns a pointer to some data. Otherwise it returns some error. A parser can be run using mpc_parse
.
bool mpc_parse(const char* f, const char* s, mpc_parser_t* p, mpc_result_t* r);
This function returns true
on success and false
on failure. It takes as input some parser p
, some input string s
, and some filename f
. It outputs into r
the result of the parse - which is either a pointer to some data object, or an error. The type mpc_result_t
is a union type defined as follows.
typedef union {
mpc_err_t* error;
mpc_val_t* output;
} mpc_result_t;
where mpc_val_t
is synonymous with void*
and simply represents some pointer to data - the exact type of which is dependant on the parser.
All the following functions return basic parsers. All of those parsers return strings with the character(s) matched. They have the following functionality.
mpc_parser_t* mpc_any(void);
- Matches any charactermpc_parser_t* mpc_char(char c);
- Matches a characterc
mpc_parser_t* mpc_range(char s, char e);
- Matches any character in the ranges
toe
(inclusive)mpc_parser_t* mpc_oneof(const char* s);
- Matches any character in provided stringmpc_parser_t* mpc_noneof(const char* s);
- Matches any character not in provided stringmpc_parser_t* mpc_satisfy(bool(*f)(char));
- Matches any character satisfying functionf
mpc_parser_t* mpc_string(const char* s);
- Matches strings
Several other functions exist that return basic parsers with special functionality.
mpc_parser_t* mpc_pass(void);
- Always successful, returnsNULL
mpc_parser_t* mpc_fail(void);
- Always failsmpc_parser_t* mpc_lift(mpc_lift_t f);
- Always successful, returns the result of functionf
mpc_parser_t* mpc_lift_val(mpc_val_t* x);
- Always successful, returnsx
Combinators are functions that take one or more parsers and return a new one. These combinators work independent of what types the input parsers return. In languages such as Haskell ensuring you don't ferry one type of data into a parser requiring a different type of data is done by the compiler. But in C we don't have that luxury, so it is at the discretion of the programmer to ensure that he or she deals correctly with the output types of different parsers.
A second annoyance in C is that of manual memory management. Some parsers might get half-way and then fail, meaning they need to clean up any partial data that has been collected in the parse. In Haskell this is handled by the Garbage Collector but in C these functions will need to take destructors - functions which clean up and partial data that has been collected.
Here are the main combinators and how to use then.
mpc_parser_t* mpc_expect(mpc_parser_t* a, const char* e);
Returns a parser that attempts a
. On success returns the result of a
. On failure reports that e
was expected.
This is useful for improving the readability of error messages. For example:
mpc_or(2, mpc_char('0'), mpc_char('1'))
might report error: expected '0' or '1' at 'x'
, while
mpc_expect(mpc_or(2, mpc_char('0'), mpc_char('1')), "binary digit")
will report error: expected binary digit at 'x'
which in some circumstances can drastically improve readability of error messages.
mpc_parser_t* mpc_apply(mpc_parser_t* a, mpc_apply_t f);
mpc_parser_t* mpc_apply_to(mpc_parser_t* a, mpc_apply_to_t f, void* x);
Applies function f
(optionality taking extra input x
) to the result of parser a
.
mpc_parser_t* mpc_not(mpc_parser_t* a, mpc_dtor_t da);
mpc_parser_t* mpc_not_else(mpc_parser_t* a, mpc_dtor_t da, mpc_lift_t lf);
Returns a parser with the following behaviour. If parser a
succeeds, the output parser fails. If parser a
fails, the output parser succeeds and returns NULL
(or the result of lift function lf
).
Destructor da
is used to destroy the result of a
.
mpc_parser_t* mpc_maybe(mpc_parser_t* a);
mpc_parser_t* mpc_maybe_else(mpc_parser_t* a, mpc_lift_t lf);
Attempts to parser a
. If this fails then succeeds and returns NULL
(or the result of lf
).
mpc_parser_t* mpc_many(mpc_parser_t* a, mpc_fold_t f);
mpc_parser_t* mpc_many_else(mpc_parser_t* a, mpc_fold_t f, mpc_lift_t lf);
Attempts to parse zero or more a
. If zero instances are found then succeeds and returns NULL
(or the result of lf
).
If more than zero instances are found, results of a
are combined using fold function f
. See the Function Types section for more details.
mpc_parser_t* mpc_many1(mpc_parser_t* a, mpc_fold_t f);
Attempts to parse one or more a
. Results are combined with fold function f
.
mpc_parser_t* mpc_count(mpc_parser_t* a, mpc_dtor_t da, mpc_fold_t f, int n);
mpc_parser_t* mpc_count_else(mpc_parser_t* a, mpc_dtor_t da, mpc_fold_t f, int n, mpc_lift_t lf);
Attempts to parse exactly n
of a
. If it fails the result output by the fold function f
is destructed with da
and either it returns NULL
or the result of lift function lf
.
Results of a
are combined using fold function f
.
mpc_parser_t* mpc_else(mpc_parser_t* a, mpc_parser_t* b);
Attempts to parse a
and if fails attempts to parse b
. If both fail, returns an error.
mpc_parser_t* mpc_also(mpc_parser_t* a, mpc_parser_t* b, mpc_dtor_t da, mpc_fold_t f);
mpc_parser_t* mpc_bind(mpc_parser_t* a, mpc_parser_t* b, mpc_dtor_t da, mpc_fold_t f);
Attempts to parse a
and then attempts to parse b
. If b
fails it destructs the result of a
using da
. If both succeed it returns the result of a
and b
combined using the fold function f
.
mpc_parser_t* mpc_or(int n, ...);
Attempts to parse n
parsers in sequence, returning the first one that succeeds. If all fail, returns an error.
For example: mpc_or(3, mpc_char('a'), mpc_char('b'), mpc_char('c'))
would attempt to match either an 'a'
or a 'b'
or a 'c'
.
mpc_parser_t* mpc_and(int n, mpc_afold_t f, ...);
Attempts to parse n
parsers in sequence, returning the fold of them using fold function f
. Parsers must be specified in series, followed by all the destructors for the types they return minus the last. These are used in case of partial success.
For example: mpc_and(3, mpcf_astrfold, mpc_char('a'), mpc_char('b'), mpc_char('c'), free, free),
would attempt to match 'a'
followed by 'b'
followed by 'c'
and if successful would concatenate them using mpcf_astrfold
.
The combinator functions take a number of special function types as function pointers. Here is a short explanation of those types are how they are expected to behave.
typedef void(*mpc_dtor_t)(mpc_val_t*);
Destructor function. Given some pointer to a data value it will ensure the memory it points to is freed correctly.
typedef mpc_val_t*(*mpc_apply_t)(mpc_val_t*);
typedef mpc_val_t*(*mpc_apply_to_t)(mpc_val_t*,void*);
Application function. This takes in some pointer to data and outputs some new or modified pointer to data, ensuring to free and old data no longer required. The apply_to
variation takes in an extra pointer to some data such as state of the system.
typedef mpc_val_t*(*mpc_fold_t)(mpc_val_t*,mpc_val_t*);
Fold function. This takes two pointers to data and must output some new combined pointer to data, ensuring to free and old data no longer required. When used with the many
, many1
and count
functions this initially takes in NULL
for it's first argument and following that takes in for it's first argument whatever was previously returned by the function itself. In this way users have a chance to build some initial data structure before populating it with whatever is passed as the second argument.
typedef mpc_val_t*(*mpc_afold_t)(int,mpc_val_t**);
AFold Function. Similar to the above but it is passed in a list of pointers to data values which must all be folded together and output as a new single data value.
typedef mpc_val_t*(*mpc_lift_t)(void);
Lift Function. This function returns some data value when called. It can be used to create empty versions of data types when certain combinators have no known default value to return.
Using the above we can already create a parser that matches a C identifier with relative ease.
First we build a fold function that will concatenate two strings together.
mpc_val_t* parse_fold_string(mpc_val_t* x, mpc_val_t* y) {
if (x == NULL) { return y; }
if (y == NULL) { return x; }
char* x = realloc(x, strlen(x) + strlen(y) + 1);
strcat(x, y);
free(y);
return x;
}
Then we can actually specify the grammar.
char* parse_ident(char* input) {
mpc_parser_t* alpha = mpc_oneof("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
mpc_parser_t* digit = mpc_oneof("0123456789");
mpc_parser_t* underscore = mpc_char('_');
mpc_parser_t* ident0 = mpc_else(alpha, underscore);
mpc_parser_t* ident1 = mpc_many(mpc_or(3, alpha, digit, underscore), parse_fold_string);
mpc_parser_t* ident = mpc_also(ident0, ident1, free, parse_fold_string);
mpc_result_t r;
if (!mpc_parse("parse_ident", input, ident, &r)) {
mpc_err_print(r.error);
abort();
}
mpc_delete(ident);
return r.output;
}
Building parsers in the above way can have issues with self reference and left handed recursion.
To overcome this we separate the construction of parsers into two different steps. Construction and Definition.
mpc_parser_t* mpc_new(const char* name);
This will construct a parser called name
which can then be used by others, including itself. Any parser created using mpc_new
is said to be retained. This means it will behave slightly differently to a normal parser. For example when deleting a parser that includes a retained parser, the retained parser it will not be deleted along with it. To delete a retained parser mpc_delete
must be used on it directly.
A retained parser can then be defined using...
mpc_parser_t* mpc_define(mpc_parser_t* p, mpc_parser_t* a);
This assigns the contents of parser a
to p
, and frees and memory used by a
. With this technique parsers can now reference each other, as well as themselves, without trouble.
mpc_parser_t* mpc_undefine(mpc_parser_t* p);
A final step is required. Parsers that reference each other must all be undefined before they are deleted. It is important to do any undefining before deletion. The reason for this is that to delete a parser it must look at each sub-parser that is used by it. If any of these have already been deleted a segfault is unavoidable.
void mpc_cleanup(int n, ...);
To ease the task of undefining and then deleting parsers mpc_cleanup
can be used. It takes n
parsers as input, and undefines them all, before deleting them all.
A number of common parsers are included.
-
mpc_parser_t* mpc_eoi(void);
- Matches only the end of input -
mpc_parser_t* mpc_soi(void);
- Matches only the start of input , -
mpc_parser_t* mpc_space(void);
- Matches some whitespace character (" \f\n\r\t\v") -
mpc_parser_t* mpc_spaces(void);
- Matches zero or more whitespace characters -
mpc_parser_t* mpc_whitespace(void);
- Matches zero or more whitespace characters and frees the result -
mpc_parser_t* mpc_newline(void);
- Matches'\n'
-
mpc_parser_t* mpc_tab(void);
- Matches'\t'
-
mpc_parser_t* mpc_escape(void);
- Matches a backslash followed by any character -
mpc_parser_t* mpc_digit(void);
- Matches any character in the range'0'
-'9'
-
mpc_parser_t* mpc_hexdigit(void);
- Matches any character in the range'0'
-'9'
as well as'A'
-'F'
and'a'
-'f'
-
mpc_parser_t* mpc_octdigit(void);
- Matches any character in the range'0'
-'7'
-
mpc_parser_t* mpc_digits(void);
- Matches one or more digit -
mpc_parser_t* mpc_hexdigits(void);
- Matches one or more hexdigit -
mpc_parser_t* mpc_octdigits(void);
- Matches one or more octdigit -
mpc_parser_t* mpc_lower(void);
- Matches and lower case character -
mpc_parser_t* mpc_upper(void);
- Matches any upper case character -
mpc_parser_t* mpc_alpha(void);
- Matches and alphabet character -
mpc_parser_t* mpc_underscore(void);
- Matches'_'
-
mpc_parser_t* mpc_alphanum(void);
- Matches any alphabet character, underscore or digit -
mpc_parser_t* mpc_int(void);
- Matches digits and converts to anint*
-
mpc_parser_t* mpc_hex(void);
- Matches hexdigits and converts to anint*
-
mpc_parser_t* mpc_oct(void);
- Matches octdigits and converts to anint*
-
mpc_parser_t* mpc_number(void);
- Matchesmpc_int
,mpc_hex
ormpc_oct
-
mpc_parser_t* mpc_real(void);
- Matches some floating point number as a string -
mpc_parser_t* mpc_float(void);
- Matches some floating point number and converts tofloat*
-
mpc_parser_t* mpc_char_lit(void);
- Matches some character literal surrounded by'
-
mpc_parser_t* mpc_string_lit(void);
- Matches some string literal surrounded by"
-
mpc_parser_t* mpc_regex_lit(void);
- Matches some regex literal surrounded by/
-
mpc_parser_t* mpc_ident(void);
- Matches a C identifier
-
mpc_parser_t* mpc_start(mpc_parser_t* a);
- Matches the start of input ana
-
mpc_parser_t* mpc_end(mpc_parser_t* a, mpc_dtor_t da);
- Matchesa
followed by the end of input -
mpc_parser_t* mpc_enclose(mpc_parser_t* a, mpc_dtor_t da);
- Matches the start of input,a
and then the end of input -
mpc_parser_t* mpc_strip(mpc_parser_t* a);
- Matchesa
striping any surrounding whitespace -
mpc_parser_t* mpc_tok(mpc_parser_t* a);
- Matchesa
and strips any trailing whitespace -
mpc_parser_t* mpc_sym(const char* s);
- Matches strings
and strips any trailing whitespace -
mpc_parser_t* mpc_total(mpc_parser_t* a, mpc_dtor_t da);
- Matches the whitespace strippeda
, enclosed in the start and end of input -
mpc_parser_t* mpc_between(mpc_parser_t* a, mpc_dtor_t ad, const char* o, const char* c);
- Matchesa
between stringso
andc
-
mpc_parser_t* mpc_parens(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between"("
and")"
-
mpc_parser_t* mpc_braces(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between"<"
and">"
-
mpc_parser_t* mpc_brackets(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between"{"
and"}"
-
mpc_parser_t* mpc_squares(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between"["
and"]"
-
mpc_parser_t* mpc_tok_between(mpc_parser_t* a, mpc_dtor_t ad, const char* o, const char* c);
- Matchesa
betweeno
andc
, whereo
andc
have their trailing whitespace striped. -
mpc_parser_t* mpc_tok_parens(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between trailing whitespace stripped"("
and")"
-
mpc_parser_t* mpc_tok_braces(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between trailing whitespace stripped"<"
and">"
-
mpc_parser_t* mpc_tok_brackets(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between trailing whitespace stripped"{"
and"}"
-
mpc_parser_t* mpc_tok_squares(mpc_parser_t* a, mpc_dtor_t ad);
- Matchesa
between trailing whitespace stripped"["
and"]"
A number of common fold functions a user might want are included. They reside under the mpcf_*
namespace.
-
void mpcf_dtor_null(mpc_val_t* x);
- Empty destructor. Does nothing -
mpc_val_t* mpcf_lift_null(void);
- ReturnsNULL
-
mpc_val_t* mpcf_lift_emptystr(void);
- Returns newly allocated empty string -
mpc_val_t* mpcf_free(mpc_val_t* x);
- Freesx
and returnsNULL
-
mpc_val_t* mpcf_int(mpc_val_t* x);
- Converts a decimal stringx
to anint*
-
mpc_val_t* mpcf_hex(mpc_val_t* x);
- Converts a hex stringx
to anint*
-
mpc_val_t* mpcf_oct(mpc_val_t* x);
- Converts a oct stringx
to anint*
-
mpc_val_t* mpcf_float(mpc_val_t* x);
- Converts a stringx
to afloat*
-
mpc_val_t* mpcf_escape(mpc_val_t* x);
- Converts a stringx
to an escaped version -
mpc_val_t* mpcf_unescape(mpc_val_t* x);
- Converts a stringx
to an unescaped version -
mpc_val_t* mpcf_fst(mpc_val_t* x, mpc_val_t* y);
- Returnsx
-
mpc_val_t* mpcf_snd(mpc_val_t* x, mpc_val_t* y);
- Returnsy
-
mpc_val_t* mpcf_fst_free(mpc_val_t* x, mpc_val_t* y);
- Returnsx
and freesy
-
mpc_val_t* mpcf_snd_free(mpc_val_t* x, mpc_val_t* y);
- Returnsy
and freesx
-
mpc_val_t* mpcf_freefold(mpc_val_t* t, mpc_val_t* x);
- ReturnsNULL
and freesx
-
mpc_val_t* mpcf_strfold(mpc_val_t* t, mpc_val_t* x);
- Concatenatest
andx
and returns result -
mpc_val_t* mpcf_afst(int n, mpc_val_t** xs);
- Returns first argument -
mpc_val_t* mpcf_asnd(int n, mpc_val_t** xs);
- Returns second argument -
mpc_val_t* mpcf_atrd(int n, mpc_val_t** xs);
- Returns third argument -
mpc_val_t* mpcf_astrfold(int n, mpc_val_t** xs);
- Concatenates and returns all input strings -
mpc_val_t* mpcf_between_free(int n, mpc_val_t** xs);
- Frees first and third argument and returns second -
mpc_val_t* mpcf_maths(int n, mpc_val_t** xs);
- Examines second argument as string to see which operator it is, then operators on first and third argument as if they areint*
.
Here is another example to show of the stuff learnt so far.
Passing around all these function pointers might seem clumsy, but having parsers be type-generic is important as it lets users define their own syntax tree types as well as perform specific house-keeping or processing in the parsing phase. For example we can specify a simple maths grammar that computes the result of the expression as it goes.
We start with a fold function that will fold int*
types based on some char*
operator.
mpc_val_t* mpcf_maths(int n, mpc_val_t** xs) {
int** vs = (int**)xs;
if (strcmp(xs[1], "*") == 0) { *vs[0] *= *vs[2]; }
if (strcmp(xs[1], "/") == 0) { *vs[0] /= *vs[2]; }
if (strcmp(xs[1], "%") == 0) { *vs[0] %= *vs[2]; }
if (strcmp(xs[1], "+") == 0) { *vs[0] += *vs[2]; }
if (strcmp(xs[1], "-") == 0) { *vs[0] -= *vs[2]; }
free(xs[1]); free(xs[2]);
return xs[0];
}
And then we use this to specify how the grammar folds.
int parse_maths(char* input) {
mpc_parser_t* Expr = mpc_new("expr");
mpc_parser_t* Factor = mpc_new("factor");
mpc_parser_t* Term = mpc_new("term");
mpc_parser_t* Maths = mpc_new("maths");
mpc_define(Expr, mpc_else(
mpc_and(3, mpcf_maths, Factor, mpc_oneof("*/"), Factor, free, free),
Factor
));
mpc_define(Factor, mpc_else(
mpc_and(3, mpcf_maths, Term, mpc_oneof("+-"), Term, free, free),
Term
));
mpc_define(Term, mpc_else(mpc_int(), mpc_parens(Expr, free)));
mpc_define(Maths, mpc_enclose(Expr, free));
mpc_result_t r;
if (!mpc_parse("parse_maths", input, Maths, &r)) {
mpc_err_print(r.error);
abort();
}
int result = *r.output;
free(r.output);
return result;
}
Supplied with something like (4*2)+5
this will output 13
.
Even with all that has been explained above, specifying parts of text can be a tedious task requiring many lines of code. So mpc provides a simple regular expression matcher.
mpc_parser_t* mpc_re(const char* re);
This returns a parser that will attempt to match the given regular expression pattern, and return the matched string on success. It does not have support for groups and match objects, but should be sufficient for simple tasks.
A cute thing about this is that it uses previous parts of the library to parse the user input string - and because mpc is type generic, the parser spits out a mpc_parser_t
directly! It even uses many of the combinator functions as fold functions! This is a great case study in learning how to use mpc, so those curious are encouraged to find it in the source code.
For those that really do not care what data they get out a basic abstract syntax tree type mpc_ast_t
has been included. Along with this are included some combinator functions which work specifically on this type. They reside under mpca_*
and you will notice they do not require fold functions or destructors to be specified.
Doing things via this method means that all the data processing must take place after the parsing - but to many this will be preferable. It also allows for one more trick...
If all the fold and destructor functions are implicit then the user can simply specify the grammar in some nice way and the system can try to build an AST for them from this alone.
mpc_parser_t* mpca_grammar(const char* grammar, ...);
This can be used to do exactly that. It takes in some grammar, as well as a list of named parsers - and outputs a parser that does exactly what is specified.