Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracting probabilities? #5

Open
regehr opened this issue Nov 21, 2019 · 3 comments
Open

extracting probabilities? #5

regehr opened this issue Nov 21, 2019 · 3 comments

Comments

@regehr
Copy link

regehr commented Nov 21, 2019

Hello! I'm sorry to abuse github issues in this fashion, but I'm not sure how else to ask for a bit of advice.

I have a simple random program generator that looks like a decision tree. It is not context free, but it is not all that far from being context free. When making a uniform choice among alternatives at every decision point, it produces extremely skewed output. I want to be able to (approximately) uniformly sample the leaves of my decision tree.

I am happy to extract a CFG from my generator (by hand) so that I can run it through gramtropy, but what I want is not for gram to sample from the resulting grammar, but rather a list of branch probabilities that I can paste back into my program so that it (approximately) uniformly samples from its leaves.

Is this a feasible thing to ask your tool for? Any help appreciated, thanks.

@regehr
Copy link
Author

regehr commented Nov 21, 2019

If you wish a bit of additional background, this is the generator I'm working on:
https://github.com/regehr/opt-fuzz
and here's a related blog post:
https://blog.regehr.org/archives/1700

@sipa
Copy link
Owner

sipa commented Nov 21, 2019

Interesting use case!

Depending on how closely you want the output to match the input grammar, that may be easy or hard to do.

The output of gramc is a file that contains a representation of a DAG, where each leaf is a list (disjunction) of strings, and each internal node is either a concatenation or disjunction of 2 or more nodes. Each node also has a count associated with it: the total number of combinations that node (including all its children) can expand to. This representation avoids cycles by specializing each symbol in the input grammar for each length (in number of string characters), so while SYM = "" | (("a" | "b") SYM) is cyclic and has an infinite number of expansions, the specialization of SYM for 7 characters is acyclic, and has a finite number of expansions.

Furthermore a number of optimizations are applied, where (specialized) symbols with only a small number of probabilities are just pre-expanded into all possible strings.

Outputting the DAG in a human-readable format with branch probabilities for all disjunctions wouldn't be very hard, but mapping it back to the original CFG would be a lot harder.

@regehr
Copy link
Author

regehr commented Nov 29, 2019

Interesting, thanks!
Hmm, I definitely want to map back to the original grammar.
Do you mind saying where the method you implemented originated? For example I've been reading this one and some of its predecessors and successors: https://ir.canterbury.ac.nz/handle/10092/11231

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants