Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optional random feature sampling for rs extract #7

Open
3 tasks
daniel-j-h opened this issue Jun 10, 2018 · 0 comments
Open
3 tasks

Implement optional random feature sampling for rs extract #7

daniel-j-h opened this issue Jun 10, 2018 · 0 comments

Comments

@daniel-j-h
Copy link
Collaborator

For features like buildings we want to sample OpenStreetMap when extracting geometries in rs extract.

The osmium handlers in robosat.osm should take a sampler and then for every OpenStreetMap entity call back ask the sampler if they should handle this entity or not.

For the sampler we have a few options:

  1. let user pass a number n of samples (e.g. 20k); we take the first n and after that just drop features. Problem: we don't randomly sample from all geographical areas; not a good idea
  2. let the user pass a fraction f of samples (e.g. 0.1); in the osm call backs we take a random number r in [0, 1] and keep the sample if the number if r < f. Problem: users want a fixed amount of samples (e.g. 20k) but a fraction will change depending on how many features there are in osm. For example with parking lots a fraction of 0.1 is maybe a few thousands, with buildings it's millions.
  3. do two passes over the data; in the first pass count how many features there are in osm, then come up with a fraction to keep; then in the second pass we use approach 2. Problem: needs two passes over the data, and two separate handlers for one feature.
  4. use an online algorithm for random sampling: reservoir sampling. It's an algorithm for randomly sampling k items out of a stream of unknown size. This is a good read.

Tasks:

  • Implement a ReservoirSampler class; it takes a size n of max. number of items to randomly sample from a stream of unknown size.
  • Let our osmium handlers take a ReservoirSampler; in the osm entity call backs they push features into the reservoir. And in the save function they save features from the reservoir. The reservoir is responsible for keeping or discarding features doing the sampling.
  • Add an optional argument to the rs extract tool for users to set the sample size; pass this argument to the sampler.

Note: now that we have the rs dedupe tool deduplicating detections against OpenStreetMap we need to think about how to design the interface here. The dedupe tool currently ready in the OpenStreetMap features created in the extract tool. If we randomly sample features in extract we can no longer use it for deduplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant