Skip to content
This repository was archived by the owner on Aug 26, 2022. It is now read-only.

Collect samples based on distinct input values #37

Open
cgendreau opened this issue Sep 11, 2017 · 1 comment
Open

Collect samples based on distinct input values #37

cgendreau opened this issue Sep 11, 2017 · 1 comment

Comments

@cgendreau
Copy link
Contributor

The current collector only consider the type of evaluation and the size of the sample.
We often get sample like :

lineNumber dwc:scientificName
82 Merismodes anomalus (Pers.) Singer
471 Merismodes anomalus (Pers.) Singer
1402 Merismodes anomalus (Pers.) Singer
1969 Merismodes anomalus (Pers.) Singer
2791 Merismodes anomalus (Pers.) Singer

It would be preferable to consider the input data and only accumulate when the input is different (up to the predefined size of the sample)

cgendreau pushed a commit that referenced this issue Sep 11, 2017
@cgendreau
Copy link
Contributor Author

There is no perfect solution, by doing that some samples are better but not all of them.
Maybe we should implement the hybrid solution: accumulate up to the predefined size of the sample and from there only replace an element if it's linked to new set of inputs (until we reach a complete distinct sample). The other problem is due to the fact that collectors operate in different threads so each of their sample is re-sampled once we aggregate all results together. We may want to increase the size of the sample in each collectors to lower the chance of collision.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant