Collect samples based on distinct input values #37

cgendreau · 2017-09-11T09:37:59Z

The current collector only consider the type of evaluation and the size of the sample.
We often get sample like :

lineNumber	dwc:scientificName
82	Merismodes anomalus (Pers.) Singer
471	Merismodes anomalus (Pers.) Singer
1402	Merismodes anomalus (Pers.) Singer
1969	Merismodes anomalus (Pers.) Singer
2791	Merismodes anomalus (Pers.) Singer

It would be preferable to consider the input data and only accumulate when the input is different (up to the predefined size of the sample)

Issue #37

cgendreau · 2017-09-11T20:12:37Z

There is no perfect solution, by doing that some samples are better but not all of them.
Maybe we should implement the hybrid solution: accumulate up to the predefined size of the sample and from there only replace an element if it's linked to new set of inputs (until we reach a complete distinct sample). The other problem is due to the fact that collectors operate in different threads so each of their sample is re-sampled once we aggregate all results together. We may want to increase the size of the sample in each collectors to lower the chance of collision.

cgendreau pushed a commit that referenced this issue Sep 11, 2017

testing collector with distinct input values

048f783

Issue #37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect samples based on distinct input values #37

Collect samples based on distinct input values #37

cgendreau commented Sep 11, 2017

cgendreau commented Sep 11, 2017

Collect samples based on distinct input values #37

Collect samples based on distinct input values #37

Comments

cgendreau commented Sep 11, 2017

cgendreau commented Sep 11, 2017