Outlier Detection Performance #2278

phorne-uncharted · 2021-02-16T13:54:27Z

For some datasets, or fields maybe, outlier detection is taking forever to run. Attempting to run it on the ACLED dataset resulted in it running for over 12 hours and still not completing. When killing the process, it was stuck in the sklearn imputer "d3m.primitives.data_cleaning.imputer.SKlearn".

The text was updated successfully, but these errors were encountered:

phorne-uncharted · 2021-02-16T22:49:22Z

The root cause of this issue is the explosion in features due to text encoding. The client currently uses the first variable as the target variable rather than the selected target variable. For acled, that happens to be the data id (unique for each row) so the text encoder ends up creating nearly 16k features.

At the very least, the client needs to be updated to use the right target. Ideally, some boundary checking or some type of sanity check would exist on the server to make sure outlier detection only runs in cases that make sense or somehow limit the feature explosion that can occur.

phorne-uncharted added the bug label Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier Detection Performance #2278

Outlier Detection Performance #2278

phorne-uncharted commented Feb 16, 2021

phorne-uncharted commented Feb 16, 2021

Outlier Detection Performance #2278

Outlier Detection Performance #2278

Comments

phorne-uncharted commented Feb 16, 2021

phorne-uncharted commented Feb 16, 2021