Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier Detection Performance #2278

Open
phorne-uncharted opened this issue Feb 16, 2021 · 1 comment
Open

Outlier Detection Performance #2278

phorne-uncharted opened this issue Feb 16, 2021 · 1 comment
Labels

Comments

@phorne-uncharted
Copy link
Contributor

For some datasets, or fields maybe, outlier detection is taking forever to run. Attempting to run it on the ACLED dataset resulted in it running for over 12 hours and still not completing. When killing the process, it was stuck in the sklearn imputer "d3m.primitives.data_cleaning.imputer.SKlearn".

@phorne-uncharted
Copy link
Contributor Author

The root cause of this issue is the explosion in features due to text encoding. The client currently uses the first variable as the target variable rather than the selected target variable. For acled, that happens to be the data id (unique for each row) so the text encoder ends up creating nearly 16k features.

At the very least, the client needs to be updated to use the right target. Ideally, some boundary checking or some type of sanity check would exist on the server to make sure outlier detection only runs in cases that make sense or somehow limit the feature explosion that can occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant