-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt to scikit-learn 1.6 estimator tag changes #11021
Adapt to scikit-learn 1.6 estimator tag changes #11021
Conversation
@@ -1481,7 +1537,7 @@ def _cls_predict_proba(n_classes: int, prediction: PredtT, vstack: Callable) -> | |||
Number of boosting rounds. | |||
""", | |||
) | |||
class XGBClassifier(XGBModel, XGBClassifierBase): | |||
class XGBClassifier(XGBClassifierBase, XGBModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of scikit-learn/scikit-learn#30234 (which will be in scikit-learn
1.6), the estimator checks raise an error like the following:
XGBRegressor is inheriting from mixins in the wrong order. In general, in mixin inheritance, more specialized mixins must come before more general ones. This means, for instance,
BaseEstimator
should be on the right side of most other mixins. You need to change the order...
That check is new, but it enforced behavior that's been documented in scikit-learn
's estimator development docs for a long time. See the "BaseEstimator
and mixins" section in https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator.
It is particularly important to notice that mixins should be “on the left” while the BaseEstimator should be “on the right” in the inheritance list for proper MRO.
That new check error led to these inheritance-order changes, which led to the XGBModel.get_params()
changes.
demo/**/*.txt | ||
*.dmatrix | ||
.hypothesis | ||
__MACOSX/ | ||
model*.json | ||
/tests/python/models/models/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed some model files left behind from running all the Python tests locally while developing this. These .gitignore
rules prevent checking them into source control.
python-package/xgboost/compat.py
Outdated
class LabelEncoder: # type: ignore[no-redef] | ||
"""Dummy class for sklearn.preprocessing.LabelEncoder.""" | ||
|
||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all of these placeholder classes set to object
, using the re-arranged inheritance order (https://github.com/dmlc/xgboost/pull/11021/files#r1857885039) results in errors like this when importing xgboost
when scikit-learn
is not installed:
TypeError: Cannot create a consistent method resolution order (MRO) for bases object, XGBModel
Having each one be a different class resolves that. I forgot until I saw this test failure that we faced a similar thing a few years ago in LightGBM: microsoft/LightGBM#3192
Just copying @StrikerRUS 's solution from there here... it's worked well for lightgbm
.
@@ -63,6 +63,8 @@ disable = [ | |||
"import-error", | |||
"attribute-defined-outside-init", | |||
"import-outside-toplevel", | |||
"too-few-public-methods", | |||
"too-many-ancestors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching the placeholder classes in the scikit-learn-is-not-available branch of compat.py
to be individual, differently-named, empty classes (instead of all using object
), led to these several of these pylint
errors, in sklearn.py
and dask.py
:
R0901: Too many ancestors (9/7) (too-many-ancestors)
R0903: Too few public methods (0/2) (too-few-public-methods)
It seems that there are already many other places in this codebase where those warnings are being suppressed with # pylint: disable
comments. So instead of adding more such comments (some of which would have to share a line with # type: ignore
comments for mypy
), I'm proposing:
- just globally ignore these
pylint
warnings for the whole project - remove any existing
# pylint: disable
comments about them
I don't feel that strongly... if you'd prefer to keep suppressing individual cases of these, please let me know and I'll happily switch back to #pylint: disable
comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.
Seeing a lot of CI passing, so I think this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! Overall looks good to me. Some tests for the get_params
would be appreciated.
@@ -55,20 +56,43 @@ def lazy_isinstance(instance: Any, module: str, name: str) -> bool: | |||
from sklearn.cross_validation import KFold as XGBKFold | |||
from sklearn.cross_validation import StratifiedKFold as XGBStratifiedKFold | |||
|
|||
# sklearn.utils Tags types can be imported unconditionally once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that once the next sklearn is published.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should.
That'd effectively raise xgboost
's requirement all the way to scikit-learn>=1.6
.
Because it would result in compat.SKLEARN_INSTALLED
being False
for scikit-learn < 1.6
:
xgboost/python-package/xgboost/compat.py
Lines 60 to 61 in 5826b02
except ImportError: | |
SKLEARN_INSTALLED = False |
Which would make all the estimators unusable on those versions.
xgboost/python-package/xgboost/sklearn.py
Lines 754 to 757 in 5826b02
if not SKLEARN_INSTALLED: | |
raise ImportError( | |
"sklearn needs to be installed in order to use this module" | |
) |
python-package/xgboost/compat.py
Outdated
class XGBRegressorBase: # type: ignore[no-redef] | ||
"""Dummy class for sklearn.base.RegressorMixin.""" | ||
|
||
class LabelEncoder: # type: ignore[no-redef] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the label encoder for now. It's not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh great! I just removed that in a511848
Noticed that KFold
was also unused, so I removed that as well.
@@ -63,6 +63,8 @@ disable = [ | |||
"import-error", | |||
"attribute-defined-outside-init", | |||
"import-outside-toplevel", | |||
"too-few-public-methods", | |||
"too-many-ancestors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.
@@ -1526,6 +1527,58 @@ def test_tags() -> None: | |||
assert "multioutput" not in tags | |||
|
|||
|
|||
# the try-excepts in this test should be removed once xgboost's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use pytest.mark.skipif
to skip tests. Seems simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that it's useful to check that the exact, expected AttributeError
is raised...there are so many layers of inheritance involved here, it wouldn't be hard to implement this in a way that accidentally raises some totally unrelated error accessing a non-existent property or something.
If you read that and still think skipif()
would be preferable, I'll happy change to that and remove the try-except
, just wanted to explain my thinking.
python-package/xgboost/sklearn.py
Outdated
@@ -1497,6 +1553,15 @@ def _more_tags(self) -> Dict[str, bool]: | |||
tags["multilabel"] = True | |||
return tags | |||
|
|||
def __sklearn_tags__(self) -> _sklearn_Tags: | |||
tags = XGBModel.__sklearn_tags__(self) | |||
tags.estimator_type = "classifier" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this if we inherit the classifier mixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you are totally right, I don't think we do:
Removed this, the corresponding LGBRegressor
code, and the imports of ClassifierTags
/ RegressorTags
in a511848
# 2. Return whatever in `**kwargs`. | ||
# 3. Merge them. | ||
# | ||
# This needs to accommodate being called recursively in the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help add a test for this? The hierarchy and the Python introspection are getting a bit confusing now. ;-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I just added one in a511848, let me know if there are other conditions you'd like to see tested.
Between that and the existing test:
xgboost/tests/python/test_with_sklearn.py
Line 758 in 5826b02
def test_parameters_access(): |
I think this behavior should be well-covered.
python-package/xgboost/sklearn.py
Outdated
params = super().get_params(deep) | ||
cp = copy.copy(self) | ||
cp.__class__ = cp.__class__.__bases__[0] | ||
# if the immediate parent is a mixin, skip it (mixins don't define get_params()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's more general to check for the get_params
attribute instead of checking hardcoded mixins? The current check seems to defeat the purpose of having a polymorphic structure (inheritance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good idea! Just pushed 18c602f implementing that.
dmlc/xgboost#11021 provides more info. Since I don't want to use a from-source version of xgboost, Ill have to downgrade sklearn.
I could solved locally only fixing the version of scikit-learn to 1.5.2 |
fixes #10896
In
scikit-learn
1.6 (coming very soon), there are some significant changes to estimator tags and estimator checks. This PR makesxgboost
compatible with those changes.In short:
estimator._more_tags() -> dict
interface is deprecated, and estimators are encouraged to instead useestimator.__sklearn_tags__() -> sklearn.utils.Tags
parametrize_with_checks()
, and others have been made stricterNotes for Reviewers
How I tested this
See #10896. I tested these changes locally (M2 macOS, Python 3.11) against the following
scikit-learn
versions:1.7.0.dev0
(latest nightly)1.6.0.rc1
(latest release candidate for 1.6)1.5.2
(latest stable version)The patterns proposed here closely follow similar changes we made over in
lightgbm
:scikit-learn>=0.24.2
, make scikit-learn estimators compatible withscikit-learn>=1.6.0dev
microsoft/LightGBM#6651Shouldn't this require changes on the C++ side?
No. The errors I reported in #10896, which looked like they might be coming from assertions raised on the C++ side, were only showing up because
xgboost
estimators weren't yet compliant withscikit-learn
's new methods likeis_regressor()
andis_classifier()
, and therefore weren't being checked against the correct set of expectations.ref: https://github.com/scikit-learn/scikit-learn/blob/fa5d7275ba4dd2627b6522e1ec4eaf0f3a2e3c05/sklearn/utils/estimator_checks.py#L385-L387
How to review the tag changes
It may be hard to review changes with assignments to these new nested attributes, like
See the implementation of
sklearn.utils.Tags
(and the other*Tags
classes), along with their docs, at https://github.com/scikit-learn/scikit-learn/blob/1.6.X/sklearn/utils/_tags.py