Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add FastTreeSHAPv1 implementation for EvoTrees #270

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kpa28-git
Copy link

I was sad to find Julia doesn't have a TreeSHAP implementation so I ported FastTreeSHAPv1 for my favorite boosted trees library. This PR is in case others find it useful.

This is marked as a draft for a few reasons:

  1. To gauge interest to see if SHAP values should be in this package at all
  2. The code runs but the results haven't yet been tested or compared to other SHAP value libraries yet
  3. I'm not very familiar with the FastTreeSHAP algorithm, this was a simple 1:1 port of the pseudocode. So suggestions are very welcome
  4. There may be a better way to parallelize this without using Channels. I usually use a tmap implementation for these kinds of parallel tasks, but I didn't want to add dependencies to the package just for this. Again, suggestions are welcome.

@tecosaur
Copy link

You should be able to compare SHAP values to those from https://github.com/nredell/ShapML.jl, which has been tested against other implementations.

@jeremiedb
Copy link
Member

Sorry for late feedback, I think it would be a nice addition to the package to support direct SHAP derivation from the modeled tree.

As @tecosaur mentionned, comparison with ShapML.jl would be a good preliminary step to asset the algo correctness.

Then, regarding the interface of such SHAP calculation, it seems like in its current form, the implementation is intended to be applied on a single observation. I think it would be interesting to support an actual DataFrame/Table of many observations. That is to be coherent with the interface that accepts dataframe with features. that can be numeric, Bool, or Categorical. I shall be able to take a quick look at this in the upcoming days.

@kpa28-git
Copy link
Author

Cool. I don't have an immediate need for this anymore, but I'll do a comparison with an existing approximate SHAP value package sometime in the next couple weeks.

Your second point was my initial thought also. However, there doesn't seem to be a canonical way to aggregate SHAP values unless you know of one. A ShapML.jl example (see - Global feature importance heading) averages the absolute SHAP values for observations, which could be one generally useful way. I believe this would differ enough from the existing gain-based feature importance to be useful but I'd need to play with it to get an idea.

We can also plot all the shap values via a rug or beeswarm plot. I have some code to create rug plots to visualize missing data that can be adapted for shap values. This would be cool, but it depends on how much plotting you want in this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants