-
Notifications
You must be signed in to change notification settings - Fork 23
❗ This is a read-only mirror of the CRAN R package repository. rpart — Recursive Partitioning and Regression Trees. Homepage: https://github.com/bethatkinson/rpart, https://cran.r-project.org/package=rpart Report bugs for this package: https://github.com/bethatkinson/rpart/issues
cran/rpart
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This represents a major revision of the rpart code, driven by the desire to add user-written split routines. 1. Bugfix -- the "maxdepth" option needed to be in front of "xval" in the return list from rpart.control. The xval option can be of length 1 or n, and the C code was assuming length 1. (If xval was a vector, its second value was used for maxdepth). 2. Changes to the rpart object This was motivated partly by the fact that StatSci wants to incorporate rpart into the product. We fixed a couple of design flaws now before they got cast into stone a. Removed the frame$splits component. The only routine that used it was labels.rpart, and then it didn't use it by default. Now we compute the labels as needed. Pre-computing was a bad idea when more/fewer digits were wanted on the printout. b. An additional component having to do with user-written split functions, see below. c. The component yval is now always the first component of the prediction. If the prediction is of length >1, a second yval2 component is returned also. For instance it is (event rate, #events) for poisson trees, and (predicted class, class counts, class probabilities) for classification trees, then yval2 will be a matrix containing the full response. Before, yval2 was the number of events for poisson, and the class counts for classification, with yet another optional vector yprob for the class probabilities. More discussion of why we did this is below. 3. Simple printout change. Per the ongoing suggestion of Brian Ripley, the print and summary routines now use options(digits), not digits-3. This should be repaired in the survival routines as well; the -3 was not one of my better ideas. 4. User-written splitting rules A user can create their own splitting rules, and pass them to rpart as a list of 3 functions: initialization, response, and splitting. 4a. Printing One important side effect of this update is the printing of trees. The print, summary, and text routines all had special if-then-else code to treat each of the 4 current splitting methods as a special case for printout. In order to make them extensible, this all had to go. The initialization functions rpart.class, rpart.exp, rpart.poisson, and rpart.anova now each return a set of formatting functions: summary <- function(yval, dev, wt, ylevel, digits) yval: a vector or matrix of response values dev : a vector of deviance values wt : a vector of weights ylevel: if the left-hand-side of the model equation was a factor, this contains its levels, otherwise NULL digits: number of significant digits The result should be a vector of character strings. For poisson splits for instance, "events=54, estimated rate=0.057, mean deviance=1.32" is the string that is created. print <- function(yval, ylevel, digits) Optional, currently only used by rpart.class. If missing the default is to use yval as the last part of the line in print.rpart. text <- Not written yet As a consequence, the summary and print routines no longer have special code per method. (And soon text.rpart) 4b. The number of y variables needs to be passed into the routine, rather than a part of the func_table.h file. Solution: all of the init functions (rpart.anova, rpart.exp, etc) now return 'numy' as a part of their list. 4c. Callback In order to get decent speed using a user written routine, I needed to use the "trick" found originally in glm code and then later in penalized survival. In 3.4 the technique is completely undocumented -- but I once got to see the C code for glm as a Bell beta tester and copied it blindly. The code here uses the approach outlined (thinly) in the green book. The heart of the work is found in rpartcallback.s and rpart_callback.c It is intended that the same approach will replace what is currently used in the survival routines. Note: I avoided the "ASSIGN_IN_FRAME" macro because of a deficiency pointed out by Bill Dunlap. An open question is how these can be mapped into R. I'm hoping, since it's mostly macros, that it will be fairly easy. Now, rpartcallback.s really isn't used. I'd like to keep its functionality as a separate routine, particularly since the same lines of code appear in both rpart() and xpred.rpart() (see rpart2.s for instance). But, as soon as rpartcallback returns, some memory that I need gets released, in particular the two expressions. I've tried putting a copy of them into eframe, using COPY_ALL in the .c code, and a few others and nothing works. The working code, rpart.s, has all the lines from rpartcallback copied inside it right where the call to rpartcallback would have been. Of course, most of the things I tried are pure guesswork, given the sparseness of the documentation for .Call. 4d. The routines In the test directory is a file "anovatest.s" that shows how to create 3 routines that replicate the built-in anova splitting method. It has a fair bit of comment 4e. Speed The last few lines of anovatest show an approx 5-fold penalty for doing the splits outside of C. But, the ability to prototype a new idea quickly is really nice. For a simple example see anovatest2.s. 5. Cleaned up the labels.rpart() function. a. Nothing calls rplabel.c or prlab anymore. The first of these had been hard to standardize across the Unix `strings' libraries because of one of the routines that I used. Most of what these routines did has been moved into the labels.rpart code itself, hopefully allowing for more transparency. b. Depreciated the "pretty" arg to labels.rpart, replacing it with "minlength", which is much more sensibly set up (see the comments on the head of the routine for details). Allow access to more of the arguments of abbreviate(). c. In many places the code now makes use of a new routine formatg(), which gives us the "g" format of printf. The routine is reminiscent of "formatc" found on statlib, but with fewer options (and fewer checks). If a more flexible format() appears one day in standard S we could convert to using it.
About
❗ This is a read-only mirror of the CRAN R package repository. rpart — Recursive Partitioning and Regression Trees. Homepage: https://github.com/bethatkinson/rpart, https://cran.r-project.org/package=rpart Report bugs for this package: https://github.com/bethatkinson/rpart/issues
Resources
Stars
Watchers
Forks
Packages 0
No packages published