Skip to content

Latest commit

 

History

History
38 lines (23 loc) · 3.09 KB

decision-trees.md

File metadata and controls

38 lines (23 loc) · 3.09 KB

Decision Trees

explains about the similarities and how to measure. which is the best split? based on SSE and GINI (good info about gini here).

  • For classification the Gini cost function is used which provides an indication of how “pure” the leaf nodes are (how mixed the training data assigned to each node is).

Gini = sum(pk * (1 – pk))

  • Early stop - 1 sample per node is overfitting, 5-10 are good
  • Pruning - evaluate what happens if the lead nodes are removed, if there is a big drop, we need it.

KDTREE

  1. Similar to a binary search tree, just by using the median and selecting a feature randomly for each level.
  2. Used to find nearest neighbours. ****
  3. Many applications of using KD tree, reduce color space, Database key search, etc

RANDOM FOREST

Using an ensemble of trees to create a high dimensional and sparse representation of the data and classifying using a linear classifier


How do deal with imbalanced data in Random-forest -

  1. One is based on cost sensitive learning.
  2. Other is based on a sampling technique

EXTRA TREES

  1. A comparison between random forest and extra trees Fig. 1: Comparison of random forests and extra trees in presence of irrelevant predictors. In blue are presented the results from the random forest and red for the extra trees. The results are quite striking: Extra Trees perform consistently better when there are a few relevant predictors and many noisy onesComparison of random forests and extra trees in presence of irrelevant predictors
  2. Difference between RF and ET
  3. Differences #2