update ml-coding

leicao-me · Mar 16, 2023 · 4346ffd · 4346ffd
1 parent bf182d5
commit 4346ffd
Show file tree

Hide file tree

Showing 6 changed files with 540 additions and 2 deletions.
diff --git a/src/MLC/ml-coding.md b/src/MLC/ml-coding.md
@@ -2,11 +2,13 @@
 ML coding module may or may not exist in particular companies interviews. The good news is that, there are only a limited number of ML algorithms that candidates are expected to be able to code. The most common ones include:
 
 ## ML Algorithms 
-- K-means clustering ([code](./notebooks/k_means_1.ipynb))
+- K-means clustering ([code](./notebooks/k_means.ipynb))
 
 - K-nearest neighbors ([code 1](./notebooks/knn.ipynb) - [code 2](https://github.com/MahanFathi/CS231/blob/master/assignment1/cs231n/classifiers/k_nearest_neighbor.py))
 
-- Decision trees ([code](https://github.com/random-forests/tutorials/blob/master/decision_tree.py))
+- Decision trees ([code](./notebooks/decision_tree.ipynb))
+  <!-- (https://github.com/random-forests/tutorials/blob/master/decision_tree.py) -->
+
 
 - Perceptron and FeedForward NN ([code 1](https://github.com/alirezadir/deep-learning/blob/master/first-neural-network/my_answers.py), [code 2](https://github.com/MahanFathi/CS231/blob/master/assignment1/cs231n/classifiers/neural_net.py))
 

diff --git a/src/MLC/notebooks/decision_tree.ipynb b/src/MLC/notebooks/decision_tree.ipynb
@@ -0,0 +1,193 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A decision tree is a type of machine learning algorithm used for classification and regression tasks. It consists of a tree-like structure where each internal node represents a feature or attribute, each branch represents a decision based on that feature, and each leaf node represents a predicted output.\n",
+    "\n",
+    "To **train** a decision tree, the algorithm uses a dataset with labeled examples to create the tree structure. It starts with the root node, which includes all the examples, and selects the feature that provides the most information gain to split the data into two subsets. It then repeats this process for each subset until it reaches a stopping criterion, such as a maximum tree depth or minimum number of examples in a leaf node.\n",
+    "\n",
+    "Once the decision tree is trained, it can be used to **predict** the output for new, unseen examples. To make a prediction, the algorithm starts at the root node and follows the branches based on the values of the input features until it reaches a leaf node. The predicted output for that example is the value associated with the leaf node.\n",
+    "\n",
+    "Decision trees have several advantages, such as being easy to interpret and visualize, handling both numerical and categorical data, and handling missing values. However, they can also suffer from overfitting if the tree is too complex or if there is noise or outliers in the data. \n",
+    "\n",
+    "To address this issue, various techniques such as pruning, ensemble methods, and regularization can be used to simplify the decision tree or combine multiple trees to improve generalization performance. Additionally, decision trees may not perform well with highly imbalanced datasets or datasets with many irrelevant features, and they may not be suitable for tasks where the relationships between features and outputs are highly nonlinear or complex."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "class DecisionTree:\n",
+    "    def __init__(self, max_depth=None):\n",
+    "        self.max_depth = max_depth\n",
+    "        \n",
+    "    def fit(self, X, y):\n",
+    "        self.n_classes_ = len(np.unique(y))\n",
+    "        self.n_features_ = X.shape[1]\n",
+    "        self.tree_ = self._grow_tree(X, y)\n",
+    "        \n",
+    "    def predict(self, X):\n",
+    "        return [self._predict(inputs) for inputs in X]\n",
+    "        \n",
+    "    def _gini(self, y):\n",
+    "        _, counts = np.unique(y, return_counts=True)\n",
+    "        impurity = 1 - np.sum([(count / len(y)) ** 2 for count in counts])\n",
+    "        return impurity\n",
+    "        \n",
+    "    def _best_split(self, X, y):\n",
+    "        m = y.size\n",
+    "        if m <= 1:\n",
+    "            return None, None\n",
+    "        \n",
+    "        num_parent = [np.sum(y == c) for c in range(self.n_classes_)]\n",
+    "        best_gini = 1.0 - sum((n / m) ** 2 for n in num_parent)\n",
+    "        best_idx, best_thr = None, None\n",
+    "        \n",
+    "        for idx in range(self.n_features_):\n",
+    "            thresholds, classes = zip(*sorted(zip(X[:, idx], y)))\n",
+    "            num_left = [0] * self.n_classes_\n",
+    "            num_right = num_parent.copy()\n",
+    "            for i in range(1, m):\n",
+    "                c = classes[i - 1]\n",
+    "                num_left[c] += 1\n",
+    "                num_right[c] -= 1\n",
+    "                gini_left = 1.0 - sum(\n",
+    "                    (num_left[x] / i) ** 2 for x in range(self.n_classes_)\n",
+    "                )\n",
+    "                gini_right = 1.0 - sum(\n",
+    "                    (num_right[x] / (m - i)) ** 2 for x in range(self.n_classes_)\n",
+    "                )\n",
+    "                gini = (i * gini_left + (m - i) * gini_right) / m\n",
+    "                if thresholds[i] == thresholds[i - 1]:\n",
+    "                    continue\n",
+    "                if gini < best_gini:\n",
+    "                    best_gini = gini\n",
+    "                    best_idx = idx\n",
+    "                    best_thr = (thresholds[i] + thresholds[i - 1]) / 2\n",
+    "        \n",
+    "        return best_idx, best_thr\n",
+    "        \n",
+    "    def _grow_tree(self, X, y, depth=0):\n",
+    "        num_samples_per_class = [np.sum(y == i) for i in range(self.n_classes_)]\n",
+    "        predicted_class = np.argmax(num_samples_per_class)\n",
+    "        node = Node(predicted_class=predicted_class)\n",
+    "        if depth < self.max_depth:\n",
+    "            idx, thr = self._best_split(X, y)\n",
+    "            if idx is not None:\n",
+    "                indices_left = X[:, idx] < thr\n",
+    "                X_left, y_left = X[indices_left], y[indices_left]\n",
+    "                X_right, y_right = X[~indices_left], y[~indices_left]\n",
+    "                node.feature_index = idx\n",
+    "                node.threshold = thr\n",
+    "                node.left = self._grow_tree(X_left, y_left, depth + 1)\n",
+    "                node.right = self._grow_tree(X_right, y_right, depth + 1)\n",
+    "        return node\n",
+    "        \n",
+    "    def _predict(self, inputs):\n",
+    "        node = self.tree_\n",
+    "        while node.left:\n",
+    "            if inputs[node.feature_index] < node.threshold:\n",
+    "                node = node.left\n",
+    "            else:\n",
+    "                node = node.right\n",
+    "        return node.predicted_class\n",
+    "    \n",
+    "class Node:\n",
+    "    def __init__(self, *, predicted_class):\n",
+    "        self.predicted_class = predicted_class\n",
+    "        self.feature_index = 0\n",
+    "        self.threshold = 0.0 \n",
+    "        self.left = None\n",
+    "        self.right = None\n",
+    "\n",
+    "    def is_leaf_node(self):\n",
+    "        return self.left is None and self.right is None\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Test "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy: 1.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.datasets import load_iris\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "\n",
+    "# Load the iris dataset\n",
+    "iris = load_iris()\n",
+    "X = iris.data\n",
+    "y = iris.target\n",
+    "\n",
+    "# Split the data into training and testing sets\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
+    "\n",
+    "# Train the decision tree\n",
+    "tree = DecisionTree(max_depth=3)\n",
+    "tree.fit(X_train, y_train)\n",
+    "\n",
+    "# Make predictions on the test set\n",
+    "y_pred = tree.predict(X_test)\n",
+    "\n",
+    "# Compute the accuracy of the predictions\n",
+    "accuracy = accuracy_score(y_test, y_pred)\n",
+    "\n",
+    "print(f\"Accuracy: {accuracy}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/MLC/notebooks/k_means_1.ipynb → src/MLC/notebooks/k_means.ipynb b/src/MLC/notebooks/k_means_1.ipynb → src/MLC/notebooks/k_means.ipynb
diff --git a/src/MLC/notebooks/linear_regression.ipynb b/src/MLC/notebooks/linear_regression.ipynb