2024-10-31-10-37-52 - content-rough-draft

bankmanx9x · Oct 31, 2024 · d5f0715 · d5f0715
1 parent f67c9b8
commit d5f0715
Show file tree

Hide file tree

Showing 6 changed files with 263 additions and 29 deletions.
diff --git a/4-Data-Pipeline/Data-Pipeline.ipynb b/4-Data-Pipeline/Data-Pipeline.ipynb
@@ -901,15 +901,6 @@
     "\n",
     "Here when we run `dbt test` the customer_key column in stg_customer table will be checked for uniqueness and not nulls.\n"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d65261ae-0181-44b1-a347-f3a7841737fa",
-   "metadata": {},
-   "source": [
-    "\n",
-    "### Create SCD2 easily with dbt snapshot"
-   ]
   }
  ],
  "metadata": {

diff --git a/5-Containerization/Containerization.ipynb b/5-Containerization/Containerization.ipynb
@@ -5,7 +5,9 @@
    "id": "706bda97-2b18-4a20-b96b-c75a3411b66e",
    "metadata": {},
    "source": [
-    "# Ensure that your code works on any system with Docker"
+    "# Ensure that your code works on any system with Docker\n",
+    "\n",
+    "We use docker to run our code inside containers. The benefit to this approach is that tour code will work similarly in any OS. Containerization also makes it easy to deploy to cloud systems (or any system that can run containers)"
    ]
   },
   {
@@ -193,60 +195,143 @@
     "## Define the OS you want to run your code on with an Image"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "37095424-0b95-43fa-a26c-c17977c08a45",
+   "metadata": {},
+   "source": [
+    "add: image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ae80bbc-5d39-4f10-8bd7-6302db5f6ab9",
+   "metadata": {},
+   "source": [
+    "An image is a blueprint to create your docker container. You can define the modules to install, variables to set, etc. Let’s consider our example:"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "fb74a151-b4f0-4dfc-a82c-d97b2efafda9",
    "metadata": {},
    "source": [
-    "## Containers are where your OS (& code) runs, they are created from Image"
+    "## Containers are where your OS (& code) runs, they are created from Image\n",
+    "\n",
+    "With a blueprint defined with an image we can use this to create one or more containers. Containers are the actual running OS where your code will be run.\n",
+    "\n",
+    "Note that we can create multiple containers from the same image."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0adb19e-779d-41eb-b3c9-4e9b96702f77",
+   "metadata": {},
+   "source": [
+    "The image files are often named `Dockerfile` \n",
+    "\n",
+    "The commands in the docker image (usually called Dockerfile ) are run in order. Let’s go over the key commands:\n",
+    "\n",
+    "    FROM: We need a base operating system on which to set our configurations. We can also use existing Docker images available at the Docker Hub and add our config on top of them. In our example, we use the official Delta Lake Docker image.\n",
+    "    COPY: Copy is used to copy files or folders from our local filesystem to the image. The copy command is usually used when building the docker image to copy settings, static files, etc. In our example, we copy over the tpch-dbgen folder, which contains the logic to create tpch data. We also copy over our requirements.txt file and our entrypoint.sh file.\n",
+    "    RUN: Run is used to run a command in the shell terminal of your image. It is typically used to install libraries, create folders, etc.\n",
+    "    ENV: This command sets the image’s environment variables. In our example, we set Spark environment variables.\n",
+    "    ENTRYPOINT: The entrypoint command executes a script when the image starts. In our example, we use a script file (entrypoint.sh) to start spark master and worker nodes depending on the inputs given to the docker cli when starting a container from this image.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "f0949982-d00a-400e-97f9-144e8822960a",
    "metadata": {},
    "source": [
-    "### Containers can be always running or only exist for the duration of your code"
+    "### Containers can be always running or only run for the duration of your code\n",
+    "\n",
+    "Docker containers are by default ephemeral, meaning that they only last for the duration of the process that is running in the container.\n",
+    "\n",
+    "In case of a webserver this means htat the container will e always on due to the nature of the process (webserver).\n",
+    "\n",
+    "In our case for running `dbt` commands, our container need only run for the duration of the execution of the command.\n",
+    "\n",
+    "In certain cases we will want our containers to be running always (e.g. Airflow scheduler, which we will see in the next chapter). \n",
+    "\n",
+    "We can use the `docker exec` command to run a command in existing containers,\n",
+    "docker run starts new containers from images."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "eb725589-6053-44de-b4b3-93743ef6f4d7",
    "metadata": {},
    "source": [
-    "## Containers can interact with your local OS"
+    "## Containers can interact with your local OS\n",
+    "\n",
+    "When we run containers, we typically want to \n",
+    "* sync code changes, ie. when we are developing our IDEs often open the files in your os and thechanges you make here should be reflected inside the copy in the container.\n",
+    "* Open port. When running systems that have some UI/port access locally you want to ensure that these ports of the specified containers are open to your local os"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "8d0a42a1-e13d-4aa9-a899-75b3a7c38da1",
    "metadata": {},
    "source": [
-    "### Ensure ports are open for your code to interact with other systems"
+    "### Ensure ports are open for your code to interact with other systems\n",
+    "\n",
+    "In our setup we want to ensure that the docs generated and served by the dbt cli (from inside the container) is accessible from our local os. \n",
+    "\n",
+    "To do this we keep port 8008 open. This will ensure that when we open http://localhost:8080 on our web browser we can actually see the dbt document UI"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "f32fefd1-bb38-4299-a2cd-7400c5fde43d",
    "metadata": {},
    "source": [
-    "### Ensure code/data is synced between your local OS and your container with `volume mounts`"
+    "### Ensure code/data is synced between your local OS and your container with `volume mounts`\n",
+    "\n",
+    "Using mounted volumes, we can also ensure that files are synced between the containers and the local operating system. In addition to syncing local files, we can also create docker volumes to sync files between our containers.\n",
+    "\n",
+    "This is especially critical when we are developing locally, since we would want the changes to our code reflected inside the containers (where our code would actually run).\n",
+    "\n",
+    "add: volume to share data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce9de536-289e-466d-b25b-ef6140deed15",
+   "metadata": {},
+   "source": [
+    "## Let's run our dbt pipeline with `docker exec`\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "037b9a7f-5af3-4e5f-b07e-d7a5cc86829a",
    "metadata": {},
    "source": [
-    "## Orchestrate multiple containers with `docker compose`"
+    "## Orchestrate multiple containers with `docker compose`\n",
+    "\n",
+    "docker cli is simple to use, but when we need to start multiple containers or have containers start in a specific order using a docker compose yml file can greatly simplify our setup\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "ce9de536-289e-466d-b25b-ef6140deed15",
+   "id": "47b37956-21fd-44d6-babc-22e152dfa5b2",
+   "metadata": {},
+   "source": [
+    "## Start containers with `docker compose up`\n",
+    "\n",
+    "Docker compose will start all the defined containers in the `docker-compose.yml` file."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dae6fb91-df95-4df1-ba70-65899a11875c",
    "metadata": {},
    "source": [
-    "## Let's run our dbt pipeline with `docker exec`"
+    "## Run dbt commands with `docker exec`"
    ]
   }
  ],

diff --git a/6-Scheduling-&-Orchestration/Scheduling-&-Orchestration.ipynb b/6-Scheduling-&-Orchestration/Scheduling-&-Orchestration.ipynb
@@ -5,31 +5,122 @@
    "id": "5cecd71f-9f02-4c03-9405-59532ecf121f",
    "metadata": {},
    "source": [
-    "# Scheduler = service to start pipeline at specified times"
+    "# Scheduler = service to start pipeline at specified times\n",
+    "\n",
+    "A scheduler is a system that starts a process (in our case a data pipeline) as specified times or frequency\n",
+    "\n",
+    "A scheduler system is usually constantly running (e.g. Airflow scheduler, cron, etc) and will check the timetable information (metadata in Airflow, corntab file forcron, etc) periodically to figure out which (if any process) to start.\n",
+    "\n",
+    "With data pipelines you may need to know when a pipeline was supposed to run, for example if your pipeline is supposed to run at 12:00AM every morning, but gets delayed due to infrastructure scale/unavailability you would still get access to the execution time (12:00AM).\n",
+    "\n",
+    "Keeping track of exectution time in a pipeline is critical if you want your pipelines to work on a specified set of data per run\n",
+    "\n",
+    "add: execution time image from max's blog\n",
+    "\n",
+    "The execution time indicates the input data for most data pipelines and depending on how you desing your pipeline this will play a crucial role. Often times this is used as an unique identifier for a specific run of the pipeline (aka `run_id`).\n",
+    "Having a unique identifier per pipeline run( `run_id`) will help you design idempotent, easy to debug pipeline.                 \n",
+    "                                                                                                                                                "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e1720a92-d783-40a7-b031-4b5845a780e5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# add: airflow macros\n",
+    "# https://airflow.apache.org/docs/apache-airflow/1.10.3/macros.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ccc3162-a380-4a3a-857f-7672fc8ec951",
+   "metadata": {},
+   "source": [
+    "While we saw the available macros in Airflow, most schedulers have similar options."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "3bdad09a-9eb0-4060-96c9-f872c4530d06",
    "metadata": {},
    "source": [
-    "# Orchestrator = System to ensure tasks in a data pipeline are run in the correct order "
+    "# Orchestrator = System to ensure tasks in a data pipeline are run in the correct order \n",
+    "\n",
+    "Orchestrator is system that is reponsible for orderding the tasks in a data pipeline. With a scheduler your pipelines will always run the taks in a specifed order.\n",
+    "\n",
+    "An orchestrator is responsible for ensuring that your data pipeline is a DAG (directed acyclic graph), i.e that there are no infinte loops in your data pipeline.\n",
+    "\n",
+    "An orchestrator is often times a python library (e.g. Airflow, Dagster) with features that enable you to order your tasks as you see fit, or systems that have auto generate order of tasks from the code (e.g. dbt uses the `ref` feature to identify data task lineage)\n",
+    "\n",
+    "Modern orchestrators offer features that are hard to replicate in native Python without extensive code:\n",
+    "* branching\n",
+    "* dynamic task creation\n",
+    "* grouping related tasks \n",
+    "* conditional workflow (task a or B or c,,.. -> pass/fail)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8116949-2a0b-421a-888e-1364003a40b6",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8530b61a-115f-4fc7-b9ac-a0991efaaee0",
+   "metadata": {},
+   "source": [
+    "**NOTE** ften times shedulers and orchestrators are considered as one system, this is not true. You want to understand where the scheduler ends and orchestrator begins to ensure that you use the right system for the use case."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "0f3dd477-46ef-4563-951e-7d0fe592ffbf",
    "metadata": {},
    "source": [
-    "# Airflow is both a scheduler and an orchestrator"
+    "# Airflow is both a scheduler and an orchestrator\n",
+    "\n",
+    "Let's take a quick look at Airflow's architecture:\n",
+    "\n",
+    "add: image from https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html#basic-airflow-deployment\n",
+    "\n",
+    "When we start docker we can see Airflow webserver (responsible for the UI) and Airflow scheduler (responsbile for starting the DAGs at the right time)\n",
+    "\n",
+    "In addtion, we have installed the `apache-airflow-client` which is used for defining data pipelines in our code."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5f4907d0-2c47-4006-b805-784632a9eeeb",
    "metadata": {},
    "source": [
-    "# Define data pipeline as a DAG"
+    "# Define data pipeline as a DAG\n",
+    "\n",
+    "Most orchestrators have their own *way* to define a data pipeline (aka DAG). We can define a DAG in Airflow as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dde10706-d312-48a8-bb53-3e4ebc45a8ae",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "37f168cb-0c8b-4456-9e74-a6dc928b688b",
+   "metadata": {},
+   "source": [
+    "Note that this is stored within the `./dag` folder which is synced to our containers (via volume mount). In the above DAG, we do the follwing\n",
+    "\n",
+    "1. run `dbt run`\n",
+    "2. Run data quality checks with `dbt test`\n",
+    "3. Move the metrics data to `sqlite3` which we will use in the next chapter for visualization."
    ]
   },
   {

diff --git a/6-Scheduling-&-Orchestration/dags/customer_outreach.py b/6-Scheduling-&-Orchestration/dags/customer_outreach.py
@@ -33,7 +33,8 @@ def write_out_preagg_to_sqlite(sqlite3_db='./data/sqlite3_tpch_preagg_tbls.db',
 
         conn = duckdb.connect(duckdb_file)
         cursor = conn.cursor()
-
+
+        # This code gets data from our preagg tables and writes it into a sqlite3 format for visualization with Metabase
         cursor.execute(f"ATTACH '{sqlite3_db}' AS sqlite_db (TYPE SQLITE)")
         cursor.execute(f"CREATE TABLE sqlite_db.customer_outreach_metrics AS SELECT * FROM customer_outreach_metrics")
         cursor.execute(f"CREATE TABLE sqlite_db.order_lineitem_metrics AS SELECT * FROM order_lineitem_metrics")

diff --git a/8-Visualizations/Visualization.ipynb b/8-Visualizations/Visualization.ipynb
@@ -55,23 +55,47 @@
    "id": "7fc407ea-3624-404f-87cc-e3fa449f3151",
    "metadata": {},
    "source": [
-    "# Visualizations makes it easy to see data trends"
+    "# Visualizations makes it easy to see data trends\n",
+    "\n",
+    "For non technical stake holders we need to ensure that they are able to quickly see trends, performance etc.\n",
+    "\n",
+    "Often times this is easier to define as a dashboard that end users can just view.\n",
+    "\n",
+    "Visualization tools also enable dynamic segmentation of data, allowing end users to easily explore the data without having to learn the intricacies of SQL.\n",
+    "\n",
+    "An additional benefit is that these tools allow the devs to define metrics in a way that enables end user to see the metrics under most conditions. For example if the end user filters the data by a specific country the metric will bereflected for the data post filter. While this may seound simple if you are using SQL, it is a lot of code. \n",
+    "\n",
+    "This is where BI tools shine are some even have their own scripting language (LOOKML, DAX, etc). **Note** that its best to keep your metric definition in your code repo when you can, as opposed to visualization tool. Having a single definition of metric will ensure  that any downstream consumer of our data (BI tool, dashboard, APIs, Data exports, etc) will have the same view of the data.                                                            \n",
+    "\n",
+    "Also note that most paid BI tools charge per seat and can be very expensive if not carefully managed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14b6decb-ffeb-40cd-a506-2ab2f1ef1de6",
+   "metadata": {},
+   "source": [
+    "In our example we will use the open source version of `Metabase` add link"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "e5102035-0f96-41d1-87bb-230c1dd5416d",
    "metadata": {},
    "source": [
-    "# Warehouses power visualization tools"
+    "# Warehouses power visualization tools\n",
+    "\n",
+    "Visualization (BI) tools often run queries on your warehouse. When an enduser is exploring data via the UI, the visualization tool run a new query on your warehouse. Ensure that your warehouse has proper caching setup, otherwise each query (in response to a simple UI drag-drop) from the BI tool will result in a new query, causing warehouse usage cost to sky rocket.\n",
+    "\n",
+    "There is a different type visualization tool, where the visualizations are generated as static html (evidence.dev, quarto, dash, plotly) where one can generate static HTML periodically and the end user just sees those. The upside is that you don't need a separate tool and simplicity. The downside is the almost non-existent data exploration capabilities of static HTML pages."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "c28080ad-426c-49cb-8b25-3b72db0e8ad9",
    "metadata": {},
    "source": [
-    "# Keep metric definition to a minimum"
+    "# Keep metric definition in BI tools to a minimum"
    ]
   }
  ],