completed thru Lab 3

samuelp101 · Sep 30, 2019 · 768b50e · 768b50e
1 parent b27780c
commit 768b50e
Show file tree

Hide file tree

Showing 8 changed files with 598 additions and 848 deletions.
diff --git a/Lab/Lab1/Lab1.md b/Lab/Lab1/Lab1.md
diff --git a/Lab/Lab2/Lab2.md b/Lab/Lab2/Lab2.md
diff --git a/Lab/Lab2/MDWLab2.pbit b/Lab/Lab2/MDWLab2.pbit
diff --git a/Lab/Lab2/Media/workflow.png b/Lab/Lab2/Media/workflow.png
diff --git a/Lab/Lab2/Solution/Create Staging NYCTaxiData.sql b/Lab/Lab2/Solution/Create Staging NYCTaxiData.sql
@@ -1,5 +1,7 @@
 create schema [Staging]
 go
+create schema [NYC]
+GO
 
 create table [Staging].[NYCTaxiData]
 (

diff --git a/Lab/Lab3/Lab3.md b/Lab/Lab3/Lab3.md
@@ -13,10 +13,6 @@ Step     | Description
 ## Create Azure Databricks Cluster 
 In this section you are going to create an Azure Databricks cluster that will be used to execute notebooks.
 
-**IMPORTANT**|
--------------|
-**Execute these steps on your host computer**|
-
 1.	In the Azure Portal, navigate to the MDW-Lab resource group and locate the Azure Databricks resource MDWDatabricks-*suffix*.
 2.	On the **MDWDatabricks-*suffix*** blade, click the **Launch Workspace** button. The Azure Databricks portal will open on a new browser tab.
 
@@ -27,18 +23,14 @@ In this section you are going to create an Azure Databricks cluster that will be
 
     ![](./Media/Lab3-Image03.png)
 
-5.	On the Create Cluster blade, type “MDWDatabricksCluster” in the **Cluster Name** field. Leave all other fields with their default values.
+5.	On the Create Cluster blade, type `MDWDatabricksCluster` in the **Cluster Name** field. Leave all other fields with their default values.
 6.	Click **Create Cluster**. It should take around 5 minutes for the cluster to be fully operational.
 
     ![](./Media/Lab3-Image04.png)
 
 ## Create an Azure Databricks Notebook 
 In this section you are going to create an Azure Databricks notebook that will be used to explore the taxi data files you copied to your data lake in the Lab 2. 
 
-**IMPORTANT**|
--------------|
-**Execute these steps on your host computer**|
-
 1.	On the Azure Databricks portal, click the **Home** button on the left-hand side menu. 
 2.	On the **Workspace** blade, click the down arrow next to your user name and then click **Create > Notebook**.
 
@@ -51,7 +43,7 @@ In this section you are going to create an Azure Databricks notebook that will b
     ![](./Media/Lab3-Image06.png)
 
 6.	On the **Cmd 1** cell, click the **Edit** button on the top right-hand corner of the cell and then click **Show Title**.
-7.	Type “Setup connection to MDWDataLake storage account” in the cell title.
+7.	Type `Setup connection to MDWDataLake storage account` in the cell title.
 
     ![](./Media/Lab3-Image07.png)
     ![](./Media/Lab3-Image08.png)
@@ -63,17 +55,21 @@ In this section you are going to create an Azure Databricks notebook that will b
 9.	Use the Python code below and replace *[your MDWDataLake storage account name]* with **mdwdatalake*suffix*** and to replace *[your MDWDataLake storage account key]* with the storage account key.
 
 ```python
-spark.conf.set(
-  "fs.azure.account.key.[your MDWDataLake storage account name].blob.core.windows.net",
-  "[your MDWDataLake storage account key]")
+## vars to change
+acctname = "mdwdatalakeg3sve"
+acctkey = "/A2mGb+x4ZLpy1pGV4JzYA0YgQZ6gV0SaSFeIjvdhXcwTkQIzAwtYP5goo2vW6dYa1i1Ng9hLWwOiKv7XUxDIQ=="
+
+fullacctname = "fs.azure.account.key." + acctname + ".blob.core.windows.net"
+wasbs_location = "wasbs://nyctaxidata@" + acctname + ".blob.core.windows.net/"
+spark.conf.set(fullacctname, acctkey)
 
 ```
 
 10.	Press **Shift + Enter** to execute and create a new notebook cell. 
-Set the title of the **Cmd 2** cell to “Define NYCTaxiData schema and load data into a Data Frame”
+Set the title of the **Cmd 2** cell to `Define NYCTaxiData schema and load data into a Data Frame`
 
 11.	In the **Cmd 2** cell, define a new **StructType** object that will contain the definition of the data frame schema.
-12.	Using the schema defined above, initialise a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:
+12.	Using the schema defined above, initialize a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:
 
 ```python
 from pyspark.sql.types import *
@@ -97,23 +93,16 @@ nycTaxiDataSchema = StructType([
   , StructField("improvement_surcharge",DoubleType(),True)
   , StructField("total_amount",DoubleType(),True)])
 
-dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load('wasbs://nyctaxidata@[your MDWDataLake storage account name].blob.core.windows.net/')
+dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load(wasbs_location)
 ```
 
-13.	Remember to replace *[your MDWDataLake storage account name]* with **mdwdatalake*suffix*** and to replace *[your MDWDataLake storage account key]* with the storage account key. Your **Cmd 2** cell should look like this:
-
-    ![](./Media/Lab3-Image09.png)
-
 14.	Hit **Shift + Enter** to execute the command and create a new cell. 
-15.	Set the title of the **Cmd 3** cell to “Display Data Frame Content”.
-16.	In the **Cmd 3** cell, call the display function to show the contents of the data frame dfNYCTaxiData. Use the Python code below:
+15.	Set the title of the **Cmd 3** cell to `Display Data Frame Content` with code:  
 
 ```python
 display(dfNYCTaxiData)
 ```
-17.	Hit **Shift + Enter** to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe:
-
-    ![](./Media/Lab3-Image10.png)
+17.	Hit **Shift + Enter** to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe
 
 18.	Set the title of the **Cmd 4** cell to “Create Temp View”
 19.	In the **Cmd 4** cell, call the **createOrReplaceTempView** method of the data frame object to create a temporary view of the data in memory. Use the Python code below:
@@ -134,10 +123,6 @@ dfNYCTaxiData.createOrReplaceTempView('NYCTaxiDataTable')
 select count(*) from NYCTaxiDataTable
 ```
 
-24.	Hit **Shift + Enter** to execute the command and create a new cell. You will see the total number of records in the data frame at the bottom of the cell.
-
-    ![](./Media/Lab3-Image11.png)
-
 25.	Set the title of the **Cmd 6** cell to “Use SQL to filter NYC Taxi Data records”
 
 26.	In the **Cmd 6** cell, write a SQL query to filter taxi rides that happened on the Apr, 7th 2018 that had more than 5 passengers. Use the command below:
@@ -154,10 +139,6 @@ where cast(tpep_pickup_datetime as date) = '2018-04-07'
   and passenger_count > 5
 ```
 
-27.	Hit **Shift + Enter** to execute the command. You will see a grid showing the filtered result set.
-
-    ![](./Media/Lab3-Image12.png)
-
 28.	Set the title of the **Cmd 7** cell to “Use SQL to aggregate NYC Taxi Data records and visualize data”
 
 29.	In the **Cmd 7** cell, write a SQL query to aggregate records and return total number of rides by payment type. Use the command below:

diff --git a/README-Instructor.md b/README-Instructor.md
@@ -0,0 +1,3 @@
+## One Day Course Setup Instructions
+
+* 
diff --git a/README.md b/README.md
@@ -1,15 +1,18 @@
 
-# Azure End-to-End Big Data
+# Azure End-to-End Big Data - One Day Event
 
 Dave Wentzel  
 Microsoft MTC Architect: Data & AI  
 linkedin.com/in/dwentzel  
 
-Get Started NOW:  
+## Agenda 
 
-* `git clone https://github.com/`
+Get Started as soon as possible:  
+
+* `git clone https://github.com/davew-msft/ADPE2E`
 * **To get everyone started quickly, begin deploying the Azure infrastructure as soon as you can.**  See [Lab Deployment](./Deploy/Deploy.md).  
   * Use `US East`, not `US East2`
+  * **If you get a failure message during deployment, let me know immediately**
 
 ## Background
 
@@ -28,7 +31,7 @@ New York City data used in this lab was obtained from the [New York City Open Da
 ## Lab Prerequisites and Deployment
 The following prerequisites must be completed before you start these labs:
 
-* You must have an Azure account with administrator- or controbutor-level access to your subscription. If you don’t have an account, you can sign up for free following the instructions here: https://azure.microsoft.com/en-au/free/
+* You must have an Azure account with administrator- or contributor-level access to your subscription. If you don’t have an account, you can sign up for free following the instructions here: https://azure.microsoft.com/en-au/free/
 * Lab 5 requires you to have a Twitter account. If you don’t have an account you can sign up for free following the instructions here: https://twitter.com/signup. 
 * Lab 5 requires you to have a Power BI Pro account. If you don’t have an account you can sign up for a 60-day trial for free here: https://powerbi.microsoft.com/en-us/power-bi-pro/
 
@@ -42,27 +45,16 @@ Throughout a series of 5 labs you will progressively implement the modern data p
 
 ### [Lab 1: Load Data into Azure SQL Data Warehouse using Azure Data Factory Pipelines](./Lab/Lab1/Lab1.md)
 
+This lab sets up the basic tooling needed to complete the remaining labs.  
+
 ### [Lab 2: Transform Big Data using Azure Data Factory and Azure SQL Data Warehouse](./Lab/Lab2/Lab2.md)
-In this lab you will use Azure Data Factory to download large data files into your data lake and use an Azure SQL Data Warehouse stored procedure to generate a summary dataset and store it in the final table. The dataset you will use contains detailed New York City Yellow Taxi rides for 2018. You will generate a daily aggregated summary of all rides and save the result in your data warehouse. You will then use Power BI to visualise summarised data. 
 
-The estimated time to complete this lab is: **45 minutes**.
+In this lab we will copy csv files from the NYC Taxi dataset to our local data lake and SQL Data Warehouse.  We'll use Azure Data Factory to orchestrate a pipeline to do this.  
 
-
-Step     | Description
--------- | -----
-![](./Media/Green1.png) | Build an Azure Data Factory Pipeline to copy big data files from shared Azure Storage
-![](./Media/Green2.png) | Save data files to your data lake
-![](./Media/Green3.png) | Use Polybase to load data into staging tables in your Azure SQL Data Warehouse. Call a Stored Procedure to perform data aggregations and save results in the final table.
-![](./Media/Green4.png) | Visualize data from your Azure SQL Data Warehouse using Power BI
+### [Lab 3: Explore Big Data using Azure Databricks](./Lab/Lab3/Lab3.md) 
 
-### [Lab 3: Explore Big Data using Azure Databricks](./Lab/Lab3/Lab3.md)
 In this lab you will use Azure Databricks to explore the New York Taxi data files you saved in your data lake in Lab 2. Using a Databricks notebook you will connect to the data lake and query taxi ride details. 
 
-The estimated time to complete this lab is: **20 minutes**.
-
-Step     | Description
--------- | -----
-![](./Media/Red1.png) |Build an Azure Databricks notebook to explore the data files you saved in your data lake in the previous exercise. You will use Python and SQL commands to open a connection to your data lake and query data from data files.
 
 ### [Lab 4: Add AI to your Big Data Pipeline with Cognitive Services](./Lab/Lab4/Lab4.md)
 In this lab you will use Azure Data Factory to download New York City images to your data lake. Then, as part of the same pipeline, you are going to use an Azure Databricks notebook to invoke Computer Vision Cognitive Service to generate metadata documents and save them in back in your data lake. The Azure Data Factory pipeline then finishes by saving all metadata information in a Cosmos DB collection. You will use Power BI to visualise NYC images and their AI-generated metadata.