Skip to content

Commit

Permalink
completed thru Lab 3
Browse files Browse the repository at this point in the history
  • Loading branch information
davewentzel committed Sep 30, 2019
1 parent b27780c commit 768b50e
Show file tree
Hide file tree
Showing 8 changed files with 598 additions and 848 deletions.
448 changes: 7 additions & 441 deletions Lab/Lab1/Lab1.md

Large diffs are not rendered by default.

916 changes: 561 additions & 355 deletions Lab/Lab2/Lab2.md

Large diffs are not rendered by default.

Binary file added Lab/Lab2/MDWLab2.pbit
Binary file not shown.
Binary file added Lab/Lab2/Media/workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions Lab/Lab2/Solution/Create Staging NYCTaxiData.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
create schema [Staging]
go
create schema [NYC]
GO

create table [Staging].[NYCTaxiData]
(
Expand Down
47 changes: 14 additions & 33 deletions Lab/Lab3/Lab3.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,6 @@ Step | Description
## Create Azure Databricks Cluster
In this section you are going to create an Azure Databricks cluster that will be used to execute notebooks.

**IMPORTANT**|
-------------|
**Execute these steps on your host computer**|

1. In the Azure Portal, navigate to the MDW-Lab resource group and locate the Azure Databricks resource MDWDatabricks-*suffix*.
2. On the **MDWDatabricks-*suffix*** blade, click the **Launch Workspace** button. The Azure Databricks portal will open on a new browser tab.

Expand All @@ -27,18 +23,14 @@ In this section you are going to create an Azure Databricks cluster that will be

![](./Media/Lab3-Image03.png)

5. On the Create Cluster blade, type MDWDatabricksCluster in the **Cluster Name** field. Leave all other fields with their default values.
5. On the Create Cluster blade, type `MDWDatabricksCluster` in the **Cluster Name** field. Leave all other fields with their default values.
6. Click **Create Cluster**. It should take around 5 minutes for the cluster to be fully operational.

![](./Media/Lab3-Image04.png)

## Create an Azure Databricks Notebook
In this section you are going to create an Azure Databricks notebook that will be used to explore the taxi data files you copied to your data lake in the Lab 2.

**IMPORTANT**|
-------------|
**Execute these steps on your host computer**|

1. On the Azure Databricks portal, click the **Home** button on the left-hand side menu.
2. On the **Workspace** blade, click the down arrow next to your user name and then click **Create > Notebook**.

Expand All @@ -51,7 +43,7 @@ In this section you are going to create an Azure Databricks notebook that will b
![](./Media/Lab3-Image06.png)

6. On the **Cmd 1** cell, click the **Edit** button on the top right-hand corner of the cell and then click **Show Title**.
7. Type Setup connection to MDWDataLake storage account in the cell title.
7. Type `Setup connection to MDWDataLake storage account` in the cell title.

![](./Media/Lab3-Image07.png)
![](./Media/Lab3-Image08.png)
Expand All @@ -63,17 +55,21 @@ In this section you are going to create an Azure Databricks notebook that will b
9. Use the Python code below and replace *[your MDWDataLake storage account name]* with **mdwdatalake*suffix*** and to replace *[your MDWDataLake storage account key]* with the storage account key.

```python
spark.conf.set(
"fs.azure.account.key.[your MDWDataLake storage account name].blob.core.windows.net",
"[your MDWDataLake storage account key]")
## vars to change
acctname = "mdwdatalakeg3sve"
acctkey = "/A2mGb+x4ZLpy1pGV4JzYA0YgQZ6gV0SaSFeIjvdhXcwTkQIzAwtYP5goo2vW6dYa1i1Ng9hLWwOiKv7XUxDIQ=="

fullacctname = "fs.azure.account.key." + acctname + ".blob.core.windows.net"
wasbs_location = "wasbs://nyctaxidata@" + acctname + ".blob.core.windows.net/"
spark.conf.set(fullacctname, acctkey)

```

10. Press **Shift + Enter** to execute and create a new notebook cell.
Set the title of the **Cmd 2** cell to Define NYCTaxiData schema and load data into a Data Frame
Set the title of the **Cmd 2** cell to `Define NYCTaxiData schema and load data into a Data Frame`

11. In the **Cmd 2** cell, define a new **StructType** object that will contain the definition of the data frame schema.
12. Using the schema defined above, initialise a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:
12. Using the schema defined above, initialize a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:

```python
from pyspark.sql.types import *
Expand All @@ -97,23 +93,16 @@ nycTaxiDataSchema = StructType([
, StructField("improvement_surcharge",DoubleType(),True)
, StructField("total_amount",DoubleType(),True)])

dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load('wasbs://nyctaxidata@[your MDWDataLake storage account name].blob.core.windows.net/')
dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load(wasbs_location)
```

13. Remember to replace *[your MDWDataLake storage account name]* with **mdwdatalake*suffix*** and to replace *[your MDWDataLake storage account key]* with the storage account key. Your **Cmd 2** cell should look like this:

![](./Media/Lab3-Image09.png)

14. Hit **Shift + Enter** to execute the command and create a new cell.
15. Set the title of the **Cmd 3** cell to “Display Data Frame Content”.
16. In the **Cmd 3** cell, call the display function to show the contents of the data frame dfNYCTaxiData. Use the Python code below:
15. Set the title of the **Cmd 3** cell to `Display Data Frame Content` with code:

```python
display(dfNYCTaxiData)
```
17. Hit **Shift + Enter** to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe:

![](./Media/Lab3-Image10.png)
17. Hit **Shift + Enter** to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe

18. Set the title of the **Cmd 4** cell to “Create Temp View”
19. In the **Cmd 4** cell, call the **createOrReplaceTempView** method of the data frame object to create a temporary view of the data in memory. Use the Python code below:
Expand All @@ -134,10 +123,6 @@ dfNYCTaxiData.createOrReplaceTempView('NYCTaxiDataTable')
select count(*) from NYCTaxiDataTable
```

24. Hit **Shift + Enter** to execute the command and create a new cell. You will see the total number of records in the data frame at the bottom of the cell.

![](./Media/Lab3-Image11.png)

25. Set the title of the **Cmd 6** cell to “Use SQL to filter NYC Taxi Data records”

26. In the **Cmd 6** cell, write a SQL query to filter taxi rides that happened on the Apr, 7th 2018 that had more than 5 passengers. Use the command below:
Expand All @@ -154,10 +139,6 @@ where cast(tpep_pickup_datetime as date) = '2018-04-07'
and passenger_count > 5
```

27. Hit **Shift + Enter** to execute the command. You will see a grid showing the filtered result set.

![](./Media/Lab3-Image12.png)

28. Set the title of the **Cmd 7** cell to “Use SQL to aggregate NYC Taxi Data records and visualize data”

29. In the **Cmd 7** cell, write a SQL query to aggregate records and return total number of rides by payment type. Use the command below:
Expand Down
3 changes: 3 additions & 0 deletions README-Instructor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## One Day Course Setup Instructions

*
30 changes: 11 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@

# Azure End-to-End Big Data
# Azure End-to-End Big Data - One Day Event

Dave Wentzel
Microsoft MTC Architect: Data & AI
linkedin.com/in/dwentzel

Get Started NOW:
## Agenda

* `git clone https://github.com/`
Get Started as soon as possible:

* `git clone https://github.com/davew-msft/ADPE2E`
* **To get everyone started quickly, begin deploying the Azure infrastructure as soon as you can.** See [Lab Deployment](./Deploy/Deploy.md).
* Use `US East`, not `US East2`
* **If you get a failure message during deployment, let me know immediately**

## Background

Expand All @@ -28,7 +31,7 @@ New York City data used in this lab was obtained from the [New York City Open Da
## Lab Prerequisites and Deployment
The following prerequisites must be completed before you start these labs:

* You must have an Azure account with administrator- or controbutor-level access to your subscription. If you don’t have an account, you can sign up for free following the instructions here: https://azure.microsoft.com/en-au/free/
* You must have an Azure account with administrator- or contributor-level access to your subscription. If you don’t have an account, you can sign up for free following the instructions here: https://azure.microsoft.com/en-au/free/
* Lab 5 requires you to have a Twitter account. If you don’t have an account you can sign up for free following the instructions here: https://twitter.com/signup.
* Lab 5 requires you to have a Power BI Pro account. If you don’t have an account you can sign up for a 60-day trial for free here: https://powerbi.microsoft.com/en-us/power-bi-pro/

Expand All @@ -42,27 +45,16 @@ Throughout a series of 5 labs you will progressively implement the modern data p

### [Lab 1: Load Data into Azure SQL Data Warehouse using Azure Data Factory Pipelines](./Lab/Lab1/Lab1.md)

This lab sets up the basic tooling needed to complete the remaining labs.

### [Lab 2: Transform Big Data using Azure Data Factory and Azure SQL Data Warehouse](./Lab/Lab2/Lab2.md)
In this lab you will use Azure Data Factory to download large data files into your data lake and use an Azure SQL Data Warehouse stored procedure to generate a summary dataset and store it in the final table. The dataset you will use contains detailed New York City Yellow Taxi rides for 2018. You will generate a daily aggregated summary of all rides and save the result in your data warehouse. You will then use Power BI to visualise summarised data.

The estimated time to complete this lab is: **45 minutes**.
In this lab we will copy csv files from the NYC Taxi dataset to our local data lake and SQL Data Warehouse. We'll use Azure Data Factory to orchestrate a pipeline to do this.


Step | Description
-------- | -----
![](./Media/Green1.png) | Build an Azure Data Factory Pipeline to copy big data files from shared Azure Storage
![](./Media/Green2.png) | Save data files to your data lake
![](./Media/Green3.png) | Use Polybase to load data into staging tables in your Azure SQL Data Warehouse. Call a Stored Procedure to perform data aggregations and save results in the final table.
![](./Media/Green4.png) | Visualize data from your Azure SQL Data Warehouse using Power BI
### [Lab 3: Explore Big Data using Azure Databricks](./Lab/Lab3/Lab3.md)

### [Lab 3: Explore Big Data using Azure Databricks](./Lab/Lab3/Lab3.md)
In this lab you will use Azure Databricks to explore the New York Taxi data files you saved in your data lake in Lab 2. Using a Databricks notebook you will connect to the data lake and query taxi ride details.

The estimated time to complete this lab is: **20 minutes**.

Step | Description
-------- | -----
![](./Media/Red1.png) |Build an Azure Databricks notebook to explore the data files you saved in your data lake in the previous exercise. You will use Python and SQL commands to open a connection to your data lake and query data from data files.

### [Lab 4: Add AI to your Big Data Pipeline with Cognitive Services](./Lab/Lab4/Lab4.md)
In this lab you will use Azure Data Factory to download New York City images to your data lake. Then, as part of the same pipeline, you are going to use an Azure Databricks notebook to invoke Computer Vision Cognitive Service to generate metadata documents and save them in back in your data lake. The Azure Data Factory pipeline then finishes by saving all metadata information in a Cosmos DB collection. You will use Power BI to visualise NYC images and their AI-generated metadata.
Expand Down

0 comments on commit 768b50e

Please sign in to comment.