In this lab you will use Azure Databricks to explore the New York Taxi data files you saved in your data lake in Lab 2. Using a Databricks notebook you will connect to the data lake and query taxi ride details.
The estimated time to complete this lab is: 45 minutes.
IMPORTANT: Some of the Azure services provisioned require globally unique name and a “-suffix” has been added to their names to ensure this uniqueness. Please take note of the suffix generated as you will need it for the following resources in this lab:
Name | Type |
---|---|
mdwdatalakesuffix | Storage Account |
MDWDatabricks-suffix | Databricks Workspace |
mdwsqlvirtualserver-suffix | SQL server |
IMPORTANT: The code snippets below illustrate the simplest and quickest way to establish connections between Databricks and other Azure services. They ARE NOT considered best practices as they expose secrets and passwords in plain text. For a secure implementation following the security best practices, please consider the use of Azure Key Vault in conjuntion with Databricks Secret Scopes (https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html).
In this section you are going to create an Azure Databricks cluster that will be used to execute notebooks.
IMPORTANT |
---|
Execute these steps on your host computer |
-
In the Azure Portal, navigate to the lab resource group and locate the Azure Databricks resource MDWDatabricks-suffix.
-
On the MDWDatabricks-suffix blade, click the Launch Workspace button. The Azure Databricks portal will open on a new browser tab.
-
On the Azure Databricks portal, click the Clusters button on the left-hand side menu.
-
On the Clusters blade, click + Create Cluster.
-
On the Create Cluster blade, enter the following connection details:
- Cluster Name: MDWDatabricksCluster
- Max Workers: 4Leave all other fields with their default values.
-
Click Create Cluster. It should take around 5 minutes for the cluster to be fully operational.
In this section you are going to create an Azure Databricks notebook that will be used to explore the taxi data files you copied to your data lake in the Lab 2.
IMPORTANT |
---|
Execute these steps on your host computer |
-
On the Azure Databricks portal, click the Home button on the left-hand side menu.
-
On the Workspace blade, click the down arrow next to your user name and then click Create > Notebook.
-
On the Create Notebook pop-up window type “NYCTaxiData” in the Name field.
-
Ensure you have the Language field set to Python and the Cluster field is set to MDWDatabricksCluster.
-
Click Create.
-
On the Cmd 1 cell, click the Edit button on the top right-hand corner of the cell and then click Show Title.
-
Type “Setup connection to MDWDataLake storage account” in the cell title.
-
On the Cmd 1 cell, you will invoke the Spark API to establish a connection to your MDWDataLake storage account. For this you will need to retrieve the name and key of your MDWDataLake storage account from the Azure Portal.
-
Use the Python code below and replace [your MDWDataLake storage account name] with mdwdatalakesuffix and to replace [your MDWDataLake storage account key] with the storage account key.
spark.conf.set(
"fs.azure.account.key.[your MDWDataLake storage account name].blob.core.windows.net",
"[your MDWDataLake storage account key]")
-
Press Shift + Enter to execute and create a new notebook cell. Set the title of the Cmd 2 cell to “Define NYCTaxiData schema and load data into a Data Frame”
-
In the Cmd 2 cell, define a new StructType object that will contain the definition of the data frame schema.
-
Using the schema defined above, initialise a new data frame by invoking the Spark API to read the contents of the nyctaxidata container in the MDWDataLake storage account. Use the Python code below:
from pyspark.sql.types import *
nycTaxiDataSchema = StructType([
StructField("VendorID",IntegerType(),True)
, StructField("tpep_pickup_datetime",DateType(),True)
, StructField("tpep_dropoff_datetime",DateType(),True)
, StructField("passenger_count",IntegerType(),True)
, StructField("trip_distance",DoubleType(),True)
, StructField("RatecodeID",IntegerType(),True)
, StructField("store_and_fwd_flag",StringType(),True)
, StructField("PULocationID",IntegerType(),True)
, StructField("DOLocationID",IntegerType(),True)
, StructField("payment_type",IntegerType(),True)
, StructField("fare_amount",DoubleType(),True)
, StructField("extra",DoubleType(),True)
, StructField("mta_tax",DoubleType(),True)
, StructField("tip_amount",DoubleType(),True)
, StructField("tolls_amount",DoubleType(),True)
, StructField("improvement_surcharge",DoubleType(),True)
, StructField("total_amount",DoubleType(),True)])
dfNYCTaxiData = spark.read.format('csv').options(header='true', schema=nycTaxiDataSchema).load('wasbs://nyctaxidata@[your MDWDataLake storage account name].blob.core.windows.net/')
-
Remember to replace [your MDWDataLake storage account name] with mdwdatalakesuffix and to replace [your MDWDataLake storage account key] with the storage account key. Your Cmd 2 cell should look like this:
-
Hit Shift + Enter to execute the command and create a new cell.
-
Set the title of the Cmd 3 cell to “Display Data Frame Content”.
-
In the Cmd 3 cell, call the display function to show the contents of the data frame dfNYCTaxiData. Use the Python code below:
display(dfNYCTaxiData)
-
Hit Shift + Enter to execute the command and create a new cell. You will see a data grid showing the top 1000 records from the dataframe:
-
Set the title of the Cmd 4 cell to “Use DataFrame Operations to Filter Data”
-
In the Cmd 4 cell, call the select() method of the data frame object to select the columns "tpep_pickup_datetime", "passenger_count" and "total_amount". Then use the filter() method to filter rows where "passenger_count > 6" and "total_amount > 50.0". Use the Python code below:
display(dfNYCTaxiData.select("tpep_pickup_datetime", "passenger_count", "total_amount").filter("passenger_count > 6 and total_amount > 50.0"))
-
Hit Shift + Enter to execute the command and create a new cell.
-
Set the title of the Cmd 5 cell to “Create Temp View”
-
In the Cmd 5 cell, call the createOrReplaceTempView method of the data frame object to create a temporary view of the data in memory. Use the Python code below:
dfNYCTaxiData.createOrReplaceTempView('NYCTaxiDataTable')
-
Hit Shift + Enter to execute the command and create a new cell.
-
Set the title of the Cmd 6 cell to “Use SQL to count NYC Taxi Data records”
-
In the Cmd 6 cell, change the default language to SQL using the %sql command.
-
Write a SQL query to retrieve the total number of records in the NYCTaxiDataTable view. Use the command below:
%sql
select count(*) from NYCTaxiDataTable
-
Hit Shift + Enter to execute the command and create a new cell. You will see the total number of records in the data frame at the bottom of the cell.
-
Set the title of the Cmd 7 cell to “Use SQL to filter NYC Taxi Data records”
-
In the Cmd 7 cell, write a SQL query to filter taxi rides that happened on the Apr, 7th 2018 that had more than 5 passengers. Use the command below:
%sql
select cast(tpep_pickup_datetime as date) as pickup_date
, tpep_dropoff_datetime
, passenger_count
, total_amount
from NYCTaxiDataTable
where cast(tpep_pickup_datetime as date) = '2018-04-07'
and passenger_count > 5
-
Hit Shift + Enter to execute the command and create a new cell. You will see a grid showing the filtered result set.
-
Set the title of the Cmd 9 cell to “Use SQL to aggregate NYC Taxi Data records and visualize data”
-
In the Cmd 9 cell, write a SQL query to aggregate records and return total number of rides by payment type. Use the command below:
%sql
select case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as PaymentType
, count(*) as TotalRideCount
from NYCTaxiDataTable
group by payment_type
order by TotalRideCount desc
-
Hit Shift + Enter to execute the command and create a new cell. Results will be displayed in a grid in the cell.
-
Click the Bar chart button to see results as a bar chart.
-
Set the title of the Cmd 10 cell to “Load Taxi Location Data from Azure SQL Data Warehouse”.
-
Using Python, open a JDBC connection to your Azure SQL Data Warehouse and load Taxi location lookup data from the Staging.NYCTaxiLocationLookup table into a new data frame called dfLocationLookup.
IMPORTANT: Don't forget to replace the 'mdwsqlvirtualserver-suffix' with your specific Azure SQL Data Warehouse server name.
In the same cell, create a temporary view called "NYCTaxiLocation" and display the contents of the data frame. Use the Python code below:
jdbcUrl = "jdbc:sqlserver://mdwsqlvirtualserver-suffix.database.windows.net:1433;database=MDWASQLDW"
connectionProperties = {
"user" : "mdwadmin",
"password" : "P@ssw0rd123!",
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = '(select * from Staging.NYCTaxiLocationLookup) as t'
dfLookupLocation = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
dfLookupLocation.createOrReplaceTempView('NYCTaxiLocation')
display(dfLookupLocation)
-
Hit Shift + Enter to execute the command and create a new cell. Results will be displayed in a grid in the cell.
-
Set the title of the Cmd 11 cell to “Combine Data Lake and Data Warehouse data frames using SQL”.
-
In the Cmd 11 cell, write a SQL query to join the two dataframes using their view names. Write a SELECT statement to return the "Borough" column from NYCTaxiLocation view and the columns "tpep_pickup_datetime", "passenger_count" and "total_amount" from the NYCTaxiDataTable view. Use a WHERE clause to filter taxi rides that happened on the Apr, 7th of 2018 with passenger_count > 5 and total_amount > 50.0. Use the SQL command below:
%sql
select
pu.Borough
, cast(tpep_pickup_datetime as date) as pickup_date
, passenger_count
, total_amount
from NYCTaxiDataTable as rides
join NYCTaxiLocation as pu
on rides.PULocationID = pu.LocationID
where cast(tpep_pickup_datetime as date) = '2018-04-07'
and passenger_count > 5
and total_amount > 50.0