Skip to content

Latest commit

 

History

History
420 lines (305 loc) · 17.6 KB

Lab1.md

File metadata and controls

420 lines (305 loc) · 17.6 KB

Lab 1: Load Data into Azure Synapse Analytics using Azure Data Factory Pipelines

In this lab you will configure the Azure environment to allow relational data to be transferred from an Azure SQL Database to an Azure Synapse Analytics data warehouse using Azure Data Factory. The dataset you will use contains data about motor vehicle collisions that happened in New Your City from 2012 to 2019. You will use Power BI to visualise collision data loaded from your Azure Synapse Analytics data warehouse.

The estimated time to complete this lab is: 45 minutes.

Microsoft Learn & Technical Documentation

The following Azure services will be used in this lab. If you need further training resources or access to technical documentation please find in the table below links to Microsoft Learn and to each service's Technical Documentation.

Azure Service Microsoft Learn Technical Documentation
Azure SQL Database Work with relational data in Azure Azure SQL Database Technical Documentation
Azure Data Factory Data ingestion with Azure Data Factory Azure Data Factory Technical Documentation
Azure Synapse Analytics Implement a Data Warehouse with Azure Synapse Analytics Azure Synapse Analytics Technical Documentation
Azure Data Lake Storage Gen2 Large Scale Data Processing with Azure Data Lake Storage Gen2 Azure Data Lake Storage Gen2 Technical Documentation

Lab Architecture

Lab Architecture

Step Description
1 Build an Azure Data Factory Pipeline to copy data from an Azure SQL Database table
2 Use Azure Data Lake Storage Gen2 as a staging area for Polybase
3 Load data to an Azure Synapse Analytics table using Polybase
4 Visualize data from Azure Synapse Analytics using Power BI

IMPORTANT: Some of the Azure services provisioned require globally unique name and a “-suffix” has been added to their names to ensure this uniqueness. Please take note of the suffix generated as you will need it for the following resources in this lab:

Name Type
SynapseDataFactory-suffix Data Factory (V2)
synapsedatalakesuffix Data Lake Storage Gen2
synapsesql-suffix SQL server
operationalsql-suffix SQL server

Connect to ADPDesktop

In this section you are going to establish a Remote Desktop Connection to ADPDesktop virtual machine.

IMPORTANT
Execute these steps on your host computer
  1. In the Azure Portal, navigate to the lab resource group and click the ADPDesktop virtual machine.

  2. On the ADPDesktop blade, from the Overview menu, click the Connect button.

  3. On the Connect to virtual machine blade, click Download RDP File. This will download a .rdp file that you can use to establish a Remote Desktop Connection with the virtual machine.

  4. If you have issues connecting via Remote Desktop Protocol (RDP), you can then connect via Azure Bastion by clicking the Bastion tab and providing the credentials indicated in the next section. This will open a new browser tab with the remote connection via SSL and HTML5.

Install required software onto ADPDesktop

In this section you are going to install Power BI Desktop and Azure Data Studio on ADPDesktop.

IMPORTANT
Execute these steps inside the ADPDesktop remote desktop connection
  1. Once the RDP file is downloaded, click on it to establish an RDP connection with ADPDesktop

  2. User the following credentials to authenticate:
    - User Name: ADPAdmin
    - Password: P@ssw0rd123!

  3. Once logged in, accept the default privacy settings.

  4. Using the browser, download and install the latest version of following software. During the setup, accept all default settings:

    Azure Data Studio (User Installer)
    https://docs.microsoft.com/en-us/sql/azure-data-studio/download


    Power BI Desktop (64-bit)
    https://aka.ms/pbiSingleInstaller

Create Azure Synapse Analytics data warehouse objects

In this section you will connect to Azure Synapse Analytics to create the database objects used to host and process data.

IMPORTANT
Execute these steps inside the ADPDesktop remote desktop connection
  1. Open Azure Data Studio. On the Servers panel, click New Connection.

  2. On the Connection Details panel, enter the following connection details:
    - Server: synapsesql-suffix.database.windows.net
    - Authentication Type: SQL Login
    - User Name: ADPAdmin
    - Password: P@ssw0rd123!
    - Database: SynapseDW

  3. Click Connect.

  4. Right-click the server name and click New Query.

  5. On the new query window, create a new database schema named [NYC]. Use this SQL Command:

create schema [NYC]
go
  1. Create a new round robin distributed table named NYC.NYPD_MotorVehicleCollisions, see column definitions on the SQL Command:
create table [NYC].[NYPD_MotorVehicleCollisions](
	[UniqueKey] [int] NULL,
	[CollisionDate] [date] NULL,
	[CollisionDayOfWeek] [varchar](9) NULL,
	[CollisionTime] [time](7) NULL,
	[CollisionTimeAMPM] [varchar](2) NOT NULL,
	[CollisionTimeBin] [varchar](11) NULL,
	[Borough] [varchar](200) NULL,
	[ZipCode] [varchar](20) NULL,
	[Latitude] [float] NULL,
	[Longitude] [float] NULL,
	[Location] [varchar](200) NULL,
	[OnStreetName] [varchar](200) NULL,
	[CrossStreetName] [varchar](200) NULL,
	[OffStreetName] [varchar](200) NULL,
	[NumberPersonsInjured] [int] NULL,
	[NumberPersonsKilled] [int] NULL,
	[IsFatalCollision] [int] NOT NULL,
	[NumberPedestriansInjured] [int] NULL,
	[NumberPedestriansKilled] [int] NULL,
	[NumberCyclistInjured] [int] NULL,
	[NumberCyclistKilled] [int] NULL,
	[NumberMotoristInjured] [int] NULL,
	[NumberMotoristKilled] [int] NULL,
	[ContributingFactorVehicle1] [varchar](200) NULL,
	[ContributingFactorVehicle2] [varchar](200) NULL,
	[ContributingFactorVehicle3] [varchar](200) NULL,
	[ContributingFactorVehicle4] [varchar](200) NULL,
	[ContributingFactorVehicle5] [varchar](200) NULL,
	[VehicleTypeCode1] [varchar](200) NULL,
	[VehicleTypeCode2] [varchar](200) NULL,
	[VehicleTypeCode3] [varchar](200) NULL,
	[VehicleTypeCode4] [varchar](200) NULL,
	[VehicleTypeCode5] [varchar](200) NULL
) 
with (distribution = round_robin)
go

Create Staging Container on Azure Data Lake Storage Gen2

In this section you create a staging container in your data lake account that will be used as a staging environment for Polybase before data can be copied to Azure Synapse Analytics.

IMPORTANT
Execute these steps on your host computer
  1. In the Azure Portal, go to the lab resource group and locate the Azure Storage account synapsedatalakesuffix.

  2. On the Overview panel, click Containers.

  3. On the synapsedatalakesuffix – Containers blade, click + Container. On the New container blade, enter the following details:
    - Name: polybase
    - Public access level: Private (no anynymous access)

  4. Click OK to create the new container.

Create Azure Data Factory Pipeline to Copy Relational Data

In this section you will build an Azure Data Factory pipeline to copy a table from NYCDataSets database to Azure Synapse Analytics data warehouse.

Create Linked Service connections

IMPORTANT
Execute these steps on your host computer
  1. In the Azure Portal, go to the lab resource group and locate the Azure Data Factory resource SynapseDataFactory-suffix.

  2. On the Overview panel, click Author & Monitor. The Azure Data Factory portal will open in a new browser tab.

  3. In the Azure Data Factory portal and click the Author (pencil icon) option on the left-hand side panel. Under Connections tab, click Linked Services and then click + New to create a new linked service connection.

  4. On the New Linked Service blade, type “Azure SQL Database” in the search box to find the Azure SQL Database linked service. Click Continue.

  5. On the New Linked Service (Azure SQL Database) blade, enter the following details:
    - Name: OperationalSQL_NYCDataSets
    - Account selection method: From Azure subscription
    - Azure subscription: [your subscription]
    - Server Name: operationalsql-suffix
    - Database Name: NYCDataSets
    - Authentication Type: SQL Authentication
    - User Name: ADPAdmin
    - Password: P@ssw0rd123!

  6. Click Test connection to make sure you entered the correct connection details and then click Finish.

  7. Repeat the process to create an Azure Synapse Analytics linked service connection.

  8. On the New Linked Service (Azure Synapse Analytics) blade, enter the following details:
    - Name: SynapseSQL_SynapseDW
    - Connect via integration runtime: AutoResolveIntegrationRuntime
    - Account selection method: From Azure subscription
    - Azure subscription: [your subscription]
    - Server Name: synapsesql-suffix
    - Database Name: SynapseDW
    - Authentication Type: SQL Authentication
    - User Name: ADPAdmin
    - Password: P@ssw0rd123!

  9. Click Test connection to make sure you entered the correct connection details and then click Finish.

  10. Repeat the process once again to create an Azure Blob Storage linked service connection.

  11. On the New Linked Service (Azure Blob Storage) blade, enter the following details:
    - Name: synapsedatalake
    - Connect via integration runtime: AutoResolveIntegrationRuntime
    - Authentication method: Account key
    - Account selection method: From Azure subscription
    - Azure subscription: [your subscription]
    - Storage account name: synapsedatalakesuffix

  12. Click Test connection to make sure you entered the correct connection details and then click Finish.

  13. You should now see 3 linked services connections that will be used as source, destination and staging.

Create Source and Destination Data Sets

IMPORTANT
Execute these steps on your host computer
  1. Open the Azure Data Factory portal and click the Author (pencil icon) option on the left-hand side panel. Under Factory Resources tab, click the ellipsis (…) next to Datasets and then click New Dataset to create a new dataset.

  2. Type "Azure SQL Database" in the search box and select Azure SQL Database. Click Finish.

  3. On the New Data Set tab, enter the following details:
    - Name: NYCDataSets_MotorVehicleCollisions
    - Linked Service: OperationalSQL_NYCDataSets
    - Table: [NYC].[NYPD_MotorVehicleCollisions]

    Alternatively you can copy and paste the dataset JSON definition below:

    {
        "name": "NYCDataSets_MotorVehicleCollisions",
        "properties": {
            "linkedServiceName": {
                "referenceName": "OperationalSQL_NYCDataSets",
                "type": "LinkedServiceReference"
            },
            "folder": {
                "name": "Lab1"
            },
            "annotations": [],
            "type": "AzureSqlTable",
            "schema": [],
            "typeProperties": {
                "schema": "NYC",
                "table": "NYPD_MotorVehicleCollisions"
            }
        }
    }
  4. Leave remaining fields with default values and click Continue.

  5. Repeat the process to create a new Azure Synapse Analytics data set.

  6. On the New Data Set tab, enter the following details:
    - Name: SynapseDW_MotorVehicleCollisions
    - Linked Service: SynapseSQL_SynapseDW
    - Table: [NYC].[NYPD_MotorVehicleCollisions]

    Alternatively you can copy and paste the dataset JSON definition below:

    {
        "name": "SynapseDW_MotorVehicleCollisions",
        "properties": {
            "linkedServiceName": {
                "referenceName": "SynapseSQL_SynapseDW",
                "type": "LinkedServiceReference"
            },
            "folder": {
                "name": "Lab1"
            },
            "annotations": [],
            "type": "AzureSqlDWTable",
            "schema": [],
            "typeProperties": {
                "schema": "NYC",
                "table": "NYPD_MotorVehicleCollisions"
            }
        }
    }
  7. Leave remaining fields with default values and click Continue.

  8. Under Factory Resources tab, click the ellipsis (…) next to Datasets and then click New folder to create a new Folder. Name it Lab1.

  9. Drag the two datasets created into the Lab1 folder you just created.

  10. Publish your dataset changes by clicking the Publish All button on the top of the screen.

Create and Execute Pipeline

IMPORTANT
Execute these steps on your host computer
  1. Open the Azure Data Factory portal and click the Author (pencil icon) option on the left-hand side panel. Under Factory Resources tab, click the ellipsis (…) next to Pipelines and then click Add Pipeline to create a new pipeline.

  2. On the New Pipeline tab, enter the following details:
    - General > Name: Lab1 - Copy Collision Data

  3. Leave remaining fields with default values.

  4. From the Activities panel, type “Copy Data” in the search box. Drag the Copy Data activity on to the design surface.

  5. Select the Copy Data activity and enter the following details:
    - General > Name: CopyMotorVehicleCollisions
    - Source > Source dataset: NYCDataSets_MotorVehicleCollisions
    - Sink > Sink dataset: SynapseDW_MotorVehicleCollisions
    - Sink > Allow PolyBase: Checked
    - Settings > Enable staging: Checked
    - Settings > Staging account linked service: synapsedatalake
    - Settings > Storage Path: polybase

  6. Leave remaining fields with default values.

  7. Publish your pipeline changes by clicking the Publish all button.

  8. To execute the pipeline, click on Add trigger menu and then Trigger Now.

  9. On the Pipeline Run blade, click Finish.

  10. To monitor the execution of your pipeline, click on the Monitor menu on the left-hand side panel.

  11. You should be able to see the Status of your pipeline execution on the right-hand side panel.

Visualize Data with Power BI

In this section you are going to use Power BI to visualize data from Azure Synapse Analytics. The Power BI report will use an Import connection to query Azure Synapse Analytics and visualise Motor Vehicle Collision data from the table you loaded in the previous exercise.

IMPORTANT
Execute these steps inside the ADPDesktop remote desktop connection
  1. On ADPDesktop, download the Power BI report from the link https://aka.ms/ADPLab1 and save it on the Desktop.
  2. Open the file ADPLab1.pbit with Power BI Desktop. Optionally sign up for the Power BI tips and tricks email, or to dismiss this, click to sign in with an existing account, and then hit the escape key.
  3. When prompted to enter the value of the SynapseSQLEnpoint parameter, type the full server name: synapsesql-suffix.database.windows.net

  1. Click Load, and then Run to acknowledge the Native Database Query message
  2. When prompted, enter the Database credentials:
    - User Name: adpadmin
    - Password: P@ssw0rd123!

  1. Once the data is finished loading, interact with the report by changing the CollisionDate slicer and by clicking on the other visualisations.

  2. Save your work and close Power BI Desktop.