title	titleSuffix	description	author	ms.author	ms.reviewer	ms.service	ms.subservice	ms.topic	ms.date
Apache Spark - Environment Configuration	Azure Machine Learning	Learn how to configure your Apache Spark environment for interactive data wrangling	ynpandey	franksolomon	franksolomon	machine-learning	mldata	how-to	03/06/2023

Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)

[!INCLUDE preview disclaimer]

To handle interactive Azure Machine Learning notebook data wrangling, Azure Machine Learning integration with Azure Synapse Analytics (preview) provides easy access to the Apache Spark framework. This access allows for Azure Machine Learning Notebook interactive data wrangling.

In this quickstart guide, you learn how to perform interactive data wrangling using Azure Machine Learning Managed (Automatic) Synapse Spark compute, Azure Data Lake Storage (ADLS) Gen 2 storage account, and user identity passthrough.

Prerequisites

An Azure subscription; if you don't have an Azure subscription, create a free account before you begin.
An Azure Machine Learning workspace. See Create workspace resources.
An Azure Data Lake Storage (ADLS) Gen 2 storage account. See Create an Azure Data Lake Storage (ADLS) Gen 2 storage account.
To enable this feature:
1. Navigate to the Azure Machine Learning studio UI
2. In the icon section at the top right of the screen, select Manage preview features (megaphone icon)
3. In the Managed preview feature panel, toggle the Run notebooks and jobs on managed Spark feature to on :::image type="content" source="./media/apache-spark-environment-configuration/how-to-enable-managed-spark-preview.png" lightbox="media/apache-spark-environment-configuration/how-to-enable-managed-spark-preview.png" alt-text="Screenshot showing the option to enable the Managed Spark preview.":::

Store Azure storage account credentials as secrets in Azure Key Vault

To store Azure storage account credentials as secrets in the Azure Key Vault using the Azure portal user interface:

Navigate to your Azure Key Vault in the Azure portal.
Select Secrets from the left panel.
Select + Generate/Import.

:::image type="content" source="media/apache-spark-environment-configuration/azure-key-vault-secrets-generate-import.png" alt-text="Screenshot showing the Azure Key Vault Secrets Generate Or Import tab.":::
At the Create a secret screen, enter a Name for the secret you want to create.
Navigate to Azure Blob Storage Account, in the Azure portal, as seen in this image:

:::image type="content" source="media/apache-spark-environment-configuration/storage-account-access-keys.png" alt-text="Screenshot showing the Azure access key and connection string values screen.":::
Select Access keys from the Azure Blob Storage Account page left panel.
Select Show next to Key 1, and then Copy to clipboard to get the storage account access key.
[!Note] Select appropriate options to copy
- Azure Blob storage container shared access signature (SAS) tokens
- Azure Data Lake Storage (ADLS) Gen 2 storage account service principal credentials
  - tenant ID
  - client ID and
  - secret
on the respective user interfaces while creating Azure Key Vault secrets for them.
Navigate back to the Create a secret screen.
In the Secret value textbox, enter the access key credential for the Azure storage account, which was copied to the clipboard in the earlier step.
Select Create.

:::image type="content" source="media/apache-spark-environment-configuration/create-a-secret.png" alt-text="Screenshot showing the Azure secret creation screen.":::

Tip

Azure CLI and Azure Key Vault secret client library for Python can also create Azure Key Vault secrets.

Add role assignments in Azure storage accounts

We must ensure that the input and output data paths are accessible before we start interactive data wrangling. First, for

the user identity of the Notebooks session logged-in user or
a service principal

assign Reader and Storage Blob Data Reader roles to the user identity of the logged-in user. However, in certain scenarios, we might want to write the wrangled data back to the Azure storage account. The Reader and Storage Blob Data Reader roles provide read-only access to the user identity or service principal. To enable read and write access, assign Contributor and Storage Blob Data Contributor roles to the user identity or service principal. To assign appropriate roles to the user identity:

Open the Microsoft Azure portal.
Search and select the Storage accounts service.

:::image type="content" source="media/apache-spark-environment-configuration/find-storage-accounts-service.png" lightbox="media/apache-spark-environment-configuration/find-storage-accounts-service.png" alt-text="Expandable screenshot showing Storage accounts service search and selection, in Microsoft Azure portal.":::
On the Storage accounts page, select the Azure Data Lake Storage (ADLS) Gen 2 storage account from the list. A page showing the storage account Overview will open.

:::image type="content" source="media/apache-spark-environment-configuration/storage-accounts-list.png" lightbox="media/apache-spark-environment-configuration/storage-accounts-list.png" alt-text="Expandable screenshot showing selection of the Azure Data Lake Storage (ADLS) Gen 2 storage account Storage account.":::
Select Access Control (IAM) from the left panel
Select Add role assignment

:::image type="content" source="media/apache-spark-environment-configuration/storage-account-add-role-assignment.png" lightbox="media/apache-spark-environment-configuration/storage-account-add-role-assignment.png" alt-text="Screenshot showing the Azure access keys screen.":::
Find and select role Storage Blob Data Contributor
Select Next

:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-choose-role.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-choose-role.png" alt-text="Screenshot showing the Azure add role assignment screen.":::
Select User, group, or service principal.
Select + Select members.
Search for the user identity below Select
Select the user identity from the list, so that it shows under Selected members
Select the appropriate user identity
Select Next

:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-choose-members.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-choose-members.png" alt-text="Screenshot showing the Azure add role assignment screen Members tab.":::
Select Review + Assign

:::image type="content" source="media/apache-spark-environment-configuration/add-role-assignment-review-and-assign.png" lightbox="media/apache-spark-environment-configuration/add-role-assignment-review-and-assign.png" alt-text="Screenshot showing the Azure add role assignment screen review and assign tab.":::
Repeat steps 2-13 for Contributor role assignment.

Once the user identity has the appropriate roles assigned, data in the Azure storage account should become accessible.

Note

If an attached Synapse Spark pool points to a Synapse Spark pool in an Azure Synapse workspace that has a managed virtual network associated with it, a managed private endpoint to storage account should be configured to ensure data access.

Ensuring resource access for Spark jobs

Spark jobs can use either a managed identity or user identity passthrough to access data and other resources. The following table summarizes the different mechanisms for resource access while using Azure Machine Learning Managed (Automatic) Spark compute and attached Synapse Spark pool.

Spark pool	Supported identities	Default identity
Managed (Automatic) Spark compute	User identity and managed identity	User identity
Attached Synapse Spark pool	User identity and managed identity	Managed identity - compute identity of the attached Synapse Spark pool

If the CLI or SDK code defines an option to use managed identity, Azure Machine Learning Managed (Automatic) Spark compute relies on a user-assigned managed identity attached to the workspace. You can attach a user-assigned managed identity to an existing Azure Machine Learning workspace using Azure Machine Learning CLI v2, or with ARMClient.

Next steps

Apache Spark in Azure Machine Learning (preview)
Attach and manage a Synapse Spark pool in Azure Machine Learning (preview)
Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
Submit Spark jobs in Azure Machine Learning (preview)
Code samples for Spark jobs using Azure Machine Learning CLI
Code samples for Spark jobs using Azure Machine Learning Python SDK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apache-spark-environment-configuration.md

apache-spark-environment-configuration.md

Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)

Prerequisites

Store Azure storage account credentials as secrets in Azure Key Vault

Add role assignments in Azure storage accounts

Ensuring resource access for Spark jobs

Next steps

Files

apache-spark-environment-configuration.md

Latest commit

History

apache-spark-environment-configuration.md

File metadata and controls

Quickstart: Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)

Prerequisites

Store Azure storage account credentials as secrets in Azure Key Vault

Add role assignments in Azure storage accounts

Ensuring resource access for Spark jobs

Next steps