title	titleSuffix	description	services	ms.service	ms.subservice	ms.topic	author	ms.author	ms.reviewer	ms.date	ms.custom
Access data from Azure cloud storage during interactive development	Azure Machine Learning	Access data from Azure cloud storage during interactive development	machine-learning	machine-learning	core	how-to	samuel100	samkemp	franksolomon	11/17/2022	sdkv2

Access data from Azure cloud storage during interactive development

[!INCLUDE sdk v2]

Typically the beginning of a machine learning project involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and building prototypes of ML models to validate hypotheses. This prototyping phase of the project is highly interactive in nature that lends itself to developing in a Jupyter notebook or an IDE with a Python interactive console. In this article you'll learn how to:

[!div class="checklist"]

Access data from a Azure Machine Learning Datastores URI as if it were a file system.

Materialize data into Pandas using mltable Python library.

Materialize Azure Machine Learning data assets into Pandas using mltable Python library.

Materialize data through an explicit download with the azcopy utility.

Prerequisites

An Azure Machine Learning workspace. For more information, see Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2).
An Azure Machine Learning Datastore. For more information, see Create datastores.

Tip

The guidance in this article to access data during interactive development applies to any host that can run a Python session - for example: your local machine, a cloud VM, a GitHub Codespace, etc. We recommend using an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, see Create and manage an Azure Machine Learning compute instance.

Important

Ensure you have the latest azure-fsspec and mltable python libraries installed in your python environment:

pip install -U azureml-fsspec mltable

Access data from a datastore URI, like a filesystem (preview)

[!INCLUDE preview disclaimer]

An Azure Machine Learning datastore is a reference to an existing storage account on Azure. The benefits of creating and using a datastore include:

[!div class="checklist"]

A common and easy-to-use API to interact with different storage types (Blob/Files/ADLS).

Easier to discover useful datastores when working as a team.

Supports both credential-based (for example, SAS token) and identity-based (use Azure Active Directory or Manged identity) to access data.

When using credential-based access, the connection information is secured so you don't expose keys in scripts.

Browse data and copy-paste datastore URIs in the Studio UI.

A Datastore URI is a Uniform Resource Identifier, which is a reference to a storage location (path) on your Azure storage account. The format of the datastore URI is:

# Azure Machine Learning workspace details:
subscription = '<subscription_id>'
resource_group = '<resource_group>'
workspace = '<workspace>'
datastore_name = '<datastore>'
path_on_datastore '<path>'

# long-form Datastore uri format:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'.

These Datastore URIs are a known implementation of Filesystem spec (fsspec): A unified pythonic interface to local, remote and embedded file systems and bytes storage.

The Azure Machine Learning Datastore implementation of fsspec automatically handles credential/identity passthrough used by the Azure Machine Learning datastore. This means you don't need to expose account keys in your scripts or do additional sign-in procedures on a compute instance.

For example, you can directly use Datastore URIs in Pandas - below is an example of reading a CSV file:

import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

Select Data from the left-hand menu followed by the Datastores tab.
Select your datastore name and then Browse.
Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::

You can also instantiate an Azure Machine Learning filesystem and do filesystem-like commands like ls, glob, exists, open, etc. The open() method will return a file-like object, which can be passed to any other library that expects to work with python files, or used by your own code as you would a normal python file object. These file-like objects respect the use of with contexts, for example:

from azureml.fsspec import AzureMachineLearningFileSystem

# instantiate file system using datastore URI
fs = AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>')

# list files in the path
fs.ls()
# output example:
# /datastore_name/folder/file1.csv
# /datastore_name/folder/file2.csv

# use an open context
with fs.open('/datastore_name/folder/file1.csv') as f:
    # do some process
    process_file(f)

Examples

In this section we provide some examples of how to use Filesystem spec, for some common scenarios.

Read a single CSV file into pandas

If you have a single CSV file, then as outlined above you can read that into pandas with:

import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")

Read a folder of CSV files into pandas

The Pandas read_csv() method doesn't support reading a folder of CSV files. You need to glob csv paths and concatenate them to a data frame using Pandas concat() method. The code below demonstrates how to achieve this concatenation with the Azure Machine Learning filesystem:

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.csv'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# append csv files in folder to a list
dflist = []
for path in fs.ls():
    with fs.open(path) as f:
        dflist.append(pd.read_csv(f))

# concatenate data frames
df = pd.concat(dflist)
df.head()

Reading CSV files into Dask

Below is an example of reading a CSV file into a Dask data frame:

import dask.dd as dd

df = dd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

Read a folder of parquet files into pandas

Parquet files are typically written to a folder as part of an ETL process, which can emit files pertaining to the ETL such as progress, commits, etc. Below is an example of files created from an ETL process (files beginning with _) to produce a parquet file of data.

:::image type="content" source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet etl process.":::

In these scenarios, you'll only want to read the parquet files in the folder and ignore the ETL process files. The code below shows how you can use glob patterns to read only parquet files in a folder:

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.parquet'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# append csv files in folder to a list
dflist = []
for path in fs.ls():
    with fs.open(path) as f:
        dflist.append(pd.read_parquet(f))

# concatenate data frames
df = pd.concat(dflist)
df.head()

Accessing data from your Azure Databricks filesystem (`dbfs`)

Filesystem spec (fsspec) has a range of known implementations, one of which is the Databricks Filesystem (dbfs).

To access data from dbfs you will need:

Instance name, which is in the form of adb-<some-number>.<two digits>.azuredatabricks.net. You can glean this from the URL of your Azure Databricks workspace.
Personal Access Token (PAT), for more information on creating a PAT, please see Authentication using Azure Databricks personal access tokens

Once you have these, you will need to create an environment variable on your compute instance for the PAT token:

export ADB_PAT=<pat_token>

You can then access data in Pandas using:

import os
import pandas as pd

pat = os.getenv(ADB_PAT)
path_on_dbfs = '<absolute_path_on_dbfs>' # e.g. /folder/subfolder/file.csv

storage_options = {
    'instance':'adb-<some-number>.<two digits>.azuredatabricks.net', 
    'token': pat
}

df = pd.read_csv(f'dbfs://{path_on_dbfs}', storage_options=storage_options)

Reading images with `pillow`

from PIL import Image
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<image.jpeg>'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

with fs.open() as f:
    img = Image.open(f)
    img.show()

PyTorch custom dataset example

In this example, you create a PyTorch custom dataset for processing images. The assumption is that an annotations file (in CSV format) exists that looks like:

image_path, label
0/image0.png, label0
0/image1.png, label0
1/image2.png, label1
1/image3.png, label1
2/image4.png, label2
2/image5.png, label2

The images are stored in subfolders according to their label:

/
└── 📁images
    ├── 📁0
    │   ├── 📷image0.png
    │   └── 📷image1.png
    ├── 📁1
    │   ├── 📷image2.png
    │   └── 📷image3.png
    └── 📁2
        ├── 📷image4.png
        └── 📷image5.png

A custom Dataset class in PyTorch must implement three functions: __init__, __len__, and __getitem__, which are implemented below:

import os
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset

class CustomImageDataset(Dataset):
    def __init__(self, filesystem, annotations_file, img_dir, transform=None, target_transform=None):
        self.fs = filesystem
        f = filesystem.open(annotations_file)
        self.img_labels = pd.read_csv(f)
        f.close()
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        f = self.fs.open(img_path)
        image = Image.open(f)
        f.close()
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

You can then instantiate the dataset using:

from azureml.fsspec import AzureMachineLearningFileSystem
from torch.utils.data import DataLoader

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# create the dataset
training_data = CustomImageDataset(
    filesystem=fs,
    annotations_file='<datastore_name>/<path>/annotations.csv', 
    img_dir='<datastore_name>/<path_to_images>/'
)

# Preparing your data for training with DataLoaders
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)

Materialize data into Pandas using `mltable` library

Another method for accessing data in cloud storage is to use the mltable library. The general format for reading data into pandas using mltable is:

import mltable

# define a path or folder or pattern
path = {
    'file': '<supported_path>'
    # alternatives
    # 'folder': '<supported_path>'
    # 'pattern': '<supported_path>'
}

# create an mltable from paths
tbl = mltable.from_delimited_files(paths=[path])
# alternatives
# tbl = mltable.from_parquet_files(paths=[path])
# tbl = mltable.from_json_lines_files(paths=[path])
# tbl = mltable.from_delta_lake(paths=[path])

# materialize to pandas
df = tbl.to_pandas_dataframe()
df.head()

Supported paths

You'll notice the mltable library supports reading tabular data from different path types:

Location	Examples
A path on your local computer	`./home/username/data/my_data`
A path on a public http(s) server	`https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv`
A path on Azure Storage	`wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>` `abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>`
A long-form Azure Machine Learning datastore	`azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>`

Note

mltable does user credential passthrough for paths on Azure Storage and Azure Machine Learning datastores. If you do not have permission to the data on the underlying storage then you will not be able to access the data.

Files, folders and globs

mltable supports reading from:

file(s), for example: abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv
folder(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
glob pattern(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/*.csv
Or, a combination of files, folders, globbing patterns

The flexibility of mltable allows you to materialize data into a single dataframe from a combination of local/cloud storage and combinations of files/folder/globs. For example:

path1 = {
    'file': 'abfss://filesystem@account1.dfs.core.windows.net/my-csv.csv'
}

path2 = {
    'folder': './home/username/data/my_data'
}

path3 = {
    'pattern': 'abfss://filesystem@account2.dfs.core.windows.net/folder/*.csv'
}

tbl = mltable.from_delimited_files(paths=[path1, path2, path3])

Supported file formats

mltable supports the following file formats:

Delimited Text (for example: CSV files): mltable.from_delimited_files(paths=[path])
Parquet: mltable.from_parquet_files(paths=[path])
Delta: mltable.from_delta_lake(paths=[path])
JSON lines format: mltable.from_json_lines_files(paths=[path])

Examples

Read a CSV file

ADLS gen2

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Blob storage

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/<file_name>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Azure Machine Learning Datastore

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/<file>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

Select Data from the left-hand menu followed by the Datastores tab.
Select your datastore name and then Browse.
Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::

HTTP Server

import mltable

path = {
    'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Read parquet files in a folder

The example code below shows how mltable can use glob patterns - such as wildcards - to ensure only the parquet files are read.

ADLS gen2

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Blob storage

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/*.parquet'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Azure Machine Learning Datastore

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

Select Data from the left-hand menu followed by the Datastores tab.
Select your datastore name and then Browse.
Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::

HTTP Server

Update the placeholders (<>) in the code snippet with your details.

Important

To glob the pattern on a public HTTP server, there must be access at the folder level.

import mltable

path = {
    'pattern': '<https_address>/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Reading data assets

In this section, you'll learn how to access your Azure Machine Learning data assets into pandas.

Table asset

If you've previously created a Table asset in Azure Machine Learning (an mltable, or a V1 TabularDataset), you can load that into pandas using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()

File asset

If you've registered a File asset that you want to read into Pandas data frame - for example, a CSV file - you can achieve this using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
    'file': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Folder asset

If you've registered a Folder asset (uri_folder or a V1 FileDataset) that you want to read into Pandas data frame - for example, a folder containing CSV file - you can achieve this using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
    'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

A note on reading and processing large data volumes with Pandas

Tip

Pandas is not designed to handle large datasets - you will only be able to process data that can fit into the memory of the compute instance.

For large datasets we recommend that you use Azure Machine Learning managed Spark, which provides the PySpark Pandas API.

You may wish to iterate quickly on a smaller subset of a large dataset before scaling up to a remote asynchronous job. mltable provides in-built functionality to get samples of large data using the take_random_sample method:

import mltable

path = {
    'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
# take a random 30% sample of the data
tbl = tbl.take_random_sample(probability=.3)
df = tbl.to_pandas_dataframe()
df.head()

You can also take subsets of large data by using:

filter
keep_columns
drop_columns

Downloading data using the `azcopy` utility

You may want to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance) and use the local filesystem. You can do this with the azcopy utility, which is pre-installed on an Azure Machine Learning compute instance. If you are not using an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install azcopy. For more information please read azcopy.

Caution

We do not recommend downloading data in the /home/azureuser/cloudfiles/code location on a compute instance. This is designed to store notebook and code artifacts, not data. Reading data from this location will incur significant performance overhead when training. Instead we recommend storing your data in home/azureuser, which is the local SSD of the compute node.

Open a terminal and create a new directory, for example:

mkdir /home/azureuser/data

Sign-in to azcopy using:

azcopy login

Next, you can copy data using a storage URI

SOURCE=https://<account_name>.blob.core.windows.net/<container>/<path>
DEST=/home/azureuser/data
azcopy cp $SOURCE $DEST

Next steps

Interactive Data Wrangling with Apache Spark in Azure Machine Learning (preview)
Access data in a job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how-to-access-data-interactive.md

how-to-access-data-interactive.md

Access data from Azure cloud storage during interactive development

Prerequisites

Access data from a datastore URI, like a filesystem (preview)

Examples

Read a single CSV file into pandas

Read a folder of CSV files into pandas

Reading CSV files into Dask

Read a folder of parquet files into pandas

Accessing data from your Azure Databricks filesystem (`dbfs`)

Reading images with `pillow`

PyTorch custom dataset example

Materialize data into Pandas using `mltable` library

Supported paths

Files, folders and globs

Supported file formats

Examples

Read a CSV file

ADLS gen2

Blob storage

Azure Machine Learning Datastore

HTTP Server

Read parquet files in a folder

ADLS gen2

Blob storage

Azure Machine Learning Datastore

HTTP Server

Reading data assets

Table asset

File asset

Folder asset

A note on reading and processing large data volumes with Pandas

Downloading data using the `azcopy` utility

Next steps

Files

how-to-access-data-interactive.md

Latest commit

History

how-to-access-data-interactive.md

File metadata and controls

Access data from Azure cloud storage during interactive development

Prerequisites

Access data from a datastore URI, like a filesystem (preview)

Examples

Read a single CSV file into pandas

Read a folder of CSV files into pandas

Reading CSV files into Dask

Read a folder of parquet files into pandas

Accessing data from your Azure Databricks filesystem (dbfs)

Reading images with pillow

PyTorch custom dataset example

Materialize data into Pandas using mltable library

Supported paths

Files, folders and globs

Supported file formats

Examples

Read a CSV file

Read parquet files in a folder

Reading data assets

Table asset

File asset

Folder asset

A note on reading and processing large data volumes with Pandas

Downloading data using the azcopy utility

Next steps

Accessing data from your Azure Databricks filesystem (`dbfs`)

Reading images with `pillow`

Materialize data into Pandas using `mltable` library

Downloading data using the `azcopy` utility