title | titleSuffix | description | services | ms.service | ms.subservice | ms.topic | author | ms.author | ms.reviewer | ms.date | ms.custom |
---|---|---|---|---|---|---|---|---|---|---|---|
Access data from Azure cloud storage during interactive development |
Azure Machine Learning |
Access data from Azure cloud storage during interactive development |
machine-learning |
machine-learning |
core |
how-to |
samuel100 |
samkemp |
franksolomon |
11/17/2022 |
sdkv2 |
[!INCLUDE sdk v2]
Typically the beginning of a machine learning project involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and building prototypes of ML models to validate hypotheses. This prototyping phase of the project is highly interactive in nature that lends itself to developing in a Jupyter notebook or an IDE with a Python interactive console. In this article you'll learn how to:
[!div class="checklist"]
- Access data from a Azure Machine Learning Datastores URI as if it were a file system.
- Materialize data into Pandas using
mltable
Python library.- Materialize Azure Machine Learning data assets into Pandas using
mltable
Python library.- Materialize data through an explicit download with the
azcopy
utility.
- An Azure Machine Learning workspace. For more information, see Manage Azure Machine Learning workspaces in the portal or with the Python SDK (v2).
- An Azure Machine Learning Datastore. For more information, see Create datastores.
Tip
The guidance in this article to access data during interactive development applies to any host that can run a Python session - for example: your local machine, a cloud VM, a GitHub Codespace, etc. We recommend using an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, see Create and manage an Azure Machine Learning compute instance.
Important
Ensure you have the latest azure-fsspec
and mltable
python libraries installed in your python environment:
pip install -U azureml-fsspec mltable
[!INCLUDE preview disclaimer]
An Azure Machine Learning datastore is a reference to an existing storage account on Azure. The benefits of creating and using a datastore include:
[!div class="checklist"]
- A common and easy-to-use API to interact with different storage types (Blob/Files/ADLS).
- Easier to discover useful datastores when working as a team.
- Supports both credential-based (for example, SAS token) and identity-based (use Azure Active Directory or Manged identity) to access data.
- When using credential-based access, the connection information is secured so you don't expose keys in scripts.
- Browse data and copy-paste datastore URIs in the Studio UI.
A Datastore URI is a Uniform Resource Identifier, which is a reference to a storage location (path) on your Azure storage account. The format of the datastore URI is:
# Azure Machine Learning workspace details:
subscription = '<subscription_id>'
resource_group = '<resource_group>'
workspace = '<workspace>'
datastore_name = '<datastore>'
path_on_datastore '<path>'
# long-form Datastore uri format:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'.
These Datastore URIs are a known implementation of Filesystem spec (fsspec
): A unified pythonic interface to local, remote and embedded file systems and bytes storage.
The Azure Machine Learning Datastore implementation of fsspec
automatically handles credential/identity passthrough used by the Azure Machine Learning datastore. This means you don't need to expose account keys in your scripts or do additional sign-in procedures on a compute instance.
For example, you can directly use Datastore URIs in Pandas - below is an example of reading a CSV file:
import pandas as pd
df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
Tip
Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:
- Select Data from the left-hand menu followed by the Datastores tab.
- Select your datastore name and then Browse.
- Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
You can also instantiate an Azure Machine Learning filesystem and do filesystem-like commands like ls
, glob
, exists
, open
, etc. The open()
method will return a file-like object, which can be passed to any other library that expects to work with python files, or used by your own code as you would a normal python file object. These file-like objects respect the use of with
contexts, for example:
from azureml.fsspec import AzureMachineLearningFileSystem
# instantiate file system using datastore URI
fs = AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>')
# list files in the path
fs.ls()
# output example:
# /datastore_name/folder/file1.csv
# /datastore_name/folder/file2.csv
# use an open context
with fs.open('/datastore_name/folder/file1.csv') as f:
# do some process
process_file(f)
In this section we provide some examples of how to use Filesystem spec, for some common scenarios.
If you have a single CSV file, then as outlined above you can read that into pandas with:
import pandas as pd
df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
The Pandas read_csv()
method doesn't support reading a folder of CSV files. You need to glob csv paths and concatenate them to a data frame using Pandas concat()
method. The code below demonstrates how to achieve this concatenation with the Azure Machine Learning filesystem:
import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem
# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.csv'
# create the filesystem
fs = AzureMachineLearningFileSystem(uri)
# append csv files in folder to a list
dflist = []
for path in fs.ls():
with fs.open(path) as f:
dflist.append(pd.read_csv(f))
# concatenate data frames
df = pd.concat(dflist)
df.head()
Below is an example of reading a CSV file into a Dask data frame:
import dask.dd as dd
df = dd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()
Parquet files are typically written to a folder as part of an ETL process, which can emit files pertaining to the ETL such as progress, commits, etc. Below is an example of files created from an ETL process (files beginning with _
) to produce a parquet file of data.
:::image type="content" source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet etl process.":::
In these scenarios, you'll only want to read the parquet files in the folder and ignore the ETL process files. The code below shows how you can use glob patterns to read only parquet files in a folder:
import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem
# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.parquet'
# create the filesystem
fs = AzureMachineLearningFileSystem(uri)
# append csv files in folder to a list
dflist = []
for path in fs.ls():
with fs.open(path) as f:
dflist.append(pd.read_parquet(f))
# concatenate data frames
df = pd.concat(dflist)
df.head()
Filesystem spec (fsspec
) has a range of known implementations, one of which is the Databricks Filesystem (dbfs
).
To access data from dbfs
you will need:
- Instance name, which is in the form of
adb-<some-number>.<two digits>.azuredatabricks.net
. You can glean this from the URL of your Azure Databricks workspace. - Personal Access Token (PAT), for more information on creating a PAT, please see Authentication using Azure Databricks personal access tokens
Once you have these, you will need to create an environment variable on your compute instance for the PAT token:
export ADB_PAT=<pat_token>
You can then access data in Pandas using:
import os
import pandas as pd
pat = os.getenv(ADB_PAT)
path_on_dbfs = '<absolute_path_on_dbfs>' # e.g. /folder/subfolder/file.csv
storage_options = {
'instance':'adb-<some-number>.<two digits>.azuredatabricks.net',
'token': pat
}
df = pd.read_csv(f'dbfs://{path_on_dbfs}', storage_options=storage_options)
from PIL import Image
from azureml.fsspec import AzureMachineLearningFileSystem
# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<image.jpeg>'
# create the filesystem
fs = AzureMachineLearningFileSystem(uri)
with fs.open() as f:
img = Image.open(f)
img.show()
In this example, you create a PyTorch custom dataset for processing images. The assumption is that an annotations file (in CSV format) exists that looks like:
image_path, label
0/image0.png, label0
0/image1.png, label0
1/image2.png, label1
1/image3.png, label1
2/image4.png, label2
2/image5.png, label2
The images are stored in subfolders according to their label:
/
└── 📁images
├── 📁0
│ ├── 📷image0.png
│ └── 📷image1.png
├── 📁1
│ ├── 📷image2.png
│ └── 📷image3.png
└── 📁2
├── 📷image4.png
└── 📷image5.png
A custom Dataset class in PyTorch must implement three functions: __init__
, __len__
, and __getitem__
, which are implemented below:
import os
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset
class CustomImageDataset(Dataset):
def __init__(self, filesystem, annotations_file, img_dir, transform=None, target_transform=None):
self.fs = filesystem
f = filesystem.open(annotations_file)
self.img_labels = pd.read_csv(f)
f.close()
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.img_labels)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
f = self.fs.open(img_path)
image = Image.open(f)
f.close()
label = self.img_labels.iloc[idx, 1]
if self.transform:
image = self.transform(image)
if self.target_transform:
label = self.target_transform(label)
return image, label
You can then instantiate the dataset using:
from azureml.fsspec import AzureMachineLearningFileSystem
from torch.utils.data import DataLoader
# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/'
# create the filesystem
fs = AzureMachineLearningFileSystem(uri)
# create the dataset
training_data = CustomImageDataset(
filesystem=fs,
annotations_file='<datastore_name>/<path>/annotations.csv',
img_dir='<datastore_name>/<path_to_images>/'
)
# Preparing your data for training with DataLoaders
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
Another method for accessing data in cloud storage is to use the mltable
library. The general format for reading data into pandas using mltable
is:
import mltable
# define a path or folder or pattern
path = {
'file': '<supported_path>'
# alternatives
# 'folder': '<supported_path>'
# 'pattern': '<supported_path>'
}
# create an mltable from paths
tbl = mltable.from_delimited_files(paths=[path])
# alternatives
# tbl = mltable.from_parquet_files(paths=[path])
# tbl = mltable.from_json_lines_files(paths=[path])
# tbl = mltable.from_delta_lake(paths=[path])
# materialize to pandas
df = tbl.to_pandas_dataframe()
df.head()
You'll notice the mltable
library supports reading tabular data from different path types:
Location | Examples |
---|---|
A path on your local computer | ./home/username/data/my_data |
A path on a public http(s) server | https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv |
A path on Azure Storage | wasbs://<container_name>@<account_name>.blob.core.windows.net/<path> abfss://<file_system>@<account_name>.dfs.core.windows.net/<path> |
A long-form Azure Machine Learning datastore | azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path> |
Note
mltable
does user credential passthrough for paths on Azure Storage and Azure Machine Learning datastores. If you do not have permission to the data on the underlying storage then you will not be able to access the data.
mltable
supports reading from:
- file(s), for example:
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv
- folder(s), for example
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
- glob pattern(s), for example
abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/*.csv
- Or, a combination of files, folders, globbing patterns
The flexibility of mltable
allows you to materialize data into a single dataframe from a combination of local/cloud storage and combinations of files/folder/globs. For example:
path1 = {
'file': 'abfss://[email protected]/my-csv.csv'
}
path2 = {
'folder': './home/username/data/my_data'
}
path3 = {
'pattern': 'abfss://[email protected]/folder/*.csv'
}
tbl = mltable.from_delimited_files(paths=[path1, path2, path3])
mltable
supports the following file formats:
- Delimited Text (for example: CSV files):
mltable.from_delimited_files(paths=[path])
- Parquet:
mltable.from_parquet_files(paths=[path])
- Delta:
mltable.from_delta_lake(paths=[path])
- JSON lines format:
mltable.from_json_lines_files(paths=[path])
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'file': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'file': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/<file_name>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'file': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/<file>.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:
- Select Data from the left-hand menu followed by the Datastores tab.
- Select your datastore name and then Browse.
- Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
import mltable
path = {
'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
The example code below shows how mltable
can use glob patterns - such as wildcards - to ensure only the parquet files are read.
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'pattern': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/*.parquet'
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Update the placeholders (<>
) in the code snippet with your details.
import mltable
path = {
'pattern': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:
- Select Data from the left-hand menu followed by the Datastores tab.
- Select your datastore name and then Browse.
- Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
Update the placeholders (<>
) in the code snippet with your details.
Important
To glob the pattern on a public HTTP server, there must be access at the folder level.
import mltable
path = {
'pattern': '<https_address>/<folder>/*.parquet'
}
tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
In this section, you'll learn how to access your Azure Machine Learning data assets into pandas.
If you've previously created a Table asset in Azure Machine Learning (an mltable
, or a V1 TabularDataset
), you can load that into pandas using:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()
If you've registered a File asset that you want to read into Pandas data frame - for example, a CSV file - you can achieve this using:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'file': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
If you've registered a Folder asset (uri_folder
or a V1 FileDataset
) that you want to read into Pandas data frame - for example, a folder containing CSV file - you can achieve this using:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")
path = {
'folder': data_asset.path
}
tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()
Tip
Pandas is not designed to handle large datasets - you will only be able to process data that can fit into the memory of the compute instance.
For large datasets we recommend that you use Azure Machine Learning managed Spark, which provides the PySpark Pandas API.
You may wish to iterate quickly on a smaller subset of a large dataset before scaling up to a remote asynchronous job. mltable
provides in-built functionality to get samples of large data using the take_random_sample method:
import mltable
path = {
'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}
tbl = mltable.from_delimited_files(paths=[path])
# take a random 30% sample of the data
tbl = tbl.take_random_sample(probability=.3)
df = tbl.to_pandas_dataframe()
df.head()
You can also take subsets of large data by using:
You may want to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance) and use the local filesystem. You can do this with the azcopy
utility, which is pre-installed on an Azure Machine Learning compute instance. If you are not using an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install azcopy
. For more information please read azcopy.
Caution
We do not recommend downloading data in the /home/azureuser/cloudfiles/code
location on a compute instance. This is designed to store notebook and code artifacts, not data. Reading data from this location will incur significant performance overhead when training. Instead we recommend storing your data in home/azureuser
, which is the local SSD of the compute node.
Open a terminal and create a new directory, for example:
mkdir /home/azureuser/data
Sign-in to azcopy using:
azcopy login
Next, you can copy data using a storage URI
SOURCE=https://<account_name>.blob.core.windows.net/<container>/<path>
DEST=/home/azureuser/data
azcopy cp $SOURCE $DEST