Skip to content

Latest commit

 

History

History
675 lines (489 loc) · 24.6 KB

how-to-access-data-interactive.md

File metadata and controls

675 lines (489 loc) · 24.6 KB
title titleSuffix description services ms.service ms.subservice ms.topic author ms.author ms.reviewer ms.date ms.custom
Access data from Azure cloud storage during interactive development
Azure Machine Learning
Access data from Azure cloud storage during interactive development
machine-learning
machine-learning
core
how-to
samuel100
samkemp
franksolomon
11/17/2022
sdkv2

Access data from Azure cloud storage during interactive development

[!INCLUDE sdk v2]

Typically the beginning of a machine learning project involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and building prototypes of ML models to validate hypotheses. This prototyping phase of the project is highly interactive in nature that lends itself to developing in a Jupyter notebook or an IDE with a Python interactive console. In this article you'll learn how to:

[!div class="checklist"]

  • Access data from a Azure Machine Learning Datastores URI as if it were a file system.
  • Materialize data into Pandas using mltable Python library.
  • Materialize Azure Machine Learning data assets into Pandas using mltable Python library.
  • Materialize data through an explicit download with the azcopy utility.

Prerequisites

Tip

The guidance in this article to access data during interactive development applies to any host that can run a Python session - for example: your local machine, a cloud VM, a GitHub Codespace, etc. We recommend using an Azure Machine Learning compute instance - a fully managed and pre-configured cloud workstation. For more information, see Create and manage an Azure Machine Learning compute instance.

Important

Ensure you have the latest azure-fsspec and mltable python libraries installed in your python environment:

pip install -U azureml-fsspec mltable

Access data from a datastore URI, like a filesystem (preview)

[!INCLUDE preview disclaimer]

An Azure Machine Learning datastore is a reference to an existing storage account on Azure. The benefits of creating and using a datastore include:

[!div class="checklist"]

  • A common and easy-to-use API to interact with different storage types (Blob/Files/ADLS).
  • Easier to discover useful datastores when working as a team.
  • Supports both credential-based (for example, SAS token) and identity-based (use Azure Active Directory or Manged identity) to access data.
  • When using credential-based access, the connection information is secured so you don't expose keys in scripts.
  • Browse data and copy-paste datastore URIs in the Studio UI.

A Datastore URI is a Uniform Resource Identifier, which is a reference to a storage location (path) on your Azure storage account. The format of the datastore URI is:

# Azure Machine Learning workspace details:
subscription = '<subscription_id>'
resource_group = '<resource_group>'
workspace = '<workspace>'
datastore_name = '<datastore>'
path_on_datastore '<path>'

# long-form Datastore uri format:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'. 

These Datastore URIs are a known implementation of Filesystem spec (fsspec): A unified pythonic interface to local, remote and embedded file systems and bytes storage.

The Azure Machine Learning Datastore implementation of fsspec automatically handles credential/identity passthrough used by the Azure Machine Learning datastore. This means you don't need to expose account keys in your scripts or do additional sign-in procedures on a compute instance.

For example, you can directly use Datastore URIs in Pandas - below is an example of reading a CSV file:

import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

  1. Select Data from the left-hand menu followed by the Datastores tab.
  2. Select your datastore name and then Browse.
  3. Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::

You can also instantiate an Azure Machine Learning filesystem and do filesystem-like commands like ls, glob, exists, open, etc. The open() method will return a file-like object, which can be passed to any other library that expects to work with python files, or used by your own code as you would a normal python file object. These file-like objects respect the use of with contexts, for example:

from azureml.fsspec import AzureMachineLearningFileSystem

# instantiate file system using datastore URI
fs = AzureMachineLearningFileSystem('azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>')

# list files in the path
fs.ls()
# output example:
# /datastore_name/folder/file1.csv
# /datastore_name/folder/file2.csv

# use an open context
with fs.open('/datastore_name/folder/file1.csv') as f:
    # do some process
    process_file(f)

Examples

In this section we provide some examples of how to use Filesystem spec, for some common scenarios.

Read a single CSV file into pandas

If you have a single CSV file, then as outlined above you can read that into pandas with:

import pandas as pd

df = pd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")

Read a folder of CSV files into pandas

The Pandas read_csv() method doesn't support reading a folder of CSV files. You need to glob csv paths and concatenate them to a data frame using Pandas concat() method. The code below demonstrates how to achieve this concatenation with the Azure Machine Learning filesystem:

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.csv'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# append csv files in folder to a list
dflist = []
for path in fs.ls():
    with fs.open(path) as f:
        dflist.append(pd.read_csv(f))

# concatenate data frames
df = pd.concat(dflist)
df.head()

Reading CSV files into Dask

Below is an example of reading a CSV file into a Dask data frame:

import dask.dd as dd

df = dd.read_csv("azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv")
df.head()

Read a folder of parquet files into pandas

Parquet files are typically written to a folder as part of an ETL process, which can emit files pertaining to the ETL such as progress, commits, etc. Below is an example of files created from an ETL process (files beginning with _) to produce a parquet file of data.

:::image type="content" source="media/how-to-access-data-ci/parquet-auxillary.png" alt-text="Screenshot showing the parquet etl process.":::

In these scenarios, you'll only want to read the parquet files in the folder and ignore the ETL process files. The code below shows how you can use glob patterns to read only parquet files in a folder:

import pandas as pd
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/*.parquet'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# append csv files in folder to a list
dflist = []
for path in fs.ls():
    with fs.open(path) as f:
        dflist.append(pd.read_parquet(f))

# concatenate data frames
df = pd.concat(dflist)
df.head()

Accessing data from your Azure Databricks filesystem (dbfs)

Filesystem spec (fsspec) has a range of known implementations, one of which is the Databricks Filesystem (dbfs).

To access data from dbfs you will need:

  • Instance name, which is in the form of adb-<some-number>.<two digits>.azuredatabricks.net. You can glean this from the URL of your Azure Databricks workspace.
  • Personal Access Token (PAT), for more information on creating a PAT, please see Authentication using Azure Databricks personal access tokens

Once you have these, you will need to create an environment variable on your compute instance for the PAT token:

export ADB_PAT=<pat_token>

You can then access data in Pandas using:

import os
import pandas as pd

pat = os.getenv(ADB_PAT)
path_on_dbfs = '<absolute_path_on_dbfs>' # e.g. /folder/subfolder/file.csv

storage_options = {
    'instance':'adb-<some-number>.<two digits>.azuredatabricks.net', 
    'token': pat
}

df = pd.read_csv(f'dbfs://{path_on_dbfs}', storage_options=storage_options)

Reading images with pillow

from PIL import Image
from azureml.fsspec import AzureMachineLearningFileSystem

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<image.jpeg>'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

with fs.open() as f:
    img = Image.open(f)
    img.show()

PyTorch custom dataset example

In this example, you create a PyTorch custom dataset for processing images. The assumption is that an annotations file (in CSV format) exists that looks like:

image_path, label
0/image0.png, label0
0/image1.png, label0
1/image2.png, label1
1/image3.png, label1
2/image4.png, label2
2/image5.png, label2

The images are stored in subfolders according to their label:

/
└── 📁images
    ├── 📁0
    │   ├── 📷image0.png
    │   └── 📷image1.png
    ├── 📁1
    │   ├── 📷image2.png
    │   └── 📷image3.png
    └── 📁2
        ├── 📷image4.png
        └── 📷image5.png

A custom Dataset class in PyTorch must implement three functions: __init__, __len__, and __getitem__, which are implemented below:

import os
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset

class CustomImageDataset(Dataset):
    def __init__(self, filesystem, annotations_file, img_dir, transform=None, target_transform=None):
        self.fs = filesystem
        f = filesystem.open(annotations_file)
        self.img_labels = pd.read_csv(f)
        f.close()
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        f = self.fs.open(img_path)
        image = Image.open(f)
        f.close()
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

You can then instantiate the dataset using:

from azureml.fsspec import AzureMachineLearningFileSystem
from torch.utils.data import DataLoader

# define the URI - update <> placeholders
uri = 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/'

# create the filesystem
fs = AzureMachineLearningFileSystem(uri)

# create the dataset
training_data = CustomImageDataset(
    filesystem=fs,
    annotations_file='<datastore_name>/<path>/annotations.csv', 
    img_dir='<datastore_name>/<path_to_images>/'
)

# Preparing your data for training with DataLoaders
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)

Materialize data into Pandas using mltable library

Another method for accessing data in cloud storage is to use the mltable library. The general format for reading data into pandas using mltable is:

import mltable

# define a path or folder or pattern
path = {
    'file': '<supported_path>'
    # alternatives
    # 'folder': '<supported_path>'
    # 'pattern': '<supported_path>'
}

# create an mltable from paths
tbl = mltable.from_delimited_files(paths=[path])
# alternatives
# tbl = mltable.from_parquet_files(paths=[path])
# tbl = mltable.from_json_lines_files(paths=[path])
# tbl = mltable.from_delta_lake(paths=[path])

# materialize to pandas
df = tbl.to_pandas_dataframe()
df.head()

Supported paths

You'll notice the mltable library supports reading tabular data from different path types:

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage wasbs://<container_name>@<account_name>.blob.core.windows.net/<path>
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>
A long-form Azure Machine Learning datastore azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<path>

Note

mltable does user credential passthrough for paths on Azure Storage and Azure Machine Learning datastores. If you do not have permission to the data on the underlying storage then you will not be able to access the data.

Files, folders and globs

mltable supports reading from:

  • file(s), for example: abfss://<file_system>@<account_name>.dfs.core.windows.net/my-csv.csv
  • folder(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/
  • glob pattern(s), for example abfss://<file_system>@<account_name>.dfs.core.windows.net/my-folder/*.csv
  • Or, a combination of files, folders, globbing patterns

The flexibility of mltable allows you to materialize data into a single dataframe from a combination of local/cloud storage and combinations of files/folder/globs. For example:

path1 = {
    'file': 'abfss://[email protected]/my-csv.csv'
}

path2 = {
    'folder': './home/username/data/my_data'
}

path3 = {
    'pattern': 'abfss://[email protected]/folder/*.csv'
}

tbl = mltable.from_delimited_files(paths=[path1, path2, path3])

Supported file formats

mltable supports the following file formats:

  • Delimited Text (for example: CSV files): mltable.from_delimited_files(paths=[path])
  • Parquet: mltable.from_parquet_files(paths=[path])
  • Delta: mltable.from_delta_lake(paths=[path])
  • JSON lines format: mltable.from_json_lines_files(paths=[path])

Examples

Read a CSV file

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/<file_name>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/<file_name>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'file': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/<file>.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

  1. Select Data from the left-hand menu followed by the Datastores tab.
  2. Select your datastore name and then Browse.
  3. Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::
import mltable

path = {
    'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Read parquet files in a folder

The example code below shows how mltable can use glob patterns - such as wildcards - to ensure only the parquet files are read.

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'abfss://<filesystem>@<account>.dfs.core.windows.net/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'wasbs://<container>@<account>.blob.core.windows.net/<folder>/*.parquet'
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Update the placeholders (<>) in the code snippet with your details.

import mltable

path = {
    'pattern': 'azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<wsname>/datastores/<name>/paths/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Tip

Rather than remember the datastore URI format, you can copy-and-paste the datastore URI from the Studio UI by following these steps:

  1. Select Data from the left-hand menu followed by the Datastores tab.
  2. Select your datastore name and then Browse.
  3. Find the file/folder you want to read into pandas, select the elipsis (...) next to it. Select from the menu Copy URI. You can select the Datastore URI to copy into your notebook/script. :::image type="content" source="media/how-to-access-data-ci/datastore_uri_copy.png" alt-text="Screenshot highlighting the copy of the datastore URI.":::

Update the placeholders (<>) in the code snippet with your details.

Important

To glob the pattern on a public HTTP server, there must be access at the folder level.

import mltable

path = {
    'pattern': '<https_address>/<folder>/*.parquet'
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Reading data assets

In this section, you'll learn how to access your Azure Machine Learning data assets into pandas.

Table asset

If you've previously created a Table asset in Azure Machine Learning (an mltable, or a V1 TabularDataset), you can load that into pandas using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

tbl = mltable.load(f'azureml:/{data_asset.id}')
df = tbl.to_pandas_dataframe()
df.head()

File asset

If you've registered a File asset that you want to read into Pandas data frame - for example, a CSV file - you can achieve this using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
    'file': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

Folder asset

If you've registered a Folder asset (uri_folder or a V1 FileDataset) that you want to read into Pandas data frame - for example, a folder containing CSV file - you can achieve this using:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get(name="<name_of_asset>", version="<version>")

path = {
    'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df.head()

A note on reading and processing large data volumes with Pandas

Tip

Pandas is not designed to handle large datasets - you will only be able to process data that can fit into the memory of the compute instance.

For large datasets we recommend that you use Azure Machine Learning managed Spark, which provides the PySpark Pandas API.

You may wish to iterate quickly on a smaller subset of a large dataset before scaling up to a remote asynchronous job. mltable provides in-built functionality to get samples of large data using the take_random_sample method:

import mltable

path = {
    'file': 'https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv'
}

tbl = mltable.from_delimited_files(paths=[path])
# take a random 30% sample of the data
tbl = tbl.take_random_sample(probability=.3)
df = tbl.to_pandas_dataframe()
df.head()

You can also take subsets of large data by using:

Downloading data using the azcopy utility

You may want to download the data to the local SSD of your host (local machine, cloud VM, Azure Machine Learning Compute Instance) and use the local filesystem. You can do this with the azcopy utility, which is pre-installed on an Azure Machine Learning compute instance. If you are not using an Azure Machine Learning compute instance or a Data Science Virtual Machine (DSVM), you may need to install azcopy. For more information please read azcopy.

Caution

We do not recommend downloading data in the /home/azureuser/cloudfiles/code location on a compute instance. This is designed to store notebook and code artifacts, not data. Reading data from this location will incur significant performance overhead when training. Instead we recommend storing your data in home/azureuser, which is the local SSD of the compute node.

Open a terminal and create a new directory, for example:

mkdir /home/azureuser/data

Sign-in to azcopy using:

azcopy login

Next, you can copy data using a storage URI

SOURCE=https://<account_name>.blob.core.windows.net/<container>/<path>
DEST=/home/azureuser/data
azcopy cp $SOURCE $DEST

Next steps