Blob Inventory report analytics

The Azure Storage blob inventory feature provides an overview of your containers, blobs, snapshots, and blob versions within a storage account. The inventory report can be used to understand various attributes of blobs and containers such as total data size, age, encryption status, immutability policy, legal hold and so on.

Inventory provides information on Containers and blobs within the storage account. This project demonstrates how to bring up Inventory reports in Azure Synapse and run queries to derive insights. Additionally, the project also describes how to visualize the analytics data in Power BI. Such an analysis will be helpful in understanding the characteristics of Storage Accounts in terms of their usage, the data stored in them etc. Further, the information can be used to optimize for cost using Lifecycle policies.

List of Insights

Overview
1. Overall growth in data over time
2. Amount of data added to the storage account over time
3. Number of files modified
4. Blob snapshot size Blob snapshots - Azure Storage | Microsoft Docs
Detailed analysis
1. Data access patterns in different tiers of Storage (Hot, Cool, Archive)
2. Distribution of data across tiers
3. Distribution of data across blob types
Data profile
1. Distribution of blob in blob types
2. Distribution of data in containers
3. Distribution of blobs in access tiers
4. Distribution of data by file types

Sample visualizations

Steps for generating insights and visualizations

Enable inventory reports
Create resources
Import the PySpark notebook that has analytics queries
Run the PySpark notebook
Visualize the results of the above queries in PowerBI reports

1. Enable inventory reports

The first step is to enable blob inventory reports on your storage account. After the inventory reports are generated, please process to the next steps. You may have to wait upto 24 hours after enabling inventory reports for your first report to be generated.

2. Set up Azure Synapse workspace – via manual steps or via the ARM templates

An Azyre Synapse workspace can be created either via manual steps or via the ARM templates provided.

Manual steps

Create an Azure Synapse workspace
1. Create an Azure Synapse workspace where you will execute a PySpark notebook to analyze the inventory report files.
2. Grant Storage Blob Data Contributor access to Synapse workspace’s Managed Service Identity (MSI).
Create Apache Spark pool in Synapse workspace created above. This Apache Spark pool will be used to execute PySpark notebook that will process the reports generated by Azure Blob Storage Inventory.

Via ARM templates

Deploy SynapseArmTemplate.json from Azure portal using Deploy a custom template.

3. Configure permissions to allow Azure Synapse to store files in a storage account

Grant yourself and other users who will require access to data visualization the following roles:

Please note that this step needs to be repeated for each user that will run the PySpark notebook used later in this documentation and also for users who will visualize the data using PowerBI

4. Provide the configuration file to read the Blob Inventory report files

Download BlobInventoryStorageAccountConfiguration.json and update the placeholders inside it in following manner:

storageAccountName: Name of the storage account for which inventory has been run
destinationContainer: Name of the container where inventory reports are stored
blobInventoryRuleName: Name of the inventory rule whose results should be analyzed
accessKey: Access keys for the storage account in which inventory reports are stored.

After updating the above placeholders, upload the configuration file in the storage container used while creating Synapse workspace.

5. Import the PySpark Notebook that has analytics queries

After you create the above resources and assign the permissions, perform the following steps:

Download ReportAnalysis.ipynb
Navigate to https://web.azuresynapse.net
Select the Develop tab on the left edge
Select the large plus sign (+) to add an item
Select Import
This will open a select file dialog. Upload the ReportAnalysis.ipynb downloaded above
This will open a Properties tab on the right. Select Configure Session
Select the Apache Spark pool created earlier in the Attach to dropdown and click Apply
In the first cell of Python notebook, update the value of storage_account and container_name variables to the names of the storage account and container specified while creating the synapse workspace
Click on "Publish" button in order to save the notebook in Azure Synapse studio and to avoid uploading it next time you run the notebook

6. Run the PySpark notebook

In the PySpark notebook imported above and click on Run all
It will take 3-4 minutes to start the Spark session and another couple of minutes to process the Inventory reports
Do remember to publish the notebook again if you made any changes in the notebook while running it

Please note that the first run could take a while if there are a lot of Inventory reports to process. Subsequent runs will only process the new inventory reports created since the last run

7. Visualizing the data

Open ReportAnalysis.pbit file using PowerBI desktop application. Please note that this file may not render correctly with the web version of PowerBI application and hence we recommend that you use the Desktop version of PowerBI. When the report is opened, a popup will open like the one shown below. In this popup, enter the name of the Synapse workspace in synapse_workspace_name and database_name as reportdata.

Common errors

If you see any error/warning in opening the PowerBI report, including but not limited to:

The key did not match any row in the table
Access to the resource is forbidden

The above errors are generic errors displayed by PowerBI. Please ensure that you have provided the input correctly to PowerBI template. Additionally, the following actions can be taken to handle the most frequently occurring errors:

Missing permissions: Please make sure the user logged into PowerBI has the required access mentioned in Configure permissions to allow Azure Synapse to store files in a Storage Account step.
Inventory reports not yet generated: Verify that the inventory run has succeeded at least once for the rule provided to PySpark notebook. Also verify the destination container provided to the PySpark notebook. An alternate way to also verify this is by going to Azure synapse studio in which you ran the PySpark notebook and ensure that a database named "reportdata" is present in Data -> Workspace -> Lake Database.

Next steps

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
media		media
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blob Inventory report analytics

List of Insights

Sample visualizations