title | description | services | author | ms.service | ms.devlang | ms.topic | ms.date | ms.author | ms.custom | ms.component |
---|---|---|---|---|---|---|---|---|---|---|
Upload large amounts of random data in parallel to Azure Storage | Microsoft Docs |
Learn how to use the Azure SDK to upload large amounts of random data in parallel to an Azure Storage account |
storage |
roygara |
storage |
dotnet |
tutorial |
02/20/2018 |
rogarana |
mvc |
blobs |
This tutorial is part two of a series. This tutorial shows you deploy an application that uploads large amount of random data to an Azure storage account.
In part two of the series, you learn how to:
[!div class="checklist"]
- Configure the connection string
- Build the application
- Run the application
- Validate the number of connections
Azure blob storage provides a scalable service for storing your data. To ensure your application is as performant as possible, an understanding of how blob storage works is recommended. Knowledge of the limits for Azure blobs is important, to learn more about these limits visit: blob storage scalability targets.
Partition naming is another important factor when designing a highly performing application using blobs. Azure storage uses a range-based partitioning scheme to scale and load balance. This configuration means that files with similar naming conventions or prefixes go to the same partition. This logic includes the name of the container that the files are being uploaded to. In this tutorial, you use files that have GUIDs for names as well as randomly generated content. They are then uploaded to five different containers with random names.
To complete this tutorial, you must have completed the previous Storage tutorial: Create a virtual machine and storage account for a scalable application.
Use the following command on your local machine to create a remote desktop session with the virtual machine. Replace the IP address with the publicIPAddress of your virtual machine. When prompted, enter the credentials you used when creating the virtual machine.
mstsc /v:<publicIpAddress>
In the Azure portal, navigate to your storage account. Select Access keys under Settings in your storage account. Copy the connection string from the primary or secondary key. Log in to the virtual machine you created in the previous tutorial. Open a Command Prompt as an administrator and run the setx
command with the /m
switch, this command saves a machine setting environment variable. The environment variable is not available until you reload the Command Prompt. Replace <storageConnectionString> in the following sample:
setx storageconnectionstring "<storageConnectionString>" /m
When finished, open another Command Prompt, navigate to D:\git\storage-dotnet-perf-scale-app
and type dotnet build
to build the application.
Navigate to D:\git\storage-dotnet-perf-scale-app
.
Type dotnet run
to run the application. The first time you run dotnet
it populates your local package cache, to improve restore speed and enable offline access. This command takes up to a minute to complete and only happens once.
dotnet run
The application creates five randomly named containers and begins uploading the files in the staging directory to the storage account. The application sets the minimum threads to 100 and the DefaultConnectionLimit to 100 to ensure that a large number of concurrent connections are allowed when running the application.
In addition to setting the threading and connection limit settings, the BlobRequestOptions for the UploadFromStreamAsync method are configured to use parallelism and disable MD5 hash validation. The files are uploaded in 100-mb blocks, this configuration provides better performance but can be costly if using a poorly performing network as if there is a failure the entire 100-mb block is retried.
Property | Value | Description |
---|---|---|
ParallelOperationThreadCount | 8 | The setting breaks the blob into blocks when uploading. For highest performance, this value should be 8 times the number of cores. |
DisableContentMD5Validation | true | This property disables checking the MD5 hash of the content uploaded. Disabling MD5 validation produces a faster transfer. But does not confirm the validity or integrity of the files being transferred. |
StoreBlobContentMD5 | false | This property determines if an MD5 hash is calculated and stored with the file. |
RetryPolicy | 2-second backoff with 10 max retry | Determines the retry policy of requests. Connection failures are retried, in this example an ExponentialRetry policy is configured with a 2-second backoff, and a maximum retry count of 10. This setting is important when your application gets close to hitting the blob storage scalability targets. |
The UploadFilesAsync
task is shown in the following example:
private static async Task UploadFilesAsync()
{
// Create random 5 characters containers to upload files to.
CloudBlobContainer[] containers = await GetRandomContainersAsync();
var currentdir = System.IO.Directory.GetCurrentDirectory();
// path to the directory to upload
string uploadPath = currentdir + "\\upload";
Stopwatch time = Stopwatch.StartNew();
try
{
Console.WriteLine("Iterating in directory: {0}", uploadPath);
int count = 0;
int max_outstanding = 100;
int completed_count = 0;
// Define the BlobRequestOptions on the upload.
// This includes defining an exponential retry policy to ensure that failed connections are retried with a backoff policy. As multiple large files are being uploaded
// large block sizes this can cause an issue if an exponential retry policy is not defined. Additionally parallel operations are enabled with a thread count of 8
// This could be should be multiple of the number of cores that the machine has. Lastly MD5 hash validation is disabled for this example, this improves the upload speed.
BlobRequestOptions options = new BlobRequestOptions
{
ParallelOperationThreadCount = 8,
DisableContentMD5Validation = true,
StoreBlobContentMD5 = false
};
// Create a new instance of the SemaphoreSlim class to define the number of threads to use in the application.
SemaphoreSlim sem = new SemaphoreSlim(max_outstanding, max_outstanding);
List<Task> tasks = new List<Task>();
Console.WriteLine("Found {0} file(s)", Directory.GetFiles(uploadPath).Count());
// Iterate through the files
foreach (string path in Directory.GetFiles(uploadPath))
{
// Create random file names and set the block size that is used for the upload.
var container = containers[count % 5];
string fileName = Path.GetFileName(path);
Console.WriteLine("Uploading {0} to container {1}.", path, container.Name);
CloudBlockBlob blockBlob = container.GetBlockBlobReference(fileName);
// Set block size to 100MB.
blockBlob.StreamWriteSizeInBytes = 100 * 1024 * 1024;
await sem.WaitAsync();
// Create tasks for each file that is uploaded. This is added to a collection that executes them all asyncronously.
tasks.Add(blockBlob.UploadFromFileAsync(path, null, options, null).ContinueWith((t) =>
{
sem.Release();
Interlocked.Increment(ref completed_count);
}));
count++;
}
// Creates an asynchronous task that completes when all the uploads complete.
await Task.WhenAll(tasks);
time.Stop();
Console.WriteLine("Upload has been completed in {0} seconds. Press any key to continue", time.Elapsed.TotalSeconds.ToString());
Console.ReadLine();
}
catch (DirectoryNotFoundException ex)
{
Console.WriteLine("Error parsing files in the directory: {0}", ex.Message);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
The following example is a truncated application output running on a Windows system.
Created container https://mystorageaccount.blob.core.windows.net/9efa7ecb-2b24-49ff-8e5b-1d25e5481076
Created container https://mystorageaccount.blob.core.windows.net/bbe5f0c8-be9e-4fc3-bcbd-2092433dbf6b
Created container https://mystorageaccount.blob.core.windows.net/9ac2f71c-6b44-40e7-b7be-8519d3ba4e8f
Created container https://mystorageaccount.blob.core.windows.net/47646f1a-c498-40cd-9dae-840f46072180
Created container https://mystorageaccount.blob.core.windows.net/38b2cdab-45fa-4cf9-94e7-d533837365aa
Iterating in directory: D:\git\storage-dotnet-perf-scale-app\upload
Found 50 file(s)
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\1d596d16-f6de-4c4c-8058-50ebd8141e4d.txt to container 9efa7ecb-2b24-49ff-8e5b-1d25e5481076.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\242ff392-78be-41fb-b9d4-aee8152a6279.txt to container bbe5f0c8-be9e-4fc3-bcbd-2092433dbf6b.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\38d4d7e2-acb4-4efc-ba39-f9611d0d55ef.txt to container 9ac2f71c-6b44-40e7-b7be-8519d3ba4e8f.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\45930d63-b0d0-425f-a766-cda27ff00d32.txt to container 47646f1a-c498-40cd-9dae-840f46072180.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\5129b385-5781-43be-8bac-e2fbb7d2bd82.txt to container 38b2cdab-45fa-4cf9-94e7-d533837365aa.
...
Upload has been completed in 142.0429536 seconds. Press any key to continue
While the files are being uploaded, you can verify the number of concurrent connections to your storage account. Open a Command Prompt and type netstat -a | find /c "blob:https"
. This command shows the number of connections that are currently opened using netstat
. The following example shows a similar output to what you see when running the tutorial yourself. As you can see from the example, 800 connections were open when uploading the random files to the storage account. This value changes throughout running the upload. By uploading in parallel block chunks, the amount of time required to transfer the contents is greatly reduced.
C:\>netstat -a | find /c "blob:https"
800
C:\>
In part two of the series, you learned about uploading large amounts of random data to a storage account in parallel, such as how to:
[!div class="checklist"]
- Configure the connection string
- Build the application
- Run the application
- Validate the number of connections
Advance to part three of the series to download large amounts of data from a storage account.
[!div class="nextstepaction"] Download large amounts of random data from Azure storage