title | description | services | author | ms.topic | ms.author | ms.date | ms.service | ms.component |
---|---|---|---|---|---|---|---|---|
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 Preview |
Upgrade your solution to use Azure Data Lake Storage Gen2 Preview |
storage |
normesta |
conceptual |
normesta |
12/06/2018 |
storage |
data-lake-storage-gen2 |
Upgrade your big data analytics solutions from Azure Data Lake Gen1 to Azure Data Lake Storage Gen2 Preview
If you're using Azure Data Lake Storage Gen1 in your big data analytics solutions, this guide helps you to upgrade those solutions to use Azure Data Lake Storage Gen2 Preview. You can use this document to assess the dependencies that your solution has on Data Lake Storage Gen1. This guide also shows you how to plan and perform the upgrade.
We'll help you through the following tasks:
✔️ Assess your upgrade readiness
✔️ Plan for an upgrade
✔️ Perform the upgrade
Our goal is that all the capabilities that are present in Data Lake Storage Gen1 will be made available in Data Lake Storage Gen2. How those capabilities are exposed e.g. in SDK, CLI etc., might differ between Data Lake Storage Gen1 and Data Lake Storage Gen2. Applications and services that work with Data Lake Storage Gen1 need to be able to work similarly with Data Lake Storage Gen2. Finally, some of the capabilities won't be available in Data Lake Storage Gen2 right away. As they become available, we'll announce them in this document.
These next sections will help you decide how best to upgrade to Data Lake Storage Gen2, and when it might make most sense to do so.
Most likely, when you use Data Lake Storage Gen1 in your analytics solutions or pipelines, there are many additional technologies that you employ to achieve the overall end-to-end functionality. This article describes various components of the data flow that include ingesting, processing, and consuming data.
In addition, there are cross-cutting components to provision, manage and monitor these components. Each of the components operate with Data Lake Storage Gen1 by using an interface best suited to them. When you're planning to upgrade your solution to Data Lake Storage Gen2, you'll need to be aware of the interfaces that are used. You'll need to upgrade both the management interfaces as well as data interfaces since each interface has distinct requirements.
Figure 1 above shows the functionality components that you would see in most analytics solutions.
Figure 2 shows an example of how those components will be implemented by using specific technologies.
The Storing functionality in Figure1 is provided by Data Lake Storage Gen1 (Figure 2). Note how the various components in the data flow interact with Data Lake Storage Gen1 by using REST APIs or Java SDK. Also note how the cross-cutting functionality components interact with Data Lake Storage Gen1. The Provisioning component uses Azure Resource templates, whereas the Monitoring component which uses Log Analytics utilizes operational data that comes from Data Lake Storage Gen1.
To upgrade a solution from using Data Lake Storage Gen1 to Data Lake Storage Gen2, you'll need to copy the data and meta-data, re-hook the data-flows, and then, all of the components will need to be able to work with Data Lake Storage Gen2.
The sections below provide information to help you make better decisions:
✔️ Platform capabilities
✔️ Programming interfaces
✔️ Azure ecosystem
✔️ Partner ecosystem
✔️ Operational information
In each section, you'll be able to determine the “must-haves” for your upgrade. After you are assured that the capabilities are available, or you are assured that there are reasonable workarounds in place, proceed to the Planning for an upgrade section of this guide.
This section describes which Data Lake Storage Gen1 platform capabilities that are currently available in Data Lake Storage Gen2.
Data Lake Storage Gen1 | Data Lake Storage Gen2 - goal | Data Lake Storage Gen2 - availability status | |
---|---|---|---|
Data organization | Supports data stored as folders and files | Supports data stored as objects/blobs as well as folders and files - Link | Supports data stored as folders and file – Available now Supports data stored as objects/blobs - Not yet available |
Namespace | Hierarchical namespace | Flat namespace and Hierarchical namespaces | Flat namespace: Available now |
API | REST API over HTTPS | REST API over HTTP/HTTPS | Available now |
Server-side API | WebHDFS-compatible REST API | Azure Blob Service REST API Data Lake Storage Gen2 REST API | Data Lake Storage Gen2 REST API – Available now Azure Blob Service REST API – Not yet available |
Hadoop File System Client | Yes (Azure Data Lake Storage) | Yes (ABFS) | Available now |
Data Operations – Authorization | File and folder level POSIX Access Control Lists (ACLs) based on Azure Active Directory Identities | File and folder level POSIX Access Control Lists (ACLs) based on Azure Active Directory Identities Share Key for account level authorization Role Based Access Control (RBAC) to access containers | Available now |
Data Operations – Logs | Yes | One-off requests for logs for specific duration using support ticket Azure Monitoring integration | One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration – Not yet available |
Encryption data at rest | Transparent, Server side with service-managed keys and with customer-managed keys in Azure KeyVault | Transparent, Server side with service-managed keys and with customer keys managed keys in Azure KeyVault | Service-managed keys – Available now Customer-managed keys – Available now |
VNET | Virtual Network integration (Preview) | Service Endpoint | Available now |
Size limits | No limits on account sizes, file sizes or number of files | No limits on account sizes or number of files. File size limited to 5TB. | Available now |
Geo-redundancy | Locally-redundant (LRS) | Locally redundant (LRS) Zone redundant (ZRS) Globally redundant (GRS) Read-access globally redundant (RA-GRS) See here for more information | Available now |
Regional availability | See here | All Azure regions | Available now |
Price | See Pricing | See Pricing | |
Availability SLA | See SLA | See SLA | Available now |
Data Management | File Expiration | Lifecycle policies | Not yet available |
This table describes which API sets that are available for your custom applications. To make things a bit clearer, we've separated these API sets into 2 types: management APIs and filesystem APIs.
Management APIs help you to manage accounts, while filesystem APIs help you to operate on the files and folders.
API set | Data Lake Storage Gen1 | Availability for Data Lake Storage Gen2 - with Shared Key auth | Availability for Data Lake Storage Gen2 - with OAuth auth |
---|---|---|---|
.NET SDK - management | Link | Not supported | Available now - Link |
.NET SDK – filesystem | Link | Not yet available | Not yet available |
Java SDK - management | Link | Not supported | Available now – Link |
Java SDK – filesystem | Link | Not yet available | Not yet available |
Node.js - management | Link | Not supported | Available now - Link |
Node.js - filesystem | Link | Not yet available | Not yet available |
Python - management | Link | Not supported | Available now - Link |
Python - filesystem | Link | Not yet available | Not yet available |
REST API - management | Link | Not supported | Available now - |
REST API - filesystem | Link | Available now | Available now - Link |
PowerShell - management and filesystem | Link | Management – Not supported Filesystem - Not yet available | Management – Available now - Link Filesystem - Not yet available |
CLI – management | Link | Not supported | Available now - Link |
CLI - filesystem | Link | Not yet available | Not yet available |
Azure Resource Manager templates - management | Template1 Template2 Template3 | Not supported | Available now - Link |
When using Data Lake Storage Gen1, you can use a variety of Microsoft services and products in your end-to-end pipelines. These services and products work with Data Lake Storage Gen1 either directly or indirectly. This table shows a list of the services that we've modified to work with Data Lake Storage Gen1, and shows which ones are currently compatible with Data Lake Storage Gen2.
Area | Availability for Data Lake Storage Gen1 | Availability for Data Lake Storage Gen2 – with Shared Key auth | Availability for Data Lake Storage Gen2 – with OAuth |
---|---|---|---|
Analytics framework | Apache Hadoop | Available now | Available now |
HDInsight | HDInsight 3.6 - Available now HDInsight 4.0 - Not yet available | HDInsight 3.6 ESP – Not yet available HDInsight 4.0 ESP - Not yet available | |
Databricks Runtime 3.1 and above | Databricks Runtime 4.2 and above - Available now | Databricks Runtime 5.1 and above – Available now | |
SQL Data Warehouse | Not supported | Available now - Link | |
Data integration | Data Factory | Version 2 – Available now Version 1 – Not supported | Version 2 – Available now Version 1 – Not supported |
AdlCopy | Not supported | Not supported | |
SQL Server Integration Services | Not yet available | Not yet available | |
Data Catalog | Not yet available | Not yet available | |
Logic Apps | Not yet available | Not yet available | |
IoT | Event Hubs – Capture | Not yet available | Not yet available |
Stream Analytics | Not yet available | Not yet available | |
Consumption | PowerBI Desktop | Not yet available | Not yet available |
Excel | Not yet available | Not yet available | |
Analysis Services | Not yet available | Not yet available | |
Productivity | Azure Portal | Not supported | Account management – Available now Data operations – Not yet available |
Data Lake Tools for Visual Studio | Not yet available | Not yet available | |
Azure Storage Explorer | Available now | Available now | |
Visual Studio Code | Not yet available | Not yet available |
This table shows a list of the third-party services and products that were modified to work with Data Lake Storage Gen1. It also shows which ones are currently compatible with Data Lake Storage Gen2.
Area | Partner | Product/Service | Availability for Data Lake Storage Gen1 | Availability for Data Lake Storage Gen2 – with Shared Key auth | Availability for Data Lake Storage Gen2 – with Oauth |
---|---|---|---|---|---|
Analytics framework | Cloudera | CDH | Link | Not yet available | Not yet available |
Cloudera | Altus | Link | NA | Not yet available | |
HortonWorks | HDP 3.0 | Link | Not yet available | Not yet available | |
Qubole | Link | Not yet available | Not yet available | ||
ETL | StreamSets | Link | Not yet available | Not yet available | |
Informatica | Link | Not yet available | Not yet available | ||
Attunity | Link | Not yet available | Not yet available | ||
Alteryx | Link | Not yet available | Not yet available | ||
ImanisData | Link | Not yet available | Not yet available | ||
WANdisco | Link | Link | Link |
Data Lake Storage Gen1 pushes specific information and data to other services which helps you to operationalize your pipelines. This table shows availability of corresponding support in Data Lake Storage Gen2.
Type of data | Availability for Data Lake Storage Gen1 | Availability for Data Lake Storage Gen2 |
---|---|---|
Billing data - meters that are sent to commerce team for billing and then made available to customers | Available now | Available now |
Activity logs | Link | One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration - Not yet available |
Diagnostic logs | Link | One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration - Not yet available |
Metrics | Not supported | Available now - Link |
This section assumes that you've reviewed the Assess your upgrade readiness section of this guide, and that all of your dependencies are met. If there are capabilities that are still not available in Data Lake Storage Gen2, please proceed only if you know the corresponding workarounds. The following sections provide guidance on how you can plan for upgrade of your pipelines. Performing the actual upgrade will be described in the Performing the upgrade section of this guide.
The most critical part of the upgrade is deciding the strategy. This decision will determine the choices available to you.
This table lists some well-known strategies that have been used to migrate databases, Hadoop clusters, etc. We'll adopt similar strategies in our guidance, and adapt them to our context.
Strategy | Pros | Cons | When to use? |
---|---|---|---|
Lift-and-shift | Simplest. | Requires downtime for copying over data, moving jobs, and moving ingress and egress | For simpler solutions, where there are not multiple solutions accessing the same Gen1 account and hence can be moved together in a quick controlled fashion. |
Copy-once-and-copy incremental | Reduce downtime by performing copy in the background while source system is still active. | Requires downtime for moving ingress and egress. | Amount of data to be copied over is large and the downtime associated with life-and-shift is not acceptable. Testing may be required with significant production data on the target system before transition. |
Parallel adoption | Least downtime Allows time for applications to migrate at their own discretion. | Most elaborate since 2-way sync is needed between the two systems. | For complex scenarios where applications built on Data Lake Storage Gen1 cannot be cutover all at once and must be moved over in an incremental fashion. |
Below are more details on steps involved for each of the strategies. The steps list what you would do with the components involved in the upgrade. This includes the overall source system, overall target system, ingress sources for source system, egress destinations for source system, and jobs running on source system.
These steps are not meant to be prescriptive. They are meant to set the framework about how we are thinking about each strategy. We'll provide case studies for each strategy as we see them being implemented.
-
Pause the source system – ingress sources, jobs, egress destinations.
-
Copy all the data from the source system to the target system.
-
Point all the ingress sources, to the target system. Point to the egress destination from the target system.
-
Move, modify, run all the jobs to the target system.
-
Turn off the source system.
-
Copy over the data from the source system to the target system.
-
Copy over the incremental data from the source system to the target system at regular intervals.
-
Point to the egress destination from the target system.
-
Move, modify, run all jobs on the target system.
-
Point ingress sources incrementally to the target system as per your convenience.
-
Once all ingress sources are pointing to the target system.
-
Turn off incremental copying.
-
Turn off the source system.
-
-
Set up target system.
-
Set up a two-way replication between source system and target system.
-
Point ingress sources incrementally to the target system.
-
Move, modify, run jobs incrementally to the target system.
-
Point to egress destinations incrementally from the target system.
-
After all the original ingress sources, jobs and egress destination are working with the target system, turn off the source system.
The overall strategy that you use to perform your upgrade (described in the Upgrade strategy section of this guide), will determine the tools that you can use for your data upgrade. The tools listed below are based on current information and are suggestions.
Strategy | Tools | Pros | Considerations |
---|---|---|---|
Lift-and-shift | Azure Data Factory | Managed cloud service | Only copies over data. ACLs cannot be copied over currently. |
Distcp | Well-known Hadoop-provided tool Permissions i.e. ACLs can be copied with this tool | Requires a cluster which can connect to both Data Lake Storage Gen1 and Gen2 at the same time. | |
Copy-once-and-copy incremental | Azure Data Factory | Managed cloud service | To support incremental copying in ADF, data needs to be organized in a time-series fashion. Shortest interval between incremental copies is 15 minutes. For shorter intervals, ADF won't work. ACLs cannot be copied over currently. |
Parallel adoption | WANdisco | Support consistent replication If using a pure Hadoop environment connected to Azure Data Lake Storage, supports two-way replication | If not using a pure-Hadoop environment, there may be a delay in the replication. |
Note that there are third-parties that can handle the Data Lake Storage Gen1 to Data Lake Storage Gen2 upgrade without involving the above data/meta-data copying tools (For example: Cloudera). They provide a “one-stop shop” experience that performs data migration as well as workload migration. You may have to perform an out-of-band upgrade for any tools that are outside their ecosystem.
-
You'll need to manually create the Data Lake Storage Gen2 account first, before you start any part of the upgrade, since the current guidance does not include any automatic Gen2 account process based on your Gen1 account information. Ensure that you compare the accounts creation processes for Gen1 and Gen2.
-
Data Lake Storage Gen2, only supports files up to 5 TB in size. To upgrade to Data Lake Storage Gen2, you might need to resize the files in Data Lake Storage Gen1 so that they are larger than 5 TB in size.
-
If you use a tool that doesn't copy ACLs or you don't want to copy over the ACLs, then you'll need to set the ACLs on the destination manually at the appropriate top level. You can do that by using Storage Explorer. Ensure that those ACLs are the default ACLs so that the files and folders that you copy over inherit them.
-
In Data Lake Storage Gen1, the highest level you can set ACLs is at root of the account. In Data Lake Storage Gen1, however, the highest level you can set ACLs is at the root folder in a filesystem, not the whole account. So, if you want to set default ACLs at account level, you'll need to duplicate those across all the file systems in your Data Lake Storage Gen2 account.
-
File naming restrictions are different between the two storage systems. These differences are especially concerning when copying from Data Lake Storage Gen2 to Data Lake Storage Gen1 since the latter has more constrained restrictions.
When you need to build applications on Data Lake Storage Gen1 or Data Lake Storage Gen2, you'll have to first choose an appropriate programming interface. When calling an API on that interface you'll have to provide the appropriate URI and the appropriate credentials. The representation of these three elements, the API, URI, and how the credentials are provided, are different between Data Lake Storage Gen1 and Data Lake Storage Gen2.So, as part of the application upgrade, you'll need to map these three constructs appropriately.
The main task here is to translate the adl:// URI that was being used in the existing workloads into an abfss:// URI.
The URI scheme for Data Lake Storage Gen1 is mentioned here in detail, but broadly speaking, it is adl://mydatalakestore.azuredatalakestore.net/<file_path>.
The URI scheme for accessing Data Lake Storage Gen2 files is explained here in detail, but broadly speaking, it is abfss://<FILE_SYSTEM_NAME>@<ACCOUNT_NAME>.dfs.core.widows.net/<PATH>.
You'll need to go through your existing applications and ensure that you've changed the URIs appropriately to point to Data Lake Storage Gen2 ones. Also, you'll need to add the appropriate credentials. Finally, how you retire the original applications and replace with the new application will have to be aligned closely to your overall upgrade strategy.
Depending on interface your application uses with Data Lake Storage Gen1, you'll need to modify it to adapt it to Data Lake Storage Gen2.
If your application uses Data Lake Storage REST APIs, you'll need to modify your application to use the Data Lake Storage Gen2 REST APIs. Links are provided in Programming interfaces section.
As called out in the Assess your upgrade readiness section, SDKs aren't currently available. If you want port over your applications to Data Lake Storage Gen2, we will recommend that you wait for supported SDKs to be available..
As called out in the Assess your upgrade readiness section, PowerShell support is not currently available for the data plane.
You could replace management plane commandlets with the appropriate ones in Data Lake Storage Gen2. Links are provided in Programming interfaces section.
As called out in Assess your upgrade readiness section, CLI support is not currently available for the data plane.
You could replace management plane commands with the appropriate ones in Data Lake Storage Gen2. Links are provided in Programming interfaces section.
If your application creates meta-data about information in the store such as explicit file and folder paths, you'll need to perform additional actions after the store data/meta-data upgrade. This is especially true of analytics frameworks such as Azure HDInsight, Databricks etc., which usually create catalog data on the store data.
Analytics frameworks work with data and meta-data stored in the remote stores like Data Lake Storage Gen1 and Gen2. So, in theory, the engines can be ephemeral, and be brought up only when jobs need to run against the stored data.
However, to optimize performance, the analytics frameworks might create explicit references to the files and folders stored in the remote store, and then create a cache to hold them. Should the URI of the remote data change, for example, a cluster that was storing data in Data Lake Storage Gen1 earlier, and now wanting to store in Data Lake Storage Gen2, the URI for same copied content will be different. So, after the data and meta-data upgrade the caches for these engines also need to be updated or re-initialized
As part of the planning process, you'll need to identify your application and figure out how meta-data information can be re-initialized to point to data that is now stored in Data Lake Storage Gen2. Below is guidance for commonly adopted analytics frameworks to help you with their upgrade steps.
Depending on the upgrade strategy you choose, the steps will differ. The current section assumes that you've chosen the “Life-and-shift” strategy. Also, the existing Databricks workspace that used to access data in a Data Lake Storage Gen1 account is expected to work with the data that is copied over to Data Lake Storage Gen2 account.
First, make sure that you've created the Gen2 account, and then copied over data and meta from Gen1 to Gen2 by using an appropriate tool. Those tools are called out in Data upgrade section of this guide.
Then, upgrade your existing Databricks cluster to start using Databricks runtime 5.1 or higher, which should support Data Lake Storage Gen2.
The steps thereafter are based on how the existing Databricks workspace accesses data in the Data Lake Storage Gen1 account. It can be done either by calling adl:// URIs directly from notebooks, or through mountpoints.
If you are accessing directly from notebooks by providing the full adl:// URIs, you'll need to go through each notebook and change the configuration to access the corresponding Data Lake Storage Gen2 URI.
Going forward, you'll need to reconfigure it to point to Data Lake Storage Gen2 account. No more changes are needed, and the notebooks should be able to work as before.
If you are using any of the other upgrade strategies, you can create a variation of the above steps to meet your requirements.
Each of the tools and services called out in Azure ecosystem section of this guide will have to be configured to work with Data Lake Storage Gen2.
First, ensure that there is integration available with Data Lake Storage Gen2.
Then, the elements called out above (For example: URI and credentials), will have to be changed. You could modify the existing instance that works with Data Lake Storage Gen1 or you could create a new instance that would work with Data Lake Storage Gen2.
Please work with the partner providing the component and tools to ensure they can work with Data Lake Storage Gen2.
As part of this, you would have gone through the Assess your upgrade readiness section and the Planning for an upgrade section of this guide, you've received all of the necessary information, and you've created a plan that would meet your needs. You probably will have a testing task during this phase.
Depending on the strategy you choose and the complexities of your solution, this phase could be a short one or an extended one where there are multiple workloads waiting to be incrementally moved over to Data Lake Storage Gen2. This will be the most critical part of your upgrade.
After you are done with the transition operation, the final steps will involve thorough verification. This would include but not be limited to verifying data has been copied over reliably, verifying ACLs have been set correctly, verifying e2e pipelines are functioning correctly etc. After the verifications have been completed, you can now turn off your old pipelines, delete your source Data Lake Storage Gen1 accounts and go full speed on your Data Lake Storage Gen2-based solutions.
The guidance provided in this document should have helped you upgrade your solution to use Data Lake Storage Gen2.
If you have more questions, or have feedback, provide comments below or provide feedback in the Azure Feedback Forum.