Skip to content

Latest commit

 

History

History
350 lines (221 loc) · 58.2 KB

data-lake-storage-upgrade.md

File metadata and controls

350 lines (221 loc) · 58.2 KB
title description services author ms.topic ms.author ms.date ms.service ms.component
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 Preview
Upgrade your solution to use Azure Data Lake Storage Gen2 Preview
storage
normesta
conceptual
normesta
12/06/2018
storage
data-lake-storage-gen2

Upgrade your big data analytics solutions from Azure Data Lake Gen1 to Azure Data Lake Storage Gen2 Preview

If you're using Azure Data Lake Storage Gen1 in your big data analytics solutions, this guide helps you to upgrade those solutions to use Azure Data Lake Storage Gen2 Preview. You can use this document to assess the dependencies that your solution has on Data Lake Storage Gen1. This guide also shows you how to plan and perform the upgrade.

We'll help you through the following tasks:

✔️ Assess your upgrade readiness

✔️ Plan for an upgrade

✔️ Perform the upgrade

Assess your upgrade readiness

Our goal is that all the capabilities that are present in Data Lake Storage Gen1 will be made available in Data Lake Storage Gen2. How those capabilities are exposed e.g. in SDK, CLI etc., might differ between Data Lake Storage Gen1 and Data Lake Storage Gen2. Applications and services that work with Data Lake Storage Gen1 need to be able to work similarly with Data Lake Storage Gen2. Finally, some of the capabilities won't be available in Data Lake Storage Gen2 right away. As they become available, we'll announce them in this document.

These next sections will help you decide how best to upgrade to Data Lake Storage Gen2, and when it might make most sense to do so.

Data Lake Storage Gen1 solution components

Most likely, when you use Data Lake Storage Gen1 in your analytics solutions or pipelines, there are many additional technologies that you employ to achieve the overall end-to-end functionality. This article describes various components of the data flow that include ingesting, processing, and consuming data.

In addition, there are cross-cutting components to provision, manage and monitor these components. Each of the components operate with Data Lake Storage Gen1 by using an interface best suited to them. When you're planning to upgrade your solution to Data Lake Storage Gen2, you'll need to be aware of the interfaces that are used. You'll need to upgrade both the management interfaces as well as data interfaces since each interface has distinct requirements.

Data Lake Storage Solution Components

Figure 1 above shows the functionality components that you would see in most analytics solutions.

Figure 2 shows an example of how those components will be implemented by using specific technologies.

The Storing functionality in Figure1 is provided by Data Lake Storage Gen1 (Figure 2). Note how the various components in the data flow interact with Data Lake Storage Gen1 by using REST APIs or Java SDK. Also note how the cross-cutting functionality components interact with Data Lake Storage Gen1. The Provisioning component uses Azure Resource templates, whereas the Monitoring component which uses Log Analytics utilizes operational data that comes from Data Lake Storage Gen1.

To upgrade a solution from using Data Lake Storage Gen1 to Data Lake Storage Gen2, you'll need to copy the data and meta-data, re-hook the data-flows, and then, all of the components will need to be able to work with Data Lake Storage Gen2.

The sections below provide information to help you make better decisions:

✔️ Platform capabilities

✔️ Programming interfaces

✔️ Azure ecosystem

✔️ Partner ecosystem

✔️ Operational information

In each section, you'll be able to determine the “must-haves” for your upgrade. After you are assured that the capabilities are available, or you are assured that there are reasonable workarounds in place, proceed to the Planning for an upgrade section of this guide.

Platform capabilities

This section describes which Data Lake Storage Gen1 platform capabilities that are currently available in Data Lake Storage Gen2.

Data Lake Storage Gen1 Data Lake Storage Gen2 - goal Data Lake Storage Gen2 - availability status
Data organization Supports data stored as folders and files Supports data stored as objects/blobs as well as folders and files - Link Supports data stored as folders and file – Available now Supports data stored as objects/blobs - Not yet available
Namespace Hierarchical namespace Flat namespace and Hierarchical namespaces Flat namespace: Available now
API REST API over HTTPS REST API over HTTP/HTTPS Available now
Server-side API WebHDFS-compatible REST API Azure Blob Service REST API Data Lake Storage Gen2 REST API Data Lake Storage Gen2 REST API – Available now Azure Blob Service REST API – Not yet available
Hadoop File System Client Yes (Azure Data Lake Storage) Yes (ABFS) Available now
Data Operations – Authorization File and folder level POSIX Access Control Lists (ACLs) based on Azure Active Directory Identities File and folder level POSIX Access Control Lists (ACLs) based on Azure Active Directory Identities Share Key for account level authorization Role Based Access Control (RBAC) to access containers Available now
Data Operations – Logs Yes One-off requests for logs for specific duration using support ticket Azure Monitoring integration One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration – Not yet available
Encryption data at rest Transparent, Server side with service-managed keys and with customer-managed keys in Azure KeyVault Transparent, Server side with service-managed keys and with customer keys managed keys in Azure KeyVault Service-managed keys – Available now Customer-managed keys – Available now
VNET Virtual Network integration (Preview) Service Endpoint Available now
Size limits No limits on account sizes, file sizes or number of files No limits on account sizes or number of files. File size limited to 5TB. Available now
Geo-redundancy Locally-redundant (LRS) Locally redundant (LRS) Zone redundant (ZRS) Globally redundant (GRS) Read-access globally redundant (RA-GRS) See here for more information Available now
Regional availability See here All Azure regions Available now
Price See Pricing See Pricing
Availability SLA See SLA See SLA Available now
Data Management File Expiration Lifecycle policies Not yet available

Programming interfaces

This table describes which API sets that are available for your custom applications. To make things a bit clearer, we've separated these API sets into 2 types: management APIs and filesystem APIs.

Management APIs help you to manage accounts, while filesystem APIs help you to operate on the files and folders.

 API set  Data Lake Storage Gen1 Availability for Data Lake Storage Gen2 - with Shared Key auth Availability for Data Lake Storage Gen2 - with OAuth auth
.NET SDK - management Link Not supported Available now - Link
.NET SDK – filesystem Link Not yet available Not yet available
Java SDK - management Link Not supported Available now – Link
Java SDK – filesystem Link  Not yet available Not yet available
Node.js - management Link  Not supported Available now - Link
Node.js - filesystem Link  Not yet available Not yet available
Python - management Link  Not supported Available now - Link
Python - filesystem Link Not yet available Not yet available
REST API - management Link Not supported Available now -
REST API - filesystem Link Available now Available now - Link
PowerShell - management and filesystem Link Management – Not supported Filesystem - Not yet available Management – Available now - Link Filesystem - Not yet available
CLI – management Link Not supported Available now - Link
CLI - filesystem Link Not yet available Not yet available
Azure Resource Manager templates - management Template1  Template2  Template3  Not supported Available now - Link

Azure ecosystem

When using Data Lake Storage Gen1, you can use a variety of Microsoft services and products in your end-to-end pipelines. These services and products work with Data Lake Storage Gen1 either directly or indirectly. This table shows a list of the services that we've modified to work with Data Lake Storage Gen1, and shows which ones are currently compatible with Data Lake Storage Gen2.

Area  Availability for Data Lake Storage Gen1 Availability for Data Lake Storage Gen2 – with Shared Key auth Availability for Data Lake Storage Gen2 – with OAuth
Analytics framework  Apache Hadoop Available now Available now
HDInsight  HDInsight 3.6 - Available now HDInsight 4.0 - Not yet available HDInsight 3.6 ESP – Not yet available HDInsight 4.0 ESP - Not yet available
Databricks Runtime 3.1 and above Databricks Runtime 4.2 and above - Available now Databricks Runtime 5.1 and above – Available now
SQL Data Warehouse  Not supported Available now - Link
Data integration  Data Factory  Version 2Available now Version 1 – Not supported Version 2Available now Version 1 – Not supported
AdlCopy Not supported Not supported
SQL Server Integration Services  Not yet available Not yet available
Data Catalog Not yet available Not yet available
Logic Apps  Not yet available Not yet available
IoT  Event Hubs – Capture  Not yet available Not yet available
Stream Analytics  Not yet available Not yet available
Consumption  PowerBI Desktop   Not yet available Not yet available
Excel  Not yet available Not yet available
Analysis Services  Not yet available Not yet available
Productivity  Azure Portal Not supported Account management – Available now Data operations Not yet available
Data Lake Tools for Visual Studio  Not yet available Not yet available
Azure Storage Explorer  Available now Available now
Visual Studio Code Not yet available Not yet available

Partner ecosystem

This table shows a list of the third-party services and products that were modified to work with Data Lake Storage Gen1. It also shows which ones are currently compatible with Data Lake Storage Gen2.

Area  Partner  Product/Service  Availability for Data Lake Storage Gen1 Availability for Data Lake Storage Gen2 – with Shared Key auth Availability for Data Lake Storage Gen2 – with Oauth
Analytics framework Cloudera  CDH Link Not yet available Not yet available
Cloudera Altus Link NA Not yet available
HortonWorks  HDP 3.0 Link  Not yet available Not yet available
Qubole  Link Not yet available Not yet available
ETL  StreamSets  Link Not yet available Not yet available
Informatica  Link Not yet available Not yet available
Attunity  Link Not yet available Not yet available
Alteryx  Link Not yet available Not yet available
ImanisData  Link Not yet available Not yet available
WANdisco Link Link Link

Operational information

Data Lake Storage Gen1 pushes specific information and data to other services which helps you to operationalize your pipelines. This table shows availability of corresponding support in Data Lake Storage Gen2.

Type of data  Availability for Data Lake Storage Gen1 Availability for Data Lake Storage Gen2
Billing data - meters that are sent to commerce team for billing and then made available to customers  Available now Available now
Activity logs Link One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration - Not yet available
Diagnostic logs Link One-off requests for logs for specific duration using support ticket – Available now Azure Monitoring integration - Not yet available
Metrics Not supported Available now - Link

Planning for an upgrade

This section assumes that you've reviewed the Assess your upgrade readiness section of this guide, and that all of your dependencies are met. If there are capabilities that are still not available in Data Lake Storage Gen2, please proceed only if you know the corresponding workarounds. The following sections provide guidance on how you can plan for upgrade of your pipelines. Performing the actual upgrade will be described in the Performing the upgrade section of this guide.

Upgrade strategy

The most critical part of the upgrade is deciding the strategy. This decision will determine the choices available to you.

This table lists some well-known strategies that have been used to migrate databases, Hadoop clusters, etc. We'll adopt similar strategies in our guidance, and adapt them to our context.

Strategy Pros Cons When to use?
Lift-and-shift Simplest. Requires downtime for copying over data, moving jobs, and moving ingress and egress For simpler solutions, where there are not multiple solutions accessing the same Gen1 account and hence can be moved together in a quick controlled fashion.
Copy-once-and-copy incremental Reduce downtime by performing copy in the background while source system is still active. Requires downtime for moving ingress and egress. Amount of data to be copied over is large and the downtime associated with life-and-shift is not acceptable. Testing may be required with significant production data on the target system before transition.
Parallel adoption Least downtime Allows time for applications to migrate at their own discretion. Most elaborate since 2-way sync is needed between the two systems. For complex scenarios where applications built on Data Lake Storage Gen1 cannot be cutover all at once and must be moved over in an incremental fashion.

Below are more details on steps involved for each of the strategies. The steps list what you would do with the components involved in the upgrade. This includes the overall source system, overall target system, ingress sources for source system, egress destinations for source system, and jobs running on source system.

These steps are not meant to be prescriptive. They are meant to set the framework about how we are thinking about each strategy. We'll provide case studies for each strategy as we see them being implemented.

Lift-and-shift

  1. Pause the source system – ingress sources, jobs, egress destinations.

  2. Copy all the data from the source system to the target system.

  3. Point all the ingress sources, to the target system. Point to the egress destination from the target system.

  4. Move, modify, run all the jobs to the target system.

  5. Turn off the source system.

Copy-once and copy-incremental

  1. Copy over the data from the source system to the target system.

  2. Copy over the incremental data from the source system to the target system at regular intervals.

  3. Point to the egress destination from the target system.

  4. Move, modify, run all jobs on the target system.

  5. Point ingress sources incrementally to the target system as per your convenience.

  6. Once all ingress sources are pointing to the target system.

    1. Turn off incremental copying.

    2. Turn off the source system.

Parallel adoption

  1. Set up target system.

  2. Set up a two-way replication between source system and target system.

  3. Point ingress sources incrementally to the target system.

  4. Move, modify, run jobs incrementally to the target system.

  5. Point to egress destinations incrementally from the target system.

  6. After all the original ingress sources, jobs and egress destination are working with the target system, turn off the source system.

Data upgrade

The overall strategy that you use to perform your upgrade (described in the Upgrade strategy section of this guide), will determine the tools that you can use for your data upgrade. The tools listed below are based on current information and are suggestions.

Tools guidance

Strategy Tools Pros Considerations
Lift-and-shift Azure Data Factory Managed cloud service Only copies over data. ACLs cannot be copied over currently.
Distcp Well-known Hadoop-provided tool Permissions i.e. ACLs can be copied with this tool Requires a cluster which can connect to both Data Lake Storage Gen1 and Gen2 at the same time.
Copy-once-and-copy incremental Azure Data Factory Managed cloud service To support incremental copying in ADF, data needs to be organized in a time-series fashion. Shortest interval between incremental copies is 15 minutes. For shorter intervals, ADF won't work. ACLs cannot be copied over currently.
Parallel adoption WANdisco Support consistent replication If using a pure Hadoop environment connected to Azure Data Lake Storage, supports two-way replication If not using a pure-Hadoop environment, there may be a delay in the replication.

Note that there are third-parties that can handle the Data Lake Storage Gen1 to Data Lake Storage Gen2 upgrade without involving the above data/meta-data copying tools (For example: Cloudera). They provide a “one-stop shop” experience that performs data migration as well as workload migration. You may have to perform an out-of-band upgrade for any tools that are outside their ecosystem.

Considerations

  • You'll need to manually create the Data Lake Storage Gen2 account first, before you start any part of the upgrade, since the current guidance does not include any automatic Gen2 account process based on your Gen1 account information. Ensure that you compare the accounts creation processes for Gen1 and Gen2.

  • Data Lake Storage Gen2, only supports files up to 5 TB in size. To upgrade to Data Lake Storage Gen2, you might need to resize the files in Data Lake Storage Gen1 so that they are larger than 5 TB in size.

  • If you use a tool that doesn't copy ACLs or you don't want to copy over the ACLs, then you'll need to set the ACLs on the destination manually at the appropriate top level. You can do that by using Storage Explorer. Ensure that those ACLs are the default ACLs so that the files and folders that you copy over inherit them.

  • In Data Lake Storage Gen1, the highest level you can set ACLs is at root of the account. In Data Lake Storage Gen1, however, the highest level you can set ACLs is at the root folder in a filesystem, not the whole account. So, if you want to set default ACLs at account level, you'll need to duplicate those across all the file systems in your Data Lake Storage Gen2 account.

  • File naming restrictions are different between the two storage systems. These differences are especially concerning when copying from Data Lake Storage Gen2 to Data Lake Storage Gen1 since the latter has more constrained restrictions.

Application upgrade

When you need to build applications on Data Lake Storage Gen1 or Data Lake Storage Gen2, you'll have to first choose an appropriate programming interface. When calling an API on that interface you'll have to provide the appropriate URI and the appropriate credentials. The representation of these three elements, the API, URI, and how the credentials are provided, are different between Data Lake Storage Gen1 and Data Lake Storage Gen2.So, as part of the application upgrade, you'll need to map these three constructs appropriately.

URI changes

The main task here is to translate the adl:// URI that was being used in the existing workloads into an abfss:// URI.

The URI scheme for Data Lake Storage Gen1 is mentioned here in detail, but broadly speaking, it is adl://mydatalakestore.azuredatalakestore.net/<file_path>.

The URI scheme for accessing Data Lake Storage Gen2 files is explained here in detail, but broadly speaking, it is abfss://<FILE_SYSTEM_NAME>@<ACCOUNT_NAME>.dfs.core.widows.net/<PATH>.

You'll need to go through your existing applications and ensure that you've changed the URIs appropriately to point to Data Lake Storage Gen2 ones. Also, you'll need to add the appropriate credentials. Finally, how you retire the original applications and replace with the new application will have to be aligned closely to your overall upgrade strategy.

Custom applications

Depending on interface your application uses with Data Lake Storage Gen1, you'll need to modify it to adapt it to Data Lake Storage Gen2.

REST APIs

If your application uses Data Lake Storage REST APIs, you'll need to modify your application to use the Data Lake Storage Gen2 REST APIs. Links are provided in Programming interfaces section.

SDKs

As called out in the Assess your upgrade readiness section, SDKs aren't currently available. If you want port over your applications to Data Lake Storage Gen2, we will recommend that you wait for supported SDKs to be available..

PowerShell

As called out in the Assess your upgrade readiness section, PowerShell support is not currently available for the data plane.

You could replace management plane commandlets with the appropriate ones in Data Lake Storage Gen2. Links are provided in Programming interfaces section.

CLI

As called out in Assess your upgrade readiness section, CLI support is not currently available for the data plane.

You could replace management plane commands with the appropriate ones in Data Lake Storage Gen2. Links are provided in Programming interfaces section.

Analytics frameworks upgrade

If your application creates meta-data about information in the store such as explicit file and folder paths, you'll need to perform additional actions after the store data/meta-data upgrade. This is especially true of analytics frameworks such as Azure HDInsight, Databricks etc., which usually create catalog data on the store data.

Analytics frameworks work with data and meta-data stored in the remote stores like Data Lake Storage Gen1 and Gen2. So, in theory, the engines can be ephemeral, and be brought up only when jobs need to run against the stored data.

However, to optimize performance, the analytics frameworks might create explicit references to the files and folders stored in the remote store, and then create a cache to hold them. Should the URI of the remote data change, for example, a cluster that was storing data in Data Lake Storage Gen1 earlier, and now wanting to store in Data Lake Storage Gen2, the URI for same copied content will be different. So, after the data and meta-data upgrade the caches for these engines also need to be updated or re-initialized

As part of the planning process, you'll need to identify your application and figure out how meta-data information can be re-initialized to point to data that is now stored in Data Lake Storage Gen2. Below is guidance for commonly adopted analytics frameworks to help you with their upgrade steps.

Azure Databricks

Depending on the upgrade strategy you choose, the steps will differ. The current section assumes that you've chosen the “Life-and-shift” strategy. Also, the existing Databricks workspace that used to access data in a Data Lake Storage Gen1 account is expected to work with the data that is copied over to Data Lake Storage Gen2 account.

First, make sure that you've created the Gen2 account, and then copied over data and meta from Gen1 to Gen2 by using an appropriate tool. Those tools are called out in Data upgrade section of this guide.

Then, upgrade your existing Databricks cluster to start using Databricks runtime 5.1 or higher, which should support Data Lake Storage Gen2.

The steps thereafter are based on how the existing Databricks workspace accesses data in the Data Lake Storage Gen1 account. It can be done either by calling adl:// URIs directly from notebooks, or through mountpoints.

If you are accessing directly from notebooks by providing the full adl:// URIs, you'll need to go through each notebook and change the configuration to access the corresponding Data Lake Storage Gen2 URI.

Going forward, you'll need to reconfigure it to point to Data Lake Storage Gen2 account. No more changes are needed, and the notebooks should be able to work as before.

If you are using any of the other upgrade strategies, you can create a variation of the above steps to meet your requirements.

Azure ecosystem upgrade

Each of the tools and services called out in Azure ecosystem section of this guide will have to be configured to work with Data Lake Storage Gen2.

First, ensure that there is integration available with Data Lake Storage Gen2.

Then, the elements called out above (For example: URI and credentials), will have to be changed. You could modify the existing instance that works with Data Lake Storage Gen1 or you could create a new instance that would work with Data Lake Storage Gen2.

Partner ecosystem upgrade

Please work with the partner providing the component and tools to ensure they can work with Data Lake Storage Gen2.

Performing the upgrade

Pre-upgrade

As part of this, you would have gone through the Assess your upgrade readiness section and the Planning for an upgrade section of this guide, you've received all of the necessary information, and you've created a plan that would meet your needs. You probably will have a testing task during this phase.

In-upgrade

Depending on the strategy you choose and the complexities of your solution, this phase could be a short one or an extended one where there are multiple workloads waiting to be incrementally moved over to Data Lake Storage Gen2. This will be the most critical part of your upgrade.

Post-upgrade

After you are done with the transition operation, the final steps will involve thorough verification. This would include but not be limited to verifying data has been copied over reliably, verifying ACLs have been set correctly, verifying e2e pipelines are functioning correctly etc. After the verifications have been completed, you can now turn off your old pipelines, delete your source Data Lake Storage Gen1 accounts and go full speed on your Data Lake Storage Gen2-based solutions.

Conclusion

The guidance provided in this document should have helped you upgrade your solution to use Data Lake Storage Gen2.

If you have more questions, or have feedback, provide comments below or provide feedback in the Azure Feedback Forum.