title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Batch and HPC solutions in the cloud | Microsoft Docs |
Learn about batch and high-performance computing (HPC and Big Compute) scenarios and solution options in Azure |
batch, virtual-machines, cloud-services |
dlepow |
timlt |
aab5401d-2baf-4cf2-bf20-ad224de33888 |
batch |
NA |
get-started-article |
NA |
big-compute |
11/17/2016 |
danlep |
Azure offers efficient, scalable cloud solutions for batch and high-performance computing (HPC) - also called Big Compute. Learn here about Big Compute workloads and Azure’s services to support them, or jump directly to solution scenarios later in this article. This article is mainly for technical decision-makers, IT managers, and independent software vendors, but other IT professionals and developers can use it to familiarize themselves with these solutions.
Organizations have large-scale computing problems: engineering design and analysis, image rendering, complex modeling, Monte Carlo simulations, financial risk calculations, and others. Azure helps organizations solve these problems with the resources, scale, and schedule they need. With Azure, organizations can:
- Create hybrid solutions, extending an on-premises HPC cluster to offload peak workloads to the cloud
- Run HPC cluster tools and workloads entirely in Azure
- Use managed and scalable Azure services such as Batch to run compute-intensive workloads without having to deploy and manage compute infrastructure
Although beyond the scope of this article, Azure also provides developers and partners a full set of capabilities, architecture choices, and development tools to build large-scale, custom Big Compute workflows. And a growing partner ecosystem is ready to help you make your Big Compute workloads productive in the Azure cloud.
Unlike web applications and many line-of-business applications, batch and HPC applications have a defined beginning and end, and they can run on a schedule or on demand, sometimes for hours or longer. Most fall into two main categories: intrinsically parallel (sometimes called “embarrassingly parallel”, because the problems they solve lend themselves to running in parallel on multiple computers or processors) and tightly coupled. See the following table for more about these application types. Some Azure solution approaches work better for one type or the other.
Note
In Batch and HPC solutions, a running instance of an application is typically called a job, and each job might get divided into tasks. And the clustered compute resources for the application are often called compute nodes.
You can readily migrate many applications that are designed to run in on-premises HPC clusters to Azure, or to a hybrid (cross-premises) environment. However, there may be some limitations or considerations, including:
- Availability of cloud resources - Depending on the type of cloud compute resources you use, you might not be able to rely on continuous machine availability while a job runs. State handling and progress check pointing are common techniques to handle possible transient failures, and more necessary when using cloud resources.
- Data access - Data access techniques commonly available in enterprise clusters, such as NFS, may require special configuration in the cloud. Or, you might need to adopt different data access practices and patterns for the cloud.
- Data movement - For applications that process large amounts of data, strategies are needed to move the data into cloud storage and to compute resources. You might need high-speed cross-premises networking such as Azure ExpressRoute. Also consider legal, regulatory, or policy limitations for storing or accessing that data.
- Licensing - Check with the vendor of any commercial application for licensing or other restrictions for running in the cloud. Not all vendors offer pay-as-you-go licensing. You might need to plan for a licensing server in the cloud for your solution, or connect to an on-premises license server.
The dividing line between Big Compute and Big Data applications isn't always clear, and some applications may have characteristics of both. Both involve running large-scale computations, usually on clusters of computers. But the solution approaches and supporting tools can differ.
• Big Compute tends to involve applications that rely on CPU power and memory, such as engineering simulations, financial risk modeling, and digital rendering. The infrastructure for a Big Compute solution might include computers with specialized multicore processors to perform raw computation, and specialized, high-speed networking hardware to connect the computers.
• Big Data solves data analysis problems that involve large amounts of data that can’t be managed by a single computer or database management system. Examples include large volumes of web logs or other business intelligence data. Big Data tends to rely more on disk capacity and I/O performance than on CPU power. There are also specialized Big Data tools such as Apache Hadoop to manage the cluster and partition the data. (For information about Azure HDInsight and other Azure Hadoop solutions, see Hadoop.)
Running Batch and HPC applications often includes a cluster manager and a job scheduler to help manage clustered compute resources and allocate them to the applications that run the jobs. These functions might be accomplished by separate tools, or an integrated tool or service.
- Cluster manager - Provisions, releases, and administers compute resources (or compute nodes). A cluster manager might automate installation of operating system images and applications on compute nodes, scale compute resources according to demands, and monitor the performance of the nodes.
- Job scheduler - Specifies the resources (such as processors or memory) an application needs, and the conditions when it runs. A job scheduler maintains a queue of jobs and allocates resources to them based on an assigned priority or other characteristics.
Clustering and job scheduling tools for Windows-based and Linux-based clusters can migrate well to Azure. For example, Microsoft HPC Pack, Microsoft’s free compute cluster solution for Windows and Linux HPC workloads, offers several options for running in Azure. You can also build Linux clusters to run open-source tools such as Torque and SLURM. You can also bring commercial grid solutions to Azure, such as TIBCO DataSynapse GridServer, IBM Spectrum Symphony and Symphony LSF, and Univa Grid Engine.
As shown in the following sections, you can also take advantage of Azure services to manage compute resources and schedule jobs without (or in addition to) traditional cluster management tools.
Here are three common scenarios to run Big Compute workloads in Azure by using existing HPC cluster solutions, Azure services, or a combination of the two. Key considerations for choosing each scenario are listed but aren't exhaustive. More about the available Azure services you might use in your solution is later in the article.
| Scenario | Why choose it? |
| --- | --- | --- |
| Burst an HPC cluster to Azure
Learn more:
• Burst to Azure worker instances with HPC Pack
• Set up a hybrid compute cluster with HPC Pack
• Burst to Azure Batch with HPC Pack
|• Combine your Microsoft HPC Pack or other on-premises cluster with additional Azure resources in a hybrid solution.
• Extend your Big Compute workloads to run on Platform as a Service (PaaS) virtual machine instances (currently Windows Server only).
• Access an on-premises license server or data store by using an optional Azure virtual network |
| Create an HPC cluster entirely in Azure
Learn more:
• HPC cluster solutions in Azure
|• Quickly and consistently deploy your applications and cluster tools on standard or custom Windows or Linux infrastructure as a service (IaaS) virtual machines.
• Run various Big Compute workloads by using the job scheduling solution of your choice.
• Use additional Azure services including networking and storage to create complete cloud-based solutions. |
| Scale out a parallel application to Azure
Learn more:
• Basics of Azure Batch
• Get started with the Azure Batch library for .NET |• Develop with Azure Batch to scale out various Big Compute workloads to run on pools of Windows or Linux virtual machines.
• Use an Azure platform service to manage deployment and autoscaling of virtual machines, job scheduling, disaster recovery, data movement, dependency management, and application deployment. |
Here is more about the compute, data, networking, and related services you can combine for Big Compute solutions and workflows. For in-depth guidance on Azure services, see the Azure services documentation. The scenarios earlier in this article show just some ways of using these services.
Note
Azure regularly introduces new services that could be useful for your scenario. If you have questions, contact an Azure partner or email [email protected].
Azure compute services are the core of a Big Compute solution, and the different compute services offer advantages for different scenarios. At a basic level, these services offer different modes for applications to run on virtual machine-based compute instances that Azure provides using Windows Server Hyper-V technology. These instances can run standard and custom Linux and Windows operating systems and tools. Azure gives you a choice of instance sizes with different configurations of CPU cores, memory, disk capacity, and other characteristics. Depending on your needs, you can scale the instances to thousands of cores and then scale down when you need fewer resources.
Note
Take advantage of the Azure compute-intensive instances such as the H-series to improve the performance and scalability of HPC workloads. These instances also support parallel MPI applications that require a low-latency and high-throughput application network. Also available are N-series VMs with NVIDIA GPUs to expand the range of computing and visualization scenarios in Azure.
Service | Description |
---|---|
Virtual machines |
• Provide compute infrastructure as a service (IaaS) using Microsoft Hyper-V technology • Enable you to flexibly provision and manage persistent cloud computers from standard Windows Server or Linux images from the Azure Marketplace, or images and data disks you supply • Can be deployed and managed as VM Scale Sets to build large-scale services from identical virtual machines, with autoscaling to increase or decrease capacity automatically • Run on-premises compute cluster tools and applications entirely in the cloud |
Cloud services |
• Can run Big Compute applications in worker role instances, which are virtual machines running a version of Windows Server and are managed entirely by Azure • Enable scalable, reliable applications with low administrative overhead, running in a platform as a service (PaaS) model • May require additional tools or development to integrate with on-premises HPC cluster solutions |
Batch |
• Runs large-scale parallel and batch workloads in a fully managed service • Provides job scheduling and autoscaling of a managed pool of virtual machines • Allows developers to build and run applications as a service or cloud-enable existing applications |
A Big Compute solution typically operates on a set of input data, and generates data for its results. Some of the Azure storage services used in Big Compute solutions include:
- Blob, table, and queue storage - Manage large amounts of unstructured data, NoSQL data, and messages for workflow and communication, respectively. For example, you might use blob storage for large technical data sets, or for the input images or media files your application processes. You might use queues for asynchronous communication in a solution. See Introduction to Microsoft Azure Storage.
- Azure File storage - Shares common files and data in Azure using the standard SMB protocol, which is needed for some HPC cluster solutions.
- Data Lake Store - Provides a hyperscale Apache Hadoop Distributed File System for the cloud, useful for batch, real-time, and interactive analytics.
Some Big Compute scenarios involve large-scale data flows, or generate data that needs further processing or analysis. Azure offers several data and analysis services, including:
- Data Factory - Builds data-driven workflows (pipelines) that join, aggregate, and transform data from on-premises, cloud-based, and Internet data stores.
- SQL Database - Provides the key features of a Microsoft SQL Server relational database management system in a managed service.
- HDInsight - Deploys and provisions Windows Server or Linux-based Apache Hadoop clusters in the cloud to manage, analyze, and report on big data.
- Machine Learning - Helps you create, test, operate, and manage predictive analytic solutions in a fully managed service.
Your Big Compute solution might need other Azure services to connect to resources on-premises or in other environments. Examples include:
- Virtual Network - Creates a logically isolated section in Azure to connect Azure resources to each other or to your on-premises data center. With a cross-premises virtual network, Big Compute applications can access on-premises data, Active Directory services, and license servers
- ExpressRoute - Creates a private connection between Microsoft data centers and infrastructure that’s on-premises or in a co-location environment. ExpressRoute provides higher security, more reliability, faster speeds, and lower latencies than typical connections over the Internet.
- Service Bus - Provides several mechanisms for applications to communicate or exchange data, whether they are located on Azure, on another cloud platform, or in a data center.
- See Technical Resources for Batch and HPC to find technical guidance to build your solution.
- Discuss your Azure options with partners including Cycle Computing, Rescale, and UberCloud.
- Read about Azure Big Compute solutions delivered by Towers Watson, Altair, ANSYS, and d3VIEW.
- For the latest announcements, see the Microsoft HPC and Batch team blog and the Azure blog.