title	description	keywords	services	author	ms.author	ms.service	ms.custom	ms.topic	ms.date
What are HDInsight and the Apache Hadoop and Apache Spark technology stack? - Azure	An introduction to HDInsight, and to the Apache Hadoop and Apache Spark technology stack and components, including Kafka, Hive, Storm, and HBase for big data analysis.	azure hadoop, hadoop azure, hadoop intro, hadoop introduction, hadoop technology stack, intro to hadoop, introduction to hadoop, what is a hadoop cluster, what is hadoop cluster, what is hadoop used for	hdinsight	hrasheed-msft	hrasheed	hdinsight	hdinsightactive,hdiseo17may2017, mvc	overview	05/07/2018

What is Azure HDInsight and the Apache Hadoop technology stack

This article provides an introduction to Apache Hadoop on Azure HDInsight. Azure HDInsight is a fully managed, full-spectrum, open-source analytics service for enterprises. You can use open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more.

[!INCLUDE hdinsight-price-change]

What is HDInsight and the Hadoop technology stack?

Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters. The Hadoop technology stack includes related software and utilities, including Apache Hive, HBase, Spark, Kafka, and many others.

Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more. With these frameworks, you can enable a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.

To see available Hadoop technology stack components on HDInsight, see Components and versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure features page for HDInsight.

What is big data?

Big data is collected in escalating volumes, at higher velocities, and in a greater variety of formats than ever before. It can be historical (meaning stored) or real time (meaning streamed from the source). See Scenarios for using HDInsight to learn about the most common use cases for big data.

Why should I use Hadoop on HDInsight?

This section lists the capabilities of Azure HDInsight.

Capability	Description
Cloud native	Azure HDInsight enables you to create optimized clusters for Hadoop, Spark, Interactive query (LLAP), Kafka, Storm, HBase, and ML Services on Azure. HDInsight also provides an end-to-end SLA on all your production workloads.
Low-cost and scalable	HDInsight enables you to scale workloads up or down. You can reduce costs by creating clusters on demand and paying only for what you use. You can also build data pipelines to operationalize your jobs. Decoupled compute and storage provide better performance and flexibility.
Secure and compliant	HDInsight enables you to protect your enterprise data assets with Azure Virtual Network, encryption, and integration with Azure Active Directory. HDInsight also meets the most popular industry and government compliance standards.
Monitoring	Azure HDInsight integrates with Azure Log Analytics to provide a single interface with which you can monitor all your clusters.
Global availability	HDInsight is available in more regions than any other big data analytics offering. Azure HDInsight is also available in Azure Government, China, and Germany, which allows you to meet your enterprise needs in key sovereign areas.
Productivity	Azure HDInsight enables you to use rich productive tools for Hadoop and Spark with your preferred development environments. These development environments include Visual Studio, VSCode, Eclipse, and IntelliJ for Scala, Python, R, Java, and .NET support. Data scientists can also collaborate using popular notebooks such as Jupyter and Zeppelin.
Extensibility	You can extend the HDInsight clusters with installed components (Hue, Presto, and so on) by using script actions, by adding edge nodes, or by integrating with other big data certified applications. HDInsight enables seamless integration with the most popular big data solutions with a one-click deployment.

Scenarios for using HDInsight

Azure HDInsight can be used for a variety of scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). The scenarios for processing such data can be summarized in the following categories:

Batch processing (ETL)

Extract, transform, and load (ETL) is a process where unstructured or structured data is extracted from heterogeneous data sources. It's then transformed into a structured format and loaded into a data store. You can use the transformed data for data science or data warehousing.

Data warehousing

You can use HDInsight to perform interactive queries at petabyte scales over structured or unstructured data in any format. You can also build models connecting them to BI tools. For more information, read this customer story.

Internet of Things (IoT)

You can use HDInsight to process streaming data that's received in real time from a variety of devices. For more information, read this blog post from Azure that announces the public preview of Apache Kafka on HDInsight with Azure Managed disks.

Data science

You can use HDInsight to build applications that extract critical insights from data. You can also use Azure Machine Learning on top of that to predict future trends for your business. For more information, read this customer story.

Hybrid

You can use HDInsight to extend your existing on-premises big data infrastructure to Azure to leverage the advanced analytics capabilities of the cloud.

Cluster types in HDInsight

HDInsight includes specific cluster types and cluster customization capabilities, such as the capability to add components, utilities, and languages. HDInsight offers the following cluster types:

Apache Hadoop: A framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
Apache Spark: An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. See What is Apache Spark in HDInsight?.
Apache HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data--potentially billions of rows times millions of columns. See What is HBase on HDInsight?
ML Services: A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of ML Services on HDInsight.
Apache Storm: A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Apache Interactive Query preview (AKA: Live Long and Process): In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.
Apache Kafka: An open-source platform that's used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.

Open-source components in HDInsight

Azure HDInsight enables you to create clusters with open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, and R. These clusters, by default, come with other open-source components that are included on the cluster such as Ambari, Avro, Hive, HCatalog, Mahout, MapReduce, YARN, Phoenix, Pig, Sqoop, Tez, Oozie, ZooKeeper.

Programming languages in HDInsight

HDInsight clusters, including Spark, HBase, Kafka, Hadoop, and others, support many programming languages. Some programming languages aren't installed by default. For libraries, modules, or packages that are not installed by default, use a script action to install the component.

Programming language	Information
Default programming language support	By default, HDInsight clusters support: Java Python You can install additional languages by using script actions.
Java virtual machine (JVM) languages	Many languages other than Java can run on a Java virtual machine (JVM). However, if you run some of these languages, you might have to install additional components on the cluster. The following JVM-based languages are supported on HDInsight clusters: Clojure Jython (Python for Java) Scala
Hadoop-specific languages	HDInsight clusters support the following languages that are specific to the Hadoop technology stack: Pig Latin for Pig jobs HiveQL for Hive jobs and SparkSQL

Development tools for HDInsight

You can use HDInsight development tools, including IntelliJ, Eclipse, Visual Studio Code, and Visual Studio, to author and submit HDInsight data query and job with seamless integration with Azure.

Azure toolkit for IntelliJ
Azure toolkit for Eclipse
Azure HDInsight tools for VS Code
Azure data lake tools for Visual Studio

Business intelligence on HDInsight

Familiar business intelligence (BI) tools retrieve, analyze, and report data that is integrated with HDInsight by using either the Power Query add-in or the Microsoft Hive ODBC Driver:

Apache Spark BI using data visualization tools with Azure HDInsight
Visualize Hive data with Microsoft Power BI in Azure HDInsight
Visualize Interactive Query Hive data with Power BI in Azure HDInsight
Connect Excel to Hadoop with Power Query (requires Windows)
Connect Excel to Hadoop with the Microsoft Hive ODBC Driver (requires Windows)
Use SQL Server Analysis Services with HDInsight
Use SQL Server Reporting Services with HDInsight

Next steps

In this article, you learned what is Azure HDInsight and how it provides Hadoop and other cluster types on Azure. Proceed to the next article to learn how to create an Apache Hadoop cluster in HDInsight.

[!div class="nextstepaction"] Create Hadoop cluster in HDInsight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apache-hadoop-introduction.md

apache-hadoop-introduction.md

What is Azure HDInsight and the Apache Hadoop technology stack

What is HDInsight and the Hadoop technology stack?

What is big data?

Why should I use Hadoop on HDInsight?

Scenarios for using HDInsight

Batch processing (ETL)

Data warehousing

Internet of Things (IoT)

Data science

Hybrid

Cluster types in HDInsight

Open-source components in HDInsight

Programming languages in HDInsight

Development tools for HDInsight

Business intelligence on HDInsight

Next steps

Files

apache-hadoop-introduction.md

Latest commit

History

apache-hadoop-introduction.md

File metadata and controls

What is Azure HDInsight and the Apache Hadoop technology stack

What is HDInsight and the Hadoop technology stack?

What is big data?

Why should I use Hadoop on HDInsight?

Scenarios for using HDInsight

Batch processing (ETL)

Data warehousing

Internet of Things (IoT)

Data science

Hybrid

Cluster types in HDInsight

Open-source components in HDInsight

Programming languages in HDInsight

Development tools for HDInsight

Business intelligence on HDInsight

Next steps