Apache Beam provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines. It also covers the data integration processus.
General usage is a good starting point for Apache Beam.
You can take a look on the Beam Examples.
Status Build Status
The key concepts in this programming model are:
PCollection
: represents a collection of data, which could be bounded or unbounded in size.PTransform
: represents a computation that transforms input PCollections into output PCollections.Pipeline
: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.PipelineRunner
: specifies where and how the pipeline should execute.
We provide the following PipelineRunners:
- The
DirectPipelineRunner
runs the pipeline on your local machine. - The
BlockingDataflowPipelineRunner
submits the pipeline to the Dataflow Service via theDataflowPipelineRunner
and then prints messages about the job status until the execution is complete. - The
SparkPipelineRunner
runs the pipeline on an Apache Spark cluster. - The
FlinkPipelineRunner
runs the pipeline on an Apache Flink cluster.
The following command will build both the sdk
and example
modules and
install them in your local Maven repository:
mvn clean install
You can speed up the build and install process by using the following options:
-
To skip execution of the unit tests, run:
mvn install -DskipTests
-
While iterating on a specific module, use the following command to compile and reinstall it. For example, to reinstall the
examples
module, run:mvn install -pl examples
Be careful, however, as this command will use the most recently installed SDK from the local repository (or Maven Central) even if you have changed it locally.
After building and installing, you can execute the WordCount
and other
example pipelines by following the instructions in this README.
You can subscribe on the mailing lists to discuss and get involved in Apache Beam:
- Subscribe on the [email protected]
- Subscribe on the [email protected]
You can report issue on Jira.