Skip to content

Commit

Permalink
Adds detailed instructions on how to execute ExampleEchoPipeline.java (
Browse files Browse the repository at this point in the history
…apache#33108)

* Adds detailed instructions on how to execute ExampleEchoPipeline.java

* spacing in exampleechopipeline

* revert header

* spacing

* spacing

* spacing

* license

* spotless
  • Loading branch information
svetakvsundhar authored Nov 14, 2024
1 parent 7accd8d commit f38edb7
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,7 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
* In this example batch pipeline we will invoke a simple Echo C++ library within a DoFn The sample
* makes use of a ExternalLibraryDoFn class which abstracts the setup and processing of the
* executable, logs and results. For this example we are using commands passed to the library based
* on ordinal position but for a production system you should use a mechanism like ProtoBuffers with
* Base64 encoding to pass the parameters to the library To test this example you will need to build
* the files Echo.cc and EchoAgain.cc in a linux env matching the runner that you are using (using
* g++ with static option). Once built copy them to the SourcePath defined in {@link
* SubProcessPipelineOptions}
*/
/** Please see the Readme.MD file for instructions to execute this pipeline. */
public class ExampleEchoPipeline {

public static void main(String[] args) throws Exception {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Apache Beam Subprocess Example

This example demonstrates how to execute external C++ binaries as subprocesses within an Apache Beam pipeline using the `SubProcessKernel`.

## Prerequisites

* **Google Cloud Project:** A Google Cloud project with billing enabled.
* **Dataflow API:** Enable the Dataflow API for your project.
* **C++ compiler:** You'll need a C++ compiler (like g++) to compile the C++ binaries.

## Steps

1. **Create a [Maven Example project](https://beam.apache.org/get-started/quickstart-java/) that builds against the latest Beam release:**

```bash
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.60.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
```

2. **Build the project:**

* Navigate to the root of the repository (`word-count-beam/`):

```bash
cd word-count-beam/
```

* Build the project using Maven:

```bash
mvn clean install
```

3. **Run the pipeline on Dataflow:**

```bash
mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.subprocess.ExampleEchoPipeline \
-Dexec.args="--sourcePath=/absolute/path/to/your/subprocess/directory \
--workerPath=/absolute/path/to/your/subprocess/directory \
--concurrency=5 \
--filesToStage=/absolute/path/to/your/subprocess/directory/echo,/absolute/path/to/your/subprocess/directory/echoagain \
--runner=DataflowRunner \
--project=your-project-id \
--region=your-gcp-region \
--tempLocation=gs://your-gcs-bucket/temp"
```

* Replace the placeholders with your actual paths, project ID, region, and Cloud Storage bucket.

## Important notes

* **Dependencies:** Ensure your `pom.xml` includes the Dataflow runner dependency (`beam-runners-google-cloud-dataflow-java`).
* **Authentication:** Authenticate your environment to your Google Cloud project.
* **DirectRunner:** On `DirectRunner`, you will see the error ` Process succeded but no result file was found`, showing that the Process is successful.

## Code overview

* **`ExampleEchoPipeline.java`:** This Java file defines the Beam pipeline that executes the `Echo` and `EchoAgain` binaries as subprocesses.
* **`Echo.cc` and `Echoagain.cc`:** These C++ files contain the code for the external binaries. These won't be visible when running the example with the created example project. You will need to compile these (using `g++ Echo.cc -o Echo` and `g++ EchoAgain.cc -o EchoAgain`), and then provide their path via the `sourcePath` and `workerPath` flags as listed above.
* **`SubProcessKernel.java`:** This class in the Beam Java SDK handles the execution of external binaries and captures their output.

0 comments on commit f38edb7

Please sign in to comment.