The examples included in this module serve to demonstrate the basic functionality of Google Cloud Dataflow, and act as starting points for the development of more complex pipelines.
A good starting point for new users is our WordCount
example, which runs over the provided input text file(s) and computes how many
times each word occurs in the input.
Besides WordCount
, the following examples are included:
- AutoComplete — An example that computes the most popular hash tags for every prefix, which can be used for auto-completion. Demonstrates how to use the same pipeline in both streaming and batch, combiners, and composite transforms.
- BigQueryTornadoes
— An example that reads the public samples of weather data from Google
BigQuery, counts the number of tornadoes that occur in each month, and
writes the results to BigQuery. Demonstrates reading/writing BigQuery,
counting a
PCollection
, and user-definedPTransforms
. - CombinePerKeyExamples
— An example that reads the public "Shakespeare" data, and for
each word in the dataset that exceeds a given length, generates a string
containing the list of play names in which that word appears. Output is saved
in a Google BigQuery table. Demonstrates the
Combine.perKey
transform, which lets you combine the values in a key-groupedPCollection
; also how to use anAggregator
to track information in the Google Developers Console. - DatastoreWordCount — An example that shows you how to read from Google Cloud Datastore.
- DeDupExample
— An example that uses Shakespeare's plays as plain text files, and
removes duplicate lines across all the files. Demonstrates the
RemoveDuplicates
,TextIO.Read
, andTextIO.Write
transforms, and how to wire transforms together. - FilterExamples
— An example that shows different approaches to filtering, including
selection and projection. It also shows how to dynamically set parameters
by defining and using new pipeline options, and use how to use a value derived
by a pipeline. Demonstrates the
Mean
transform,Options
configuration, and using pipeline-derived data as a side input. - JoinExamples
— An example that shows how to join two collections. It uses a
sample of the GDELT "world event"
data, joining the event
action
country code against a table that maps country codes to country names. Demonstrates theJoin
operation, and using multiple input sources. - MaxPerKeyExamples
— An example that reads the public samples of weather data from BigQuery,
and finds the maximum temperature (
mean_temp
) for each month. Demonstrates theMax
statistical combination transform, and how to find the max-per-key group. - PubsubFileInjector — A batch Cloud Dataflow pipeline for injecting a set of Cloud Storage files into a Google Cloud Pub/Sub topic, line by line. This example can be useful for testing streaming pipelines.
- StreamingWordExtract — A streaming pipeline example that inputs lines of text from a Cloud Pub/Sub topic, splits each line into individual words, capitalizes those words, and writes the output to a BigQuery table.
- TfIdf — An example that computes a basic TF-IDF search table for a directory or Cloud Storage prefix. Demonstrates joining data, side inputs, and logging.
- TopWikipediaSessions
— An example that reads Wikipedia edit data from Cloud Storage and
computes the user with the longest string of edits separated by no more than
an hour within each month. Demonstrates using Cloud Dataflow
Windowing
to perform time-based aggregations of data. - TrafficMaxLaneFlow
— A streaming Cloud Dataflow example using BigQuery output in the
traffic sensor
domain. Demonstrates the Cloud Dataflow streaming runner, sliding windows, Cloud Pub/Sub topic ingestion, the use of theAvroCoder
to encode a custom class, and customCombine
transforms. - TrafficRoutes
— A streaming Cloud Dataflow example using BigQuery output in the
traffic sensor
domain. Demonstrates the Cloud Dataflow streaming runner,GroupByKey
, keyed state, sliding windows, and Cloud Pub/Sub topic ingestion. - WindowingWordCount — An example that applies windowing to "Shakespeare" data in a `WordCount` pipeline.
After building and installing the SDK
and Examples
modules, as explained in this
README,
you can execute the WordCount
and other example pipelines using the
DirectPipelineRunner
on your local machine:
mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>"
You can use the BlockingDataflowPipelineRunner
to execute the WordCount
example on
Google Cloud Dataflow Service using managed resources in the Google Cloud Platform.
Start by following the general Cloud Dataflow
Getting Started instructions.
You should have a Google Cloud Platform project that has a Cloud Dataflow API enabled,
a Google Cloud Storage bucket that will serve as a staging location, and installed and
authenticated Google Cloud SDK. In this case, invoke the example as follows:
mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner"
Your Cloud Storage location should be entered in the form of
gs://bucket/path/to/staging/directory
. The Cloud Platform project refers to
its name (not number).
Alternatively, you may choose to bundle all dependencies into a single JAR and
execute it outside of the Maven environment. For example, after building and
installing as usual, you can execute the following commands to create the
bundled JAR of the Examples
module and execute it both locally and in Cloud
Platform:
mvn bundle:bundle -pl examples
java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--input=<INPUT FILE PATTERN> --output=<OUTPUT FILE>
java -cp examples/target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner
Other examples can be run similarly by replacing the WordCount
class name with
BigQueryTornadoes
, DatastoreWordCount
, TfIdf
, TopWikipediaSessions
, etc.
and adjusting runtime options under the Dexec.args
parameter, as specified in
the example itself. If you are running the streaming pipeline examples, see the
additional setup instruction, below.
Note that when running Maven on Microsoft Windows platform, backslashes (\
)
under the Dexec.args
parameter should be escaped with another backslash. For
example, input file pattern of c:\*.txt
should be entered as c:\\*.txt
.
The TrafficMaxLaneFlow
and TrafficRoutes
pipelines, when run in
streaming mode (with the --streaming=true
option), require the
publication of traffic sensor data to a
Google Cloud Pub/Sub topic.
You can run the example with default settings using the following command:
mvn exec:java -pl examples \
-Dexec.mainClass=com.google.cloud.dataflow.examples.TrafficMaxLaneFlow \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT NAME> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=DataflowPipelineRunner \
--streaming=true"
By default, they use a separate batch pipeline to publish previously gathered traffic sensor data to the Cloud Pub/Sub topic, which is used as an input source for the streaming pipeline.
The default traffic sensor data --inputFile
is downloaded from
curl -O \
http://storage.googleapis.com/aju-sd-traffic/unzipped/Freeways-5Minaa2010-01-01_to_2010-02-15_test2.csv
This file contains real traffic sensor data from San Diego freeways. See this file for copyright information.
You may override the default --inputFile
with an alternative complete
data set (~2GB). It is provided in the Google Cloud Storage bucket
gs://dataflow-samples/traffic_sensor/Freeways-5Minaa2010-01-01_to_2010-02-15.csv
.
You may also set --inputFile
to an empty string, which will disable
the automatic Pub/Sub injection, and allow you to use separate tool to control
the input to this example. An example code, which publishes traffic sensor data
to a Pub/Sub topic, is provided in traffic_pubsub_generator.py
Note: If you set --streaming=false
, these traffic pipelines will run in batch mode,
using the timestamps applied to the original dataset to process the data in
a batch. For further information on how these pipelines operate, see
TrafficMaxLaneFlow
and TrafficRoutes.