Skip to content

Commit

Permalink
Documentation for Samza SQL
Browse files Browse the repository at this point in the history
**Samza tools** :
Contains the following tools that can be used for playing with Samza sql or any other samza job.

1. Generate kafka events : Tool used to generate avro serialized kafka events
2. Event hub consumer : Tool used to consume events from event hubs topic. This can be used if the samza job writes events to event hubs.
3. Samza sql console : Tool used to execute SQL using samza sql.

Adds documentation on how to use Samza SQL on a local machine and on a yarn environment and their associated Samza tooling.

https://issues.apache.org/jira/browse/SAMZA-1526

Author: Srinivasulu Punuru <[email protected]>

Reviewers: Yi Pan<[email protected]>, Jagadish<[email protected]>

Closes apache#374 from srinipunuru/docs.1
  • Loading branch information
srinipunuru authored and xiliu committed Dec 22, 2017
1 parent ef506a3 commit 804aea2
Show file tree
Hide file tree
Showing 31 changed files with 1,886 additions and 4 deletions.
28 changes: 28 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,34 @@ project(':samza-sql') {
}
}

project(':samza-tools') {
apply plugin: 'java'

dependencies {
compile project(':samza-sql')
compile project(':samza-api')
compile project(':samza-azure')
compile "log4j:log4j:$log4jVersion"
compile "org.slf4j:slf4j-api:$slf4jVersion"
compile "org.slf4j:slf4j-log4j12:$slf4jVersion"
compile "commons-cli:commons-cli:$commonsCliVersion"
compile "org.apache.avro:avro:$avroVersion"
compile "org.apache.commons:commons-lang3:$commonsLang3Version"
compile "org.apache.kafka:kafka-clients:$kafkaVersion"
}

tasks.create(name: "releaseToolsTarGz", dependsOn: configurations.archives.artifacts, type: Tar) {
into "samza-tools-${version}"
compression = Compression.GZIP
from(project.file("./scripts")) { into "scripts/" }
from(project.file("./config")) { into "config/" }
from(project(':samza-shell').file("src/main/bash/run-class.sh")) { into "scripts/" }
from(configurations.runtime) { into("lib/") }
from(configurations.archives.artifacts.files) { into("lib/") }
duplicatesStrategy 'exclude'
}
}

project(":samza-kafka_$scalaVersion") {
apply plugin: 'scala'

Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,8 @@ Following can be done when updating the gradle.properties file

* if this is a major release, add the x.x.x release to Archive category in docs/_layouts/default.html and x.x.x release part in docs/archive/index.html

* update the download page to use x.x.x release
* docs/startup/download/index.md
* update the download page (docs/startup/download/index.md) to use x.x.x release
* Add an entry to the Sources releases and Samza Tools section to use the new x.x.x release

* update the version number in "tar -xvf ./target/hello-samza-y.y.y-dist.tar.gz -C deploy/samza" in each of the tutorials (and search for other uses of version x.x.x which may need to be replaced with y.y.y)
* docs/startup/hello-samza/versioned/index.md
Expand Down
3 changes: 3 additions & 0 deletions docs/learn/tutorials/versioned/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ title: Tutorials

[Samza Async API and Multithreading User Guide](samza-async-user-guide.html)

[Samza SQL User Guide](samza-sql.html)


<!-- TODO a bunch of tutorials
[Log Walkthrough](log-walkthrough.html)
<a href="configuring-kafka-system.html">Configuring a Kafka System</a><br/>
Expand Down
123 changes: 123 additions & 0 deletions docs/learn/tutorials/versioned/samza-sql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
layout: page
title: How to use Samza SQL
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

There are couple of ways to use Samza SQL

1. Run Samza SQL on your local machine.
2. Run Samza SQL on YARN.

# Running Samza SQL on your local machine


Samza SQL console tool documented [here](samza-tools.html) uses Samza standalone to run the Samza SQL on your local machine. This is the quickest way to play with Samza SQL. Please follow the instructions [here](samza-tools.html) to get access to the Samza tools on your machine.

## Start the Kafka server

Please follow the instructions from the [Kafka quickstart](http://kafka.apache.org/quickstart) to start the zookeeper and Kafka server.

## Create ProfileChangeStream Kafka topic

The below sql statements requires a topic named ProfileChangeStream to be created on the Kafka broker. You can follow the instructions in the [Kafka quick start guide](http://kafka.apache.org/quickstart) to create a topic named "ProfileChangeStream".

{% highlight bash %}
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ProfileChangeStream
{% endhighlight %}

## Generate events into ProfileChangeStream topic

Use generate-kafka-events from [Samza tools](samza-tools.html) to generate events into the ProfileChangeStream

{% highlight bash %}
cd samza-tools-<version>
./scripts/generate-kafka-events.sh -t ProfileChangeStream -e ProfileChange
{% endhighlight %}

## Using Samza SQL Console to run Samza sql on your local machine

Below are some of the sql queries that you can execute using the samza-sql-console tool from [Samza tools](samza-tools.html) package.

{% highlight bash %}
# This command just prints out all the events in the Kafka topic ProfileChangeStream into console output as a json serialized payload.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select * from kafka.ProfileChangeStream"

# This command prints out the fields that are selected into the console output as a json serialized payload.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name, OldCompany, NewCompany from kafka.ProfileChangeStream"

# This command showcases the RegexMatch udf and filtering capabilities.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name as __key__, Name, NewCompany, RegexMatch('.*soft', OldCompany) from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"
{% endhighlight %}


# Running Samza SQL on YARN

The [hello-samza](https://github.com/apache/samza-hello-samza) project is an example project designed to help you run your first Samza application. It has examples of applications using the low level task API, high level API as well as Samza SQL.

This tutorial demonstrates a simple Samza application that uses SQL to perform stream processing.

## Get the hello-samza Code and Start the grid

Please follow the instructions from [hello-samza-high-level-yarn](hello-samza-high-level-yarn.html) on how to build the hello-samza repository and start the yarn grid.

## Create the topic and generate Kafka events

Please follow the steps in the section "Create ProfileChangeStream Kafka topic" and "Generate events into ProfileChangeStream topic" above.

## Build a Samza Application Package

Before you can run a Samza application, you need to build a package for it. Please follow the instructions from [hello-samza-high-level-yarn](hello-samza-high-level-yarn.html) on how to build the hello-samza application package.

## Run a Samza Application

After you've built your Samza package, you can start the app on the grid using the run-app.sh script.

{% highlight bash %}
./deploy/samza/bin/run-app.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/page-view-filter-sql.properties
{% endhighlight %}

The app executes the following SQL command :
{% highlight sql %}
insert into kafka.NewLinkedInEmployees select Name from ProfileChangeStream where NewCompany = 'LinkedIn'
{% endhighlight %}

This SQL performs the following

1. Consumes the Kafka topic ProfileChangeStreamStream which contains the avro serialized ProfileChangeEvent(s)
2. Deserializes the events and filters out only the profile change events where NewCompany = 'LinkedIn' i.e. Members who have moved to LinkedIn.
3. Writes the Avro serialized event that contains the Id and Name of those profiles to Kafka topic NewLinkedInEmployees.


Give the job a minute to startup, and then tail the Kafka topic:

{% highlight bash %}
./deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic NewLinkedInEmployees
{% endhighlight %}


Congratulations! You've now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza SQL application on it.

## Shutdown and cleanup

To shutdown the app, use the same _run-app.sh_ script with an extra _--operation=kill_ argument
{% highlight bash %}
./deploy/samza/bin/run-app.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/page-view-filter-sql.properties --operation=kill
{% endhighlight %}

Please follow the instructions from [Hello Samza High Level API - YARN Deployment](hello-samza-high-level-yarn.html) on how to shutdown and cleanup the app.
109 changes: 109 additions & 0 deletions docs/learn/tutorials/versioned/samza-tools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
layout: page
title: How to use Samza tools
---

<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->


# Get Samza tools

Please visit the [Download page] (/startup/download) to download the Samza tools package

{% highlight bash %}
tar -xvzf samza-tools-*.tgz
cd samza-tools-<version>
{% endhighlight %}


# Using Samza tools


## Generate kafka events


Generate kafka events tool is used to insert avro serialized events into kafka topics. Right now it can insert two types of events [PageViewEvent](https://github.com/apache/samza/blob/master/samza-tools/src/main/java/org/com/linkedin/samza/tools/schemas/PageViewEvent.avsc) and [ProfileChangeEvent](https://github.com/apache/samza/blob/master/samza-tools/src/main/java/org/com/linkedin/samza/tools/schemas/ProfileChangeEvent.avsc)

Before you can generate kafka events, Please follow instructions [here](http://kafka.apache.org/quickstart) to start the zookeeper and kafka server on your local machine.

You can follow below instructions on how to use Generate kafka events tool.

{% highlight bash %}

# Usage of the tool

./scripts/generate-kafka-events.sh
usage: Error: Missing required options: t, e
generate-kafka-events.sh
-b,--broker <BROKER> Kafka broker endpoint Default (localhost:9092).
-n,--numEvents <NUM_EVENTS> Number of events to be produced,
Default - Produces events continuously every second.
-p,--partitions <NUM_PARTITIONS> Number of partitions in the topic,
Default (4).
-t,--topic <TOPIC_NAME> Name of the topic to write events to.
-e,--eventtype <EVENT_TYPE> Type of the event values can be (PageView|ProfileChange).


# Example command to generate 100 events of type PageViewEvent into topic named PageViewStream

./scripts/generate-kafka-events.sh -t PageViewStream -e PageView -n 100


# Example command to generate ProfileChange events continuously into topic named ProfileChangeStream

./scripts/generate-kafka-events.sh -t ProfileChangeStream -e ProfileChange

{% endhighlight %}

## Samza SQL console tool

Once you generated the events into the kafka topic. Now you can use samza-sql-console tool to perform processing on the events published into the kafka topic.

There are two ways to use the tool -

1. You can either pass the sql statement directly as an argument to the tool.
2. You can write the sql statement(s) into a file and pass the sql file as an argument to the tool.

Second option allows you to execute multiple sql statements, whereas the first one lets you execute one at a time.

Samza SQL needs all the events in the topic to be uniform schema. And it also needs access to the schema corresponding to the events in a topic. Typically in an organization, there is a deployment of schema registry which maps topics to schemas.

In the absence of schema registry, Samza SQL console tool uses the convention to identify the schemas associated with the topic. If the topic name has string "page" it assumes the topic has PageViewEvents else ProfileChangeEvents.

{% highlight bash %}

# Usage of the tool

./scripts/samza-sql-console.sh
usage: Error: One of the (f or s) options needs to be set
samza-sql-console.sh
-f,--file <SQL_FILE> Path to the SQL file to execute.
-s,--sql <SQL_STMT> SQL statement to execute.

# Example command to filter out all the users who have moved to LinkedIn

./scripts/samza-sql-console.sh --sql "Insert into log.consoleOutput select Name, OldCompany from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"

{% endhighlight %}

You can run below sql commands using Samza sql console. Please make sure you are running generate-kafka-events tool to generate events into ProfileChangeStream before running the below command.

{% highlight bash %}
./scripts/samza-sql-console.sh --sql "Insert into log.consoleOutput select Name, OldCompany from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"

{% endhighlight %}
7 changes: 7 additions & 0 deletions docs/startup/download/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ If you just want to play around with Samza for the first time, go to [Hello Samz

Starting from 2016, Samza will begin requiring JDK8 or higher. Please see [this mailing list thread](http://mail-archives.apache.org/mod_mbox/samza-dev/201610.mbox/%3CCAHUevGGnOQD_VmLWEdpFNq3Lv%2B6gQQmw_JKx9jDr5Cw%2BxFfGtQ%40mail.gmail.com%3E) for details on this decision.

### Samza Tools

Samza tools package contains command line tools that user can run to use Samza and it's input/output systems.

* [samza-tools-0.14.0.tgz](tbd)

### Source Releases

* [samza-sources-0.13.1.tgz](http://www.apache.org/dyn/closer.lua/samza/0.13.1)
Expand All @@ -40,6 +46,7 @@ Starting from 2016, Samza will begin requiring JDK8 or higher. Please see [this
* [samza-sources-0.8.0-incubating.tgz](https://archive.apache.org/dist/incubator/samza/0.8.0-incubating)
* [samza-sources-0.7.0-incubating.tgz](https://archive.apache.org/dist/incubator/samza/0.7.0-incubating)


### Maven

All Samza JARs are published through [Apache's Maven repository](https://repository.apache.org/content/groups/public/org/apache/samza/).
Expand Down
1 change: 1 addition & 0 deletions gradle/dependency-versions.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
apacheCommonsCollections4Version = "4.0"
avroVersion = "1.7.0"
calciteVersion = "1.14.0"
commonsCliVersion = "1.2"
commonsCodecVersion = "1.9"
commonsCollectionVersion = "3.2.1"
commonsHttpClientVersion = "3.1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,11 @@ public SamzaSqlRelMessage convertToRelMessage(KV<Object, Object> samzaMessage) {

@Override
public KV<Object, Object> convertToSamzaMessage(SamzaSqlRelMessage relMessage) {
GenericRecord record = new GenericData.Record(this.avroSchema);
return convertToSamzaMessage(relMessage, this.avroSchema);
}

protected KV<Object, Object> convertToSamzaMessage(SamzaSqlRelMessage relMessage, Schema avroSchema) {
GenericRecord record = new GenericData.Record(avroSchema);
List<String> fieldNames = relMessage.getFieldNames();
List<Object> values = relMessage.getFieldValues();
for (int index = 0; index < fieldNames.size(); index++) {
Expand Down
35 changes: 35 additions & 0 deletions samza-tools/config/eh-consumer-log4j.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE log4j:configuration PUBLIC "-//APACHE//DTD LOG4J 1.2//EN" "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="fileAppender" class="org.apache.log4j.FileAppender">
<param name="File" value="./eh-consumer.log" />
<param name="Append" value="false" />
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %-5p [%c{1}:%L] - %m%n"/>
</layout>
</appender>
<root>
<priority value ="info" />
<appender-ref ref="fileAppender" />
</root>
</log4j:configuration>
Loading

0 comments on commit 804aea2

Please sign in to comment.