Skip to content

Commit

Permalink
vagrant boot strap to get Kafka, Hadoop/YARN, Samza running and launc…
Browse files Browse the repository at this point in the history
…hing the sample wikipedia jobs
  • Loading branch information
joestein committed Jan 13, 2014
1 parent c5ebd92 commit a75d797
Show file tree
Hide file tree
Showing 6 changed files with 120 additions and 2 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
target/
.classpath
.project
.vagrant
.settings/
.idea/
.idea_modules/
Expand Down
33 changes: 32 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,35 @@ hello-samza

Hello Samza is a starter project for Samza jobs.

Please see [Hello Samza](http://samza.incubator.apache.org/) to get started.
Use Vagrant to get up and running.

1) Install Vagrant [http://www.vagrantup.com/](http://www.vagrantup.com/)
2) Install Virtual Box [https://www.virtualbox.org/](https://www.virtualbox.org/)

Then once that is done (or if done already) clone this repository and boot the virtual machine up.

cd hello-samza
vagrant up

This will take ~ 10-15 minutes to install Kafka, Hadoop/YARN, Samza, configure everything together and launch the jobs.

Once the VM is launched and you are back at a command prompt go into the virtual machine and see whats running.

vagrant ssh
cd /vagrant

The wikipedia-feed Samza job that is running is consuming a feed of real-time edits from Wikipedia, and producing them to a Kafka topic called "wikipedia-raw". You can view this in real-time by by using the Kafka console consumer to view the topic.

deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-raw

The wikipedia-parser Samza job is then parsing the messages in wikipedia-raw, and extracting information about the size of the edit, who made the change, etc. It outputs these counts to the wikipedia-edits topic.

deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-edits

The wikipedia-stats Samza job reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the wikipedia-stats topic.

deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-stats

You can view the Samza jobs running in the YARN UI http://192.168.80.20:8088/cluster/apps too.

To see how this was setup and works look at `vagrant/bootstrap.sh` and [Hello Samza](http://samza.incubator.apache.org/).
35 changes: 35 additions & 0 deletions Vagrantfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- mode: ruby -*-
# vi: set ft=ruby :

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "precise64"

# The url from where the 'config.vm.box' box will be fetched if it
# doesn't already exist on the user's system.
config.vm.box_url = "http://files.vagrantup.com/precise64.box"

config.vm.define "samza" do |samza|
samza.vm.network :private_network, ip: "192.168.80.20"
samza.vm.provider :virtualbox do |vb|
vb.customize ["modifyvm", :id, "--memory", "2048"]
end
samza.vm.provision "shell", path: "vagrant/bootstrap.sh", :args => "1"
end
end
2 changes: 1 addition & 1 deletion bin/grid
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash
#!/bin/bash -x

# This script will download, setup, start, and stop servers for Kafka, YARN, and ZooKeeper,
# as well as downloading, building and locally publishing Samza
Expand Down
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@
<version>0.9</version>
<configuration>
<excludes>
<exclude>.vagrant/**</exclude>
<exclude>.git/**</exclude>
<exclude>*.md</exclude>
<exclude>docs/**</exclude>
Expand Down
50 changes: 50 additions & 0 deletions vagrant/bootstrap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/bash -x
apt-get -y update
apt-get install -y software-properties-common python-software-properties
add-apt-repository -y ppa:webupd8team/java
apt-get -y update
/bin/echo debconf shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections
apt-get -y install oracle-java7-installer oracle-java7-set-default

apt-get -y install git vim wget screen curl

export JAVA_HOME=/usr

su vagrant -c "touch ~/.bashrc"
su vagrant -c "export JAVA_HOME=/usr' >> ~/.bashrc"

cd /tmp
wget http://apache.spinellicreations.com/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
mkdir -p /opt/apache
cd /opt/apache/
tar -xvf /tmp/apache-maven-3.1.1-bin.tar.gz
export PATH=/opt/apache/apache-maven-3.1.1/bin:$PATH
su vagrant -c "echo 'export PATH=/opt/apache/apache-maven-3.1.1/bin:$PATH' >> ~/.bashrc"

cd /vagrant
su vagrant -c "bin/grid bootstrap"

su vagrant -c "mvn clean package"
su vagrant -c "mkdir -p deploy/samza"
su vagrant -c "tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza"
su vagrant -c "deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties"
sleep 10
su vagrant -c "deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-parser.properties"
sleep 10
su vagrant -c "deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-stats.properties"
sleep 10

0 comments on commit a75d797

Please sign in to comment.