Skip to content

Commit

Permalink
fixing/adding sql docs to correct locations (apache#2849)
Browse files Browse the repository at this point in the history
  • Loading branch information
jerrypeng authored and merlimat committed Oct 26, 2018
1 parent 519cbe9 commit b797d7b
Show file tree
Hide file tree
Showing 8 changed files with 765 additions and 0 deletions.
152 changes: 152 additions & 0 deletions site2/docs/sql-deployment-configurations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
id: sql-deployment-configurations
title: Pulsar SQl Deployment and Configuration
sidebar_label: Deployment and Configuration
---

Below is a list configurations for the Presto Pulsar connector and instruction on how to deploy a cluster.

## Presto Pulsar Connector Configurations
There are several configurations for the Presto Pulsar Connector. The properties file that contain these configurations can be found at ```${project.root}/conf/presto/catalog/pulsar.properties```.
The configurations for the connector and its default values are discribed below.

```properties
# name of the connector to be displayed in the catalog
connector.name=pulsar

# the url of Pulsar broker service
pulsar.broker-service-url=http://localhost:8080

# URI of Zookeeper cluster
pulsar.zookeeper-uri=localhost:2181

# minimum number of entries to read at a single time
pulsar.entry-read-batch-size=100

# default number of splits to use per query
pulsar.target-num-splits=4
```

## Query Pulsar from Existing Presto Cluster

If you already have an existing Presto cluster, you can copy Presto Pulsar connector plugin to your existing cluster. You can download the archived plugin package via:

```bash
$ wget pulsar:binary_release_url
```

## Deploying a new cluster

Please note that the [Getting Started](sql-getting-started.md) guide shows you how to easily setup a standalone single node enviroment to experiment with.

Pulsar SQL is powered by [Presto](https://prestodb.io) thus many of the configurations for deployment is the same for the Pulsar SQL worker.

You can use the same CLI args as the Presto launcher:

```bash
$ ./bin/pulsar sql-worker --help
Usage: launcher [options] command

Commands: run, start, stop, restart, kill, status

Options:
-h, --help show this help message and exit
-v, --verbose Run verbosely
--etc-dir=DIR Defaults to INSTALL_PATH/etc
--launcher-config=FILE
Defaults to INSTALL_PATH/bin/launcher.properties
--node-config=FILE Defaults to ETC_DIR/node.properties
--jvm-config=FILE Defaults to ETC_DIR/jvm.config
--config=FILE Defaults to ETC_DIR/config.properties
--log-levels-file=FILE
Defaults to ETC_DIR/log.properties
--data-dir=DIR Defaults to INSTALL_PATH
--pid-file=FILE Defaults to DATA_DIR/var/run/launcher.pid
--launcher-log-file=FILE
Defaults to DATA_DIR/var/log/launcher.log (only in
daemon mode)
--server-log-file=FILE
Defaults to DATA_DIR/var/log/server.log (only in
daemon mode)
-D NAME=VALUE Set a Java system property

```

There is a set of default configs for the cluster located in ```${project.root}/conf/presto``` that will be used by default. You can change them to customize your deployment

You can also set the worker to read from a different configuration directory as well as set a different directory for writing its data:

```bash
$ ./bin/pulsar sql-worker run --etc-dir /tmp/incubator-pulsar/conf/presto --data-dir /tmp/presto-1
```

You can also start the worker as daemon process:

```bash
$ ./bin sql-worker start
```

### Deploying to a 3 node cluster

For example, if I wanted to deploy a Pulsar SQL/Presto cluster on 3 nodes, you can do the following:

First, copy the Pulsar binary distribution to all three nodes.

The first node, will run the Presto coordinator. The mininal configuration in ```${project.root}/conf/presto/config.properties``` can be the following

```properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=<coordinator-url>
```

Also, modify ```pulsar.broker-service-url``` and ```pulsar.zookeeper-uri``` configs in ```${project.root}/conf/presto/catalog/pulsar.properties``` on those nodes accordingly

Afterwards, you can start the coordinator by just running

```$ ./bin/pulsar sql-worker run```

For the other two nodes that will only serve as worker nodes, the configurations can be the following:

```properties
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=<coordinator-url>

```

Also, modify ```pulsar.broker-service-url``` and ```pulsar.zookeeper-uri``` configs in ```${project.root}/conf/presto/catalog/pulsar.properties``` accordingly

You can also start the worker by just running:

```$ ./bin/pulsar sql-worker run```

You can check the status of your cluster from the SQL CLI. To start the SQL CLI:

```bash
$ ./bin/pulsar sql --server <coordinate_url>

```

You can then run the following command to check the status of your nodes:

```bash
presto> SELECT * FROM system.runtime.nodes;
node_id | http_uri | node_version | coordinator | state
---------+-------------------------+--------------+-------------+--------
1 | http://192.168.2.1:8081 | testversion | true | active
3 | http://192.168.2.2:8081 | testversion | false | active
2 | http://192.168.2.3:8081 | testversion | false | active
```


For more information about deployment in Presto, please reference:

[Deploying Presto](https://prestodb.io/docs/current/installation/deployment.html)

142 changes: 142 additions & 0 deletions site2/docs/sql-getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
id: sql-getting-started
title: Pulsar SQL Getting Started
sidebar_label: Getting Started
---

It is super easy to get started on querying data in Pulsar.

## Requirements
1. **Pulsar distribution**
* If you haven't install Pulsar, please reference [Installing Pulsar](io-quickstart.md#installing-pulsar)
2. **Pulsar built-in connectors**
* If you haven't installed the built-in connectors, please reference [Installing Builtin Connectors](io-quickstart.md#installing-builtin-connectors)

First, start a Pulsar standalone cluster:

```bash
./bin/pulsar standalone
```

Next, start a Pulsar SQL worker:
```bash
./bin/pulsar sql-worker run
```

After both the Pulsar standalone cluster and the SQL worker are done initializing, run the SQL CLI:
```bash
./bin/pulsar sql
```

You can now start typing some SQL commands:


```bash
presto> show catalogs;
Catalog
---------
pulsar
system
(2 rows)

Query 20180829_211752_00004_7qpwh, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]


presto> show schemas in pulsar;
Schema
-----------------------
information_schema
public/default
public/functions
sample/standalone/ns1
(4 rows)

Query 20180829_211818_00005_7qpwh, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [4 rows, 89B] [21 rows/s, 471B/s]


presto> show tables in pulsar."public/default";
Table
-------
(0 rows)

Query 20180829_211839_00006_7qpwh, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

```

Currently, there is no data in Pulsar that we can query. Lets start the built-in connector _DataGeneratorSource_ to ingest some mock data for us to query:

```bash
./bin/pulsar-admin source create --tenant test-tenant --namespace test-namespace --name generator --destinationTopicName generator_test --source-type data-generator
```

Afterwards, the will be a topic with can query in the namespace "public/default":

```bash
presto> show tables in pulsar."public/default";
Table
----------------
generator_test
(1 row)

Query 20180829_213202_00000_csyeu, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:02 [1 rows, 38B] [0 rows/s, 17B/s]
```

We can now query the data within the topic "generator_test":

```bash
presto> select * from pulsar."public/default".generator_test;

firstname | middlename | lastname | email | username | password | telephonenumber | age | companyemail | nationalidentitycardnumber |
-------------+-------------+-------------+----------------------------------+--------------+----------+-----------------+-----+-----------------------------------------------+----------------------------+
Genesis | Katherine | Wiley | [email protected] | genesisw | y9D2dtU3 | 959-197-1860 | 71 | [email protected] | 880-58-9247 |
Brayden | | Stanton | [email protected] | braydens | ZnjmhXik | 220-027-867 | 81 | [email protected] | 604-60-7069 |
Benjamin | Julian | Velasquez | [email protected] | benjaminv | 8Bc7m3eb | 298-377-0062 | 21 | [email protected] | 213-32-5882 |
Michael | Thomas | Donovan | [email protected] | michaeld | OqBm9MLs | 078-134-4685 | 55 | [email protected] | 443-30-3442 |
Brooklyn | Avery | Roach | [email protected] | broach | IxtBLafO | 387-786-2998 | 68 | [email protected] | 085-88-3973 |
Skylar | | Bradshaw | [email protected] | skylarb | p6eC6cKy | 210-872-608 | 96 | [email protected] | 453-46-0334 |
.
.
.
```
Now, you have some mock data to query and play around with!
If you want to try to ingest some of your own data to play around with, you can write a simple producer to write custom defined data to Pulsar.
For example:
```java
public class Test {

public static class Foo {
private int field1 = 1;
private String field2;
private long field3;
}

public static void main(String[] args) throws Exception {
PulsarClient pulsarClient = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
Producer<Foo> producer = pulsarClient.newProducer(AvroSchema.of(Foo.class)).topic("test_topic").create();

for (int i = 0; i < 1000; i++) {
Foo foo = new Foo();
foo.setField1(i);
foo.setField2("foo" + i);
foo.setField3(System.currentTimeMillis());
producer.newMessage().value(foo).send();
}
producer.close();
pulsarClient.close();
}
}
```
Afterwards, you should be able query the data you just wrote.
24 changes: 24 additions & 0 deletions site2/docs/sql-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
id: sql-overview
title: Pulsar SQL Overview
sidebar_label: Overview
---

One of the common use cases of Pulsar is storing streams of event data. Often the event data is structured which predefined fields. There is tremendous value for users to be able to query the existing data that is already stored in Pulsar topics. With the implementation of the [Schema Registry](concepts-schema-registry.md), structured data can be stored in Pulsar and allows for the potential to query that data via SQL language.

By leveraging [Presto](https://prestodb.io/), we have created a method for users to be able to query structured data stored within Pulsar in a very efficient and scalable manner. We will discuss why this very efficient and scalable in the [Performance](#performance) section below.

At the core of this Pulsar SQL is the Presto Pulsar connector which allows Presto workers within a Presto cluster to query data from Pulsar.


![The Pulsar consumer and reader interfaces](assets/pulsar-sql-arch-2.png)


## Performance

The reason why query performance is very efficient and highly scalable because of Pulsar's [two level segment based architecture](concepts-architecture-overview.md#apache-bookkeeper).

Topics in Pulsar are stored as segments in [Apache Bookkeeper](https://bookkeeper.apache.org/). Each topic segment is also replicated to a configurable (default 3) number of Bookkeeper nodes which allows for concurrent reads and high read throughput. In the Presto Pulsar connector, we read data directly from Bookkeeper to take advantage of the Pulsar's segment based architecture. Thus, Presto workers can read concurrently from horizontally scalable number bookkeeper nodes.


![The Pulsar consumer and reader interfaces](assets/pulsar-sql-arch-1.png)
5 changes: 5 additions & 0 deletions site2/website/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@
"io-connectors",
"io-develop"
],
"Pulsar SQL": [
"sql-overview",
"sql-getting-started",
"sql-deployment-configurations"
],
"Deployment": [
"deploy-aws",
"deploy-kubernetes",
Expand Down
Loading

0 comments on commit b797d7b

Please sign in to comment.