Presto SQL by Facebook
backed by Teradata from mid 2015
Starburst - company backing Presto spun out from Teradata.
PrestoSQL.io community replacing Facebook PrestoDB.io
-
written in Java
-
2-7.5x faster than Hive, 4-7x more CPU efficient
- Key Points
- Instant Presto using Docker
- High Availability
- Access Control
- Coordinators
- Workers
- Connectors
- Catalog
- Logs
- Presto Verifier
- Benchmark Tool
- Python library
- Setup
- Configs
- API
- Presto SQL
- OLAP for data warehousing
- not OLTP
- uses Hive Metastore
- external data sources eg. Cassandra
- ANSI SQL
- query federation via external catalogs
- CLI is self-executing jar, downloaded separately
- pipelined execution
- no wait time - starts all stages and streams data from one stage to another
- no fault tolerance
- no intermediate storage (all in memory)
- does not tolerate failures, queries will fail like Impala / Drill etc no time to recover
- joining 2 tables, one must fit in memory
- HTTP for data transfer
Port | Description |
---|---|
8080 | - WebUI - inter-node communication - HTTP API (ODBC/JDBC drivers use this too) |
This docker command pulls the docker image and runs a container dropping you into an SQL shell on a one node Presto instance running in Docker:
docker run -ti harisekhon/presto
or
docker run -ti harisekhon/presto-dev
Then see the Presto SQL section futher down.
See DockerHub for a selection of docker images - the Dockerfiles for various Presto builds and other technologies can be found in the HariSekhon/Dockerfiles GitHub repo.
Run 2 coordinators and Load Balance with stickyness in front of multiple coordinator nodes eg. using HAProxy
- Nodes communicate between each other using REST API
- Presto CLI, JDBC / ODBC drivers all use REST API
- separates authenticating user from authorized user,
- requires customization to
SystemAccessControlFactory
andConnectorAccessControlFactory
for authorization
- requires customization to
- JDBC driver
- Kerberos support
- Presto does not allow drop table if you are not the owner of the table
- parses SQL
- plans queries
- tracks query execution
- manages worker nodes
- communicates with workers via REST API
- fetches results from workers and returns them to client (bottleneck?!)
- executes tasks
- fetches data from connectors
- exchange intermediate data with other workers
- advertises to discovery server in the coordinator
- communicates with other works and coorindators via REST API
- Yarn support via Slider
Connectors are pluggable interfaces to other sources.
The namespace becomes <name>
from the file name and the connector type is taken from the file eg.
- multiple Hive clusters just different
etc/catalog/<name>.properties
files containingconnector.name=hive-hadoop2
- HDFS - if not Krb will connect using OS user of Presto process
- override in JVM Config with
-DHADOOP_USER_NAME=hdfs_user
- override in JVM Config with
Kerberos ticket cache not supported yet - does this mean it kinit's before every request or at daemon startup?
- Hive (metastore)
- Local Files
- HDFS
- MySQL
- PostgresSQL
- Kafka
- Cassandra
- Redis
- AWS Redshift
- JMX periodic dumping
- MongoDB
- Accumulo
- Microsoft SQL Server
Term | Description |
---|---|
Schema | way to organize tables |
Catalog | schema + reference to connector for data source |
Statement | textual SQL query |
Query | config + components instantiated to execute the statement |
- multiple schemas can use same connector eg. query 2 Hive clusters in 1 query
- table names are fully qualified rooted to the catalog eg.
hive.<db>.<table>
- ACLs are only all / read / file based / custom class:
access-control.name=allow-all
access-control.name=read-only
access-control.name=file
- file method has
allow: true/false
for each catalog, json does not permit commentssecurity.config-file=etc/rules.json
- file method has
- MyCustomClass
Main log:
server.log
Startup stdout/stderr log:
launcher.log
HTTP request log:
http-request.log
https://prestodb.io/docs/current/installation/verifier.html
- set up a MySQL db table definition
- create
config.properties
Download jar and run it:
wget presto-verifier-<version>-executable.jar -O verifier &&
chmod +x verifier
./verifier config.properties
https://prestodb.io/docs/current/installation/benchmark-driver.html
-
No Parquet support
-
No local/embedded mode
-
Supports formats: Text, SequenceFile, RCFile, ORC
-
Connectors for Hive, Cassandra, JMX, Kafka, MySQL PostgreSQL, S3
Recommended to use presto-admin
Download Teradata distribution server tarball, contains:
prestoadmin
presto-server-rpm
presto-cli-...-executable.jar
prestoadmin
online installer downloads from internet, more platform independent
offline installer is recommended cos faster
- uses paramiko SSH with keys or pw
- basically just copies RPM argument around and does start/stop, copy configs etc
etc/config.properties
:
# set false on just workers
coordinator=true
# might want to set false on dedicated coordinators
node-scheduler.include-coordinator=true
http-server.http.port=8080
# tune this according to -Xmx in jvm.config
query.max-memory=2GB
query.max-memory-per-node=512MB
discovery-server.enabled=true
# must not end in a slash
discovery.uri=http://localhost:8080
# metrics monitoring
jmx.rmiserver.port =
http-server.max-request.header-size=128KB
http-server.authentication=KERBEROS
http-server.authentication.krb5.service=presto
http.server.authentication.krb5.keytab=/etc/presto/presto.keytab
http.authentication.krb5.config=/etc/krb5.conf
http.server.authentication.keytabs=keytab1,keytab2
http.server.authentication.principals=HTTP/<fqdn>@REALM,HTTP/<fqdn>@REALM2
http-server.https.enabled=true
http-server.https.port=7778
http-server.https.keystore.path=/etc/presto_keystore.jks
http-server.https.keystore.key=changeit
# Debug Krb using these in jvm.config
-Dsun.security.krb5.debug=true
-dlog.enable-console=true
etc/catalog/<name>.properties
:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://<fqdn>:9083
hive.metastore-cache-ttl=30
# for HDFS HA setups add:
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
# this config must be on all presto nodes, as must the hadoop conf files
hive.metastore.authentication.type=KERBEROS
hive.hdfs.authentication.type=KERBEROS
# no authz in Presto yet so need to use HDFS ACLs and impersonation
hive.hdfs.impersonation.enabled=true
# _HOST will be auto-replaced by the hostname of the Presto worker
hive.metastore.service.principal=hive/_HOST@<REALM>
hive.metastore.client.principal=hive/_HOST@<REALM>
hive.metastore.client.keytab=/etc/security/keytabs/hive.service.keytab
hive.hdfs.presto.principal=hive/_HOST@<REALM>
hive.hdfs.presto.keytab=/etc/security/keytabs/hive.service.keytab
Presto /etc/presto/catalog/postgres.properties
:
connector.name=postgresql
connection-url=jdbc:postgresql://<fqdn>/<db>?schema=<schema>
connection-user=hari
connection-password=test123
TODO: docker system connector
Presto queries via HTTP API, sends to /v1/statement
=> follows 2nd URL href returned.
Would then have to poll that second ref for operationalState
- if using API instead use fetchall()
to block on query
completion and test status.
SHOW CATALOGS;
USE
sets <catalog>.<schema>
or <catalog>
or just <schema>
, so must use <catalog>.<anything>
to set the catalog
before show schemas will work:
USE hive.blah
SHOW SCHEMAS;
SHOW TABLES FROM hive.default;
Variables:
SHOW SESSION;
USE memory.test;
Create table - can't use backticks, and no string type like hive:
CREATE TABLE test ("first" varchar(20));
TODO document show nodes, schemas, tables, kill queries etc
Partial port from private Knowledge Base page 2015+