Python client for Impala/Hive distributed query engine.
-
Lightweight,
pip
-installable package for connecting to Impala and Hive databases -
Fully DB API 2.0 (PEP 249)-compliant Python client (similar to sqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.
-
Connects to HiveServer2; runs with Kerberos, LDAP, SSL
-
SQLAlchemy connector
-
Converter to pandas
DataFrame
, allowing easy integration into the Python data stack (including scikit-learn and matplotlib)
These features will be removed in a future release.
-
BigDataFrame
-
beeswax support
-
scikit-learn wrapper
-
numba-compiled Python UDFs
See the Ibis project for continued development of these higher-level features.
Required:
-
Python 2.6+ or 3.3+
-
six
-
thrift_sasl
-
bit_array
-
thrift
(on Python 2.x) orthriftpy
(on Python 3.x)
Optional:
-
pandas
for conversion toDataFrame
objects -
python-sasl
for Kerberos support (for Python 3.x support, requires laserson/python-sasl@cython) -
sqlalchemy
for the SQLAlchemy engine -
pytest
for running tests;unittest2
for testing on Python 2.6
Install the latest release (0.11.1
) with pip
:
pip install impyla
For the latest (dev) version, clone the repo:
pip install git+https://github.com/cloudera/impyla.git
or clone the repo:
git clone https://github.com/cloudera/impyla.git
cd impyla
python setup.py install
impyla uses the pytest toolchain, and depends on the following environment variables:
export IMPYLA_TEST_HOST=your.impalad.com
export IMPYLA_TEST_PORT=21050
export IMPYLA_TEST_AUTH_MECH=NOSASL
To run the maximal set of tests, run
cd path/to/impyla
py.test --connect impyla
Leave out the --connect
option to skip tests for DB API compliance.
Impyla implements the Python DB API v2.0 (PEP 249) database interface (refer to it for API details):
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
The Cursor
object also exposes the iterator interface, which is buffered
(controlled by cursor.arraysize
):
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
process(row)
You can also get back a pandas DataFrame object
from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example