This is an example program that shows how to use the Kudu API in Python to load data into a new / existing Kudu table generated by an external program.
Make sure you have the Kudu client library installed and the kudu Python bindings are available. If you have the Kudu client library and Python bindings in a special place, you'll need to set the environment variables:
LD_LIBRARY_PATH PYTHONPATH
To the according directories. In addition you'll need the dstat program, it should be available from your typical package repository.
In this case the dstat
program is used to generate data about the system load and pipe
this data into a named pipe that is then read and pipe to the python program.
To execute this script simply run:
python kudu_dstat.py
This will create a table assuming that you have a kudu-master running locally. You can use the Web UI to access some information about the table using the following link: http://localhost:8051. The program will run until it is terminated via C-c.
To drop the table in Kudu and start fresh start the program with:
python kudu_dstat.py drop
To query the data via Impala, create a new Kudu table in Impala using the following command in the impala-shell.
CREATE EXTERNAL TABLE dstat (
`ts` BIGINT,
`usr` FLOAT,
`sys` FLOAT,
`idl` FLOAT,
`wai` FLOAT,
`hiq` FLOAT,
`siq` FLOAT,
`read` FLOAT,
`writ` FLOAT,
`recv` FLOAT,
`send` FLOAT,
`in` FLOAT,
`out` FLOAT,
`int` FLOAT,
`csw` FLOAT
)
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'dstat',
'kudu.master_addresses' = '127.0.0.1:7051',
'kudu.key_columns' = 'ts'
);
Now you can query your local system's load using:
-- How many rows are stored right now?
select count(*) from dstat;
-- Average load in 10s windows
select (ts - ts % 10 ) as mod_ts, avg(usr), avg(sys), avg(idl) from dstat group by mod_ts order by mod_ts