Open partitioned parquet files #16

dvictori · 2023-09-21T18:16:14Z

I'm trying out metabase with duckdb for processing viewing a large parquet file. I'm running Metabase from a Docker container following the instructions found on this repo readme.

I was able to query several single parquet files using

select * from 'path/to/parquet_single_data/*.parquet'

However, I can't seam to find out how to do the same for a parquet file that was created with partitions. My parquet file is partitioned into years like so:

parquet_data/ano=2013/part-0.parquet
parquet_data/ano=2014/part-0.parquet
parquet_data/ano=2015/part-0.parquet

This is what I tried:

select * from read_parquet('path/to/parquet_data/*.parquet')

and

select * from read_parquet('path/to/parquet_data/*/*.parquet')

None worked

Is it possible to open partitioned parquet files in duckdb?

Edit: Looking at the duckdb documentation, one should use parquet_scan for this. But I'm getting Cannot invoke "Object.getClass()" because "target" is null

https://duckdb.org/docs/archive/0.8.1/data/partitioning/hive_partitioning

The text was updated successfully, but these errors were encountered:

dvictori · 2023-09-21T19:43:14Z

As a complement, I check using the duckdb cli and the following query works.

I'm using Metabase v1.46.2 DuckDB v0.8.1

select * from read_parquet('/app/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

K377U · 2023-09-27T05:50:36Z

Im using duckdb views to do this and also setting the datatype of each field instead of using star notation.

CREATE OR REPLACE VIEW {table_name} AS (
SELECT {fields} FROM read_parquet(['path/to/parquet_data/*/*.parquet'], HIVE_PARTITIONING=true)
)

With this you can add or remove data from the path and it will automatically update to metabase. Not sure how well the caching works when you do change the data.

dvictori · 2023-09-27T18:44:24Z

I was having this issue when using a Duckdb database "in memory". But following what you did, I'm now creating a duckdb database on disk, inserting a bunch of views that use read_parquet and loading that file on Metabase. It's working just fine and I can use partitioned parquet files.

dvictori · 2023-09-29T12:16:21Z

I misread you comment. After some testing I realized that if trying to open a partitioned parquet inside metabase, one must declare all fields contained in the files.

For instance, this works:

select REF_BACEN, NU_ORDEM from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

I'm reopening the issue because I think that select * from... should work regardless if we have a single parquet or a partitioned dataset.

In case it helps, I'm attaching the metabase log that I get when I execute the query.

metabase_partitioned_parquet_errorerr.log

Not sure it's related but I'm also seeing this warning when I re-scan the duckdb fields in metabase:

proagrodb_docker-metabase-1  | 2023-09-29 12:13:03,190 WARN query-processor.deprecated :: Atenção: driver :duckdb está usando Honey SQL 1. Este método foi descontinuado em 0.46.0 e será excluído em uma versão futura.

Sorry for the message in portuguese. But it's basically saying that the driver is using Honey SQL1, which was discontinued in version 0.46.0 and will be excluded in the future.

AlexR2D2 · 2023-09-29T12:44:52Z

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

Hi! Did you try this thing direct in duckdb? Is it working?

dvictori · 2023-09-29T18:53:17Z

but this does not
select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')
Hi! Did you try this thing direct in duckdb? Is it working?

Yes, it works using duckdb CLI. I pasted the output a couple of comments above

#16 (comment)

AlexR2D2 · 2023-11-30T13:21:26Z

Hi, could you try this again using the latest version of metabase plugin?

select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet', hive_partitioning=true);
or
select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet');

dvictori · 2024-01-11T14:42:06Z

Sorry for the long delay in answering. I'm installing metabase and duckdb driver using the following dockerfile. Will that give me the latest version of the metabase plugin?

FROM openjdk:19-buster

ENV MB_PLUGINS_DIR=/home/plugins/

ADD https://downloads.metabase.com/v0.47.5/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.2.3/duckdb.metabase-driver.jar /home/plugins/

RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

CMD ["java", "-jar", "/home/metabase.jar"]

dvictori · 2024-01-11T16:39:35Z

I just tested using a metabase installed as shown above and a partitioned parquet dataset. All worked fine!

The following syntaxes worked:

select * from read_parquet('/data/iris_part/**/*.parquet')
select * from read_parquet('/data/iris_part/**/*')

select * from read_parquet('/data/iris_part/*/*.parquet')
select * from read_parquet('/data/iris_part/*/*')

select * from read_parquet('/data/iris_part/**')

dvictori closed this as completed Sep 27, 2023

dvictori reopened this Sep 29, 2023

dvictori closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open partitioned parquet files #16

Open partitioned parquet files #16

dvictori commented Sep 21, 2023 •

edited

Loading

dvictori commented Sep 21, 2023 •

edited

Loading

K377U commented Sep 27, 2023

dvictori commented Sep 27, 2023

dvictori commented Sep 29, 2023

AlexR2D2 commented Sep 29, 2023

dvictori commented Sep 29, 2023

AlexR2D2 commented Nov 30, 2023

dvictori commented Jan 11, 2024

dvictori commented Jan 11, 2024

Open partitioned parquet files #16

Open partitioned parquet files #16

Comments

dvictori commented Sep 21, 2023 • edited Loading

dvictori commented Sep 21, 2023 • edited Loading

K377U commented Sep 27, 2023

dvictori commented Sep 27, 2023

dvictori commented Sep 29, 2023

AlexR2D2 commented Sep 29, 2023

dvictori commented Sep 29, 2023

AlexR2D2 commented Nov 30, 2023

dvictori commented Jan 11, 2024

dvictori commented Jan 11, 2024

dvictori commented Sep 21, 2023 •

edited

Loading

dvictori commented Sep 21, 2023 •

edited

Loading