Skip to content
This repository was archived by the owner on Apr 11, 2024. It is now read-only.

Open partitioned parquet files #16

Closed
dvictori opened this issue Sep 21, 2023 · 9 comments
Closed

Open partitioned parquet files #16

dvictori opened this issue Sep 21, 2023 · 9 comments

Comments

@dvictori
Copy link

dvictori commented Sep 21, 2023

I'm trying out metabase with duckdb for processing viewing a large parquet file. I'm running Metabase from a Docker container following the instructions found on this repo readme.

I was able to query several single parquet files using

select * from 'path/to/parquet_single_data/*.parquet'

However, I can't seam to find out how to do the same for a parquet file that was created with partitions. My parquet file is partitioned into years like so:

  • parquet_data/ano=2013/part-0.parquet
  • parquet_data/ano=2014/part-0.parquet
  • parquet_data/ano=2015/part-0.parquet

This is what I tried:

select * from read_parquet('path/to/parquet_data/*.parquet')

and

select * from read_parquet('path/to/parquet_data/*/*.parquet')

None worked

Is it possible to open partitioned parquet files in duckdb?

Edit: Looking at the duckdb documentation, one should use parquet_scan for this. But I'm getting Cannot invoke "Object.getClass()" because "target" is null

https://duckdb.org/docs/archive/0.8.1/data/partitioning/hive_partitioning

@dvictori
Copy link
Author

dvictori commented Sep 21, 2023

As a complement, I check using the duckdb cli and the following query works.

I'm using Metabase v1.46.2 DuckDB v0.8.1

select * from read_parquet('/app/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

image

@K377U
Copy link

K377U commented Sep 27, 2023

Im using duckdb views to do this and also setting the datatype of each field instead of using star notation.

CREATE OR REPLACE VIEW {table_name} AS (
SELECT {fields} FROM read_parquet(['path/to/parquet_data/*/*.parquet'], HIVE_PARTITIONING=true)
)

With this you can add or remove data from the path and it will automatically update to metabase. Not sure how well the caching works when you do change the data.

@dvictori
Copy link
Author

I was having this issue when using a Duckdb database "in memory". But following what you did, I'm now creating a duckdb database on disk, inserting a bunch of views that use read_parquet and loading that file on Metabase. It's working just fine and I can use partitioned parquet files.

@dvictori
Copy link
Author

I misread you comment. After some testing I realized that if trying to open a partitioned parquet inside metabase, one must declare all fields contained in the files.

For instance, this works:

select REF_BACEN, NU_ORDEM from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

I'm reopening the issue because I think that select * from... should work regardless if we have a single parquet or a partitioned dataset.

In case it helps, I'm attaching the metabase log that I get when I execute the query.

metabase_partitioned_parquet_errorerr.log

Not sure it's related but I'm also seeing this warning when I re-scan the duckdb fields in metabase:

proagrodb_docker-metabase-1  | 2023-09-29 12:13:03,190 WARN query-processor.deprecated :: Atenção: driver :duckdb está usando Honey SQL 1. Este método foi descontinuado em 0.46.0 e será excluído em uma versão futura.

Sorry for the message in portuguese. But it's basically saying that the driver is using Honey SQL1, which was discontinued in version 0.46.0 and will be excluded in the future.

@dvictori dvictori reopened this Sep 29, 2023
@AlexR2D2
Copy link
Owner

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

Hi! Did you try this thing direct in duckdb? Is it working?

@dvictori
Copy link
Author

but this does not

select * from read_parquet('/dados/dados_parquet/sicor_operacao_basica.parquet/*/*')

Hi! Did you try this thing direct in duckdb? Is it working?

Yes, it works using duckdb CLI. I pasted the output a couple of comments above

#16 (comment)

@AlexR2D2
Copy link
Owner

Hi, could you try this again using the latest version of metabase plugin?

select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet', hive_partitioning=true);
or
select * from read_parquet('/Users/alex/Documents/Dev/duckdb/cars_part/**/*.parquet');

Screenshot 2023-11-30 at 18 18 11 Screenshot 2023-11-30 at 18 18 40

@dvictori
Copy link
Author

Sorry for the long delay in answering. I'm installing metabase and duckdb driver using the following dockerfile. Will that give me the latest version of the metabase plugin?

FROM openjdk:19-buster

ENV MB_PLUGINS_DIR=/home/plugins/

ADD https://downloads.metabase.com/v0.47.5/metabase.jar /home
ADD https://github.com/AlexR2D2/metabase_duckdb_driver/releases/download/0.2.3/duckdb.metabase-driver.jar /home/plugins/

RUN chmod 744 /home/plugins/duckdb.metabase-driver.jar

CMD ["java", "-jar", "/home/metabase.jar"]

@dvictori
Copy link
Author

I just tested using a metabase installed as shown above and a partitioned parquet dataset. All worked fine!

The following syntaxes worked:

select * from read_parquet('/data/iris_part/**/*.parquet')
select * from read_parquet('/data/iris_part/**/*')

select * from read_parquet('/data/iris_part/*/*.parquet')
select * from read_parquet('/data/iris_part/*/*')

select * from read_parquet('/data/iris_part/**')

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants