Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.
- 📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
- 🔄 Robust Caching: Built-in support for faster data access
- 🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
- 🔍 Advanced Querying: SQL-like filtering with predicate pushdown
- 🛠️ Schema Management: Automatic validation and tracking
pip install pydala2
from pydala.dataset import ParquetDataset
dataset = ParquetDataset(
path="path/to/dataset",
partitioning="hive", # Hive-style partitioning
timestamp_column="timestamp", # For time-based operations
cached=True # Enable performance caching
)
import polars as pl
# Create sample time-series data
df = pl.DataFrame({
"timestamp": pl.date_range(0, 1000, "1d"),
"value": range(1000)
})
# Write with smart partitioning and compression
dataset.write_to_dataset(
data=df, # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result
mode="overwrite", # Options: "overwrite", "append", "delta"
row_group_size=250_000, # Optimize chunk size
compression="zstd", # High-performance compression
partition_by=["year", "month"], # Auto-partition by time
unique=True # Ensure data uniqueness
)
dataset.load(update_metadata=True)
# Flexible data format conversion
pt = dataset.t # PyDala Table
df_polars = pt.to_polars() # Convert to Polars
df_pandas = pt.to_pandas() # Convert to Pandas
df_arrow = pt.to_arrow() # Convert to Arrow
rel_ddb = pt.to_ddb() # Convert DuckDB relation
# and many more...
# Efficient filtered reads with predicate pushdown
pt_filtered = dataset.filter("timestamp > '2023-01-01'")
# Chaining operations
df_filtered = (
dataset
.filter("column_name > 100")
.pl.with_columns(
pl.col("column_name").str.slice(0, 5).alias("new_column_name")
)
.to_pandas()
)
# Fast metadata-only scans
pt_scanned = dataset.scan("column_name > 100")
# Access matching files
matching_files = ds.scan_files
# Incremental metadata update
dataset.load(update_metadata=True) # Update for new files
# Full metadata refresh
dataset.load(reload_metadata=True) # Reload all metadata
# Repair schema/metadata
dataset.repair_schema()
# Optimize storage types
dataset.opt_dtypes() # Automatic type optimization
# Smart file management
dataset.compact_by_rows(max_rows=100_000) # Combine small files
dataset.repartition(partitioning_columns=["date"]) # Optimize partitions
dataset.compact_by_timeperiod(interval="1d") # Time-based optimization
dataset.compact_partitions() # Partition structure optimization
Type optimization involves full dataset rewrite Choose compaction strategy based on your access patterns Regular metadata updates ensure optimal query performance
For advanced usage and complete API documentation, visit our docs.
Contributions welcome! See our contribution guidelines.