Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Implement Variant readers #12139

Merged
merged 12 commits into from
Feb 18, 2025

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jan 31, 2025

This adds Parquet readers for Variant data, including for shredded fields.

The readers are built using a new Variant visitor that is plugged into the existing Parquet type visitor by returning a Variant visitor class. When a Variant visitor is present, it is called to produce a result to pass into the Parquet variant visitor method. This PR adds a Variant visitor that produces Parquet readers that handle metadata, serialized values, and shredded values.

Tests are in TestVariantReaders.

This does not support Variant arrays. I intend to follow up with support after this is reviewed.

@@ -47,7 +47,7 @@ public Set<Integer> struct(Types.StructType struct, List<Set<Integer>> fieldResu

@Override
public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) {
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType()) {
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType() || field.type() instanceof Types.VariantType) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed for reading Parquet files without dropping variant columns.

@@ -209,59 +213,59 @@ public static VariantPrimitive<Void> ofNull() {
return new PrimitiveWrapper<>(PhysicalType.NULL, null);
}

static VariantPrimitive<Boolean> of(boolean value) {
public static VariantPrimitive<Boolean> of(boolean value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to expose these to create test values in the Parquet package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to expose it here, but it we could also add some builders in the test module if we want to not have as much exposure. So TestParquet depends on TestCore and core has public builders for Variants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we're going to want these in order to be able to construct variants anyway, so I'm inclined to leave them as public, at least in core.

* @param name a String name
* @return true if the group contains a field with the given name
*/
public static boolean hasField(GroupType group, String name) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New helper methods for working with Parquet fields.

* }
* </pre>
*/
public R object(GroupType object, R valueResult, List<R> fieldResults) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value, object, and array visitor methods are all used for a value / typed_value pair, depending on the structure of typed value (a shredded value, shredded array, or shredded object).

if (field.isPrimitive()) {
return false;
} else if (expected.type() == org.apache.iceberg.types.Types.VariantType.get()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were needed to project the sub-fields of a variant because it looks like a struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another case where I feel like the Type:isVariantType would make things more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll update this when the other PR from @aihuaxu is in.

@@ -117,6 +120,14 @@ public GroupType map(MapType map, Type.Repetition repetition, int id, String nam
.named(AvroSchemaUtil.makeCompatibleName(name));
}

public Type variant(Type.Repetition repetition, int id, String originalName) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implements converting an Iceberg variant schema into a Parquet schema, without shredding.

@@ -54,101 +57,108 @@ public static <T> T visit(
} else {
// if not a primitive, the typeId must be a group
GroupType group = type.asGroupType();
OriginalType annotation = group.getOriginalType();
Copy link
Contributor Author

@rdblue rdblue Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code was refactored into methods for each case. Handling for LIST and MAP and struct fields did not change.

@@ -217,11 +238,19 @@ public T map(Types.MapType iMap, GroupType map, T key, T value) {
return null;
}

public T variant(Types.VariantType iVariant, T result) {
throw new UnsupportedOperationException("Not implemented for variant");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thrown to force implementations to update in order to handle variants.

@amogh-jahagirdar amogh-jahagirdar self-requested a review January 31, 2025 04:53
@rdblue rdblue force-pushed the variant-parquet-implementation branch from 1962051 to 215e4fd Compare February 5, 2025 23:05
Types.VariantType variant, GroupType group, TypeWithSchemaVisitor<T> visitor) {
ParquetVariantVisitor<T> variantVisitor = visitor.variantVisitor();
if (variantVisitor != null) {
T variantResult = ParquetVariantVisitor.visit(group, variantVisitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a static visit() function, can we use variantVisitor.visit(group)?

With that, we have many functions in ParquetVariantVisitor.java to be class instance method rather than static method. Seems that will make the code there simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project uses visitors to separate the logic of traversing a type from the logic of what to do with parts of the structure. While you could put the logic in visit into an instance method of the visitor (or other places) I think it is important to have that traversal logic in only one place across visitors with the same pattern (i.e. TypeWithSchemaVisitor). If we allow customization of the visit method, I think we would soon defeat the purpose of the visitors.

}

@Test
public void testValueAndTypedValueConflict() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException is not thrown here.

@rdblue rdblue force-pushed the variant-parquet-implementation branch from db51974 to 125e715 Compare February 18, 2025 17:43
@github-actions github-actions bot removed the API label Feb 18, 2025
@rdblue rdblue force-pushed the variant-parquet-implementation branch from 52233ec to d752b89 Compare February 18, 2025 21:16
this.fieldReaders = fieldReaders.toArray(VariantValueReader[]::new);
this.fieldColumn = this.fieldReaders[0].column();
this.valueColumn = valueReader != null ? valueReader.column() : fieldColumn;
this.children = children(Iterables.concat(Arrays.asList(valueReader), fieldReaders));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Collections.singletonList is preferable to Arrays.asList()

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Overall looks good. The one thing I don't love is that we don't have a good path for reuse, which makes me concerned that variant will be very noisy with object creation. I think it might be worth considering looking into a different reader generic wrapper like VariantExtract that can both be a container for the metadata with reuse and a placeholder for the value. It's not perfectly inline with the read contract for other types, but I still think it would be inline with the intent.

@rdblue rdblue merged commit 3c8f369 into apache:main Feb 18, 2025
46 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Feb 18, 2025

Thanks for reviewing, @aihuaxu, @danielcweeks, @RussellSpitzer, and @amogh-jahagirdar!

ankurbansal-tradedoubler added a commit to ankurbansal-tradedoubler/iceberg that referenced this pull request Feb 19, 2025
* Site: Learn More to point to Spark QuickStart Doc (apache#12272)

* Build: Bump datamodel-code-generator from 0.27.2 to 0.28.1 (apache#12290)

* Spark 3.5: Fix job description of RewriteTablePathSparkAction (apache#12282)

* Build: Bump io.netty:netty-buffer from 4.1.117.Final to 4.1.118.Final (apache#12287)

Bumps [io.netty:netty-buffer](https://github.com/netty/netty) from 4.1.117.Final to 4.1.118.Final.
- [Commits](netty/netty@netty-4.1.117.Final...netty-4.1.118.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-buffer
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Build: Bump software.amazon.awssdk:bom from 2.30.16 to 2.30.21 (apache#12286)

Bumps software.amazon.awssdk:bom from 2.30.16 to 2.30.21.

---
updated-dependencies:
- dependency-name: software.amazon.awssdk:bom
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* OpenAPI: Add overwrite option when registering a table (apache#12239)

* OpenAPI: Add optional overwrite when registering table

* simplify to overwrite

* Add the article to the description

Co-authored-by: Eduard Tudenhoefner <[email protected]>

* Update generated python as well

Signed-off-by: Hongyue Zhang <[email protected]>

* Fix import order

---------

Signed-off-by: Hongyue Zhang <[email protected]>
Co-authored-by: Eduard Tudenhoefner <[email protected]>

* Build: Bump mkdocs-material from 9.6.3 to 9.6.4 (apache#12284)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.6.3 to 9.6.4.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.6.3...9.6.4)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Core: Fix Enabling row-lineage during Create Table (apache#12307)

* API: Reject unknown type for required fields and validate defaults (apache#12302)

* API: Fix TestInclusiveMetricsEvaluator notStartsWith tests. (apache#12303)

* Core: Add variant type support to utils and visitors (apache#11831)

* Core: Fix CI: Update tests with UnknownType from required to optional (apache#12316)

* Docs: Refactor site navigation bar (apache#12289)

* Parquet: Implement Variant readers (apache#12139)

* Docs: Add rewrite_table_path Spark Procedure (apache#12115)

* Parquet: Fix errorprone warning (apache#12324)

* Docs: Add Apache Amoro docs (apache#11966)

* Parquet: Fix performance regression in reader init (apache#12305)

* Core: Fallback to GET requests for namespace/table/view exists checks (apache#12314)

Co-authored-by: Daniel Weeks <[email protected]>

* Docs: Fix refs in Apache Amoro docs (apache#12332)

* Revert "Core: Serialize `null` when there is no current snapshot (apache#11560)" (apache#12312)

This reverts commit bf8d25f.

* Parquet: Fix performance regression in reader init (apache#12305) (apache#12329)

Co-authored-by: Bryan Keller <[email protected]>

* Checkstyle: Apply the same generic type naming rules to interfaces and classes (apache#12333)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Hongyue Zhang <[email protected]>
Co-authored-by: Danica Fine <[email protected]>
Co-authored-by: Manu Zhang <[email protected]>
Co-authored-by: Yuya Ebihara <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Hongyue/Steve Zhang <[email protected]>
Co-authored-by: Eduard Tudenhoefner <[email protected]>
Co-authored-by: Tom Tanaka <[email protected]>
Co-authored-by: Ryan Blue <[email protected]>
Co-authored-by: Aihua Xu <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: ConradJam <[email protected]>
Co-authored-by: Bryan Keller <[email protected]>
Co-authored-by: Daniel Weeks <[email protected]>
Co-authored-by: pvary <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants