-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Implement Variant readers #12139
Conversation
@@ -47,7 +47,7 @@ public Set<Integer> struct(Types.StructType struct, List<Set<Integer>> fieldResu | |||
|
|||
@Override | |||
public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) { | |||
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType()) { | |||
if ((includeStructIds && field.type().isStructType()) || field.type().isPrimitiveType() || field.type() instanceof Types.VariantType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed for reading Parquet files without dropping variant columns.
@@ -209,59 +213,59 @@ public static VariantPrimitive<Void> ofNull() { | |||
return new PrimitiveWrapper<>(PhysicalType.NULL, null); | |||
} | |||
|
|||
static VariantPrimitive<Boolean> of(boolean value) { | |||
public static VariantPrimitive<Boolean> of(boolean value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to expose these to create test values in the Parquet package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to expose it here, but it we could also add some builders in the test module if we want to not have as much exposure. So TestParquet depends on TestCore and core has public builders for Variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we're going to want these in order to be able to construct variants anyway, so I'm inclined to leave them as public, at least in core.
* @param name a String name | ||
* @return true if the group contains a field with the given name | ||
*/ | ||
public static boolean hasField(GroupType group, String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New helper methods for working with Parquet fields.
* } | ||
* </pre> | ||
*/ | ||
public R object(GroupType object, R valueResult, List<R> fieldResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value
, object
, and array
visitor methods are all used for a value
/ typed_value
pair, depending on the structure of typed value (a shredded value, shredded array, or shredded object).
if (field.isPrimitive()) { | ||
return false; | ||
} else if (expected.type() == org.apache.iceberg.types.Types.VariantType.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were needed to project the sub-fields of a variant because it looks like a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another case where I feel like the Type:isVariantType
would make things more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll update this when the other PR from @aihuaxu is in.
@@ -117,6 +120,14 @@ public GroupType map(MapType map, Type.Repetition repetition, int id, String nam | |||
.named(AvroSchemaUtil.makeCompatibleName(name)); | |||
} | |||
|
|||
public Type variant(Type.Repetition repetition, int id, String originalName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implements converting an Iceberg variant schema into a Parquet schema, without shredding.
@@ -54,101 +57,108 @@ public static <T> T visit( | |||
} else { | |||
// if not a primitive, the typeId must be a group | |||
GroupType group = type.asGroupType(); | |||
OriginalType annotation = group.getOriginalType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code was refactored into methods for each case. Handling for LIST and MAP and struct fields did not change.
@@ -217,11 +238,19 @@ public T map(Types.MapType iMap, GroupType map, T key, T value) { | |||
return null; | |||
} | |||
|
|||
public T variant(Types.VariantType iVariant, T result) { | |||
throw new UnsupportedOperationException("Not implemented for variant"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrown to force implementations to update in order to handle variants.
api/src/main/java/org/apache/iceberg/types/GetProjectedIds.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantReaders.java
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java
Outdated
Show resolved
Hide resolved
1962051
to
215e4fd
Compare
Types.VariantType variant, GroupType group, TypeWithSchemaVisitor<T> visitor) { | ||
ParquetVariantVisitor<T> variantVisitor = visitor.variantVisitor(); | ||
if (variantVisitor != null) { | ||
T variantResult = ParquetVariantVisitor.visit(group, variantVisitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a static visit() function, can we use variantVisitor.visit(group)
?
With that, we have many functions in ParquetVariantVisitor.java to be class instance method rather than static method. Seems that will make the code there simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The project uses visitors to separate the logic of traversing a type from the logic of what to do with parts of the structure. While you could put the logic in visit
into an instance method of the visitor (or other places) I think it is important to have that traversal logic in only one place across visitors with the same pattern (i.e. TypeWithSchemaVisitor
). If we allow customization of the visit
method, I think we would soon defeat the purpose of the visitors.
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantVisitor.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testValueAndTypedValueConflict() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException is not thrown here.
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantReaders.java
Show resolved
Hide resolved
db51974
to
125e715
Compare
52233ec
to
d752b89
Compare
this.fieldReaders = fieldReaders.toArray(VariantValueReader[]::new); | ||
this.fieldColumn = this.fieldReaders[0].column(); | ||
this.valueColumn = valueReader != null ? valueReader.column() : fieldColumn; | ||
this.children = children(Iterables.concat(Arrays.asList(valueReader), fieldReaders)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Collections.singletonList
is preferable to Arrays.asList()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Overall looks good. The one thing I don't love is that we don't have a good path for reuse, which makes me concerned that variant will be very noisy with object creation. I think it might be worth considering looking into a different reader generic wrapper like VariantExtract
that can both be a container for the metadata with reuse and a placeholder for the value. It's not perfectly inline with the read contract for other types, but I still think it would be inline with the intent.
Thanks for reviewing, @aihuaxu, @danielcweeks, @RussellSpitzer, and @amogh-jahagirdar! |
* Site: Learn More to point to Spark QuickStart Doc (apache#12272) * Build: Bump datamodel-code-generator from 0.27.2 to 0.28.1 (apache#12290) * Spark 3.5: Fix job description of RewriteTablePathSparkAction (apache#12282) * Build: Bump io.netty:netty-buffer from 4.1.117.Final to 4.1.118.Final (apache#12287) Bumps [io.netty:netty-buffer](https://github.com/netty/netty) from 4.1.117.Final to 4.1.118.Final. - [Commits](netty/netty@netty-4.1.117.Final...netty-4.1.118.Final) --- updated-dependencies: - dependency-name: io.netty:netty-buffer dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Build: Bump software.amazon.awssdk:bom from 2.30.16 to 2.30.21 (apache#12286) Bumps software.amazon.awssdk:bom from 2.30.16 to 2.30.21. --- updated-dependencies: - dependency-name: software.amazon.awssdk:bom dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * OpenAPI: Add overwrite option when registering a table (apache#12239) * OpenAPI: Add optional overwrite when registering table * simplify to overwrite * Add the article to the description Co-authored-by: Eduard Tudenhoefner <[email protected]> * Update generated python as well Signed-off-by: Hongyue Zhang <[email protected]> * Fix import order --------- Signed-off-by: Hongyue Zhang <[email protected]> Co-authored-by: Eduard Tudenhoefner <[email protected]> * Build: Bump mkdocs-material from 9.6.3 to 9.6.4 (apache#12284) Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.6.3 to 9.6.4. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.6.3...9.6.4) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Core: Fix Enabling row-lineage during Create Table (apache#12307) * API: Reject unknown type for required fields and validate defaults (apache#12302) * API: Fix TestInclusiveMetricsEvaluator notStartsWith tests. (apache#12303) * Core: Add variant type support to utils and visitors (apache#11831) * Core: Fix CI: Update tests with UnknownType from required to optional (apache#12316) * Docs: Refactor site navigation bar (apache#12289) * Parquet: Implement Variant readers (apache#12139) * Docs: Add rewrite_table_path Spark Procedure (apache#12115) * Parquet: Fix errorprone warning (apache#12324) * Docs: Add Apache Amoro docs (apache#11966) * Parquet: Fix performance regression in reader init (apache#12305) * Core: Fallback to GET requests for namespace/table/view exists checks (apache#12314) Co-authored-by: Daniel Weeks <[email protected]> * Docs: Fix refs in Apache Amoro docs (apache#12332) * Revert "Core: Serialize `null` when there is no current snapshot (apache#11560)" (apache#12312) This reverts commit bf8d25f. * Parquet: Fix performance regression in reader init (apache#12305) (apache#12329) Co-authored-by: Bryan Keller <[email protected]> * Checkstyle: Apply the same generic type naming rules to interfaces and classes (apache#12333) --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Hongyue Zhang <[email protected]> Co-authored-by: Danica Fine <[email protected]> Co-authored-by: Manu Zhang <[email protected]> Co-authored-by: Yuya Ebihara <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Hongyue/Steve Zhang <[email protected]> Co-authored-by: Eduard Tudenhoefner <[email protected]> Co-authored-by: Tom Tanaka <[email protected]> Co-authored-by: Ryan Blue <[email protected]> Co-authored-by: Aihua Xu <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: ConradJam <[email protected]> Co-authored-by: Bryan Keller <[email protected]> Co-authored-by: Daniel Weeks <[email protected]> Co-authored-by: pvary <[email protected]>
This adds Parquet readers for Variant data, including for shredded fields.
The readers are built using a new Variant visitor that is plugged into the existing Parquet type visitor by returning a Variant visitor class. When a Variant visitor is present, it is called to produce a result to pass into the Parquet
variant
visitor method. This PR adds a Variant visitor that produces Parquet readers that handle metadata, serialized values, and shredded values.Tests are in
TestVariantReaders
.This does not support Variant arrays. I intend to follow up with support after this is reviewed.