-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] BIT/VARBIT type conversion between pg and duckdb #628
base: main
Are you sure you want to change the base?
Conversation
src/pgduckdb_types.cpp
Outdated
Datum pg_varbit = DirectFunctionCall3(varbit_in, CStringGetDatum(value_str.c_str()), ObjectIdGetDatum(VARBITOID), | ||
Int32GetDatum(-1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar feedback as for time/timetz PR. I think it should be fairly easy to convert the internal representations directly to eachother, without having to go through string formatting/parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In duckdb BIT
type is stored as VARCHAR
physically.
https://github.com/duckdb/duckdb/blob/057f9753cbd3eb6d07543c2f1f1085de961fffc4/src/common/types.cpp#L120-L122
Even if we want to the parsing ourselves (from char* and size_t), we need to have a way to get pointer and its length.
- We're able to get the pointer via
GetPointer
: https://github.com/duckdb/duckdb/blob/057f9753cbd3eb6d07543c2f1f1085de961fffc4/src/include/duckdb/common/types/value.hpp#L227 - But the length information is stored inside of extra value info: https://github.com/duckdb/duckdb/blob/057f9753cbd3eb6d07543c2f1f1085de961fffc4/src/include/duckdb/common/types/value.hpp#L341, the actual derived type is
StringValueInfo
, which lives inside of an anonymous namespace: https://github.com/duckdb/duckdb/blob/057f9753cbd3eb6d07543c2f1f1085de961fffc4/src/common/types/value.cpp#L74-L95, so we cannot get a string reference (which could save a copy)
So IMO, convert the Value
into a string is a simple and reasonable implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay looking closer at the way that the bit types are stored internally I now agree that going through the string representation is probably the best. I thought you could do something like this:
Do the same as ConvertBinaryDatum
to get the raw value, except that you need to use some functions from duckdb its src/common/types/bit.cpp
. I think you can do something very similar to what BitToBlob does: https://github.com/duckdb/duckdb/blob/057f9753cbd3eb6d07543c2f1f1085de961fffc4/src/common/types/bit.cpp#L175-L190
duckdb::string_t str = value.GetValueUnsafe<duckdb::string_t>();
// get padding
// copy raw bytes
Afaict that would work fine, were it not for the fact that the padding bit in the PG representation are on the end of the bitstring, and not on the beginning like in DuckDB. So to copy you couldn't use a plain memcpy
, because you'd need to copy the actual data with bit offset (that's not divisible by 8). That definitely seems too much complexity for this rather exotic data type. So yeah, let's go through the string representation instead.
Let's move this string conversion logic to pg/types.cpp
though just like you did for VARINT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for explaining your thoughts! That's very helpful!
My initial thought is memcpy as well, I was just thinking if it's possible to avoid handling padding and byte order myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code looks good. But it still needs conversion the other way around, i.e. from PG to DuckDB (just like you did for the time
type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I thought about it, for postgres write-side the entrypoint type if BITOID / VARBITOID, when duckdb takes on the read side the type is BIT, so there's no 1-1 mapping between pg/duckdb types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping, doesn't have to be 1-1. As long as each type converts into something sensible in the other typesystem it's fine. For example, if you have a PG table that stores a BITOID, and we then we convert that to a DuckDB BIT, and then back to PG VARBITOID, yes you've lost a bit of type information. But that's still a lot better than not being able to read a BITOID columns at all. It's a tradeoff, and so far we're trading nice interoperability for strictly correct 1-1 mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback as for the time/timetz PR. This should also implement conversion from PG types to DuckDB types (both for bit
and bit varying
)
317d1c4
to
e9b2ba6
Compare
@@ -6,7 +6,6 @@ SELECT * exclude( | |||
tinyint, -- PG14 outputs this differently currently | |||
TIME, | |||
time_tz, | |||
bit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs/types.md
file should also be updated that these types are now supported.
Resolves #597