-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion of native str/String from/to HDF5 string types #108
Comments
Test file:
|
Prototype: use std::alloc;
use std::error::Error;
use std::mem;
use std::slice;
use std::str;
use libc::{c_void, size_t};
use ndarray::Array1;
use hdf5_sys::h5::herr_t;
use hdf5_sys::h5d::H5Dread;
use hdf5_sys::h5i::hid_t;
use hdf5_sys::*;
const STRING_SIZE: usize = mem::size_of::<String>();
extern "C" fn conv_func(
src_id: hid_t, dst_id: hid_t, cdata: *mut h5t::H5T_cdata_t, nelmts: size_t, buf_stride: size_t,
_bkg_stride: size_t, buf: *mut c_void, _bkg: *mut c_void, _dset_xfer_plist: hid_t,
) -> herr_t {
// TODO: the accepted function pointer should be unsafe by default
unsafe {
// check examples in H5Tconv.c, e.g. H5Tconv__s_s
match (*cdata).command {
h5t::H5T_CONV_INIT => {
// initialization, checks, etc - ignore for now
(*cdata).need_bkg = h5t::H5T_BKG_NO;
}
h5t::H5T_CONV_FREE => {
// nothing to do here
}
h5t::H5T_CONV_CONV => {
let buf = buf as *mut u8;
let nullterm = match h5t::H5Tget_strpad(src_id) {
h5t::H5T_STR_NULLTERM | h5t::H5T_STR_NULLPAD => true,
h5t::H5T_STR_SPACEPAD => false,
_ => panic!("unsupported"),
};
let src_size = h5t::H5Tget_size(src_id) as usize;
let dst_size = h5t::H5Tget_size(dst_id) as usize;
let (dir, mut src_buf, mut dst_buf) = if src_size >= dst_size {
(1, buf, buf)
} else {
let k = nelmts - 1;
(-1, buf.offset((k * src_size) as _), buf.offset((k * dst_size) as _))
};
let src_stride =
dir * (if buf_stride == 0 { src_size } else { buf_stride }) as isize;
let dst_stride =
dir * (if buf_stride == 0 { dst_size } else { buf_stride }) as isize;
for _ in 0..nelmts {
let mut len = src_size;
if nullterm {
// technically, nullpad has to be handled differently, but that's
// how it's done in the HDF5 library itself (H5T__conv_s_s in H5Tconv.c)
for i in 0..src_size {
if *src_buf.offset(i as _) == b'\0' {
len = i;
break;
}
}
} else {
for i in (0..src_size).rev() {
if *src_buf.offset(i as _) != b' ' {
len = i + 1;
break;
}
}
}
// alternatively, could use std::from_utf8_unchecked()?
let s =
str::from_utf8(slice::from_raw_parts(src_buf, len)).unwrap().to_string();
libc::memcpy(dst_buf as _, &s as *const _ as _, STRING_SIZE);
mem::forget(s);
src_buf = src_buf.offset(src_stride);
dst_buf = dst_buf.offset(dst_stride);
}
}
}
}
0
}
fn main() -> Result<(), Box<dyn Error>> {
unsafe {
assert!(h5::H5open() >= 0);
let type_id = h5t::H5Tcreate(h5t::H5T_OPAQUE, STRING_SIZE as _);
assert!(type_id >= 0);
assert!(h5t::H5Tset_tag(type_id, "rust::String\0".as_ptr() as *const _) >= 0);
// let h5_type_id = h5t::H5Tcreate(h5t::H5T_STRING, 1);
let h5_type_id = *hdf5::globals::H5T_C_S1;
assert!(h5_type_id >= 0);
assert!(
h5t::H5Tregister(
h5t::H5T_PERS_SOFT,
"H5T_C_S1->rust::String\0".as_ptr() as _,
h5_type_id,
type_id,
Some(conv_func),
) >= 0
);
let file =
h5f::H5Fopen("test.h5\0".as_ptr() as *const _, h5f::H5F_ACC_RDONLY, h5p::H5P_DEFAULT);
assert!(file >= 0);
for name in &["a1\0", "a2\0"] {
println!("{}:", name);
let ds = h5d::H5Dopen2(file, name.as_ptr() as *const _, h5p::H5P_DEFAULT);
assert!(ds >= 0);
let space = h5d::H5Dget_space(ds);
assert!(space >= 0);
let npoints = h5s::H5Sget_simple_extent_npoints(space);
assert!(npoints >= 0);
let npoints = npoints as usize;
let layout =
alloc::Layout::from_size_align(npoints * STRING_SIZE, mem::align_of::<String>())?;
let buf = alloc::alloc(layout);
assert!(
H5Dread(ds, type_id, h5s::H5S_ALL, h5s::H5S_ALL, h5p::H5P_DEFAULT, buf as _,) >= 0
);
let vec = Vec::<String>::from_raw_parts(buf as _, npoints, npoints);
println!("{:#?}", vec);
let arr: Array1<String> = Array1::from_shape_vec_unchecked(npoints, vec);
println!("{:#?}", arr);
}
}
Ok(())
} |
Output:
|
TLDR: we have an HDF5 dataset with type |
@magnusuMET There you go as promised ^ 😄 |
Just verified, the conversion routine indeed runs chunk by chunk. So, if you're converting a dataset with 1K strings but chunk size is 100, you will allocate memory for at most 1100 strings at a time (this would be the advantage as opposed to "read all, then convert" approach). |
@aldanor That is some really great stuff! So it sort of acts as an inplace conversion? Nasty trick of copying the layout of the String 👍 |
Yea, it is in-place in a sense that One could argue it's not the most efficient way of doing things etc, but given that it allows you to map directly to Rust types, I think convenience outweighs everything else. Typically, if you want performance, you won't be using strings at all in the first place :) Note also that this would automatically work for structs as well, any |
Just to add to the above so I don't forget, we could totally do something like that (I could probably take up on that once the dust settles over the current blockers), BUT: this will require splitting |
I've spent some time digging into HDF5 conversion API and it seems like it actually works! As in, we can force it to "understand" Rust string types and convert back and forth. Given the painful experience with strings and arrays (#86, #47, #85), this could be a huge win in usability.
The same can be done with varlen/fixed arrays/strings (direct conversions to/from
&[T]
,Vec<T>
,String
,&str
, etc).Price to pay: extra memory allocation. If the dataset is not chunked, it will (at some point in the conversion path) use double the required memory. If it is chunked, I think it will process it chunk by chunk so the cost could be negligible.
There's many details to consider and discuss, this is just a start and an experiment. Details below.
The text was updated successfully, but these errors were encountered: