Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion of native str/String from/to HDF5 string types #108

Open
aldanor opened this issue Aug 6, 2020 · 9 comments
Open

Conversion of native str/String from/to HDF5 string types #108

aldanor opened this issue Aug 6, 2020 · 9 comments

Comments

@aldanor
Copy link
Owner

aldanor commented Aug 6, 2020

I've spent some time digging into HDF5 conversion API and it seems like it actually works! As in, we can force it to "understand" Rust string types and convert back and forth. Given the painful experience with strings and arrays (#86, #47, #85), this could be a huge win in usability.

The same can be done with varlen/fixed arrays/strings (direct conversions to/from &[T], Vec<T>, String, &str, etc).

Price to pay: extra memory allocation. If the dataset is not chunked, it will (at some point in the conversion path) use double the required memory. If it is chunked, I think it will process it chunk by chunk so the cost could be negligible.

There's many details to consider and discuss, this is just a start and an experiment. Details below.

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

Test file:

HDF5 "test.h5" {
GROUP "/" {
   DATASET "a1" {
      DATATYPE  H5T_STRING {
         STRSIZE 26;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
      DATA {
      (0): "abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
      (2): "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (3): "123\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
      (4): "a\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
      }
   }
   DATASET "a2" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "abc", "1\000\000", "\000\000\000", "23\000"
      }
   }
}
}

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

Prototype:

use std::alloc;
use std::error::Error;
use std::mem;
use std::slice;
use std::str;

use libc::{c_void, size_t};
use ndarray::Array1;

use hdf5_sys::h5::herr_t;
use hdf5_sys::h5d::H5Dread;
use hdf5_sys::h5i::hid_t;
use hdf5_sys::*;

const STRING_SIZE: usize = mem::size_of::<String>();

extern "C" fn conv_func(
    src_id: hid_t, dst_id: hid_t, cdata: *mut h5t::H5T_cdata_t, nelmts: size_t, buf_stride: size_t,
    _bkg_stride: size_t, buf: *mut c_void, _bkg: *mut c_void, _dset_xfer_plist: hid_t,
) -> herr_t {
    // TODO: the accepted function pointer should be unsafe by default
    unsafe {
        // check examples in H5Tconv.c, e.g. H5Tconv__s_s
        match (*cdata).command {
            h5t::H5T_CONV_INIT => {
                // initialization, checks, etc - ignore for now
                (*cdata).need_bkg = h5t::H5T_BKG_NO;
            }
            h5t::H5T_CONV_FREE => {
                // nothing to do here
            }
            h5t::H5T_CONV_CONV => {
                let buf = buf as *mut u8;
                let nullterm = match h5t::H5Tget_strpad(src_id) {
                    h5t::H5T_STR_NULLTERM | h5t::H5T_STR_NULLPAD => true,
                    h5t::H5T_STR_SPACEPAD => false,
                    _ => panic!("unsupported"),
                };
                let src_size = h5t::H5Tget_size(src_id) as usize;
                let dst_size = h5t::H5Tget_size(dst_id) as usize;
                let (dir, mut src_buf, mut dst_buf) = if src_size >= dst_size {
                    (1, buf, buf)
                } else {
                    let k = nelmts - 1;
                    (-1, buf.offset((k * src_size) as _), buf.offset((k * dst_size) as _))
                };
                let src_stride =
                    dir * (if buf_stride == 0 { src_size } else { buf_stride }) as isize;
                let dst_stride =
                    dir * (if buf_stride == 0 { dst_size } else { buf_stride }) as isize;
                for _ in 0..nelmts {
                    let mut len = src_size;
                    if nullterm {
                        // technically, nullpad has to be handled differently, but that's
                        // how it's done in the HDF5 library itself (H5T__conv_s_s in H5Tconv.c)
                        for i in 0..src_size {
                            if *src_buf.offset(i as _) == b'\0' {
                                len = i;
                                break;
                            }
                        }
                    } else {
                        for i in (0..src_size).rev() {
                            if *src_buf.offset(i as _) != b' ' {
                                len = i + 1;
                                break;
                            }
                        }
                    }
                    // alternatively, could use std::from_utf8_unchecked()?
                    let s =
                        str::from_utf8(slice::from_raw_parts(src_buf, len)).unwrap().to_string();
                    libc::memcpy(dst_buf as _, &s as *const _ as _, STRING_SIZE);
                    mem::forget(s);
                    src_buf = src_buf.offset(src_stride);
                    dst_buf = dst_buf.offset(dst_stride);
                }
            }
        }
    }
    0
}

fn main() -> Result<(), Box<dyn Error>> {
    unsafe {
        assert!(h5::H5open() >= 0);
        let type_id = h5t::H5Tcreate(h5t::H5T_OPAQUE, STRING_SIZE as _);
        assert!(type_id >= 0);
        assert!(h5t::H5Tset_tag(type_id, "rust::String\0".as_ptr() as *const _) >= 0);
        // let h5_type_id = h5t::H5Tcreate(h5t::H5T_STRING, 1);
        let h5_type_id = *hdf5::globals::H5T_C_S1;
        assert!(h5_type_id >= 0);
        assert!(
            h5t::H5Tregister(
                h5t::H5T_PERS_SOFT,
                "H5T_C_S1->rust::String\0".as_ptr() as _,
                h5_type_id,
                type_id,
                Some(conv_func),
            ) >= 0
        );
        let file =
            h5f::H5Fopen("test.h5\0".as_ptr() as *const _, h5f::H5F_ACC_RDONLY, h5p::H5P_DEFAULT);
        assert!(file >= 0);
        for name in &["a1\0", "a2\0"] {
            println!("{}:", name);
            let ds = h5d::H5Dopen2(file, name.as_ptr() as *const _, h5p::H5P_DEFAULT);
            assert!(ds >= 0);
            let space = h5d::H5Dget_space(ds);
            assert!(space >= 0);
            let npoints = h5s::H5Sget_simple_extent_npoints(space);
            assert!(npoints >= 0);
            let npoints = npoints as usize;
            let layout =
                alloc::Layout::from_size_align(npoints * STRING_SIZE, mem::align_of::<String>())?;
            let buf = alloc::alloc(layout);
            assert!(
                H5Dread(ds, type_id, h5s::H5S_ALL, h5s::H5S_ALL, h5p::H5P_DEFAULT, buf as _,) >= 0
            );
            let vec = Vec::<String>::from_raw_parts(buf as _, npoints, npoints);
            println!("{:#?}", vec);
            let arr: Array1<String> = Array1::from_shape_vec_unchecked(npoints, vec);
            println!("{:#?}", arr);
        }
    }

    Ok(())
}

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

Output:

a1:
[
    "abcdefghijklmnopqrstuvwxyz",
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
    "",
    "123",
    "a",
]
["abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "", "123", "a"], shape=[5], strides=[1], layout=C | F (0x3), const ndim=1
a2:
[
    "abc",
    "1",
    "",
    "23",
]
["abc", "1", "", "23"], shape=[4], strides=[1], layout=C | F (0x3), const ndim=1

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

TLDR: we have an HDF5 dataset with type |S26 and we read it directly into a Vec<String> and it sort of seems to work.

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

@magnusuMET There you go as promised ^ 😄

@aldanor
Copy link
Owner Author

aldanor commented Aug 6, 2020

Just verified, the conversion routine indeed runs chunk by chunk. So, if you're converting a dataset with 1K strings but chunk size is 100, you will allocate memory for at most 1100 strings at a time (this would be the advantage as opposed to "read all, then convert" approach).

@magnusuMET
Copy link
Contributor

@aldanor That is some really great stuff! So it sort of acts as an inplace conversion? Nasty trick of copying the layout of the String 👍

@aldanor
Copy link
Owner Author

aldanor commented Aug 7, 2020

Yea, it is in-place in a sense that String body (pretty hefty, 24B) is generated in place, obviously not the heap data it points to.

One could argue it's not the most efficient way of doing things etc, but given that it allows you to map directly to Rust types, I think convenience outweighs everything else. Typically, if you want performance, you won't be using strings at all in the first place :)

Note also that this would automatically work for structs as well, any String field wrapped in a struct or array would automatically be decoded in place.

@aldanor
Copy link
Owner Author

aldanor commented Jan 27, 2021

Just to add to the above so I don't forget, we could totally do something like that (I could probably take up on that once the dust settles over the current blockers), BUT: this will require splitting H5Type into H5Read and H5Write. I.e., you can write &str or String but you can only read String.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants