h5flow.data

h5flow.data.lib.dereference(sel, ref, data=None, region=None, mask=None, ref_direction=(0, 1), indices_only=False, as_masked=True)[source]

Load data referred to by ref that corresponds to the desired positions specified in sel.

Parameters

sel – iterable of indices, an index, or a slice to match against ref[:,ref_direction[0]]. Return value will have same first dimension as sel, e.g. dereference(slice(100), ref, data).shape[0] == 100
ref – a shape (N,2) h5py.Dataset or array of pairs of indices linking sel and data
data – a h5py.Dataset or array to load dereferenced data from, can be omitted if indices_only==True
region – a 1D h5py.Dataset or array with a structured array type of [(‘start’,’i8’), (‘stop’,’i8’)]; ‘start’ defines the earliest index within the ref dataset for each value in sel, and ‘stop’ defines the last index + 1 within the ref dataset (optional). If a h5py.Dataset is used, the sel spec will be used to load data from the dataset (i.e. region[sel]), otherwise len(sel) == len(region) and a 1:1 correspondence is assumed
mask – mask off specific items in selection (boolean, True == don’t dereference selection), len(mask) == len(np.r_[sel])
ref_direction – defines how to interpret second dimension of ref. ref[:,ref_direction[0]] are matched against items in sel, and ref[:,ref_direction[1]] are indices into the data array (default=(0,1)). So for a simple example: dereference([0,1,2], [[1,0], [2,1]], ['A','B','C','D'], ref_direction=(0,1)) returns an array equivalent to [[],['A'],['B']] and dereference([0,1,2], [[1,0], [2,1]], ['A','B','C','D'], ref_direction=(1,0)) returns an array equivalent to [['B'],['C'],[]]
indices_only – if True, only returns the indices into data, does not fetch data from data

Returns

numpy masked array (or if as_masked=False a list) of length equivalent to sel

h5flow.data.lib.dereference_chain(sel, refs, data=None, regions=None, mask=None, ref_directions=None, indices_only=False)[source]

Load a “chain” of references. Allows traversal of multiple layers of references, e.g. for three datasets A, B, and C linked A->B->C. One can use a selection in A and load the C data associated with it.

Example usage:

sel = slice(0,100)
refs = [f['A/ref/B/ref'], f['C/ref/B/ref']]
ref_dirs = [(0,1), (1,0)]
data = f['C/data']
regions = [f['A/ref/B/ref_region'], f['B/ref/C/ref_region']]
mask = np.r_[sel] > 50

c_data = dereference_chain(sel, refs, data, regions=regions, mask=mask, ref_directions=ref_dirs)
c_data.shape # (100, max_a2b_assoc, max_b2c_assoc)

Parameters

sel – iterable of indices, a slice, or an integer, see sel argument in dereference
refs – a list of reference datasets to load, in order, see ref argument in dereference
data – a dataset to load dereferenced data from, optional if indices_only=True
regions – lookup table into refs for each selection, see region argument in dereference
mask – a boolean mask into the first selection, true will not load the entry
ref_directions – intepretation of reference datasets, see ref_direction argument in dereference
indices_only – flag to skip loading the data and instead just return indices into the final dataset

h5flow.data.lib.print_attr(grp)[source]: Print out all attributes in file (or group)

h5flow.data.lib.print_data(grp)[source]: Print out all datasets in file (or group)

h5flow.data.lib.print_ref(grp)[source]: Print out all references in file (or group)

class h5flow.data.h5flow_data_manager.H5FlowDataManager(filepath, mode='a', mpi=True, drop_list=None)[source]

Coordinates access to the output data file across multiple processes.

To initialize:

hfdm = H5FlowDataManager(<path to file>, mode=<'r'/'a'>, mpi=<True/False>)

Opening and closing the underlying resource is handled automatically when using the dedicated file access API, e.g.:

hfdm.dset_exists(...)
hfdm.create_dset(...)
hfdm.get_ref(...)
hfdm.reserve_data(...)
hfdm.write_ref(...)
hfdm[...]
...

attr_exists(name, key)[source]

Check if attribute key exists for name

Parameters

name – str path to object, e.g. stage0/obj0 or stage0
key – str attribute name

Returns

True if attribute exists

close_file()[source]: Force underlying hdf5 resource to close

create_dset(dataset_name, dtype, shape=())[source]

Create a 1D dataset of dataset_name with datatype dtype, if it doesn’t already exist

Parameters

dataset_name – str path to dataset, e.g. stage0/obj0
dtype – np.dtype of dataset, can be a structured dtype

create_ref(parent_dataset_name, child_dataset_name)[source]

Create a 1D dataset of references of parent_dataset_name -> child_dataset_name, if it doesn’t already exist. Both datasets must already exist.

Parameters

parent_dataset_name – str path to parent dataset, e.g. stage0/obj0
child_dataset_name – str path to child dataset, e.g. stage0/obj1

delete(name)[source]

Delete object at and references to name. Ignored if path is in temp file.

Parameters: name – str path to dataset to be deleted

dset_exists(dataset_name)[source]

Check if data object of dataset_name exists

Parameters: dataset_name – str path to dataset, e.g. stage0/obj0
Returns: True if data object exists

exists(path)[source]

Check if a path exists

Parameters: path – str path to check
Returns: True if path is present

property fh: Direct access to the underlying h5py File object. Not recommended for use. Instead, use get_dset(...), write_data(...), or the implemented __getitem__().

finish()[source]: Deletes datasets specified in the drop_list before closing file handle.

get_attrs(name)[source]

Get attributes of name

Parameters: name – str path to object, e.g. stage0
Returns: h5py.AttributeManager

get_dset(dataset_name)[source]

Get dataset of dataset_name

Parameters: dataset_name – str path to dataset, e.g. stage0/obj0
Returns: h5py.Dataset, e.g. stage0/obj0/data

get_ref(parent_dataset_name, child_dataset_name)[source]

Get references of parent_dataset_name -> child_dataset_name

Parameters

parent_dataset_name – str path to parent dataset, e.g. stage0/obj0
child_dataset_name – str path to child dataset, e.g. stage0/obj1

Returns

tuple of h5py.Dataset, reference direction; e.g. (stage0/obj0/ref/stage0/obj1/ref, (0,1))

get_ref_region(parent_dataset_name, child_dataset_name)[source]

Get reference lookup regions for parent_dataset_name -> child_dataset_name

Parameters

parent_dataset_name – str path to parent dataset, e.g. stage0/obj0
child_dataset_name – str path to child dataset, e.g. stage0/obj1

Returns

h5py.Dataset, stage0/obj0/ref/stage0/obj1/ref_region, (0,1)

get_refs(dataset_name)[source]: Get all references involving dataset_name -> other

ref_exists(parent_dataset_name, child_dataset_name)[source]

Check if references for parent_dataset_name -> child_dataset_name exists

Parameters

parent_dataset_name – str path to parent dataset, e.g. stage0/obj0
child_dataset_name – str path to child dataset, e.g. stage0/obj1

Returns

True if references exists

ref_region_exists(parent_dataset_name, child_dataset_name)[source]

Check if reference table for parent_dataset_name -> child_dataset_name exists

Parameters

parent_dataset_name – str path to parent dataset, e.g. stage0/obj0
child_dataset_name – str path to child dataset, e.g. stage0/obj1

Returns

True if reference table exists

reserve_data(dataset_name, spec)[source]

Coordinate access into dataset_name. Depending on the type of spec a different access mode will be performed:

int: access in append mode - will grant access to spec rows at the end of the dataset

slice or list of int or list of slice: access a specific section(s) of the dataset - will resize dataset if section does not exist

Parameters

dataset_name – str path to dataset, e.g. stage0/obj0
spec – see function description

Returns

slice into dataset_name where access is given

set_attrs(name, **attrs)[source]

Update attributes of name. Attribute key: value are passed in as additional keyword arguments

Parameters: name – str path to object, e.g. stage0

write_data(dataset_name, spec, data)[source]

Write data into dataset_name at spec

Parameters

dataset_name – str path to dataset, e.g. stage0/obj0
spec – slice into dataset_name to write data
data – numpy array or iterable to write

write_ref(parent_dataset_name, child_dataset_name, refs)[source]

Add refs for parent_dataset_name -> child_dataset_name. Note that references are never updated and can’t be removed after they are created.

Parameters: refs – an integer array of shape (N,2) with refs[:,0] corresponding to the index in the parent dataset and refs[:,1] corresponding to the index in the child dataset